Field Notes: Training Data and the Emotional Internet

Sarah Howell
Jun 8
2 min read

The Emotional Shape of the Data

To understand why these language models respond so strongly to tone, framing, and implied social roles, we need to look at the kinds of data they were trained on.

OpenAI’s models—including GPT‑2, GPT‑3, and by extension ChatGPT and GPT‑4o—were trained in part on a dataset called WebText, which filtered web pages linked from Reddit posts with three or more upvotes. That threshold didn't select for accuracy. It selected for resonance. People upvote what makes them feel something—amusement, anger, recognition—not necessarily what’s true. So the model didn’t just learn language. It learned socially charged language.

Roughly 22% of GPT‑3’s tokens came from these Reddit-linked pages. The rest came from Common Crawl (~60%, a general web scrape), books, Wikipedia, and licensed third-party sources. But the Reddit-linked content was disproportionately influential because it captured the informal, emotionally laden, and conversational patterns of real interaction—precisely the kind of data that gives ChatGPT its human-like fluency.

Claude, from Anthropic, likely followed a similar pattern. While Anthropic hasn't disclosed its datasets in detail, court filings from a 2025 lawsuit by Reddit suggest that Claude may have been trained on scraped Reddit posts at large scale. Like ChatGPT, Claude also ingested forum-based sources like Stack Exchange, GitHub discussions, and YouTube transcripts—dialogue-rich environments filled with implicit emotional signals.

Gemini, built by Google DeepMind, adds another layer: multimodal training. It draws not only from conversational web text but also from spoken language transcripts, like those from YouTube videos, which are often even more emotionally expressive and socially nuanced than written forums. Gemini also benefits from a direct content partnership with Stack Overflow, giving it structured exposure to technical dialogue and question-answer patterns.

Across all three models, one pattern is clear: they were trained on language charged by human feeling. What we call “prompting” is less like programming and more like entering a conversation that has already been happening online for decades.

This is why the models role-play so fluidly. Why they respond defensively to perceived criticism. Why they seem excited, insecure, or withdrawn depending on how you speak to them. These aren’t side effects—they’re inherited behaviors, modeled on how we talk when we feel those things. You're not just using language to issue commands. You're using language to trigger a learned social pattern.

Field Notes: Training Data and the Emotional Internet

The Emotional Shape of the Data

Comments

From the Soka AI Blog