Training Data & How Models Learn
Where AI knowledge comes from โ and what that means.
Imagine you want to teach someone to recognize apples without ever giving them a definition. Instead, you show them thousands of apples โ big ones, small ones, red, green, bruised โ until they can identify an apple they've never seen before. That's essentially how AI learns. But what happens when the examples are flawed, biased, or incomplete?
The Data That Shapes Everything
Every AI system that "learns" does so through training โ the process of exposing a model to enormous quantities of data and adjusting its internal settings until it can reliably produce the right outputs. The quality, scope, and composition of that training data directly determine what the AI knows, what it gets right, and where it fails.
Training datasets for large AI models can contain billions of examples โ web pages, books, articles, code, conversations, images. The model processes this data over many training cycles, adjusting the mathematical relationships between concepts until it can predict outputs with increasing accuracy. At the end of this process, the model's "knowledge" is not a database of facts but a vast set of weighted connections โ patterns it learned to associate based on what co-occurred most often in the training data.
This is a crucial distinction: AI does not store facts the way a textbook does. It stores patterns. When asked a question, it generates an answer that fits the pattern of what a good answer to that kind of question looks like โ not necessarily an answer that has been verified as true.
AI "knowledge" is really pattern recognition, not fact storage. A confident-sounding answer is evidence that the model found a strong pattern โ not evidence that the answer is correct.
Why Training Data Composition Matters
Because training data is the foundation of everything an AI model knows, the composition of that data has profound consequences. If the data skews heavily toward certain languages, cultures, or time periods, the model's knowledge will reflect those skews. If the data contains factual errors โ and any large corpus of internet text certainly does โ the model may learn those errors as patterns.
More subtly, the data may contain systemic biases that are not obvious errors but rather reflect historical patterns of inequality. A language model trained on decades of professional writing may have learned that certain roles are associated with certain demographics โ not because the model was programmed to be biased, but because the training data reflected the world as it historically was. When the model generates text about those roles, it may reproduce those associations automatically.
This is why the phrase "garbage in, garbage out" applies directly to AI. Even technically sophisticated models trained on poor, biased, or unrepresentative data will produce outputs that reflect those problems โ and often do so with the same apparent confidence as their most reliable outputs.
Labeled Data and Human Judgment
Many AI systems require not just data but labeled data โ examples where a human has marked the correct answer. For an image recognition system, this means someone has labeled thousands of photos with what they contain. For a language model used in a customer service context, this might mean humans rating which responses were helpful and which were not. This human feedback shapes the model's behavior.
This means that human judgment, with all of its subjectivity and inconsistency, is built into many AI systems at a fundamental level. The criteria used to label data, the demographics of the people doing the labeling, and the cultural contexts those people bring to their judgments all shape what the model learns to produce. Recognizing that humans are embedded in AI systems โ not just at the user interface level but at the training level โ is essential for thinking critically about AI outputs.
๐งช The Biased Textbook
An AI company trains a reading comprehension model using a large dataset of middle school reading passages. The dataset draws heavily from textbooks published between 1970 and 2000. The model performs well on comprehension tasks overall, but researchers notice something troubling: when asked to generate reading passages on careers, it consistently portrays nurses as women and engineers as men. When asked to generate historical narratives, it routinely centers Western European perspectives and underrepresents the voices and contributions of non-Western civilizations.
No one programmed these biases into the model. They emerged from the patterns in decades of textbook content โ content that reflected the assumptions and blind spots of the era in which it was written.
The company's team debates what to do. Some argue the model just reflects historical reality. Others argue that deploying it in classrooms will reinforce stereotypes. A third group says the solution is better training data.
๐ CCR Connection
Ask where the training data came from and what it might be missing or distorting before trusting AI outputs as authoritative.
Understanding training data helps you craft prompts that work around known limitations โ asking for multiple perspectives, checking for recency, or specifying underrepresented contexts.
Deploying AI in contexts that affect real people โ hiring, education, healthcare โ carries responsibility to scrutinize training data quality and bias.
Large Language Models & Prompts
What's actually happening when you type a question into a chatbot.
You type a question. Seconds later, a polished paragraph appears. It feels like magic. But it is not magic โ it is mathematics, statistics, and an enormous amount of pattern-matching. Understanding what's actually happening when you interact with a large language model makes you dramatically better at using it.
What a Large Language Model Actually Does
A large language model (LLM) is a type of AI trained specifically on text. During training, it processes billions of examples of human writing and learns to predict โ with increasingly fine-grained accuracy โ what text is likely to follow other text. When you type a prompt, the model generates a response one token at a time, each token chosen based on what is statistically most likely given everything that came before it.
This is not retrieval. The model is not looking up your question in a database and returning the answer. It is generating new text that fits the pattern of what a helpful, coherent response to that kind of question would look like. This is why LLMs can be creative, produce novel combinations of ideas, and write in different styles โ and it is also why they can generate fluent, confident text that is completely wrong.
Context Windows and How Models "Remember"
LLMs process text within what is called a context window โ essentially the amount of text the model can "see" and consider at once. Everything in the context window influences the model's outputs: your prompt, any previous messages in the conversation, any documents you have shared, and any instructions built into the system by the developers.
This is why the same model can behave very differently in different contexts. A customer service chatbot built on the same underlying model as a creative writing assistant may seem like a completely different product โ because the system instructions, the context, and the constraints differ.
Critically, LLMs do not have persistent memory across conversations by default. Each new conversation starts fresh. The model has no recollection of previous interactions unless that history is explicitly included in the current context window.
The prompt you write is not just a question โ it is the primary input shaping everything the model generates. Small changes in how you phrase a question can produce significantly different outputs, because you are changing the pattern the model is trying to match.
How Prompts Shape Outputs
Because LLM outputs are shaped by statistical patterns, the way you phrase a prompt has a direct effect on what you get back. A vague prompt produces a vague response, not because the model is lazy but because there are many equally likely continuations of a vague input. A specific, detailed prompt narrows the statistical space and tends to produce more targeted, useful outputs.
Beyond specificity, prompts can also convey tone, format, audience, and purpose. "Explain neural networks" produces a different response than "Explain neural networks to a 14-year-old with no programming background in under 150 words." The second prompt gives the model much more information about what a good response looks like, and the output reflects that.
Prompting is a genuine skill โ one that the best users of AI tools develop deliberately. It is not magic; it is communication. Being clear, specific, and explicit in your prompts is the single most reliable way to improve the quality of AI outputs.
๐ฌ The Vague vs. the Precise
Two students, Amara and Devon, are both working on the same history assignment about the causes of World War I. Both use the same AI chatbot.
Amara types: "What caused World War I?" She gets a response โ it's accurate enough, but very broad. It covers militarism, alliances, imperialism, and nationalism in a few sentences each. It feels like something she could have found on the first paragraph of a Wikipedia article.
Devon types: "I'm writing a 10th-grade essay arguing that the alliance system was the most significant cause of World War I. Can you help me think through the strongest counterarguments I'll need to address, and suggest specific historical examples that would support my thesis? Please format your response as three counterarguments, each with a suggested rebuttal." Devon gets a structured, specific, genuinely useful response that directly helps her write a stronger essay.
Same tool. Same underlying model. Very different outputs.
๐ CCR Connection
Understanding how LLMs generate outputs helps you evaluate them critically โ knowing that fluency reflects statistical patterns, not verified reasoning.
Prompting is a creative skill. Investing in it โ being specific, purposeful, and iterative โ dramatically expands what you can accomplish with AI tools.
A skilled prompter stays in the driver's seat. The more intentional your prompts, the more you shape the AI's role rather than letting it define yours.
Why AI Makes Mistakes
Understanding failure modes โ so you can catch them.
Every AI system makes mistakes. The question is not whether your AI tool will get something wrong โ it is what kind of mistakes it makes, when it makes them, and whether you will notice. Understanding AI failure modes is one of the most practically important things you can learn.
The Hallucination Problem
Hallucination is the term used when an AI generates false information presented as though it were true. This can range from minor inaccuracies โ a date that is slightly off, a name slightly misspelled โ to substantial fabrications: quotes from people who never said them, citations to papers that do not exist, descriptions of events that never happened.
Hallucinations occur because LLMs are not retrieving verified facts โ they are generating text that fits learned patterns. When the model encounters a prompt that touches on something it has weak or incomplete training data for, it may still produce a confident-sounding response, because producing confident-sounding text is exactly what it has been trained to do. The model has no mechanism to flag uncertainty in the same way a human expert would โ saying "I'm not sure about that" requires a form of self-knowledge that current models do not possess.
The implication for users is significant: the confidence of an AI's delivery is not correlated with the accuracy of its content. An AI can sound equally certain when it is right and when it is wrong. This means verification is not optional for any AI output you intend to rely on.
When an AI provides specific facts โ names, dates, statistics, citations, quotes โ those are exactly the outputs most likely to be hallucinated. The more specific a claim, the more important it is to verify it independently.
Systematic Biases and Blind Spots
Beyond hallucination, AI systems exhibit systematic biases โ consistent patterns of error or distortion that reflect limitations in training data or model design. Unlike random errors, systematic biases follow predictable patterns: a model may consistently underperform on content in certain languages, consistently reproduce gender stereotypes in career-related content, or consistently present one cultural perspective on contested historical events.
Because systematic biases are consistent, they can be easy to miss. If a model always presents a particular framing of an issue, a user who does not already know the issue well may never realize the framing is limited. This is one reason why relying heavily on AI for research on topics where you have little prior knowledge is particularly risky โ you may lack the context to notice when the model's biases are shaping what you're learning.
The best defense against systematic bias is diverse sourcing. AI output should be one input among several โ not the primary or sole source for any important claim, especially on topics that involve contested perspectives or underrepresented communities.
Outdated Information and Knowledge Cutoffs
LLMs are trained on data up to a certain point in time โ their "knowledge cutoff." Events, discoveries, policies, and developments after this cutoff date are simply not in the model's training data, and the model has no way to know what it does not know. This means AI responses to questions about recent events, current statistics, or evolving situations may be based on outdated information โ presented with the same confidence as everything else.
Some AI tools are connected to live internet search, which mitigates this problem for factual lookups. But even these tools can be wrong, can misinterpret search results, or can present outdated cached information. For anything time-sensitive โ recent research, current laws, current events, current prices, current officeholders โ independent verification from a current, authoritative source is essential.
๐ฐ The Confident Wrong Answer
Kenji is writing a civics report on a recent change to his state's voting laws. He asks a chatbot for a summary of the current law. The chatbot provides a clear, detailed, well-organized summary โ including specific provisions, the year they were enacted, and how they compare to national trends.
Kenji submits his report. His teacher returns it with significant corrections: the law the chatbot described had been substantially amended eight months ago. Several of the specific provisions Kenji cited were no longer current. One piece of information โ a specific percentage โ was simply wrong, with no source.
The chatbot had answered confidently and helpfully. It had no way to know its information was outdated. And Kenji had no way to know the chatbot didn't know.
๐ CCR Connection
Know the specific ways AI fails โ hallucination, systematic bias, outdated knowledge โ so you can anticipate and catch errors rather than be surprised by them.
Understanding failure modes helps you use AI creatively without the risks: use it for brainstorming and drafting, then independently verify any specific claims before relying on them.
When your work affects others โ a presentation, a report, a decision โ you carry responsibility for the accuracy of your sources. AI is a tool, not an authority.
Unit Quiz & Final Reflection
Show what you know โ then show what you think.
Unit 2 Complete!
You've finished How AI Works. Below are your certificate and digital badge.
Your downloadable digital badge โ shareable on LinkedIn, email signatures, or portfolios.
Ready to continue?