Week 13 — Lecture Outline · Hypothesis Testing: Foundations
Course: Introduction to Statistics (MATH 11) · Silver Oak University (fictional sample) · Prof. Rivera
Objectives covered: Objective 7 — Conduct and interpret hypothesis tests. (This week builds the logic and interpretation; the mechanics of specific tests come in Week 14.)
SLOs touched: A (reason quantitatively from data) · B (communicate results to a non-technical audience)
Meeting pattern: 2 sessions × 75 min = 150 min. Segment minutes below total ~150; scale to your own pattern.
Campus note (Thanksgiving week): we meet Tuesday, Nov 24 only. Thursday, Nov 26 is Thanksgiving and the campus is closed Thu Nov 26 – Fri Nov 27 — no Thursday class, no office hours. Because of the holiday, the week's graded work is due the following Sunday, Nov 29 (an extended, post-break deadline). Say this out loud Tuesday so nobody is surprised. With only one meeting, this outline front-loads the must-teach beats into Session 1; Segments 7–8 are a short, optional "if you have a flipped/async Thursday slot" block — otherwise fold their key lines into Tuesday and assign the readings/tutorial over the break.
Week at a Glance
| The week's big question | "When a sample shows an effect, how do we decide whether it's real — or just the kind of wobble random chance produces all the time?" |
| By the end of the week, students can… | (1) state a null (H₀) and alternative (Hₐ) hypothesis for a described claim; (2) explain what a p-value is — how surprising the data would be if H₀ were true — and compare it to the significance level α to decide reject or fail to reject; (3) state the conclusion in context (not just "reject H₀"); (4) tell a Type I error (rejecting a true H₀) from a Type II error (failing to reject a false H₀), and name a consequence of each; (5) catch the three classic misreadings — the p-value is not "the probability H₀ is true," "fail to reject" is not "H₀ is proven," and "statistically significant" is not "large or important." |
| Key vocabulary | significance test (hypothesis test), null hypothesis (H₀), alternative hypothesis (Hₐ), test statistic, p-value, significance level (α), reject H₀, fail to reject H₀, statistically significant, Type I error (false positive), Type II error (false negative), statistical significance vs. practical significance, effect size |
| Materials | slides (Deck 13), the week's readings + video links, a spreadsheet (Google Sheets or Excel) for the AI-critique moment, one approved chatbot (Gemini / Claude / ChatGPT). No tables to look up this week — every p-value you need is handed to you; the work is the reasoning, not the arithmetic. |
| Timing note | 8 segments, ~150 min. This term Session 2 is the Thanksgiving holiday — teach Segments 1–6 on Tue Nov 24 (~120–130 min is fine for a 75 + buffer, or trim the interactions), and treat Segments 7–8 as the wrap you either squeeze into Tuesday's last 15 minutes or push to the readings/tutorial over break. |
The numbers we use all week (pre-computed — never compute on the fly)
Rule for this course: this is a conceptual week. Where a number appears, it is given and kept friendly. Every example below uses one of these exact, pre-decided values. Students never compute a p-value by hand this week — technology or the problem hands it to them, and the skill is comparing and interpreting it.
| The comparison | Decision | Plain-language reading |
|---|---|---|
| p = 0.03 vs α = 0.05 | 0.03 < 0.05 → reject H₀ | "Data this extreme would happen only ~3% of the time if H₀ were true — surprising enough; we reject H₀." |
| p = 0.20 vs α = 0.05 | 0.20 > 0.05 → fail to reject H₀ | "Data like this would happen ~20% of the time even if H₀ were true — not surprising; not enough evidence." |
| p = 0.048 vs α = 0.05 | 0.048 < 0.05 → reject H₀ (barely) | "Just over the line — significant by the rule, but a razor-thin margin worth saying out loud." |
| p = 0.051 vs α = 0.01 | 0.051 > 0.01 → fail to reject H₀ | "At a stricter α = 0.01, the same data don't clear the bar — the threshold you pick matters." |
The decision rule, said once and kept: p ≤ α → reject H₀; p > α → fail to reject H₀. Small p = surprising data = evidence against H₀.
Segment 1 — Hook & the Promise (10 min) · Session 1 (Tue Nov 24) opens
Hook. Put a real-sounding headline on the board: "New study finds students who use the campus tutoring center score significantly higher on finals."
- "Stop on one word: significantly. In everyday English that means 'by a lot.' In statistics it means something completely different — and the gap between those two meanings is one of the most expensive misunderstandings in the news. By Friday you'll never read that word the lazy way again."
- "Here's the deeper problem this week solves. Suppose a coin lands heads 6 times in 10 flips. Is the coin rigged — or is that just what a fair coin does sometimes? Suppose a new study app raises a class average by 2 points. Real effect, or random luck? Every sample wobbles. We need a disciplined way to decide when a result is more than wobble."
The promise (write it on the board): "By the end of this week you can take any 'a study found…' claim and run the logic yourself: state what 'nothing's going on' would mean (H₀), state the rival claim (Hₐ), look at how surprising the data are if nothing's going on (the p-value), compare that to a line we drew in advance (α), and say — in plain English — whether we have real evidence or just noise. And you'll catch the three sentences that get this wrong."
Why it matters line (memory hook): "A hypothesis test doesn't prove anything true. It asks one question: 'Are these data too surprising to blame on chance?'"
Segment 2 — The Logic of a Significance Test: the Courtroom (22 min)
Plain language first — start with the analogy, it carries the whole week.
A hypothesis test is a courtroom trial.
- The defendant is presumed innocent. That presumption is the null hypothesis, H₀ — "nothing unusual is going on; the effect is zero; it was just chance."
- The prosecutor's claim — "something is going on" — is the alternative hypothesis, Hₐ.
- The data are the evidence.
- We never "prove innocence." We ask one thing: is the evidence strong enough to reject the presumption of innocence "beyond a reasonable doubt"? That reasonable-doubt threshold is the significance level, α.
Why innocence is the default (the key move students miss). We put the boring, no-effect, status-quo claim in H₀ on purpose, because a test can only ever try to knock it down. We give chance the benefit of the doubt and make the data work to overturn it. The claim a researcher is hoping to support is almost always Hₐ.
Two verdicts, never three. A jury returns "guilty" (we reject H₀) or "not guilty" (we fail to reject H₀). Notice the second verdict is not "innocent." "Not guilty" means "the prosecution didn't meet the bar," not "we proved the defendant didn't do it." Hold that thought — it is the cure to the biggest misconception of the week.
Memory hook (put it on a slide):
H₀ = innocent (chance). Hₐ = guilty (a real effect). The data are the evidence. α is "beyond a reasonable doubt."
Land the idea: a significance test is a structured way to ask, "If nothing were really going on, would data this extreme be a shock?" If yes → we doubt 'nothing's going on.' If no → we can't.
Segment 3 — Stating H₀ and Hₐ (22 min)
Plain language first.
- Null hypothesis (H₀): the "nothing new / no difference / no effect" statement. It always contains an equals idea — the parameter equals the status-quo value (μ = 75, p = 0.5, "no difference between the two groups").
- Alternative hypothesis (Hₐ): the claim we'd accept if we reject H₀ — that the parameter is different, greater, or less (μ ≠ 75, p > 0.5, "the new method is better"). Hₐ is what the researcher set out to find support for.
The one rule that prevents most errors: H₀ always gets the "=", Hₐ never does. The thing you're trying to demonstrate goes in Hₐ; the thing you're trying to overturn goes in H₀.
One fully worked example (do every step out loud).
Claim a researcher wants to support: "A new study app raises the average final-exam score above the long-run department average of 75."
- The status quo / "no effect" statement: the app changes nothing, so the true mean stays 75 → H₀: μ = 75.
- The claim they hope to support: the true mean is higher → Hₐ: μ > 75.
- In words: "We'll keep believing the app does nothing (mean = 75) unless the exam data are too good to explain by chance — then we'll conclude the mean is really above 75."
(Notation lands only after the idea: μ is the population mean exam score; 75 is the status-quo value we test against.)
A second quick one (two-sided, to show the contrast).
Claim: "A coin might be unfair — in either direction." Status quo = fair → H₀: p = 0.5. The rival, with no direction specified → Hₐ: p ≠ 0.5. ("≠" because 'unfair' could mean too many heads or too many tails.)
Misconception + cure:
- ❌ "I'll put the thing I want to prove in H₀."
✅ Cure: backwards. H₀ is the claim you're trying to knock down; the exciting claim is Hₐ. A test can only build evidence against H₀, never for it.
- ❌ "H₀ and Hₐ are about my sample (x̄, p̂)."
✅ Cure: hypotheses are about the population parameter (μ, p) — the truth we can't see — never about the sample number we happened to get. We test a claim about the world, using the sample as evidence.
Segment 4 — The p-value, and Comparing it to α (24 min) · Session 1 core
Plain language first — the single most important sentence of the week.
The p-value is the probability of getting data at least as extreme as what we saw, assuming H₀ is true. It measures how surprising our data would be if nothing were really going on.
- Small p-value → surprising data → evidence against H₀ ("a fair coin almost never does this").
- Large p-value → unremarkable data → not evidence against H₀ ("a fair coin does this all the time").
The significance level α — the line drawn in advance.
- α is the threshold we set before seeing the data for how surprising is surprising enough. The default is α = 0.05 (sometimes 0.01 for higher-stakes work).
- The decision rule (say it once, keep it): p ≤ α → reject H₀. p > α → fail to reject H₀.
One fully worked example — the whole pipeline, end to end (this is the centerpiece).
Scenario: The study app from Segment 3. We collect exam data and technology hands us p = 0.03. We set α = 0.05 in advance.
1. State the hypotheses. H₀: μ = 75 (app does nothing). Hₐ: μ > 75 (app raises the mean).
2. Read the p-value in words. p = 0.03 means: if the app truly did nothing, we'd see exam results this good only about 3 times in 100 just by chance. That's surprising.
3. Compare to α. 0.03 ≤ 0.05 → reject H₀.
4. Conclude in context (the step students skip). "At the 0.05 level, we have statistically significant evidence that the study app raises the average final-exam score above 75." — Not "we proved the app works"; not just "reject H₀."Now flip one number. Suppose instead technology hands us p = 0.20. Reading: data this good would happen ~20% of the time even if the app did nothing — not surprising. 0.20 > 0.05 → fail to reject H₀. In context: "We don't have enough evidence to conclude the app raises scores." Not "the app definitely doesn't work" — just "this study didn't show it."
Drive the interpretation home — what the p-value is NOT (full Segment 5 treatment is next; flag it here): p = 0.03 does not mean "there's a 3% chance H₀ is true." It is computed assuming H₀ is true, so it can't also be the probability that H₀ is true. (Cure detailed in Segment 5.)
Segment 5 — The Three Classic Misinterpretations (and the Cure for Each) (22 min)
Frame it: "These three sentences sound right, appear in real news stories, and are all wrong. Learning to catch them is half of what this week is worth. Say the wrong version, then the cure, for each."
① The p-value is NOT "the probability that H₀ is true."
- ❌ "p = 0.03, so there's a 3% chance the null is true / a 97% chance the effect is real."
- ✅ Cure: the p-value is computed assuming H₀ is true — it answers "how surprising are the data if H₀ holds?", not "how likely is H₀?" The hypothesis is either true or it isn't; the p-value is a property of the data under H₀, not a probability about the hypothesis. Memory line: "The p-value assumes the null — so it can't measure the null."
② "Fail to reject H₀" is NOT "H₀ is proven true."
- ❌ "p was big, so we proved the app makes no difference / the coin is fair."
- ✅ Cure: failing to reject is the jury's "not guilty," not "innocent." It means "this study didn't find enough evidence," which can happen because there's truly no effect or because the sample was too small to detect a real one. Absence of evidence is not evidence of absence. Memory line: "We never accept H₀ — we just fail to reject it."
③ "Statistically significant" is NOT "large" or "important."
- ❌ "The result was statistically significant, so the effect is big and matters."
- ✅ Cure: "statistically significant" means only "too surprising to be chance at our α" — it says the effect is probably real, not that it is big. With a huge sample, a trivial effect (a 0.3-point bump on a 100-point exam) can be statistically significant and still be practically meaningless. That gap is Segment 6. Memory line: "Significant means 'probably real,' not 'probably big.'"
Quick interaction — "Spot the foul" (think-pair-share, ~8 min): Put four sentences on a slide; students decide legit or classic-misread, then name which misread (①/②/③). Suggested:
1. "p = 0.04, so we have evidence against H₀ at α = 0.05." → legit.
2. "p = 0.04 means there's a 4% chance the null hypothesis is true." → ① false.
3. "We failed to reject H₀, which proves the two groups are identical." → ② false.
4. "The effect was statistically significant, so it must be large enough to matter." → ③ false.
Debrief the two that always split the room: #2 (probability-of-H₀ trap) and #4 (significant ≠ big).
Segment 6 — Statistical vs. Practical Significance (16 min)
Plain language first.
- Statistical significance answers "is the effect probably real (not just chance)?" — a yes/no at level α.
- Practical significance answers "is the effect big enough to care about in the real world?" — a judgment about the size of the effect and its cost/benefit.
- They are different questions, and a result can be one without the other. Big samples make even tiny effects statistically significant; that never makes a tiny effect important.
One fully worked example (pre-computed, friendly).
A weight-loss program is tested on 40,000 people. After a year the program group lost, on average, 0.3 pounds more than the control group, and because the sample is enormous, p = 0.001 — wildly statistically significant.
- Statistically significant? Yes — p = 0.001 ≤ 0.05. The 0.3-pound difference is almost certainly real, not chance.
- Practically significant? No. 0.3 pounds over a year is meaningless to an actual dieter. A real effect, but not one worth the cost or effort.
- The lesson: "significant" answered the first question, not the second. Always ask both: Is it real? Is it big enough to matter?
Memory hook: "Statistical significance is about chance; practical significance is about size. A big enough sample can make a meaningless difference 'significant.'"
Misconception + cure:
- ❌ "It's significant, so we should act on it."
✅ Cure: ask for the effect size, not just the p-value. Significant + tiny effect = real but not worth acting on. (This is also why good reports show the estimate and a confidence interval, Weeks 11–12 — not just a p-value.)
Segment 7 — Type I vs. Type II Error: the Two Ways to Be Wrong (16 min) · Session 2 beat (push to Tue or to the break readings)
Plain language first — there are exactly two mistakes a test can make.
| H₀ is actually TRUE (no real effect) | H₀ is actually FALSE (real effect) | |
|---|---|---|
| We reject H₀ | ❌ Type I error (false positive) | ✅ correct |
| We fail to reject H₀ | ✅ correct | ❌ Type II error (false negative) |
- Type I error = rejecting a true H₀ → we cried "effect!" when there was none. A false positive. Its probability is exactly α (that's what α is).
- Type II error = failing to reject a false H₀ → we missed a real effect that was actually there. A false negative.
The memorable framing (put it on a slide) — the courtroom, finished:
Type I error = convicting an innocent person (you punished a defendant who did nothing — a false alarm).
Type II error = letting a guilty person go free (a real culprit walked — a missed catch).
Lowering α (demanding more evidence) makes false convictions rarer (fewer Type I) but lets more guilty go free (more Type II). You can't drive both to zero at once — that trade-off is why we pick α deliberately.
A consequence of each (make it concrete).
Medical test for a disease. H₀: "patient is healthy."
- Type I (reject a true H₀): tell a healthy patient they're sick → needless fear, cost, maybe risky follow-up treatment.
- Type II (fail to reject a false H₀): tell a sick patient they're fine → a real disease goes untreated.
Which is worse depends on the stakes — and that is exactly why we choose α (and sample size) on purpose for the situation.
Memory hook: "Type I = false alarm (convict the innocent). Type II = missed it (free the guilty)." — or the catchphrase: "Type 1 cries wolf; Type 2 misses the wolf."
Segment 8 — Technology Workflow + AI-Critique, Callback & Hand-off (12 min) · Session 2 close (or fold into Tue)
Technology workflow — compare a p-value to α automatically in a spreadsheet (exact steps):
1. In cell A1 type a label p-value; in B1 put the number, e.g. 0.03.
2. In A2 type alpha; in B2 put 0.05.
3. In A3 type decision; in B3 type:
=IF(B1<=B2, "Reject H0", "Fail to reject H0")
4. Change B1 to 0.20, then 0.048, then 0.051 (and try B2 = 0.01) and watch the verdict flip. (This is the decision rule from Segment 4, made into one formula — Google Sheets and Excel are identical.)
AI-critique moment (students verify, not consume — the signature habit):
Paste to an approved chatbot: "In a study, p = 0.03 with α = 0.05. Explain what the p-value means and what we conclude."
Then audit the answer against this week's rules. Chatbots frequently write the forbidden sentence — "there's a 3% chance the null hypothesis is true" (that's misread ①) — or slide from "significant" to "large/important" (misread ③). Make students catch and rewrite the bad sentence: "p = 0.03 means data this extreme would occur only ~3% of the time if H₀ were true; since 0.03 ≤ 0.05 we reject H₀ and conclude the effect is statistically significant — not that the effect is large, and not that there's a 3% chance H₀ is true." The tool drafts; you judge. This is exactly how the weekly Lecture Tutorial works.
Callback + tease:
- Callback: "Weeks 11–12 we built confidence intervals — an honest range for a parameter. A test asks the yes/no cousin question: is a specific claimed value too far from our data to believe? (In fact, a 95% CI that excludes the null value matches a two-sided test rejecting at α = 0.05 — same logic, two views.)"
- Tease next week: "This week was the logic. Week 14 turns the crank: actual one-sample t-tests for a mean, tests for a proportion, and two-sample comparisons — the same H₀/Hₐ/p-value/α machine, now with the formulas that produce the p-value."
Hand-off (the week's graded work — note the post-Thanksgiving due date):
- Lecture Tutorial 13 (AI tutor, share-link submission) — H₀/Hₐ, p-value vs. α, Type I/II, the three misreads.
- Quiz 13, Discussion 13 ("a study found a significant effect" — interrogate a real headline), and Assignment 13 — all due Sun Nov 29 (extended past the holiday weekend).
Instructor FAQ — Common Stumbles
| Student says / does | Quick cure |
|---|---|
| Puts the claim they want to prove in H₀. | Flip it. H₀ is the "no effect / status quo" claim you try to knock down; the exciting claim is Hₐ. A test only ever builds evidence against H₀. |
| Writes hypotheses about x̄ or p̂ (the sample). | Hypotheses are about the population parameter (μ, p) — the unknown truth — never the sample statistic you measured. |
| "p = 0.03 means a 3% chance the null is true." | No — the p-value is computed assuming H₀ is true, so it can't be the probability H₀ is true. It's "how surprising are the data if H₀ holds." "The p-value assumes the null, so it can't measure the null." |
| "We failed to reject, so H₀ is proven true." | "Not guilty" ≠ "innocent." We never accept H₀ — we just didn't find enough evidence. Could be no effect, or a sample too small to catch one. |
| "It's statistically significant, so it's a big/important effect." | Significant means "probably real," not "big." A huge sample can make a trivial effect significant. Ask for the effect size. |
| Confuses Type I and Type II. | Type I = reject a true H₀ = false alarm = "convict the innocent." Type II = fail to reject a false H₀ = missed it = "free the guilty." α is the Type I rate. |
| "Why is α = 0.05? Can I change it?" | α is a choice made before the data — the reasonable-doubt bar. 0.05 is convention; use 0.01 when a false positive is costly. Lower α → fewer Type I but more Type II. |
| Reports only "reject H₀" with no context. | Always finish the sentence: "At the 0.05 level we have significant evidence that [the real-world claim]." The number is not the conclusion; the sentence about the world is. |
| "A one-sided vs. two-sided Hₐ — which do I use?" | Use two-sided (≠) unless the question only cares about one direction ("is it higher?" → >). When unsure, two-sided is the safe, honest default. |
Scope flag
This outline stays within Objective 7 at the logic-and-interpretation level (no formulas that produce a p-value — those are Week 14). The medical-test consequence example, the CI-excludes-null connection in Segment 8, and the α/Type-II trade-off line are added context (not strictly required by the objective) — kept because they make the two-errors idea and the "significant ≠ proven" cure stick. Trim them for a leaner 60-minute version; the must-keeps are the courtroom analogy, the worked p-vs-α pipeline, and the three misinterpretations.
~ Prof. Rivera's edition · Fall 2026 · built with thecoursemaker.com