Week 7 · Lecture outline

Week 7 — Lecture Outline · Multimodal AI: Voice, Audio, Images & Documents

Using Artificial Intelligence · AI 101 Fall 2026 · Prof. Quinn Fictional sample

Course: Using Artificial Intelligence (AI 101) · Silver Oak University (fictional sample) · Prof. Quinn
Objective covered: Objective 3 — Use the full range of AI modalities and match the right tool to the right task.
SLOs touched: A (produce quality results by choosing the right modality and tool) · B (evaluate AI output critically, especially in multimodal workflows)
Meeting pattern: 2 sessions × 75 min = 150 min. Segment minutes below total ~150; scale to your own pattern.

Week at a Glance


The week's big question	"When AI can hear, see, read, and draw — what can I actually trust it to do, and what do I still have to check?"
By the end of the week, students can…	(1) describe voice prompting (Skill 8); (2) run the record → transcribe → analyze workflow (Skill 9); (3) name and match multimodal tasks (image-to-text, handwriting-to-text, image analysis, document/PDF analysis, image creation); (4) catch and fix transcription errors and summary fabrications; (5) weigh both sides of the AI image generation debate.
Key vocabulary	multimodal AI, voice prompting, audio transcription, image-to-text, handwriting-to-text, image analysis, document analysis, image creation/generation, transcription error, summary fabrication
Materials	slides (Deck 7), the week's readings + resource links, one approved assistant (ChatGPT / Claude / Gemini / Copilot) for live demos, a phone or built-in recorder for the transcription demo
Timing note	8 segments, ~150 min total. Session 1 = Segments 1–4 (~75). Session 2 = Segments 5–8 (~75).

Segment 1 — Hook & the Big Reveal (8 min) · Session 1 opens

Hook. Without any announcement, hold up your phone and record yourself saying, slowly and clearly: "Prof. Quinn's AI 101. Week seven. Multimodal AI." Then run it through your phone's built-in voice-to-text (iOS Notes voice transcription, Android's live transcription, or whatever your device has). Project the transcript. Ask the room: "Is it right? What did it miss? What did it change?"

The reveal: "Almost everyone in this room has been using AI as a text box. This week we open it into something much bigger — AI that hears you, reads your handwriting, analyzes your PDFs, and draws from your description. Welcome to multimodal AI."

Why it matters: the same underlying AI infrastructure that powers text chat can now process voice, images, audio files, and documents. Knowing which modality to use — and where each one fails — is the workforce skill.

Memory hook: "Text is just one channel. AI now speaks every channel — but it still makes mistakes on all of them."

Segment 2 — Skill 8: Voice Prompting (20 min)

Plain language first. Many AI assistants let you speak your prompt instead of typing it. The assistant converts your speech to text (transcription), then processes it exactly as it would a typed prompt. The output comes back as text (or sometimes as audio, depending on the tool).

What voice prompting is good for:
- When your hands are busy (cooking, walking, commuting)
- When speaking is faster than typing for longer, conversational prompts
- When you want to practice thinking out loud before committing to a written request
- Accessibility: voice input removes the typing barrier for users with motor difficulties

What to watch for:
- The transcription step introduces errors — especially with names, technical terms, numbers, and accents. What the AI heard may not be what you said.
- Verify the transcribed text before you rely on its response. Most assistants show you the transcript.
- Tone and pace matter: speaking clearly and in full sentences gets better transcription than rushed fragments.

Live demo (do this now, projector visible):
1. Open ChatGPT or Claude on a mobile or laptop with mic access. Find the microphone icon.
2. Say: "I'm a marketing student working on a campaign for a small local coffee shop. Suggest three social media post ideas for the week before their grand opening. Keep each idea to two sentences."
3. Read the transcript displayed. Point out any error.
4. Read the response. Ask: "Does the response reflect what you actually said, or the transcription error?"

Misconception + cure:
- ❌ "Voice mode means AI hears and understands you like a human."
✅ Cure: it first transcribes your speech to text (introducing potential errors), then processes that text. It's two steps, not one. Errors in step 1 flow into step 2.

Segment 3 — Skill 9: Record → Transcribe → Analyze (22 min)

Plain language first. This is the workflow that makes AI genuinely useful for meetings, interviews, lectures, and voice memos — and it has three distinct steps, each of which can introduce its own errors.

The three-step workflow:
1. Record: capture audio of a meeting, interview, voice memo, or class using your phone, laptop, or a dedicated recorder. The goal is clean audio — background noise and crosstalk degrade every downstream step.
2. Transcribe: convert the audio file to text using a free transcription tool. The resulting text file is your raw material. This is where the first errors appear — mis-heard words, dropped sentences, merged speakers, fabricated words.
3. Analyze: paste the transcript into an AI assistant and ask for a summary, key points, or action items. This is where the second errors appear — the AI may fill in gaps, smooth over contradictions, or invent plausible-sounding details that weren't in the transcript.

Free transcription tools (links only — verified against official sites):
- Many phones have built-in transcription: iOS Live Voicemail transcription (https://support.apple.com/en-us/guide/iphone/iphdb3c9fe97/ios) and Android's Recorder app with Live Transcribe (https://support.google.com/recorder/answer/11420396) both work with short recordings.
- Whisper — OpenAI's open-source transcription model; runs locally or via free web wrappers; accurate but not real-time. (openai.com/research/whisper)
- For this week's Studio, any tool that converts a short voice memo to a text file counts.

Live demo:
- Play a pre-recorded 90-second audio clip (a simulated meeting snippet: two people discussing a project deadline and a budget question). Show the raw transcript. Have students call out two errors they notice. Then paste the transcript into an assistant and ask for "a 3-bullet summary and any action items." Read the result. Point out anything the AI added that wasn't in the audio.

The key discipline: the AI cannot hear the audio — it only reads the transcript. Garbage in, garbage out, plus hallucination on top. Check the transcript before you trust the summary.

Segment 4 — Misconceptions + Quick Interaction (22 min) · Session 1 closes (~75)

Name the misconceptions, then cure each:

❌ "Chatbots are text-only — you can't use them with audio or images."
✅ Cure: most major assistants now support multimodal input. ChatGPT, Claude, and Gemini all accept images and/or audio in at least some modes; transcription tools convert audio to text that any assistant can process.
❌ "Transcription is always accurate."
✅ Cure: transcription accuracy depends on audio quality, accent, background noise, and technical vocabulary. It is never perfectly accurate. Always review the transcript.
❌ "An AI 'reading' an image truly sees like a human."
✅ Cure: an AI analyzing an image performs pattern recognition on pixel data — it doesn't perceive depth, context, or meaning the way human vision does. It can make confident but wrong statements about an image's content.
❌ "If the summary sounds right, the AI captured the meeting accurately."
✅ Cure: a summary that sounds fluent and coherent can still omit critical decisions, merge two distinct points, or invent a conclusion that was never reached. Cross-check the summary against the transcript.

Interaction — Spot the Error (rapid-fire, ~10 min):
Project a short, fictional 5-line meeting transcript with two obvious transcription errors embedded (e.g., "budget of $3,000" transcribed as "budget of $30,000"; a name spelled wrong). Students identify both errors individually (30 sec), compare with a neighbor (1 min), report out. Then project the AI's summary of that flawed transcript. Ask: "Did the AI catch the errors, or did it incorporate them as facts?"

Segment 5 — Image-to-Text, Handwriting-to-Text, Image Analysis (20 min) · Session 2 opens

Hook back in: "Last session: what your AI can hear. This session: what it can see."

Image-to-text (OCR-class tasks): uploading a photo of a printed document, receipt, sign, or label and asking the AI to extract the text. Modern multimodal assistants (ChatGPT with vision, Claude, Gemini) can do this fairly reliably for clean, legible printed text.

Handwriting-to-text: a more demanding version — uploading a photo of handwritten notes, a whiteboard, a form, or a sticky note. Accuracy depends heavily on handwriting clarity and image quality. Works well for neat, clear handwriting; fails more often on cursive, cramped, or stylized script.

Image analysis: asking the AI to describe, interpret, or answer questions about an image — "What's in this photo?", "What does this chart show?", "Is this a rash I should be concerned about?" The AI generates plausible descriptions but may miss, misidentify, or confidently mis-state elements.

Live demo:
- Upload a clear photo of a printed receipt to a multimodal assistant. Ask it to extract the line items and total. Show the result. Note any errors.
- Upload a photo of a whiteboard with a few handwritten notes. Ask the AI to transcribe them. Compare.

Critical limit: the AI cannot look up the image's context or history. It only sees the pixels you provide. Never rely on AI image analysis for medical, legal, or safety decisions without independent verification.

Segment 6 — Document/PDF Analysis, Image Creation, and Verify (20 min)

Document and PDF analysis: most major AI assistants now allow you to upload a PDF, spreadsheet, or document file and then ask questions about it — "What's the main argument?", "Summarize the budget table", "List every deadline mentioned." This is powerful for long documents you need to work through quickly.

What to check:
- The AI may not read every page of a very long document.
- It may summarize in ways that emphasize some parts and ignore others.
- Numbers and dates extracted from tables can be wrong — always spot-check.

Image creation: text-to-image AI systems (such as DALL·E at openai.com, Midjourney at midjourney.com, Adobe Firefly at firefly.adobe.com, and Google's Imagen integrated into Gemini) generate images from a text description. The quality and style vary; the tools are improving rapidly.

What image creation can and can't do:
- Can: generate concept art, illustrations, design variations, backgrounds, visual brainstorms quickly and cheaply.
- Can't: guarantee a specific person's likeness accurately (and you should not generate real people's images without consent), guarantee no copyright-adjacent content, or produce legal/professional graphics without human review.

The verify-the-AI moment (technology workflow):

Ask ChatGPT or Claude to describe a specific, well-known image (e.g., a famous historical photograph) without uploading it — just ask by name. It will produce a confident, detailed description. Then look up the actual image. How accurate was the description? What did it miss or invent?

This is the core discipline for image analysis: the AI can only work with what's in front of it, and even then it may confabulate. When it's working from memory or general knowledge rather than an actual uploaded file, the risk of fabrication is even higher.

Memory hook: "Upload the document, not just the title — and still check the key numbers."

Segment 7 — Tool → Modality Matching + the Landscape (20 min)

The map students need to leave class with:

You want to…	The right modality / tool type
Speak a prompt instead of typing	Voice input in a chatbot (ChatGPT, Claude, Gemini)
Convert a recording to text	Audio transcription (Whisper-class; built-in phone tools)
Get a meeting summary and action items	Transcribe first, then analyze with a chatbot
Extract text from a printed photo	Multimodal chatbot with image upload (ChatGPT, Claude, Gemini)
Read messy handwriting	Multimodal chatbot (quality varies — test and verify)
Ask questions about a PDF or spreadsheet	Upload to a multimodal chatbot; spot-check numbers
Generate an image from a description	Image-generation tool (DALL·E, Midjourney, Adobe Firefly, Imagen in Gemini)

Key point: these are often separate tools or modes, not one button. A transcription tool doesn't analyze; a chatbot doesn't always transcribe; an image generator doesn't analyze images. Knowing the right tool for the right step is the skill.

Name tools factually (link only to official sites):
- ChatGPT multimodal: https://chatgpt.com
- Claude (image/document upload): https://claude.com
- Gemini (multimodal): https://gemini.google.com
- DALL·E (image generation, via ChatGPT): https://openai.com/dall-e-3 (accessible through chatgpt.com)
- Midjourney: https://midjourney.com
- Adobe Firefly: https://firefly.adobe.com
- OpenAI Whisper (transcription): https://openai.com/research/whisper

(No version or price claims — these tools evolve rapidly; always check the official site.)

Segment 8 — Callback + Tease + Hand-off (18 min) · Session 2 closes (~75)

AI-critique moment (students verify, not consume):

Project this scenario: "A student uploads their 12-page research paper to Claude, asks 'Are there any unsupported claims?', and posts the AI's list of concerns directly to their professor as their own review." Ask: "What are the failure modes here? What should the student actually do?"
Expected answers: the AI may miss unsupported claims it doesn't have context for, flag claims that are actually fine, or add concerns that weren't relevant. The student should use the AI's output as a starting point for their own re-read, not as a final answer.

Callback:
- Week 5 and 6 had you building prompts and running simulations — those were text-in / text-out skills. Now the input can be a voice, an image, or a file. The same rules apply: clear input → better output; always verify the result.
- The modality changes; the verification habit doesn't.

Tease next week: "Week 8 is the midterm — Objectives 1–3 (Weeks 1–7). The study guide, exam-prep tutorial, and practice exam are all in the module. Use this week's quiz and the modality map above to lock in Objective 3."

Hand-off (the week's graded work):
- Lecture Tutorial 7 (AI tutor, share-link submission) — voice prompting, the transcription workflow, modality matching.
- Quiz 7 (no AI), Discussion 7 ("AI Image Generation: Tool or Threat?"), and Assignment 7 (multimodal workflow design + scenario analysis).
- AI Build Studio 7 — "Record → Transcribe → Summarize" — the three-step workflow, catch the errors, fix the summary.

Instructor FAQ — Common Stumbles

Student says / does	Quick cure
"My chatbot doesn't have a microphone button."	Voice input availability varies by device, browser, and account. If it's not showing, check the mobile app or a different browser. This week's Studio doesn't require voice input to the chatbot — just a recording app.
Assumes transcription is 100% accurate.	Play the demo audio again and count the errors together. "How many words did it get wrong?" usually lands the lesson.
Submits AI summary without checking the transcript.	Ask: "Did the AI catch the $30,000 vs $3,000 error we saw in class?" If not, the summary isn't trustworthy — and that's the lesson.
Thinks "image analysis" means the AI knows everything about the image.	It knows what's visible in the pixels you uploaded. It has no memory of where the image came from or who took it. Test it: upload a photo it should "know" (a landmark) and see what it misses.
Confused about image creation vs image analysis.	Creation = text in → image out. Analysis = image in → text out. Different direction, sometimes different tool.
Asks whether AI image generators are legal to use commercially.	"That's a live legal question — different tools have different terms of service, and the law is evolving. Check the official tool's ToS for your specific use. This course doesn't give legal advice."

Scope flag

This outline stays within Objective 3 at the workflow and tool-matching level. Deep technical explanations of how vision transformers or speech recognition systems work are not covered here (they'd require math prerequisites this course doesn't have). The legal and ethical dimensions of image generation are introduced here and deepened in Week 15 (ethics/privacy/IP). Real products are named factually per VERIFIED_FACTS.md §5; the instructor and institution are fictional; no pricing claims are made; tool links point to official sites only.

~ Prof. Quinn's edition · Fall 2026 · built with thecoursemaker.com