Week 7 — Readings & Resources · Multimodal AI: Voice, Audio, Images & Documents
Course: Using Artificial Intelligence (AI 101) · Silver Oak University (fictional sample) · Prof. Quinn
Objective covered: Objective 3 — Use the full range of AI modalities and match the right tool to the right task.
How to use this page
Everything here is a link to an external resource — open it in your browser. Nothing needs to be downloaded or installed.
This week's load is light: 2 short readings + 2 tool reference pages + 1 support article, grouped by the ideas from the lecture. Read one item per group and you're ready for the quiz; do all of them and you'll be very comfortable. Total time is roughly 35–45 minutes if you do everything, far less if you pick one per group.
Order that matches the lecture: ① voice prompting → ② audio transcription → ③ image and document analysis → ④ image creation → ⑤ free tools for Studio 7.
A habit to carry forward: before you trust any output from a multimodal workflow — transcription, image analysis, document summary — ask the questions from class: Did the AI hear me correctly? Did the transcription introduce errors? Did the summary add anything that wasn't in the source?
① Voice Prompting — Using AI Hands-Free
Maps to Lecture Segment 2. Voice input first transcribes your speech to text, then processes it. What the AI heard may not be what you said.
Reference — ChatGPT voice mode
🔗 https://help.openai.com/en/articles/8400625-voice-mode-faq
Why it's here: OpenAI's official FAQ explains how voice mode works in ChatGPT — what it can do, what it can't, and what languages it supports. Read the first two sections.
⏱ ~5 min
Reference — Using Claude on mobile (voice input)
🔗 https://claude.ai/download
Why it's here: Claude's mobile app supports voice input; this page links to the iOS and Android apps where voice input is available. Try it before class on Thursday.
⏱ ~3 min to download/explore
② Audio Transcription — Record, Then Convert
Maps to Lecture Segment 3. Transcription converts audio to text. Errors in the transcript flow into the AI's summary. Always review the transcript first.
Reading — "Introducing Whisper" (OpenAI)
🔗 https://openai.com/research/whisper
Why it's assigned: a clear, non-technical explanation of what Whisper is (an open-source speech-recognition model from OpenAI), what languages it handles, and where it falls short. This is the same technology powering many free transcription tools. Read the intro section; you don't need the technical charts.
⏱ ~8 min
Reference — Android Recorder app (Google)
🔗 https://support.google.com/recorder/answer/11420396
Why it's here: Android's built-in Recorder app can transcribe recordings in real time (supported languages vary). Useful for Studio 7 if you have an Android device.
⏱ ~4 min
(iOS users: the Notes and Voice Memos apps both offer transcription on recent iOS versions; check your device settings under Accessibility → Live Captions or use Notes → New Audio Recording.)
③ Image and Document Analysis — What AI Can See
Maps to Lecture Segments 5–6. AI image analysis works on pixel data — it can be confident and wrong. Always upload the actual file, and spot-check key numbers from documents.
Reading — "GPT-4 with vision" capabilities overview (OpenAI)
🔗 https://platform.openai.com/docs/guides/vision
Why it's assigned: OpenAI's official documentation describes what vision-capable models can do (describe images, read text, analyze charts) and clearly states what they can't (read URLs in images, process very small text reliably, handle certain image types). The "Limitations" section is the most important part for this course.
⏱ ~10 min (focus on the Overview and Limitations sections)
④ Image Creation — Text In, Image Out
Maps to Lecture Segment 6. Text-to-image tools generate new images from a text description. They are creative tools with real limitations around accuracy, likeness, and terms of use.
Reference — DALL·E (OpenAI)
🔗 https://openai.com/dall-e-3
Why it's here: the official landing page for OpenAI's image-generation model, accessible through ChatGPT. Read about what it does; the Discussion this week is about whether tools like this are a creative opportunity, a threat to artists, or both.
⏱ ~5 min
Reference — Adobe Firefly
🔗 https://firefly.adobe.com
Why it's here: Adobe's generative AI image tool, positioned as designed with creators in mind. Compare the framing here with the DALL·E page — each tool positions itself differently on the creator/artist question.
⏱ ~5 min
Reference — Midjourney
🔗 https://midjourney.com
Why it's here: one of the most-used image generation platforms; primarily Discord-based but with a growing web interface. Named factually in the course; explore the gallery and the docs.
⏱ ~3 min
⑤ Free Tools for Studio 7 — Your Recorder and Transcription Options
Maps to AI Build Studio 7. You need (a) a way to record a short voice memo and (b) a way to get a text transcript. Free options are listed below. Pick whatever works on your device.
Recording:
- iPhone/iPad: use the built-in Voice Memos app. Tap the red button; tap again to stop. The file saves automatically.
- Android: use the built-in Recorder app (Google Pixel) or Voice Recorder (Samsung) or similar built-in app.
- Laptop: use QuickTime Player on Mac (File → New Audio Recording) or the Sound Recorder / Voice Recorder app on Windows.
Transcription:
- Android Recorder app: https://support.google.com/recorder/answer/11420396 — transcribes live (Pixel) or on playback.
- Whisper (free web wrappers): search "Whisper transcription free" for web apps that accept audio file upload and return a text transcript. No specific third-party site is endorsed here — evaluate the app's privacy policy before uploading personal audio.
- AI assistant with audio upload (ChatGPT Plus, Claude): if you have a paid account, you can upload an audio file directly. Free-tier accounts may not support direct audio upload — use a separate transcription step first.
- Your phone's built-in keyboard dictation works for very short recordings (dictate the memo as you transcribe it simultaneously into a notes app).
Quick path (≈12 min total)
In a hurry? Do exactly these two before Thursday:
1. Read "Introducing Whisper" (group ②) — understand what transcription is and where it fails.
2. Read the GPT-4 vision "Limitations" section (group ③) — understand what AI image analysis can't do.
Then make sure you have a recording app on your device ready for Studio 7.
Heads-up (links rot): these point to outside sites that occasionally move or rename pages. If a link ever fails, tell Prof. Quinn; use the tool's main homepage in the meantime. Nothing here is downloaded or redistributed — all resources stay as links to their original sources. Official tool homepages (openai.com, claude.ai, gemini.google.com, firefly.adobe.com, midjourney.com) are stable canonical anchors.
~ Prof. Quinn's edition · Fall 2026 · built with thecoursemaker.com