14 - 项目4：多模态智能体

约 232 字小于 1 分钟

LangChain

2026-03-08

项目目标

支持图像理解、语音交互的多模态智能体。

功能特性

✅ 图像理解
✅ 语音输入/输出
✅ 多模态融合

核心代码

\\python from langchain_openai import ChatOpenAI import speech_recognition as sr from PIL import Image

图像理解

llm = ChatOpenAI(model="gpt-4-vision-preview")

def analyze_image(image_path): image = Image.open(image_path) response = llm.invoke([ {"type": "text", "text": "描述这张图片"}, {"type": "image_url", "image_url": image_path} ]) return response.content

语音识别

def voice_to_text(): recognizer = sr.Recognizer() with sr.Microphone() as source: audio = recognizer.listen(source) return recognizer.recognize_google(audio, language="zh-CN")

多模态交互

while True: mode = input("选择模式（1:文字 2:语音 3:图像）：")

if mode == "1":
    text = input("你：")
elif mode == "2":
    text = voice_to_text()
elif mode == "3":
    image_path = input("图片路径：")
    text = analyze_image(image_path)

response = llm.invoke(text)
print(f"AI：{response.content}")

本课小结

GPT-4 Vision 图像理解
语音识别与合成
多模态融合

下一课：15 - 部署与优化