LiveKit Agents 构建实时、多模态的 AI 代理，GPT-4 Omni 也用了一样的 RTC LiveKit！

2024-05-18 02:55#1 标记1

LiveKit Agents[1] 是一个端到端的框架，用于构建实时、多模态的人工智能 “代理”，通过语音、视频和数据通道与最终用户进行交互。该框架允许您使用 Python 构建代理。

LiveKit Agents 的特点
LiveKit 音频/视频传输：使用相同的 LiveKit API[2] 将语音和视频从客户端设备实时传输到应用服务器。
简化常见任务：简化了语音转文本、文本转语音和使用 LLM 等任务，因此您可以专注于核心应用程序逻辑。
广泛且可扩展的插件：预置了与 OpenAI、DeepGram、Google 和 ElevenLabs 等的集成。您还可以自定义插件来集成其他提供商。
端到端的开发体验：与 LiveKit[3] 服务器和 LiveKit Cloud[4] 兼容。本地开发并部署到生产中，无需更改任何代码。
编排和扩展：内置 Worker 服务，用于代理编排和负载平衡。要扩展，只需添加更多 Worker。
开源：与 LiveKit 一样，Agent 也是 Apache 2.0。
边缘优化：使用 LiveKit Cloud 时，您的代理将利用 LiveKit 的全球边缘网络。代理会在靠近终端用户的地方运行，从而减少延迟，让您有更多时间进行推理。
LiveKit Agents 应用场景
Agents 的设计目的是为您在构建服务器端应用时提供极大的灵活性。您可以用它创建各种应用程序，包括：
使用 LLM 进行语音和视频聊天
实时语音转文字
通过实时视频进行物体检测/识别
生成人工智能驱动的头像
混合人工智能和人工座席的联络中心或服务台解决方案
实时翻译
实时视频过滤器和转换
LiveKit Agents 生命周期
Worker 注册：当你的代理程序运行时，它会首先连接到 LiveKit 服务器，并通过持久的 WebSocket 连接将自己注册为 “工作程序”。一旦注册成功，工作程序就会处于待命状态，等待 “工作 ”请求的到来。
代理调用：当创建一个房间时，LiveKit 服务器会逐一通知已注册的工人有关作业的信息。第一个接受任务的工人将实例化你的代理并让它加入房间。一个工人可以同时管理多个代理实例。
应用逻辑：这是您的应用程序接管的地方。您的代理可以通过 Python SDK 使用大多数 LiveKit 客户端功能。代理还可以利用插件生态系统来处理或合成音频和视频数据。
关闭房间：当最后一位参与者（不包括您的代理）离开房间时，您的代理实例也将从房间断开连接。

LiveKit Agents 快速上手
下面，我们将使用 LiveKit、Python 和 NextJS 构建一个可进行实时对话的人工智能语音助手。
本快速入门教程将引导您完成使用 Python 和 NextJS 构建对话式 AI 应用程序的步骤。它使用 LiveKit 的 Agents SDK 和 React Components Library 来创建一个可以与用户进行实时对话的人工智能语音助手。最后，您将拥有一个可以运行和交互的基本对话式人工智能应用程序。

前置条件
LiveKit Cloud Project[5] 或 open-source LiveKit server[6]
ElevenLabs API Key[7]
Deepgram API Key[8]
OpenAI API Key[9]
Python 3.10+[10]
1. 设置开发环境
设置以下的环境变量：
export LIVEKIT_URL=<your LiveKit server URL>export LIVEKIT_API_KEY=<your API Key>export LIVEKIT_API_SECRET=<your API Secret>export ELEVEN_API_KEY=<your ElevenLabs API key>export DEEPGRAM_API_KEY=<your Deepgram API key>export OPENAI_API_KEY=<your OpenAI API key>
配置一个 Python 虚拟环境：
python -m venv venvsource venv/bin/activate
安装项目所需的 Python 依赖包：
pip install \livekit \livekit-agents \livekit-plugins-deepgram \livekit-plugins-openai \livekit-plugins-elevenlabs \livekit-plugins-silero
2. 创建服务器代理
新建一个 main.py 文件，并输入以下代码：
import asyncioimport loggingfrom livekit.agents import JobContext, JobRequest, WorkerOptions, clifrom livekit.agents.llm import (    ChatContext,    ChatMessage,    ChatRole,)from livekit.agents.voice_assistant import VoiceAssistantfrom livekit.plugins import deepgram, elevenlabs, openai, silero# This function is the entrypoint for the agent.async def entrypoint(ctx: JobContext):    # Create an initial chat context with a system prompt     initial_ctx = ChatContext(        messages=[            ChatMessage(                role=ChatRole.SYSTEM,                text="You are a voice assistant created by LiveKit. Your interface with users will be voice. Pretend we're having a conversation, no special formatting or headings, just natural speech.",            )        ]    )    # VoiceAssistant is a class that creates a full conversational AI agent.    # See https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/voice_assistant/assistant.py    # for details on how it works.    assistant = VoiceAssistant(        vad=silero.VAD(), # Voice Activity Detection        stt=deepgram.STT(), # Speech-to-Text        llm=openai.LLM(), # Language Model        tts=elevenlabs.TTS(), # Text-to-Speech        chat_ctx=initial_ctx, # Chat history context    )    # Start the voice assistant with the LiveKit room    assistant.start(ctx.room)    await asyncio.sleep(3)    # Greets the user with an initial message    await assistant.say("Hey, how can I help you today?", allow_interruptions=True)# This function is called when the worker receives a job request# from a LiveKit server.async def request_fnc(req: JobRequest) -> None:    logging.info("received request %s", req)    # Accept the job tells the LiveKit server that this worker    # wants the job. After the LiveKit server acknowledges that job is accepted,    # the entrypoint function is called.    await req.accept(entrypoint)if __name__ == "__main__":    # Initialize the worker with the request function    cli.run_app(WorkerOptions(request_fnc))
有些插件需要先下载附加文件。例如，Silero 插件需要下载模型权重。运行以下命令后，所有导入的插件都将下载它们所需的文件：
python main.py download-files
3. 运行服务器代理
运行以下命令，启动服务器代理：
python main.py start
运行上述命令后，worker 将开始监听来自 LiveKit 服务器的作业请求。您可以以相同的方式运行许多工作人员来扩展代理，LiveKit 的服务器将平衡它们之间的请求。
4. 配置前端环境
默认情况下，创建房间时会创建代理任务请求。我们将创建一个 NextJS 应用程序，让人类参与者加入一个新房间，与人工智能代理对话。
首先构建一个 NextJS 项目：
npx create-next-app@latest
接着，安装 LiveKit 依赖：
npm install @livekit/components-react @livekit/components-styles livekit-client livekit-server-sdk
配置 .env.local 文件中的环境变量：
export LIVEKIT_URL=<your LiveKit server URL>export LIVEKIT_API_KEY=<your API Key>export LIVEKIT_API_SECRET=<your API Secret>
5. 创建一个 access token 的接口
新建一个 src/app/api/token/route.ts 文件并输入以下代码：
import { AccessToken } from 'livekit-server-sdk';export async function GET(request: Request) {  const roomName = Math.random().toString(36).substring(7);  const apiKey = process.env.LIVEKIT_API_KEY;  const apiSecret = process.env.LIVEKIT_API_SECRET;  const at = new AccessToken(apiKey, apiSecret, {identity: "human_user"});  at.addGrant({    room: roomName,    roomJoin: true,    canPublish: true,    canPublishData: true,    canSubscribe: true,  });  return Response.json({ accessToken: await at.toJwt(), url: process.env.LIVEKIT_URL });}
6. 创建 UI
新建 src/app/page.tsx 文件并输入以下代码：
'use client';import {  LiveKitRoom,  RoomAudioRenderer,  useLocalParticipant,} from '@livekit/components-react';import { useState } from "react";export default () => {  const [token, setToken] = useState<string | null>(null);  const [url, setUrl] = useState<string | null>(null);  return (    <>      <main>        {token === null ? (<button onClick={async () => {          const {accessToken, url} = await fetch('/api/token').then(res => res.json());          setToken(accessToken);          setUrl(url);        }}>Connect</button>) : (          <LiveKitRoom            token={token}            serverUrl={url}            connectOptions={{autoSubscribe: true}}          >            <ActiveRoom />          </LiveKitRoom>        )}      </main>    </>  );};const ActiveRoom = () => {  const { localParticipant, isMicrophoneEnabled } = useLocalParticipant();  return (    <>      <RoomAudioRenderer />      <button onClick={() => {        localParticipant?.setMicrophoneEnabled(!isMicrophoneEnabled)      }}>Toggle Microphone</button>      <div>Audio Enabled: { isMicrophoneEnabled ? 'Muted' : 'Unmuted' }</div>    </>  );};
7. 运行前端应用
在终端输入以下命令，启动前端应用：
npm run dev
之后，可以通过 http://localhost:3000 地址访问应用。
8. 与 AI 增强的应用交互
服务器代理和前端应用程序启动并运行后，您就可以开始与代理对话、提问或讨论任何您想讨论的话题。当您通过网络用户界面加入一个房间时，代理任务请求会自动发送给您的工作员。工作人员接受任务后，人工智能驱动的代理就会加入房间，随时准备进行对话。代理会聆听您的声音，使用 Deepgram 的语音到文本 (STT) 技术处理您的语音，使用 OpenAI 的高级语言模型生成回复，并使用 ElevenLabs 的文本到语音 (TTS) 服务进行口头回复。
往期文章
2024 年最完整的 AI Agents 清单来了，涉及 13 个领域，上百个 Agents！
当 GPT-4o 遇上 OpenGlass：将任何眼镜变成超级 AI 智能眼镜！
当 AI 遇上爬虫：让数据提取变得前所未有地简单！
告别传统客服：支持十几个平台的 AI “懒人客服” 来了！
卖货主播大模型开源了，让销冠触手可及！
Kimi+麦肯锡，5 秒摸透一个行业！
Kimi 10 秒生成流程图，别再手动绘图了！
万字长文秒变精华！Kimi 的超强提示词秘籍
阿里开源的自动化视频剪辑工具，太好用了！
开源，11K Star！一个主题或关键词就能自动生成一条高清的短视频，无侵权风险！
参考资料
[1]
LiveKit Agents: https://github.com/livekit/agents[2]
LiveKit API: https://docs.livekit.io/realtime/concepts/api-primitives/[3]
LiveKit: https://github.com/livekit/livekit[4]
LiveKit Cloud: https://cloud.livekit.io/[5]
LiveKit Cloud Project: https://cloud.livekit.io/[6]
open-source LiveKit server: https://docs.livekit.io/home/self-hosting/local/[7]
ElevenLabs API Key: https://elevenlabs.io/dashboard/api-keys[8]
Deepgram API Key: https://console.deepgram.com/[9]
OpenAI API Key: https://platform.openai.com/api-keys[10]
Python 3.10+: https://www.python.org/