AI
min read
Last update on

Real-Time Voice Detection with Vector TTS in Gemini Live API

Real-Time Voice Detection with Vector TTS in Gemini Live API
Table of contents

The convergence of Voice Activity Detection (VAD), real-time Text-to-Speech synthesis, and contextual vector retrieval is an important step forward in conversational AI.

Picture an AI tutor that waits until you finish speaking before replying in natural speech, or a support agent assistant that detects silence, fetches the right policy from a knowledge base, and answers instantly. Achieving this requires far more than simple API calls: it means streaming audio in real time, detecting pauses with millisecond precision, retrieving context vectors, generating responses, and returning speech within sub-second latency.

Platforms such as the Gemini Live API make this orchestration possible, enabling developers to build applications that feel less like tools and more like collaborators.

Architectural decision: why Gemini live API

The selection of Gemini Live API as the core TTS engine stems from several critical technical advantages that directly address the fundamental challenges of real-time voice applications with contextual enhancement.

Native streaming architecture: Unlike traditional batch-processing TTS models, Gemini Live API implements true streaming synthesis that generates audio progressively as text becomes available. 

This streaming approach eliminates the typical latency bottleneck where the entire response must be synthesised before audio playback begins. The architecture enables audio chunk transmission to commence within 50-100 milliseconds of text generation, significantly reducing perceived response time in conversational scenarios.

Integrated VAD processing: The API's built-in automatic activity detection eliminates the need for external VAD implementation, reducing system complexity while providing optimised voice activity detection specifically tuned for the model's acoustic characteristics. 

The integrated approach ensures temporal synchronisation between voice detection and synthesis processes, preventing the timing misalignments that commonly plague multi-component VAD systems.

Context-aware synthesis: Gemini's multi-modal understanding enables sophisticated prosodic adaptation based on retrieved document context. The model can adjust speaking style, emphasis, and pacing based on the type of information being conveyed, whether reading technical documentation, providing casual explanations, or delivering urgent information. 

This contextual awareness proves particularly valuable when synthesising content from diverse document sources with varying formality levels.

WebSocket-native communication: The API's native WebSocket support enables full-duplex communication essential for real-time voice applications. 

This bidirectional capability allows simultaneous audio transmission, interruption handling, and metadata exchange without the overhead of HTTP request-response cycles that would introduce prohibitive latency in conversational scenarios.

Voice activity detection and real-time processing

The implementation of effective VAD in a vector-enhanced TTS system requires sophisticated coordination between multiple processing streams operating at different temporal scales.

VAD configuration and setup

The Gemini Live API integrates automatic activity detection through sophisticated configuration parameters that enable fine-tuned voice activity detection:

setupMessage = {
    setup: {
        realtime_input_config: {
            automatic_activity_detection: {
                disabled: false,
                prefix_padding_ms: 100,
                silence_duration_ms: 300
            }
        }
    }

This configuration establishes critical VAD parameters where prefix_padding_ms captures speech onset events that might otherwise be truncated, while silence_duration_ms determines the threshold for detecting speech completion. These parameters directly impact conversational flow by controlling when the system interprets voice input as complete and triggers response generation.

Temporal synchronisation challenges

Voice activity detection operates on millisecond-level time windows, typically processing 20-30 millisecond audio frames with 10-15 millisecond overlap. Vector retrieval operations, by contrast, operate on human-perceptible timescales of 50-200 milliseconds depending on corpus size and query complexity. The challenge lies in orchestrating these disparate temporal domains to maintain conversational flow.

The system implements a predictive processing approach where vector retrieval begins immediately upon voice activity detection, before speech recognition completes. This anticipatory processing reduces the cumulative latency of sequential operations by overlapping VAD, transcription, vector search, and response preparation phases.

Interruption handling and context preservation

Real-time voice interactions require sophisticated interruption handling that goes beyond simple audio cutoff. The server-side interruption handling demonstrates the complexity of managing multiple concurrent processes:

if response.server_content.interrupted:
    print("🛑 [Interrupted by VAD] Generation stopped - Continue speaking...")
    await client_websocket.send(json.dumps({
        "interrupted": True
    }))
    continue

When a user interrupts an ongoing response, the system must preserve conversation context while gracefully terminating multiple concurrent processes: audio synthesis, vector retrieval operations, and any pending text generation.

The corresponding frontend interruption handling ensures immediate audio cessation and UI feedback:

if (messageData.interrupted) {
    console.log("🛑 [Interrupted by VAD] Generation stopped - Continue speaking...");
    hideTypingIndicator();
    
    // Stop and clear audio playback immediately
    if (audioInputContext && audioInputContext.state !== "closed") {
        try {
            workletNode.port.postMessage(new Float32Array(0)); // Clear buffer
        } catch (e) {
            console.log("Error clearing audio buffer:", e);
        }
    }
    
    addMessage("system", "🛑 Response interrupted - Continue speaking");
    return;
}

WebSocket infrastructure and real-time communication

The WebSocket implementation serves as the critical communication backbone that enables the low-latency, bidirectional data exchange essential for real-time voice applications.

Audio stream processing and transmission

The frontend audio processing pipeline captures microphone input and converts it to the PCM format required for real-time transmission:

processor.onaudioprocess = (e) => {
    const inputData = e.inputBuffer.getChannelData(0);
    const pcm16 = new Int16Array(inputData.length);
    for (let i = 0; i < inputData.length; i++) {
        pcm16[i] = inputData[i] * 0x7fff;
    }
    pcmData.push(...pcm16);
};

This processing converts floating-point audio samples to 16-bit PCM format, which is then transmitted via WebSocket in base64-encoded chunks. The 4096-sample buffer size balances latency with processing efficiency, providing approximately 256 milliseconds of audio per chunk at a 16kHz sampling rate.

Multiplexed data streams

The WebSocket connection carries multiple distinct data types simultaneously: compressed audio streams, vector retrieval requests, document uploads, system status messages, and control signals. The implementation utilises JSON-based message framing with MIME type identification:

for chunk in data["realtime_input"]["media_chunks"]:
    if chunk["mime_type"] == "audio/pcm":
        await session.send_realtime_input(
            audio=types.Blob(
                data=base64.b64decode(chunk["data"]),
                mime_type="audio/pcm;rate=16000"
            )
        )
    elif chunk["mime_type"] == "application/pdf":
        # Handle PDF upload processing
        pdf_data = base64.b64decode(chunk["data"])
        filename = chunk.get("filename", "uploaded.pdf")
        # Save and process document

Connection resilience and error recovery

Real-time voice applications demand robust connection management that can gracefully handle network instability without disrupting ongoing conversations. The implementation includes automatic reconnection logic and connection health monitoring:

webSocket.onclose = (event) => {
    console.log("WebSocket closed:", event);
    updateStatus("disconnected", "Disconnected");
    addMessage("system", "❌ Connection lost. Please refresh the page.");
};

webSocket.onerror = (event) => {
    console.log("WebSocket error:", event);
    updateStatus("error", "Connection Error");
    addMessage("error", "Connection error occurred.");
};

Vector database integration architecture

The integration of ChromaDB for contextual document retrieval introduces complex orchestration challenges that must be resolved within strict real-time constraints.

Tool function integration

The system implements function-calling capabilities that enable the TTS model to access vector database functionality through structured tool definitions:

tool_query_docs = {
    "function_declarations": [
        {
            "name": "query_docs",
            "description": "Query the document content with a specific query string.",
            "parameters": {
                "type": "OBJECT",
                "properties": {
                    "query": {
                        "type": "STRING",
                        "description": "The query string to search the document index."
                    }
                },
                "required": ["query"]
            }
        }
    ]
}

Real-time vector retrieval

The document querying function implements efficient similarity search with configurable result limits to balance accuracy with response speed:

def query_docs(query):
    try:
        index = build_index()
        if index is None:
            return "No documents available to search."
        
        # Use retriever instead of query engine to get raw text chunks
        retriever = index.as_retriever(similarity_top_k=3)
        nodes = retriever.retrieve(query)
        
        if not nodes:
            return "No relevant information found in the documents."
        
        # Return the raw text chunks for the Live API model to process
        context_chunks = []
        for i, node in enumerate(nodes, 1):
            context_chunks.append(f"Document chunk {i}:\n{node.text}")
        
        result = "\n\n".join(context_chunks)
        return result
        
    except Exception as e:
        return f"Error searching documents: {str(e)}"

The retrieval process implements a similarity-based approach that returns raw document chunks for processing by the TTS model, enabling contextually aware response generation while maintaining processing efficiency.

Function call execution pipeline

The server-side function call handling demonstrates the integration between vector retrieval and real-time TTS generation:

if response.tool_call is not None:
    function_calls = response.tool_call.function_calls
    function_responses = []

    for function_call in function_calls:
        name = function_call.name
        args = function_call.args
        call_id = function_call.id

        if name == "query_docs":
            try:
                result = query_docs(args["query"])
                function_responses.append(
                    types.FunctionResponse(
                        name=name,
                        response={"result": result},
                        id=call_id  
                    )
                ) 
            except Exception as e:
                print(f"Error executing function: {e}")

    # Send function response back to Gemini
    await session.send_tool_response(function_responses=function_responses)

WebRTC integration considerations

While the current implementation utilises WebSocket connections for simplicity and broad compatibility, the architecture readily accommodates WebRTC integration for enhanced real-time performance.

Audio processing worklet

The frontend implements Audio Worklet technology for low-latency audio processing that approaches WebRTC-level performance:

await audioInputContext.audioWorklet.addModule("pcm-processor.js");
workletNode = new AudioWorkletNode(audioInputContext, "pcm-processor");
workletNode.connect(audioInputContext.destination);

This approach enables dedicated audio processing threads that minimise latency while maintaining compatibility with WebSocket transport mechanisms.

Real-time audio playback

The audio playback system converts received base64 audio data to a playable format through efficient buffer management:

function convertPCM16LEToFloat32(pcmData) {
    const inputArray = new Int16Array(pcmData);
    const float32Array = new Float32Array(inputArray.length);

    for (let i = 0; i < inputArray.length; i++) {
        float32Array[i] = inputArray[i] / 32768;
    }

    return float32Array;
}

async function injectAudioChunkToPlay(base64AudioChunk) {
    try {
        if (audioInputContext.state === "suspended") {
            await audioInputContext.resume();
        }
        
        const arrayBuffer = base64ToArrayBuffer(base64AudioChunk);
        const float32Data = convertPCM16LEToFloat32(arrayBuffer);
        workletNode.port.postMessage(float32Data);
        
        showTypingIndicator();
    } catch (error) {
        console.error("Error processing audio chunk:", error);
    }
}

Document processing and contextual enhancement

The integration of PDF document processing with real-time voice interactions requires sophisticated document handling capabilities that maintain conversational flow while enabling rich contextual retrieval.

Asynchronous document processing

PDF upload handling demonstrates the system's ability to process large documents without blocking real-time voice interactions:

elif chunk["mime_type"] == "application/pdf":
    # Save PDF file to downloads directory
    pdf_data = base64.b64decode(chunk["data"])
    filename = chunk.get("filename", "uploaded.pdf")
    
    # Create downloads directory if it doesn't exist
    os.makedirs("./downloads", exist_ok=True)
    
    # Save the PDF file
    file_path = os.path.join("./downloads", filename)
    with open(file_path, "wb") as f:
        f.write(pdf_data)
    
    # Rebuild the index with the new PDF
    if os.path.exists("./storage"):
        import shutil
        shutil.rmtree("./storage")
    build_index()

Frontend document upload

The client-side document handling provides drag-and-drop functionality with real-time feedback:

function uploadFile(file) {
    const reader = new FileReader();
    
    reader.onload = (e) => {
        const base64Reader = new FileReader();
        base64Reader.onload = function(e) {
            try {
                const base64PDF = e.target.result.split(',')[1];
                sendPDFMessage(base64PDF, file.name);
                addMessage("system", `📄 Uploading ${file.name}...`);
            } catch (err) {
                console.error("Error processing PDF file:", err);
                addMessage("error", `❌ Error processing ${file.name}`);
            }
        };
        base64Reader.readAsDataURL(file);
    };
    
    reader.readAsDataURL(file);
}

Performance monitoring and analytics

Comprehensive monitoring systems track not only traditional application metrics but also conversation-specific performance indicators that directly impact user experience.

Response time decomposition enables identification of bottlenecks across the complex processing pipeline. The system tracks individual component latencies: VAD detection time, vector retrieval duration, TTS synthesis speed, and end-to-end response latency to enable targeted optimisation efforts.

User experience metrics include conversation flow smoothness, interruption handling effectiveness, and contextual response accuracy. These qualitative measures complement traditional performance metrics to provide a holistic system health assessment.

The implementation of real-time VAD with vector-enhanced TTS represents a significant advancement in conversational AI architectures. The integration of Gemini Live API provides a robust foundation that addresses the fundamental challenges of latency, quality, and contextual awareness while maintaining the architectural flexibility necessary for production deployment. The technical challenges of orchestrating multiple concurrent processing streams require sophisticated engineering approaches that balance performance with reliability, demonstrating the feasibility of creating conversational AI applications that provide both immediate responsiveness and rich contextual understanding.

Conclusion

The convergence of VAD, real-time TTS, and contextual vector retrieval is more than a technical achievement. With Gemini Live API, businesses can deliver conversational systems that respond quickly, speak naturally, and adapt intelligently to context. This directly strengthens customer interactions by reducing response times, improving engagement, and enabling scalable, high-quality support.

For organisations, the impact goes beyond operational efficiency. The ability to create real-time, voice-driven experiences that feel personal and informed sets a new standard for digital engagement. In my view, this is where AI becomes a true collaborator for customers and teams alike, opening new opportunities for differentiation and long-term business value.

Written by
Editor
Ananya Rakhecha
Tech Advocate