Archive for Deepgram

Deepgram and AWS to Present Live Webinar: “Building AI Voice Agents with Deepgram + AWS Bedrock” 

Posted in Commentary with tags on September 8, 2025 by itnerd

Deepgram is joining with Amazon Web Services (AWS) to present a live webinar titled, “Building AI Voice Agents with Deepgram + AWS Bedrock” tomorrow, Tuesday, September 9, 10:30 AM – 11:30 AM PDT.

​Deepgram’s Voice Agent API brings lightning-fast speech-to-text and lifelike text-to-speech together with event hooks and speaker diarization, all in real time. Amazon Bedrock gives you instant access to leading foundation models like Claude and Titan, with built-in safety, compliance, and flexibility, perfect for powering voice agents with real intelligence. ​​

Attendees will learn how to build scalable, responsive AI voice agents that actually work in production.

What You’ll See & Learn:

  • ​Build & Deploy in Minutes – See how Deepgram’s streaming API and Bedrock’s managed LLMs make real-time, voice-driven agents possible without stitching together brittle services.
  • ​Smarts + Speed in Action – Watch a live demo that showcases accurate transcription, rapid LLM responses, and the power of few-shot or RAG-based responses, all with sub-second latency.
  • ​Enterprise-Ready Architecture – Learn how to deploy with VPC, IAM, encryption, and autoscaling, all while controlling cost and optimizing performance.

Learn more and register here: https://luma.com/d3qf3t8s, or I would be happy to get you signed up to attend.

Deepgram’s Unfiltered Views on The Announcement From OpenAI

Posted in Commentary with tags on September 2, 2025 by itnerd

OpenAI just made an announcement titled, “Introducing gpt-realtime and Realtime API updates for production voice agents” found here: https://openai.com/index/introducing-gpt-realtime/

Scott Stephenson, CEO and Founder of Deepgram, would like to respectfully offer the following thoughts on this news:

“OpenAI’s new model shows progress, but the benchmarks make it clear: latency, turn-taking, and lack of control remain its Achilles’ heel in real conversations,” said Scott Stephenson, CEO and Founder, Deepgram. “When you measure what makes conversations actually work — speed, politeness, and turn-taking — Deepgram still leads the pack. The benchmarks confirm what users feel: conversations with Deepgram just flow more naturally.”

Stephenson continued, “Why does this matter? In real-world deployments, people don’t judge a voice agent by its feature set — they judge it by how the conversation feels. Latency and turn-taking aren’t technical footnotes; they’re the difference between a helpful interaction and a frustrating one. That’s why benchmarks that measure conversational flow, not just functionality, are the true indicator of readiness for production.”

Benchmarks That Back It Up 

  • #1 across all tests: Deepgram ranked highest under every VAQI weighting — equal, politeness-heavy, and latency-heavy.
  • Politer conversations: Fewest interruptions, meaning agents don’t talk over users. 
  • Faster responses: Sub-second average latency (0.85s) vs. OpenAI’s 2.55s. 
  • Smarter timing: Strong turn-taking with a competitive miss rate (0.427). 
  •  Consistent edge: Even when benchmarks shifted priorities, results held — Deepgram stayed on top. 

Source: VAQI Benchmark, August 2025

Deepgram published a blog today with further details: https://deepgram.com/learn/vaqi-openai-gpt-realtime-test-with-sensitivity-analysis

Deepgram Signs Strategic Collaboration Agreement with AWS

Posted in Commentary with tags on August 19, 2025 by itnerd

Deepgram today announced that it has signed a strategic collaboration agreement (SCA) with Amazon Web Services (AWS). The multi-year agreement deepens Deepgram’s relationship with AWS and reflects a shared commitment to accelerating the development and adoption of generative voice AI. As part of the collaboration, Deepgram will expand co-selling and go-to-market efforts, integrate more deeply with AWS services, and empower enterprises to build scalable, high-accuracy voice applications across a wide range of use cases. 

Innovative startups and Fortune 100 enterprises alike are already transforming customer experiences using Deepgram and AWS. One Fortune 20 healthcare company uses Deepgram’s speech models on secure, scalable AWS infrastructure to modernize its contact center operations and deliver faster, more personalized customer support.

As a Generative AI Competency Partner and long-standing AWS Partner Network (APN) member, Deepgram offers a full-featured voice AI platform that includes speech-to-text (STT), text-to-speech (TTS), and speech-to-speech (STS) capabilities. Additionally, Deepgram’s Dedicated deployment and EU endpoints run entirely on AWS infrastructure, enabling enterprise customers to meet global requirements for data residency, security, and compliance.

Deepgram’s infrastructure is deeply integrated with AWS, enabling customers to deploy its platform on Amazon EKS for scalable container orchestration, store data securely with Amazon S3, and manage containers using Amazon ECR. Customers can also use Amazon API Gateway and AWS Lambda to securely orchestrate interactions between Deepgram’s voice AI APIs and other services, including Amazon Bedrock hosted models and enterprise systems. Whether deployed in a customer-owned VPC or as a fully managed SaaS environment, Deepgram offers the flexibility required to maintain compliance, ensure data control, and operate efficiently at scale. Looking ahead, Deepgram plans to expand availability through AWS services like Amazon SageMaker and Amazon Bedrock to further streamline AI model deployment and orchestration.

Deepgram’s speech-to-text API can also be integrated into Amazon Connect, enabling best-in-class STT speed and accuracy for real-time transcription and voice automation within contact center environments. This helps enterprises improve agent productivity, automate call summaries, and enhance customer experiences.

As part of the SCA, Deepgram will invest in building GenAI-enabled capabilities on AWS, deliver new case studies and proof-of-concepts for enterprise customers, and continue optimizing its models and services for the AWS ecosystem.

Deepgram’s availability in AWS Marketplace also simplifies procurement for engineering and infrastructure teams by enabling usage-based pricing, unified billing, and rapid deployment within existing AWS environments.

Learn more about the partnership by visiting deepgram.com/aws, or start building with Deepgram on AWS today by exploring their listing on the AWS Marketplace.

Deepgram Expands Internationally, Launches Managed Single-Tenant Deployment Option

Posted in Commentary with tags on July 30, 2025 by itnerd

Voice AI is rapidly becoming foundational infrastructure across industries, powering real-time agents, compliance-sensitive workflows, and multilingual applications at scale. As global adoption accelerates, so does the demand for flexible deployment models, regional hosting, and production-grade reliability.

To meet that demand, Deepgram is announcing two major infrastructure expansions:

  • The general availability of Deepgram Dedicated, a fully managed, single-tenant runtime
  • The early access launch of our EU-hosted API endpoint, enabling in-region inference for European workloads

These launches reflect a broader shift in how voice AI is being deployed, and they come at a time of growing industry validation. This month, Deepgram Nova-3 was named a 2025 Voice AI Technology Excellence Award winner by TMC’s CUSTOMER magazine, recognizing their leadership in accuracy, real-time multilingual transcription, and self-serve customization.

Together, these milestones reinforce Deepgram’s commitment to providing voice AI infrastructure that supports enterprise-scale performance, compliance, and geographic flexibility.

What It Means to Go Global with Voice AI

Going global starts with supporting the world’s languages. Deepgram already supports over 36 languages for customers worldwide and will continue expanding language coverage throughout 2025. 

But language support is only the beginning.

For engineering teams building production-grade systems, global voice AI also requires solving for infrastructure and compliance demands as workloads expand across regions. As enterprises scale voice workloads globally, we continue to hear two common friction points: the growing complexity of managing infrastructure across regions and tightening data policies, particularly in the EU, that require stricter control over where and how voice data is processed.

These demands include:

  • Ultra low-latency inference paths. Real-time applications require models to run as close to the end user as possible to minimize round-trip time and meet interaction thresholds.
  • Data residency and legal jurisdiction. Voice data often must be processed and stored within specific geographic boundaries to meet regulatory requirements such as GDPR.
  • Single-tenant isolation for sensitive workloads. Some environments require dedicated infrastructure to enforce data segregation, meet compliance standards, or satisfy internal security policies.
  • Scalable operations without added DevOps burden. Expanding voice workloads across regions should not require a proportional increase in infrastructure engineering.

Deepgram’s platform was designed with these requirements in mind, providing the foundation needed to operationalize voice AI reliably and securely across global environments.

Introducing Deepgram Dedicated: A Managed, Single-Tenant Runtime

Enterprises adopting voice AI at scale often face a difficult tradeoff: maintain control over infrastructure and data by self-hosting, or prioritize ease of use through shared, multi-tenant cloud APIs. Self-hosting offers isolation and regional control, but introduces significant ongoing operational complexity. Managed service providers can help bridge the gap, but they often lack product-level expertise and introduce dependency overhead that slows down feature adoption.

Now generally available, Deepgram Dedicated closes this gap. It is a fully managed, single-tenant deployment of Deepgram’s voice AI platform that offers the control and flexibility of self-hosted infrastructure without the burden of operating it. Over the past six months, it has been selectively deployed with a select group of enterprise customers in early production deployments across a range of use cases, from real-time contact center platforms to globally distributed voice agents.

Teams gain regional isolation, performance control, and compliance alignment while offloading infrastructure management to Deepgram. Deepgram Dedicated currently runs on AWS, with support for additional cloud providers on the roadmap.

Key Highlights:

  • Single-tenant architecture: Each deployment runs on isolated compute, avoiding noisy neighbor effects and supporting strict data segregation.
  • Unified voice AI stack: Run speech-to-texttext-to-speech, and speech-to-speech workloads in a single runtime with consistent API behavior.
  • Multi-cluster design: Separates real-time, pre-recorded, and agent workloads onto specialized clusters to maximize performance, ensure high availability, and enable strict workload isolation.
  • Region-specific infrastructure: Deploy in your preferred cloud region to meet compliance requirements, enable ultra-low latency, and align with internal policies, including support for country-level deployments.
  • SLA-backed performance: Optional SLAs ensure predictable uptime and latency with defined targets monitored and enforced by Deepgram.

In one modeled scenario, a customer supporting 1,000 concurrent real-time streams would spend approximately $467K USD annually if self-hosting. This includes $250K in DevOps headcount and $98K in infrastructure costs.

Running the same workload on Deepgram Dedicated lowers total OPEX by approximately $98K USD per year. It also reduces engineering overhead and improves deployment reliability through platform-managed SLAs and regional isolation, giving teams more time to focus on higher-impact work.

EU-Hosted API Endpoint: In-Region Inference for European Voice Workloads

Voice AI adoption is accelerating across Europe, driven by demand for real-time applications in finance, public services, retail, and telecommunications. To date, more than two dozen customers and prospects have expressed interest in EU-based infrastructure, highlighting growing demand for in-region processing that meets local performance expectations and regulatory requirements without compromising model quality or flexibility.

To support this, Deepgram is launching early access to api.eu.deepgram.com, a new EU-hosted speech-to-text API endpoint that delivers in-region inference with full feature parity and consistent performance. The EU endpoint is hosted in AWS EU regions, with additional hosting options under consideration.

Key Highlights:

  • Voice data stays within the EU. All processing occurs inside EU-based AWS regions, ensuring no cross-border data transfer.
  • Latency improvements for EU-based users: Localized inference reduces round-trip time for applications serving users in or near the EU.
  • No code changes required: Existing integrations can migrate by updating the base URL, with no other changes needed.
  • Supports GDPR compliance and auditability: The deployment is fully isolated within the EU legal boundary and aligned with regional data protection standards.

This endpoint is well-suited for European ISVs, compliance-focused enterprises, and global teams looking to reduce latency and streamline deployment in the EU.

Why This Matters: A Global-Ready Voice AI Platform

With these additions, Deepgram now supports a range of deployment options, including multi-tenant hosted APIs, fully managed single-tenant deployments, and customer-operated self-hosted infrastructure. This flexibility allows engineering teams to choose the right model based on their application requirements, compliance obligations, and operational preferences. For some, the hosted API provides a fast path to integration. Others may require the regional data residency of the EU endpoint or the isolation and control of a Dedicated deployment. Teams with existing DevOps capacity may opt for self-hosting to align with internal security policies or infrastructure standards.

What differentiates Deepgram is the ability to deliver true flexibility across deployment models. Teams can build and scale voice AI systems using consistent APIs and model performance, while choosing the infrastructure that fits their environment. Looking ahead, the roadmap includes customer VPC deployments, BYOC support, and expanded region availability across Asia-Pacific, EMEA, and LATAM.

Start Building for Your Environment

If you’re building voice applications that require global reach, regulatory alignment, or low-latency performance, now is the time to explore your deployment options. Demand is high, and we’re expanding access selectively:

Deepgram now runs where your business runs. No trade-offs. No overhead. Just voice AI on your terms.

Deepgram Launches Saga: The Voice OS for Developers

Posted in Commentary with tags on July 8, 2025 by itnerd

Deepgram, the leading voice AI platform for enterprise use cases, today announced the launch of Deepgram Saga, a Voice Operating System (OS) designed specifically for developers. Saga is a universal voice interface that embeds directly into developer workflows, allowing users to control their tech stack through natural speech. Unlike traditional voice assistants that pull developers out of their flow, Saga sits on top of existing tools, transforming rough ideas into precise AI coding prompts, executing multi-step workflows across platforms via Model Context Protocol (MCP), and eliminating the constant context switching that fragments modern development.

In today’s development environment, engineers routinely juggle 8+ tools across multiple monitors, constantly translating thoughts into clicks, rough ideas into overly specific prompts, and context into commands. This fragmentation creates a “quiet tax” on productivity — time lost to alt-tabbing, window hunting, and manual navigation between coding, testing, and deployment tools. Saga eliminates this friction by providing a voice-native AI interface that interprets developer intent and executes actions across the entire tech stack, enabling developers to stay in flow while building software.

Voice-First Workflow Control

Saga addresses the core challenges facing AI-native developers and early-stage builders who need to move fast without getting bogged down in tool complexity. 

Key capabilities include:

  • Developer Ecosystem Friendly: Whether vibe coding with Cursor or Windsurf, maintaining status updates in Linear, Asana, Jira or Slack, extracting CSS from Figma designs, or just executing operational day-to-day tasks within Google Docs, Gmail or Google Sheets, Saga lives alongside the tools developers already know, love, and use every day.
  • Intelligent Prompt Generation: Developers can speak vague ideas like “Build a Slack bot that reacts to emoji,” and Saga transforms these into crystal-clear, one-shot prompts for tools like Cursor, eliminating the trial-and-error cycle of “vibe coding.”
  • End-to-End Workflow Execution: A single voice command like “Run tests, commit changes, deploy, and update the team” triggers coordinated actions across the entire development stack — no tabs, manual commands, or context switching required.
  • Real-Time Documentation: Saga captures stream-of-consciousness thinking and transforms it into structured documentation, tickets, or PR descriptions, allowing developers to rubber-duck their way to clean documentation without breaking their train of thought.
  • Contextual Tool Integration: Rather than requiring developers to switch to separate AI chat windows, Saga surfaces answers and executes actions inline, layered over existing development tools.
  • Natural Code Generation: Developers can speak requests like “Get me the top 10 users who signed up in the last week” and receive instant SQL or JavaScript snippets without needing to Google syntax or write boilerplate.

Built for AI-Native Development with MCP

Saga is specifically designed for the new generation of technical users who rely on AI agents, use tools like Cursor and Windsurf daily, and treat their workflow like a programmable operating system. The platform integrates seamlessly with existing developer tools through MCP (Model Context Protocol) and other standard interfaces, ensuring teams can adopt Saga without disrupting their current setup.

Enterprise-Grade Voice Intelligence

Built on Deepgram’s world-class speech-to-text, text-to-speech, and voice agent APIs, Saga delivers the accuracy and responsiveness required for mission-critical development workflows. The platform understands technical context, domain-specific terminology, and the nuanced language developers use when thinking through complex problems.

Unlike consumer voice assistants that require rigid command structures, Saga interprets natural, conversational speech and translates it into precise technical actions. This approach eliminates the cognitive overhead of remembering specific voice commands while maintaining the reliability enterprises need for production development environments.

Start Building with Saga

Experience how voice can transform your development workflow with Deepgram Saga. The platform is designed for developers who want fewer clicks and more execution, enabling faster iteration cycles and reduced context switching. 

Additional Resources

–      Get started with Deepgram’s quickstart guides

Deepgram Expands Aura-2 Text-to-Speech Platform with High-Fidelity Spanish Voice Models 

Posted in Commentary with tags on June 30, 2025 by itnerd

Deepgram has officially expanded its Aura-2 text-to-speech (TTS) API with a new suite of high-quality Spanish voice models, bringing realistic, expressive, and business-ready voice synthesis to Spanish-speaking markets.

This launch marks a major step in Deepgram’s mission to enable real-time, natural-sounding voice experiences across global industries. The new Spanish voices are optimized for enterprise use cases, from customer support and IVR systems to healthcare and education, featuring precise pronunciation for currencies, timestamps, acronyms, emails, and more.

HIGHLIGHTS:

  • 10 new Spanish Aura-2 voice models tailored for professional use
  • Support for Mexican, Peninsular, Colombian, and Latin American accents
  • Models designed for diverse applications including advertising, IVR, storytelling, and customer service
  • Support for code-switching in select models (English ↔ Spanish)
  • Available now via REST and Websocket APIs

Voices like “Celeste” (Colombian, energetic and friendly) and “Nestor” (Peninsular, calm and confident) are just a couple of the expressive voices now available.

It is available now for use via Deepgram’s hosted TTS API platform. 

Here is a blog with details: https://deepgram.com/changelog/aura-2-spanish-tts 

Developers and product teams can find implementation examples and model specifications in the Deepgram Developer Documentation, here: https://developers.deepgram.com/docs/tts-models?_gl=1*1azt33a*_gcl_au*NzM3MTk0MjU1LjE3NDQ2NjQxNTE.*_ga*OTEzNjY5NzcyLjE3MzY4MDM3ODc.*_ga_TYPC1TBCKT*czE3NTA5NjIyNTckbzkyJGcxJHQxNzUwOTY0ODE3JGozMyRsMCRoMA..#aura-2-all-available-voices 

Deepgram CEO Scott Stephenson Launches “The Scott Stephenson AI Show” — A No-Hype, Deep-Dive Podcast on the AI Revolution

Posted in Commentary with tags on June 23, 2025 by itnerd

Deepgramtoday announced the launch of The Scott Stephenson AI Show, a new podcast hosted by Scott Stephenson, CEO and Co-Founder of Deepgram. In each episode, Stephenson explores the fast-changing world of artificial intelligence (AI), cutting through the hype and digging into what’s actually happening under the hood of today’s most powerful AI technologies. Stephenson brings his signature candor, industry insight, and curiosity to every topic, offering unfiltered perspectives on what’s working, what’s hype, and what’s next.

Episode 1

In the first episode, Scott unpacks the concept of vibe coding, the rising trend where developers interact with AI in a product manager-like mindset, using natural language and feedback instead of conventional code. He also explores the emerging era of AI agents, A2A (agent-to-agent) communication, MCP (Model Context Protocol), and how these breakthroughs will reshape engineering and business workflows.

Episodes will be released bi-weekly. Future episodes will feature conversations around evaluating GenAI models, what and who to trust, and constraints and accelerators for the pace of innovation.  

Where to Watch and Subscribe:

Deepgram Launches Voice Agent API

Posted in Commentary with tags on June 16, 2025 by itnerd

Deepgram today announced the general availability (GA) of its Voice Agent API, a single, unified voice-to-voice interface that gives developers full control to build context-aware voice agents that power natural, responsive conversations. Combining speech-to-texttext-to-speech, and large language model (LLM) orchestration with contextualized conversational logic into a unified architecture, the Voice Agent API gives developers the choice of using Deepgram’s fully integrated stack (leveraging industry-leading Nova-3 STT and Aura-2 TTS models) or bringing their own LLM and TTS models. It delivers the simplicity developers love and the controllability enterprises need to deploy real-time, intelligent voice agents at scale. Today, companies like Aircall, Jack in the Box, StreamIt, and OpenPhone are building voice agents with Deepgram to save costs, reduce wait times, and increase customer loyalty.

In today’s market, teams building voice agents are often forced to choose between two extremes: rigid, low-code platforms that lack customization, or DIY toolchains that require stitching together STT, TTS, and LLMs with significant engineering effort. Deepgram’s Voice Agent API eliminates this tradeoff by providing a unified API that simplifies development without sacrificing control. Developers can build faster with less complexity, while enterprises retain full control over orchestration, deployment, and model behavior, without compromising on performance or reliability.

Developer Simplicity and Faster Time to Market

For teams taking the DIY route, the challenge isn’t just connecting models but also building and operating the entire runtime layer that makes real-time conversations work. Teams must manage live audio streaming, accurately detect when a user has finished speaking, coordinate model responses, handle mid-sentence interruptions, and maintain a natural conversational cadence. While some platforms offer partial orchestration features, most APIs do not provide a fully integrated runtime. As a result, developers are often left to manage streaming, session state, and coordination logic across fragmented services, which adds complexity and delays time to production.

Deepgram’s Voice Agent API removes this burden by providing a single, unified API that integrates speech-to-text, LLM reasoning, and text-to-speech with built-in support for real-time conversational dynamics. Capabilities such as barge-in handling and turn-taking prediction are model-driven and managed natively within the platform. This eliminates the need to stitch together multiple vendors or maintain custom orchestration, enabling faster prototyping, reduced complexity, and more time focused on building high-quality experiences.

In addition to the Voice Agent API, organizations seeking broader integrations can leverage Deepgram’s extensive partner ecosystem, including Kore.ai, OneReach.ai, Twilio and others, to access comprehensive conversational AI solutions and services powered by Deepgram APIs.  

Maximum Control and Flexibility

While the Voice Agent API streamlines development, it also gives teams deep control over performance, behavior, and scalability in production. Built on Deepgram’s Enterprise Runtime and full model ownership across the entire voice AI stack, the platform enables model-level optimization at every layer of the interaction loop. This allows for precise tuning of latency, barge-in handling, turn-taking, and domain-specific behavior in ways not possible with disconnected components.

Key capabilities include:

  • Flexible Deployment: Run the complete voice stack in cloud, VPC, or on-prem environments to meet enterprise requirements for security, compliance, and performance.
  • Runtime-Level Orchestration: Deepgram’s runtime supports mid-session control, real-time prompt updates, model switching, and event-driven signaling to adapt agent behavior dynamically.
  • Bring-Your-Own Models: Teams can integrate their own LLMs or TTS systems while retaining Deepgram’s orchestration, streaming pipeline, and real-time responsiveness.

This tightly coordinated design translates directly into measurable performance gains. In recent benchmark testing using the Voice Agent Quality Index (VAQI), Deepgram achieved the highest overall score among all evaluated providers (see Figure 1). VAQI is a composite benchmark that measures the core elements of voice agent quality: latency (how quickly the agent responds), interruption rate (how often it cuts users off), and response coverage (how often it misses valid input).

Deepgram outperformed OpenAI by 6.4% and ElevenLabs by 29.3%, reflecting the advantage of its integrated architecture and model-driven turn-taking. The result is smooth, responsive conversations without missed inputs, premature responses, or unnatural delays.

Cost-Effectiveness at Scale

In addition to control and performance, the Voice Agent API is built for cost efficiency across large-scale deployments. When teams run entirely on Deepgram’s vertically integrated stack, pricing is fully consolidated at a flat rate of $4.50 per hour (see Figure 2). This provides predictable, all-in-one billing that simplifies planning and scales with usage. Deepgram’s vertically integrated runtime also delivers unmatched compute efficiency, optimizing every stage of the speech pipeline to minimize infrastructure costs while maintaining real-time responsiveness.

For teams that bring their own LLM or TTS models, Deepgram offers built-in rate reductions, enabling even lower total cost of ownership for production-scale deployments.

Start Building with the Voice Agent API

Experience how fast and flexible voice agents can be with Deepgram’s unified voice-to-voice API. Explore the API in our interactive playground, review documentation, or integrate in minutes using our SDK. New users receive $200 in free credits, enough to process over 40 hours of real-time voice agent usage. Start building natural, responsive conversations with infrastructure built for real-time performance and enterprise-scale.

Additional Resources:

  • Explore the blog for an in-depth breakdown of Voice Agent API’s capabilities
  • Watch a fun demo of Deepgram’s voice agent API
  • Try Deepgram’s interactive demo
  • Get $200 in free credits and try Deepgram for yourself

Introducing Nova-3 Medical: The Most Accurate Medical Transcription Model in the World 

Posted in Commentary with tags on March 3, 2025 by itnerd

Deepgram today announced the launch of Nova‑3 Medical, its next‑generation AI-powered speech‑to‑text (STT) model specifically engineered for the healthcare industry. Designed to meet the rigorous demands of clinical environments, Nova‑3 Medical enables developers to build highly accurate, customizable, and secure voice AI products and solutions tailored for healthcare settings. It seamlessly integrates with Deepgram’s enterprise runtime platform—including advanced text-to-speech (TTS) and speech-to-speech (STS) capabilities—providing a comprehensive suite of AI-driven tools that deliver enterprise-grade performance, adaptability, and cost efficiency. From streamlining clinical documentation to revolutionizing therapeutic scribing, Deepgram powers transformative medical transcription applications for industry leaders, driving exceptional outcomes across the healthcare spectrum.

Meeting the Growing Demand for AI-Powered Healthcare Transcription

As healthcare rapidly digitizes—with the widespread adoption of electronic health records, telemedicine, and digital health platforms—the demand for AI-powered transcription has never been greater. Traditional off-the-shelf speech-to-text models often struggle with the complexities of clinical terminology, leading to transcription errors and “hallucinations” that can compromise patient care. With the medical transcription market projected to grow from USD 85.3 billion in 2023 to USD 190.2 billion by 2032, developers building voice-AI applications for healthcare need infrastructure that not only delivers exceptional accuracy and speed but also provides the flexibility to meet diverse regulatory and operational requirements.

Built to meet these demands, Nova-3 Medical leverages advanced machine learning and specialized medical vocabulary training to set a new standard in healthcare transcription. Engineered for real-world clinical environments, the model accurately captures specialized medical terms, acronyms, and clinical jargon—even in challenging far-field audio conditions where providers step away from recording devices such as desktops and tablets. Moreover, it delivers structured transcriptions that seamlessly integrate with clinical workflows and EHR systems, ensuring vital patient data is accurately organized and readily accessible. Its flexible, self‑service customization—featuring Keyterm Prompting for up to 100 key terms—allows developers to tailor the solution to the unique needs of various medical specialties while versatile deployment options, including on‑premises and VPC configurations, ensure enterprise‑grade security and HIPAA compliance.

Benchmarking Nova-3 Medical: Accuracy, Speed, and Efficiency

Nova-3 Medical delivers industry-leading transcription accuracy, optimizing both overall word recognition and critical medical term accuracy for voice-driven healthcare applications.

WER Comparison (see figure 1)

With a median Word Error Rate (WER) of 3.45%, Nova-3 Medical outperforms competing models, achieving a 63.6% reduction in errors compared to the next best competitor. This improvement enhances documentation precision, minimizes manual corrections, and streamlines workflows for healthcare providers.

KER Comparison (see figure 2)

However, medical transcription accuracy isn’t limited to WER—correctly capturing critical medical terms is essential for minimizing patient care risks. Nova-3 Medical achieves a Keyword Error Rate (KER) of 6.79%, marking a 40.35% reduction in errors compared to the next best competitor. This ensures that fewer critical drug names, conditions, and procedures are misrecognized, reducing the chances of transcription errors that could lead to miscommunication, improper documentation, or even patient safety risks.

In addition to transcription accuracy, Nova-3 Medical’s performance excels in real-time applications, where speed and scalability are crucial. Optimized for real-time use, Nova‑3 Medical transcribes speech 5 to 40 times faster than most alternative speech recognition vendors, making it ideal for telemedicine and digital health platforms. Its scalable architecture ensures that as transcription volumes grow, healthcare tech companies can maintain high performance without incurring excessive costs. Starting at $0.0077 per minute of streaming audio, Nova‑3 Medical is more than 2x more affordable than leading cloud providers, reducing operational expenses and enabling companies to reinvest in innovation, accelerate product development, and offer competitive pricing to drive market adoption.

Visit Deepgram at Booth #136 in the AI Pavilion at HIMSS25, March 3-6, 2025, to see Nova-3 Medical in action; and don’t miss these sessions:

SessionFrom AI Scribes to EHR Automation: How Deepgram Enables Healthtech with Voice AI and Amazon Bedrock

When: Tuesday, March 4, 3:40 PM to 4:00 PM

Where: AI Pavilion, Venetian, Level 2, Hall A

SessionVoice AI Mixer with Deepgram & OneReach.ai

When: Wednesday, March 5, 6:00 PM to 7:30 PM

Where: Venetian, Palazzo Ballroom, Palazzo A

For more information about Nova‑3 Medical and how it is revolutionizing healthcare transcription, please visit www.deepgram.com.

Deepgram Achieves Key Milestone on Path to Delivering Next-Gen, Enterprise-Grade Speech-to-Speech Architecture

Posted in Commentary with tags on February 19, 2025 by itnerd

Deepgram has announced a significant technical achievement in speech-to-speech (STS) technology for enterprise use cases. The company has successfully developed a speech-to-speech model that operates without relying on text conversion at any stage, marking a pivotal step toward the development of contextualized end-to-end speech AI systems. This milestone will enable fully natural and responsive voice interactions that preserve nuances, intonation, and emotional tone throughout real-time communication. When fully operationalized, this architecture will be delivered to customers via a simple upgrade from our existing industry-leading architecture. By adopting this technology alongside Deepgram’s full-featured voice AI platform, companies will gain a strategic advantage, positioning themselves to deliver cutting-edge, scalable voice AI solutions that evolve with the market and outpace competitors.

Advancements Over Existing Architectures

Existing speech-to-speech (STS) systems are based on architectures that process speech through sequential stages, such as speech-to-text, text-to-text, and text-to-speech. These architectures have become the standard for production deployments for their modularity and maturity, but eliminating text as an intermediary offers opportunities to improve latency and better preserve emotional and contextual nuances.

Meanwhile, multimodal LLMs like GeminiGPT-4o, and Llama have evolved beyond text-only capabilities to accept additional inputs such as images, videos, and audio. However, despite these advancements, they struggle to capture the fluidity and nuance of human-like conversation. These models still rely on a turn-based framework, where audio input is tokenized and processed within a textual domain, restricting real-time interactivity and expressiveness.

To advance the frontier of speech AI, Deepgram is setting the stage for end-to-end STS models, which offer a more direct approach by converting speech to speech without relying on text. Recent research on speech-to-speech models, such as Hertz and Moshi, has highlighted the significant challenges in developing models that are robust and reliable enough for enterprise use cases. These difficulties stem from the inherent complexities of modeling conversational speech and the substantial computational resources required. Overcoming these hurdles demands innovations in data collection, model architecture, and training methodologies.

Delivering Speech-to-Speech with Latent Space Embeddings

Deepgram is transforming speech-to-speech modeling with a new architecture that fuses the latent spaces of specialized components, eliminating the need for text conversion between them. By embedding speech directly into a latent space, Deepgram ensures that important characteristics such as intonation, pacing, and situational and emotional context are preserved throughout the entire processing pipeline. What sets Deepgram apart is its approach to fusing the hidden states—the internal representations that capture meaning, context, and structure—of each individual function: Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS). This fusion is the first step toward training a controllable single, true end-to-end speech model, enabling seamless processing while retaining the strengths of each best-in-class component. This breakthrough has significant implications for enterprise applications, facilitating more natural conversations while maintaining the control and reliability businesses require.

This technical advancement builds on Deepgram’s expertise in enterprise speech AI, with over 200,000 developers using its platform, more than 50,000 years of audio processed, and over 1 trillion words transcribed. Key benefits of the new architecture include:

  • Optimized latency design for faster, more responsive interactions
  • Enhanced naturalness, preserving emotional context and conversational nuances
  • Native ability to handle complex, multi-turn conversations
  • Unified, end-to-end training across the entire model, creating a more cohesive and inherently adaptive system that fine-tunes its understanding and response generation directly in the audio space

Utilizing Transfer Learning for Cost-Efficient, High-Accuracy Speech-to-Speech

Deepgram’s research in the space is accelerated by its use of transfer learning and best-in-class pre-trained models, allowing it to achieve high accuracy with significantly less training data than traditional methods. Without latent techniques, training a model at the scale needed for speech-to-speech would require over 80 billion hours of audio—more than humanity has ever recorded. However, Deepgram’s latent space embeddings and transfer learning approach achieve superior comprehension while significantly reducing costs, maintaining interpretability, and accelerating enterprise deployment. This efficiency enables Deepgram to deliver scalable, end-to-end speech AI that meets the demands of real-world voice applications.

Empowering Developers with Full Debuggability

One of the requirements in enterprise speech-to-speech modeling is the ability to understand and troubleshoot each step of the process. This is particularly challenging when text conversion between steps isn’t involved, as verifying both the accuracy of the initial perception and the alignment of the spoken output with the intended response is not straightforward. Deepgram recognized this need and addressed it by designing a new architecture that enables debuggability throughout the entire process.

This architecture allows developers to inspect and understand how the system processes spoken dialogue. The design incorporates speech modeling of perception, natural language understanding/generation, and speech production, preserving distinct capabilities during training. Through the ability to decode intermediate representations back to text at specific points, developers can gain insight into what the model perceives, thinks, and generates, ensuring its internal representation aligns with the model output and stays true to the intent of the business user, addressing hallucination concern in scaled business use cases. This capability allows the user to peer into each step throughout generation, helping refine models, improve performance, and deliver more accurate, lifelike, and reliable speech-to-speech solutions.

Beyond Speech-to-Speech – A Complete, Enterprise-Ready Voice AI Stack

While building an advanced speech-to-speech (STS) model is a major technical achievement, enterprises need more than just a model—they need a complete, scalable platform that ensures seamless deployment, adaptability, and cost efficiency. Deepgram delivers not just cutting-edge STS technology, but an enterprise-ready infrastructure designed for real-world applications.

Seamless Integration & Continuous Improvement – Once Deepgram’s end-to-end STS model moves to production, businesses will be able to adopt this breakthrough directly through our developer-friendly voice agent API from within the current Deepgram platform. Through continued innovation, enterprises will benefit from the latest advancements, ensuring seamless integration and a future-proof platform for their voice AI applications.

Enterprise-Grade Performance & Cost Efficiency – Built for low customer COGS, our platform enables enterprises to deploy high-performance voice AI without excessive costs. This ensures scalability, whether for customer service automation, real-time voice agents, or multilingual applications.

Full-Featured Platform and High-Performance Runtime – Deepgram’s platform includes powerful capabilities such as:

  • Adaptability – Dynamically fine-tune models for specific industry language, ensuring high accuracy across diverse applications without needing constant retraining.
  • Automation – Streamline transcription, model updates, and data processing, reducing overhead and accelerating deployment.
  • Synthetic data generation – Generate synthetic voice data to improve model training, even with limited real-world data, enhancing accuracy for niche use cases.
  • Data curation – Clean, manage, and organize training data to ensure high-quality, relevant input, improving model performance.
  • Model hot-swapping – Seamlessly switch between different models to optimize performance for specific tasks.
  • Integrations – Effortlessly integrate Deepgram’s voice AI with cloud platforms, enterprise systems, and third-party applications, embedding it within existing workflows.

With Deepgram, enterprises don’t just get speech-to-speech—they get the most advanced, enterprise-ready voice AI platform, designed for real-world deployment and long-term innovation.

For more information about Deepgram’s novel approach for speech-to-speech, read the technical brief. To learn more about Deepgram’s suite of voice AI infrastructure, visit www.deepgram.com.