Brando Koch avatar

ML Engineer & CEO privatesynapse.ai

Releasing PrivWhisper

We are releasing PrivWhisper: State-of-the-art AWS speech and speaker recognition solution deployable privately in one click on AWS.

Enterprise-grade Speech Recognition

PrivWhisper is not just a speech recognition model; it is a complete speech recognition solution deployable within your own AWS environment. Companies are no longer forced to send their sensitive speech data to third-party APIs to obtain amazing speech and speaker recognition performance.

PrivWhisper is built on top of Whisper, a speech recognition architecture developed by OpenAI. We optimized it to achieve state-of-the-art speaker recognition and inference speed-up, which lowered the cost of hosting on AWS below $0.05 per transcribed audio hour. It includes:

  • Fully private AWS deployment
  • State-of-the-art speech and speaker recognition
  • Lowest cost per audio hour on the market

Speech recognition market analysis

True democratization of AI isn't only about technology; it's about accessibility. Before developing PrivWhisper, we performed a market analysis to identify where companies struggle when developing internal speech recognition solutions or using third-party ones. We identified three key areas, summarized by the following three questions:

How expensive is speech recognition?

The expense of implementing speech recognition technology depends on whether a company opts for a third-party speech recognition API or decides to develop its own system. Utilizing a third-party service eliminates development costs but comes at the potential cost of privacy and a relatively high usage cost, averaging around $0.60 per transcribed hour.

Building a fully private speech recognition solution requires a different approach. First, if a company is developing its own internal speech recognition solution, they have to incur the development cost along with the associated risk of meeting the required quality standard. Second, the company needs to attain a good enough inference cost per audio hour to make the solution viable. Costs for developing an enterprise-grade speech recognition solution start at around $500,000.

There are very few options on the market for a fully private solution without the assumed development cost and with low usage cost. PrivWhisper is a scalable and cost-efficient solution that meets that need. The table below showcases the inference cost of PrivWhisper compared to other popular solutions.

PrivWhisper offers 100% reduction in development and up to 95% reduction in inference cost.

Speech Recognition Provider Comparison based on Cost (Feb. 2024)
Provider Cost per audio hour
AWS Transcribe Standard Batch $1.44
Gladia $0.61
AssemblyAI $0.37
OpenAI Whisper $0.36
Deepgram Pre-Recorded Whisper Large $0.29
Deepgram Pre-Recorded Nova-2 $0.26
PrivWhisper $0.05

For a more concrete example, if you are on AWS and using the native AWS transcription service (e.g. AWS Transcribe Standard Batch) for processing 4,000 hours of audio a month, switching to PrivWhisper would decrease your cost from $5760 to $200 a month.

How hard is it to deploy speech recognition?

Deploying and managing a speech recognition model is more than just its development; it requires dedicated DevOps teams as well as AI experts. Companies, especially large enterprises, have specific needs at this stage, including high bandwidth, scalability, and strong encryption. However, we haven't found any automatic speech recognition solutions that fully meet these requirements.

PrivWhisper fills this gap with a complete deployment infrastructure package designed for scalability, powered by TerraForm. Our engineers take care of the deployment, so companies face no cost to deploy the solution.

How to get better speaker recognition?

When it comes to processing speech involving multiple speakers, the primary goal of speaker recognition is to accurately assign text to the correct speakers. The clarity of a speech transcript from a dialogue significantly diminishes without precise identification of who said what. This challenge affects not only human readers of the transcript but also the performance of Generative AI models. Our research has identified poor speaker recognition as a major obstacle in AI-driven question-answering and text summarization tasks that rely on speech data.

Our evaluation of models from leading speech recognition providers revealed a common issue: all struggled with accurate speaker recognition. To address this challenge we trained and equipped PrivWhisper with a cutting-edge multi-scale diarization module significantly enhancing the accuracy of speaker identification and recognition.

PrivWhisper Availability

PrivWhisper is now available for businesses seeking to revolutionize their speech and speaker recognition capabilities on AWS. We offer native support for AWS deployment with an average turn-around time of 2 weeks. On-prem or hybrid deployment is available upon request.