@maddisinhogarth

thank you so much this will help so much for my passion project. (i am so happy right now, ive been looking for something like this for days) love you <3 <3

@geopopos

I built a voice bot with this exact set up steaming and all quite a few months ago and the biggest issue was the latency! So excited to see this is no longer a Problem!

@matten_zero

Deepgram is the most slept on AI player.

@StuartJ

I purchased a ReSpeaker Mic Array v2.0 for this purpose. It captures speech with great clarity. Works out of the box on Linux, so should be able to make a standalone voice assistant with it.

@theflipbit01

I experimented with integrating Groq with Siri shortcut, and it was quite interesting. The response time was pretty impressive.

@avi7278

This is exactly what I wanted to do this weekend. Great timing.

@stevecoxiscool

Great heads up on new STT/LLM/TTS technology. I had been working on a Unreal Meta Human demo which got pretty close to RT using a Google STT/ChatGPT/TTS. One of the other things to think about if one want's to get into 2D/3D talking head chat apps is streaming back viseme data as well as TTS audio. Plus maybe emotion tokens of some kind. I can't wait for all of this to be offered on one platform/api service.

@balubalaji9956

this is exactly i am looking for . 
thank you youtube.
love you lots

@clumsymoe

Thank you for creating this tutorial it's exactly what I was looking for. Great content!

@Slimshady68356

Thanks Greg you do alot for the community ,I have respect for you did in semantic chunking in langchain repo

@IvarDaigon

FYI you dont need to use an API for TTS or Speech to Text both can be run locally using Faster Whisper for TTS and Coqui for Speech even if you dont have the worlds most expensive GPU because both of those only use a couple of GB of Video Ram. Going forward, on device will be the way to go for TTS and STT because they simply do not require that much processing power.

@xXWillyxWonkaXx

So in a nutshell, streaming is basically using a websocket - getting a chunk of the text to analyze quickly rather than have bit sent over little by little?

@abhijoy.sarkar

I think you can get faster using audio to audio models like ultravox.

@ajaykumarreddy8841

Hi Greg. Great video! Thanks for sharing.

But I have some issues when running the code:

Firstly, the Speech-to-text performance is not very good. I literally have to shout into my mic for it to be able to hear. I thought it was my microphone issue and tested it normally on a simple voice recorder and it worked as expected.

Secondly, the text-to-speech voice output keeps breaking in between. Not sure if that is expected because of ffplay but definitely wasn't as smooth as what you showed in the video.

Thirdly, the voice input is not getting recognized immediately once I get the response. It seems there is a small but noticeable delay to when I get the response back as voice to when I can again start speaking even though it says "Listening" in the console. I have to wait for like 5-10 seconds before I start speaking for the program to recognize my voice or else it isn't doing so.

Is anyone else facing the same issue?

@106rutvik

Hi Greg, Rutvik here. We have created something similar to this but using GPT as LLM and Elevenlabs as TTS. We are facing issues with Silence detection with Deepgram. I know you did mentioned in your video at 3:53 that we need to make sure that we dont talk too slowly. And unfortunately Deepgram only has MAX value of 500 ms for endpointing (Silence detection). Can you confirm if we are using proper configurations with Deepgram? Following are my configurations.

            'punctuate': True,
            'interim_results': True,
            'language': 'en-US',
            'channels': 1,
            'sample_rate': 16000,
            'model': 'nova-2-conversationalai',
            'endpointing': 500

@HideousSlots

conversational endpointing is a great idea, but I'd like to see that combined with a small  model agent that was only looking for breaks in the conversation and an appropriate time to interject. Maybe with a crude scale for the length of the response. So if the user has a break in the point they're trying to make - we don't want the user interrupted and the conversation moved on - what would be more appropriate would be a simple acknowledgement. But once the point is complete, we would then pass back that we want a longer response.

@michielsmissaert

Great stuff! Did you not cut in the video to reduce the waiting time for the LLM response? If you did not, the speed is impressive! Thank you so much!

@hjoseph777

Can that be installed locally? Can you provide more detail on how to do the first step. The transcription is extremely fast

@andrewtschesnok5582

Nice. But in reality your demo is 3,500-4,000 ms from when you stop speaking to getting a response. It does not match the numbers you are printing...

@damien2198

I suppose Whisper/ tiny model would be faster than this Deepgram ? have you tried?