This is the 20th project of Wes Bos's JS30 series. To see the whole 30 part series, click here Today we learn how to use the built in Speech Recognition in the browser. Text to speech in realtime.
Video -
The starter files -
What we have to do is understand the speech in real time using the speech recognition API, and once the user has finished speaking a sentence (rather pauses the speech), append a <p>
tag to the div.words
and start listening for the next time the user speaks.
The JS we have
window.SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition
The SpeechRecognition interface of the Web Speech API is the controller interface for the recognition service; this also handles the SpeechRecognitionEvent sent from the recognition service. This is still an experimental technology. In chrome you need window.webkitSpeechRecognition
.
Create the speech recognition object, set the language. Controls whether interim results should be returned (true) or not (false.) Interim results are results that are not yet final. Setting it to true gives the whole app a realtime feel (i.e. text appears as you speak)
const recognition = new SpeechRecognition();
recognition.interimResults = true;
recognition.lang = 'en-US';
Append a paragraph to the div.words
, this is where the converted text goes initially.
let p = document.createElement('p');
const words = document.querySelector('.words');
words.appendChild(p);
Now we add the event handler to get notified of results. The onresult
property of the SpeechRecognition interface represents an event handler that will run when the speech recognition service returns a result — a word or phrase has been positively recognized and this has been communicated back to the app (when the result event fires).
Since we've set recognition.interimResults = true
, we'll get a bunch if intermediate values as well (we'll be updating the screen on intermediate values as well to give a realtime feel). The result of the last event generated will have an isFinal
property which will determine the final result we see on screen.
recognition.addEventListener('result', e => {
console.log(e)
});
The event e
is of type SpeechRecognitionEvent
. The e.results
property is a SpeechRecognitionResultList
object. The SpeechRecognitionResultList
contains SpeechRecognitionResult
objects. e.results
can be accessed like an array. The r.results[0]
returns the SpeechRecognitionResult
at position 0.
Each SpeechRecognitionResult
object contains a list of SpeechRecognitionAlternative
objects that contain individual results. Each SpeechRecognitionResult
can also be accessed like arrays. So e.result[0][0]
returns the SpeechRecognitionAlternative
at position 0 of the SpeechRecognitionResult
at position 0 of the SpeechRecognitionResultList
(viz e.results
). We then return the transcript property of the SpeechRecognitionAlternative object. What we want is e.result[0][0].transcript
.
I know it's a bit confusing, but this is how it is! You can learn more about speech recognition @ MDN Docs.
So now filling in the event listener correctly -
recognition.addEventListener('result', e => {
const transcript = e.results[0][0].transcript
if (e.results[0].isFinal) {
p = document.createElement('p');
words.appendChild(p);
}
});
If the first SpeechRecognitionResult
has a isFinal
property set to true, then we understand that it the user has stopped speaking and the API is done translating. We create a new <p>
element and add it to the div.words
. Anything spoken after now goes into the new paragraph.
Finally we start the speech recognition on page load
recognition.start();
recognition.addEventListener('end', recognition.start);
recognition.start()
Starts the speech recognition service listening to incoming audio with intent to recognize grammars associated with the current SpeechRecognition. Once the user pauses and the API finishes converting speech to text, a end
event is triggered on the recognition object. We pass recognition.start
to ask the browser to start listening again (otherwise the browser stops listening, and speech to text happens only once).
That is all for this small experiment. Final code -
See the Pen JS30-20-speech-b by Deepak Karki (@deepakkarki) on CodePen.