Ashley Sheridan​

Teaching Children with the Speech API

Posted on


My eldest son is in the process of learning the alphabet, so I thought I could put my skills to use by helping him learn using two of the things he loves most: mobile devices and Mario! Like all 3 year olds, he loves technology, and we use it as an incentive for many things, from general good behviour to potty training. I've been incredibly impressed with how well he's taken to using a Switch and a Raspberry PI running some retro Nintendo games, completing game levels and bosses with an ease that surprised me. As parents brought up in Nintendo households, my and my partner are thrilled that he loves Nintendo as much as we do, and you can't have that without a love of Mario. He has the Mario room decor, the toys, and every night time story has to feature Mario, Bowser, & Daisy (he does tell me I'm doing bedtime stories wrong if I don't!)

I wanted to find a way to help him learn in a way that he could find more fun by making something that he could connect with on a more personal interest level. Obviously, this led me to create a web app that could help him learn. I was looking to use the Web Speech Synthesis API which allows you to use Javascript to read out text using a text to speech engine supported by the operating system. While support for this is actually pretty good, I was only concerned about mobile phone support, as that was my target. I was actually quite impressed with how well that worked, and how nice it sounded, although some phonetic nudging was required in places (if you've ever used a screen reader you will understand) as some of the words I was passing in were unusual proper nouns. I'm quite happy with the work in progress [letter learning app]() and my son is impressed too. It's helped him make a fun connection between learning and something he really enjoys.

The initial screen from the learn letters app

The completed letter learning app is available here, although it's only been tested so far across a few phones, and isn't yet fully responsive, so it won't look great on tablets and desktops.

Building the Slides

I built each screen like a very basic slide in a presentation. Each slide contains a button which references the next slide it will show (which makes it easy for the end slide to loop back to the start) using a `data-next-slide=""` attribute. This makes it a lot easier to just assign a single event handler to all such buttons, rather than try to set up something custom per each slide. Throughout I wanted to make as much generic as possible, whilst seeming a bit more individual and tailored, so things like the background colour loop through a series of colours:

.screen.letter-screen { align-items: baseline; justify-content: left; &:nth-child(5n+1) { background-color: lighten($letter-red, 20%); } &:nth-child(5n+2) { background-color: lighten($letter-green, 20%); } &:nth-child(5n+3) { background-color: lighten($letter-cyan, 20%); } &:nth-child(5n+4) { background-color: lighten($letter-yellow, 20%); } &:nth-child(5n+5) { background-color: lighten($letter-pink, 20%); } }

The SASS here references the same set of colours I used for the individual letters on the initial introduction slide, to mirror the same uniquely coloured letter effect seen on various Mario games.

Each letter slide is very simple, and just contains the letter (in upper and lowercase), a learning phrase (like M is for Mario), and an image to help as a visual aid. The image part was probably the most difficult, just because of how time-consuming it was. In order to ensure I wasn't encroaching on anybodies copyright (even though this isn't something I'm trying to sell or claim is in any way endorsed by Nintendo), I needed to draw out each image myself. Each graphic was painstakingly drawn by hand in Inkscape (no automated tracing) using other images as a reference. This gave me great (albeit quite flawed, but less noticeable on phone screens) SVG images to use on each screen, but ones that I could manipulate more easily however I needed in the future. Drawing these actually took far longer than writing any of the code involved in this little web app!

I used PHP in the backend to pull in the SVG code for each letter slide (when each file existed) in order to help keep my real file system more logical and easy to manage, rather than manually embed all 26 SVG images. As I finished each SVG, it would be automatically added to the app and shown on each slide:

$this->pwd = getcwd(); if(!file_exists("$this->pwd/$filename")) return ''; return file_get_contents("$this->pwd/$filename");

I wasn't overly concerned here about memory issues reading in the files, as I knew the SVGs would all be fairly small.


The Speech Synthesis API is surprisingly simple to use and has very few parts. At the crux is the voice, which is typically one of many offered by the operating system. For my purposes, knowing which devices this was going to be used on and knowing they were all set to British English, I could assume at least one compatible voice would always be available, but otherwise set it to the first voice which would typically be a good fallback match for the current locale:

var voices = L.synth.getVoices(); for(var i = 0; i < voices.length; i ++) { if(voices[i].lang === 'en-GB') { L.voice = voices[i]; break; } } if(!L.voice) { L.voice = voices[0]; }

The exact voice this offers differs from machine to machine, but did sound brilliant on my Android phone and on Firefox on Windows where I was doing my initial development. I haven't tested it beyond this, so other devices and browser/operating system combinations might give quite different mileage.

After selecting the voice, it was just a matter of picking the right pitch and rate of the speech. For the individual letters, a slow voice worked the best, as it helped to focus the letter and wasn't so quick in speaking that it was easily missed. However, when reading the description associated to a letter, a quicker rate of speech was required, because it helped put it into the context of a more normal use of speech. I found a rate of `0.5` for speaking the letters worked well, and sped it up slightly to `0.75` for speaking the letter usage in a sentence.

L.saySomething = function(textToSpeak, rate) { var utterance = new SpeechSynthesisUtterance(textToSpeak); utterance.voice = L.voice; utterance.pitch = L.pitch; utterance.rate = rate; L.synth.speak(utterance); };

There were some problems with the pronounciation of certain words, specifically proper nouns which may differ slightly from traditional English pronounciation. To get around this, I added phonetic override sentences that wouldn't be displayed, but that would pronounce things as I intended. Unfortunately, this didn't quite stretch to the phonetics system now taught in schools/nurseries, and I couldn't find a way to get the speech API to make the these sounds. For example, the speech synthesis pronounced "Koopa" as if it had an "s" before the "p". Where I noticed these anomalies, I added a more phonetic override:

// public function __construct($description, $phoneticDescription, $nextSlideName) {...} 'k' => new LetterSlide('K is for Koopa', 'K is for Koo-pa', 'letter-l')

I did run into a slight race condition bug where the first attempt to speak anything didn't work. This seemed to occur when I tried to have the API speak before it had completed setting the voice. I resolved this by immediately trying to speak a single space (i.e. nothing), which seemed to resolve the problem.

Triggering the Speech

Obviously, all of this is nothing unless something actually triggers the app to read something out. I added event handlers to the letter and the description:

var letters = document.querySelectorAll('.screen.letter-screen .letter'); letters.forEach((letter) => { // cast to lowercase to fix an issue in iOS which reads it out as "Capital A" var letterToSpeak = letter.textContent.substring(0, 1).toLowerCase(); letter.addEventListener('click', function(event){L.speakLetter(letterToSpeak)}); }); var descriptions = document.querySelectorAll('.screen .letter-description'); descriptions.forEach((description) => { var textToSpeak = description.getAttribute('data-speech-content'); description.addEventListener('click', function(event){L.speakLetterDescription(textToSpeak)}); });

These each call slightly different methods due to the need for a different rate of speech for a letter versus a sentence, which ultimately then just call a single common method which does the speaking.

L.speakLetter = function(letterToSpeak) { L.saySomething(letterToSpeak, L.letterRate); }; L.speakLetterDescription = function(descriptionToSpeak) { L.saySomething(descriptionToSpeak, L.sentenceRate); };

The Reception

My son loves it, and I can already see him understanding the letters, and recognising them before the app reads them out. Seeing how well it works, and how happy he is to learn like this makes the weeks of effort absolutely worth it!


Leave a comment