After getting the display and worker up and running I started down the path of training my model for keyword recognition. Right now I've settled on the wake words
Hi Smalltalk. After the wake word is detected the model will then detect
My starting point for training the model was the
speech_commands tutorials that are part of the Tensorflow project. One of the first things I noticed while planning out this step was the lack of good wake words in the speech command dataset. There are many voice datasets available online, but many are unlabeled or conversational. Since digging didn't turn up much in the way of open labeled word datasets I decided to use
off from the speech commands dataset since that gave me a baseline for comparison with my custom words. After recording myself saying
smalltalk less then ten times I knew I did not want to generate my own samples at the scale of the other labeled keywords.
Instead of giving up on my wake word combination I started digging around for options and found an interesting project where somebody had started down the path of generating labeled words with text to speech. After reading through the repo I ended up using espeak and sox to generate my labeled dataset.
The first step was to generate the phonemes for the wake words:
$ espeak -v en -X smalltalk sm'O:ltO:k
I then stored the phoneme in a word file that will be used by
$ cat words hi 001 [[h'aI]] busy 002 [[b'Izi]] free 003 [[fr'i:]] smalltalk 004 [[sm'O:ltO:k]]
generate.sh from the spoken command repo (eliminating some extra commands and extending the loop to generating more samples) I had everything I needed to synthetically generate a new labeled word dataset.
#!/bin/bash # For the various loops the variable stored in the index variable # is used to attenuate the voices being created from espeak. lastwordid="" cat words | while read word wordid phoneme do echo $word mkdir -p db/$word if [[ $word != $lastword ]]; then versionid=0 fi lastword=$word # Generate voices with various dialects for i in english english-north en-scottish english_rp english_wmids english-us en-westindies do # Loop changing the pitch in each iteration for k in $(seq 1 99) do # Change the speed of words per minute for j in 80 100 120 140 160; do echo $versionid "$phoneme" $i $j $k echo "$phoneme" | espeak -p $k -s $j -v $i -w db/$word/$versionid.wav # Set sox options for Tensorflow sox db/$word/$versionid.wav -b 16 --endian little db/$word/tf_$versionid.wav rate 16k ((versionid++)) done done done done
After the run I have samples and labels with a volume comparable to the other words provided by Google. The pitch, speed and tone of voice changes with each loop which will hopefully provide enough variety to make this dataset useful in training. Even if this doesn't work out learning about
sox was interesting. I've already got some future ideas on how to use those. If it does work the ability to generate training data on demand seems incredibly useful.
Next up, training the model and loading to the ESP-EYE. The code, docs, images etc for the project can be found here and I'll be posting updates as I continue along to HackadayIO and this blog. If you have any questions or ideas reach out.