AIY Vision Kit and Mozilla DeepSpeech

Hey everyone!

Just browsing through the MagPi magazine and I found this.

Apparently the great minds at Google have managed to figure out how to fit a vision processing chip onto a Raspberry Pi Zero bonnet (the HAT equivalent of the Zero family). This chip is capable of running as a neural network accelerator, similar to the Movidius Compute Stick, but it's easier to use as a vision processing chip since Google already released tutorials specifying how to use it as such. One of the networks they showed off was capable of distinguishing between human facial emotions. Imagine that, InMoov could tell when you're sad or happy! :)

Another thing I found was that Mozilla has released the first version of their DeepSpeech local speech recognition system. From what I've seen, accuracy is pretty good (much better than Sphinx), and it runs locally as a TensorFlow neural network, instead of Google's speech API. The only downside is that DeepSpeech cannot yet be used with a microphone because it can't tell when the sentence has been completed, but it might be possible to fix this by using a secondary system to detect when human speech is detected (ie just tell when somebody is talking, not actually trying to transcribe), and just record the audio from the moment speech is detected until the speaker stops. Then the system could send the recorded audio to DeepSpeech, where it is transcribed and then sent back to MRL.

The secondary system could start audio recording using a wakeword engine, like what Amazon's Alexa and Google Assistant use, and perhaps a speaker identification engine to detect the end of speech. The speaker ID engine could also allow MRL to pick out certain people by their voices as well.

This is just me spitballing ideas here, we'd have to wait until after Manticore when GroG makes the changes to the messaging system so that mrlpy can be used reliably, since all of this requires native access.

Suggestion Box