Cheap and Easy DIY Speech Recognition For Your Home Made Starship

Cheap and Easy Speech Recognition
For Your Home Made Starship

(Use arrow keys to navigate slides, F11 for fullscreen)

So You're Making a Starship?

You're gonna need a computer...

(What a computer might look like)

So You're Making a Starship?

You're gonna need a computer

(What a cheaper, easier computer might look like)

Disclaimer: If you haven't figured it out already this is all more of a very cheesy hack than real AI or NLP

Three parts to a Starship Computer:
  • Speech Synthesis

  • Speech Recognition

  • "Natural Language Processing"

  • Speech Synthesis - Cheap and easy:
  • espeak

  • espeak 'I'\''m afraid I can'\''t do that dave'

  • pico2wave

  • pico2wave -w /tmp/zzz.wav 'I'\''m afraid I can'\''t do that dave'; aplay /tmp/zzz.wav

  • pico2wave -l en-GB -w /tmp/zzz.wav 'I'\''m afraid I can'\''t do that dave'; aplay /tmp/zzz.wav

  • Speech Synthesis - Cheap and easy:
  • To serialize speech, to keep the computer from talking over itself and saying multiple things at the same time, a simple "mkdir" based lock is used:

  • text_to_speech.sh

  • Speech Recognition - Cheap and easy:
  • Pocketsphinx

  • Has hotword support

  • Trainable with custom vocabulary for (hopefully good enough) accuracy

  • Speech Recognition - Cheap and easy:

    Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    

    Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    
  • "stdbuf -o 0" turns off any output buffering of the command it runs.

  • We need this because we want "real time" responsiveness and don't want things to get stuck for lack of a newline

  • Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    
  • pocketsphinx_continuous - run continuously

  • -inmic yes - Use the microphone

  • -lm, -dict - specify language model (more on this later)

  • -keyphrase - specify "key phrase" or "hot word"

  • -kws_threshold - adjust sensitivity of key phrase recognition

  • Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    
  • Ignoring portions of the output from pocketsphinx (determined empirically)

  • Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    
  • Transform "blah blah COMPUTER DO SOMETHING" into "DO SOMETHING"

  • Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    
  • Capture recognizer output for debugging.

  • Pocketsphinx - How to use it

    language_model="yourlanguagemodel"
    keyphrase="computer"
    stdbuf -o 0 pocketsphinx_continuous -inmic yes -lm "$language_model".lm \
      		-dict "$language_model".dic -keyphrase "$keyphrase" \
    		-kws_threshold 1e-20 2>/dev/null |\
    	stdbuf -o 0 egrep -v 'READY...|Listening...|Stopped listening' |\
    	stdbuf -o 0 sed -e 's/^[0-9][0-9]*[:] //' |\
    	stdbuf -o 0 grep COMPUTER |\
    	stdbuf -o 0 sed -e 's/^.*COMPUTER //' |\
    	stdbuf -o 0 tee recog.txt | \
    	stdbuf -o 0 cat > /tmp/snis-natural-language-fifo
    
  • Send text of recognized speech into a FIFO for "natural language processing"

  • Pocketsphinx - Language Model

  • Web Service for building language model

  • Instructions

  • Service: http://www.speech.cs.cmu.edu/tools/lmtool-new.html

  • General procedure is:

  • Upload a corpus of typical commands you want to recognize

  • Web service processes this corpus

  • Download a tarball containing 5 files: xxxx.dic, xxxx.lm, xxxx.log_pronounce, xxxx.sent, xxxx.vocab (where "xxxx" is some number)

  • xxxx.dic and xxxx.lm are the dictionary and language files used by pocketsphinx_continuous

  • "Natural Language Processing"

  • For this I wrote my own C code

  • Works much as old games like Zork did

  • snis_nl.h, snis_nl.c

  • What it does:

    • Parses text using user supplied dictionary and verb syntax

    • Calls user supplied verb functions, passing text and part of speech information

  • "Natural Language Processing"

  • For this I wrote my own C code

  • Works much as old games like Zork did

  • snis_nl.h, snis_nl.c

  • What it does not do:

    • Associate any meaning to any words (apart from mapping verbs to callbacks)

    • Understand anything.

  • All "understanding" happens in user supplied verb callbacks

  • "Natural Language Processing"

    The basic process is:
    • At start up, call functions to add words to dictionary

      • Define your nouns, pronouns, articles, adjectives, adverbs

      • Define your verbs, with "syntax", and associated function pointers

    • Parse text with snis_nl_parse_natural_language_request()

    • which will call back functions you have associated with verbs

    "Natural Language Processing"

    Example: adding non-verbs to the dictionary
    snis_nl_add_dictionary_word("coolant",		"coolant",	POS_NOUN);
    snis_nl_add_dictionary_word("power",		"power",	POS_NOUN);
    snis_nl_add_dictionary_word("off",		"off",		POS_PREPOSITION);
    snis_nl_add_dictionary_word("on",		"on",		POS_PREPOSITION);
    snis_nl_add_dictionary_word("up",		"up",		POS_ADJECTIVE);
    snis_nl_add_dictionary_word("down",		"down",		POS_ADJECTIVE);
    snis_nl_add_dictionary_word("port",		"port",		POS_ADJECTIVE);
    snis_nl_add_dictionary_word("left",		"port",		POS_ADJECTIVE);
    

    "Natural Language Processing"

    Example: adding verbs to the dictionary
    snis_nl_add_dictionary_verb("set",		"set",		"npq", nl_set_npq); /* set warp drive power to 50 percent */
    snis_nl_add_dictionary_verb("set",		"set",		"npn", nl_set_npn); /* set a course for the planet */ 
    snis_nl_add_dictionary_verb("set",		"set",		"npan", nl_set_npn); /* set a course for the nearest planet */
    snis_nl_add_dictionary_verb("set",		"set",		"npnq", nl_set_npnq); /* set a course for starbase one */
    snis_nl_add_dictionary_verb("plot",		"plot",		"npn", nl_set_npn);
    snis_nl_add_dictionary_verb("plot",		"plot",		"npan", nl_set_npn);
    snis_nl_add_dictionary_verb("plot",		"plot",		"npnq", nl_set_npnq);
    snis_nl_add_dictionary_verb("lay in",		"lay in",	"npn", nl_set_npn);
    snis_nl_add_dictionary_verb("lay in",		"lay in",	"npan", nl_set_npn);
    snis_nl_add_dictionary_verb("lay in",		"lay in",	"npnq", nl_set_npnq);
    

    "Natural Language Processing"

    Example: adding verbs to the dictionary
    snis_nl_add_dictionary_verb("set",		"set",		"npn", nl_set_npn); /* set a course for the planet */ 
    

    First parameter is the verb you want to define.

    "Natural Language Processing"

    Example: adding verbs to the dictionary
    snis_nl_add_dictionary_verb("set",		"set",		"npn", nl_set_npn); /* set a course for the planet */ 
    

    Second parameter is the "canonical" word, which doesn't have to be the same, but often is.

    "Natural Language Processing"

    Example: adding verbs to the dictionary
    snis_nl_add_dictionary_verb("set",		"set",		"npn", nl_set_npn); /* set a course for the planet */ 
    

    Third parameter is the verb "syntax"

    In this case, "npn" means the verb expects a noun, a preposition, and another noun. (more on this later)

    "Natural Language Processing"

    Example: adding verbs to the dictionary
    snis_nl_add_dictionary_verb("set",		"set",		"npn", nl_set_npn); /* set a course for the planet */ 
    

    The fourth parameter is a pointer to the function to be called when this verb is encountered.

    "Natural Language Processing"

    The verb syntax is defined by a string with each character representing an expected part of speech.

    Syntax of verb, denoted by characters.
    'n' - single noun (or pronoun)
    'l' - one or more nouns (or pronoun)
    'p' - preposition
    'P' - pronoun (unsubstitued, in most cases you probably want 'n' for noun.)
          This is for cases like "it" in "how far is it to earth?" The word
          "it" doesn't have or need any explicit antecedent.
    'q' - quantity, that is to say, a number.
    'a' - adjective
    'x' - auxiliary verb (be, do, have, will, shall, would, should,
                           can, could, may, might, must, ought, etc. )
    
  • For example: "put", as in "put the butter on the bread with the knife" has a syntax of "npnpn", while "put" as in "put the coat on", has a syntax of "np".

  • Verbs with the same "word", but different syntaxes are considered different verbs and typically have different associated function pointers.

  • Think of the verb syntax as kind of loose overloaded function prototypes.

  • "Natural Language Processing"

    Verb Callback functions:
    static void my_verb_callback_fn(void *context, int argc, char *argv[], int pos[],
                    union snis_nl_extra_data extra_data[])
    
  • context : allows for your call back to receive some information to be passed through from the point of call of the parse.

  • argc : indicates number of elements in following array params.

  • argv[] : Array of canonical words parsed., eg: { "turn", "on", "the", "lights" }

  • pos[] : parallel array of parts of speech of argv[] elements, eg: { VERB, PREP, ART, NOUN }

  • extra_data[] : parallel array of "extra data" corresponding to argv[] array. (more on this later)

  • "Natural Language Processing"

    External noun lookup function:
    static uint32_t my_lookup_function(void *context, char *word);
    snis_nl_add_external_lookup(my_lookup_function);
  • An external noun lookup function can be provided to handle, eg. named objects.

  • For example, if you have a game with procedurally named planets, the parser can call back a lookup function to be able to identify them, so you can parse "set a course for name-of-planet"

  • Your lookup function can associate a uint32_t handle with any words it chooses. In the verb callback functions, these handles will be available in the extra_data[] parameter for any words identified as POS_EXTERNAL_NOUN.

  • uint32_t noun_id = extra_data[noun].external_noun.handle

    "Natural Language Processing"

    Numbers

  • 'q' in a verb syntax means a number is expected.

  • Numbers come from speech recognition as text, eg: "forty two", not "42".

  • Custom code for handling percent, common fractions, the word "and" in the middle of numbers, etc.

  • The value of parsed numbers is passed back to verb functions in the "extra_data[]" parameter.

  • float value = extra_data[number].number.value;
  • Not completely bug free. (Fails to parse "one hundred and ten thousand" correctly, for instance.)

  • spelled_numbers.h, spelled_numbers.c

  • Some things it can parse.

  • "Natural Language Processing"

    Parsing

    Assuming the dictionary is set up, parsing is as simple as passing it a string to chew on:

    snis_nl_parse_natural_language_request(NULL, "set the warp drive power to fifty percent");

    Links