Speech recognition helloworld in Python

As shown in this video, this is how you try out the “helloworld” speech recognition using Sphinx from Python in Ubuntu…

$ sudo apt-get install python-pocketsphinx pocketsphinx-hmm-wsj1 pocketsphinx-lm-wsj

And the code (a script called speech_recognition.py, you can download it as a gist) goes as follows (it may be necessary to change paths to your language model and hidden markov model files):

import sys,os

def decodeSpeech(hmmd,lmdir,dictp,wavfile):
    """
    Decodes a speech file
    """

    try:
        import pocketsphinx as ps
        import sphinxbase

    except:
        print """Pocket sphinx and sphixbase is not installed
        in your system. Please install it with package manager.
        """

    speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
    wavFile = file(wavfile,'rb')
    wavFile.seek(44)
    speechRec.decode_raw(wavFile)
    result = speechRec.get_hyp()

    return result[0]

if __name__ == "__main__":
    hmdir = "/usr/share/pocketsphinx/model/hmm/wsj1"
    lmd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP"
    dictd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic"
    wavfile = sys.argv[1]
    recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)

    print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
    print recognised
    print "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"

You can then execute it for some wav file (my_recorded_speech.wav, assumed to be in the same folder) of your recorded speech by issuing

$ python speech_recognition.py my_recorded_speech.wav

Published by

metakermit

Building apps, analysing data at Punk Rock Dev and sharing weird & cool photographs, drawings, music, films, games... More about me here. You can get new blog posts via RSS or follow @metakermit on Twitter where I also announce new stuff.

23 thoughts on “Speech recognition helloworld in Python”

  1. what about this:
    import speech
    speech.say(“hello,world”)
    pyspeech is for python25 or 24
    requires windows 7,vista or xp and speech pakage ,pywin32 32bit requires:Windows
    32 bits:regular
    64 bits:32 bit everything

    1. Thanks for the suggestion, james. Haven’t tried pyspeech myself, although I think Windows is a strong dependency :). I actually got into Sphinx because I was building my own language model and it offers good tools to do that, only later did I find its Python bindings.

  2. All the above mentioned steps completed succesfully. All the packages were installed on Ubuntu.

    After that i tried running the above code but whatever the .wva file i pass the outpt is always as

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    THEY
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    So whats the fix to this. Or can you just point out the error that i am commiting here.
    Thanks

    1. Hi aaditya, are you sure your file is a proper .wav file and that you’re passing it to the script as an argument? I created my recording using the open source program Audacity (available for installation in the Software Center).

      1. i am also facing same problem but instead of They , I am getting none. when i went in detail i found that wavFile.seek(44) is returning none. I have used proper wav file. Please help if possible.

  3. I get an error…
    $ python2 speech.py test.wav
    Traceback (most recent call last):
    File “speech.py”, line 24, in
    recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
    File “speech.py”, line 8, in decodeSpeech
    import pocketsphinx as ps
    File “sphinxbase.pxd”, line 150, in init pocketsphinx (pocketsphinx.c:7935)
    ValueError: PyCapsule_GetPointer called with invalid PyCapsule object

    1. The error seems to be originating in pocketsphinx (“line 150, in init pocketsphinx (pocketsphinx.c:7935)”). I suggest that you report it to the developers of pocketsphinx – they seem to host their issue tracker here. If you find out that the code snippet I suggested is doing something in a wrong way, I would be thankful if you reported it to me too. Cheers!

          1. importing the module twice seems to work. So as a workaround I did this:

            try:
            import pocketsphinx as ps
            except:
            pass
            import pocketsphinx as ps

  4. no meu da este erro aqui
    root@jbktomate-P430-K-BE46P1:/home/jbk-tomate# python speech_recognition.py test.wav
    File “speech_recognition.py”, line 4
    try:
    ^
    IndentationError: expected an indented block
    root@jbktomate-P430-K-BE46P1:/home/jbk-tomate#

  5. I am having trouble to have the wsj hmm installed on Ubuntu.

    i see only a “/usr/share/pocketsphinx/model/hmm/en_US/” dir after installation instead.

    May I know which version of Ubuntu you have this working? Is there any way to install the hmm and lmd from source?

    1. hi,

      you sorted this one ah. because i facing this problem now. any solution or suggestions.

      Thanks & regards,
      cooldharma06 .. 🙂

  6. Although this runs without error, I’m finding the accuracy to be around 0%. It even fails for simple recordings.

    e.g. For one recording of me saying “yes” that I made with Audacity, it thinks I said “THEY ADDED AHEAD D. THAT IN AHEAD AT AND HOW THEY”.

    I don’t even…

  7. how do i run and achieve the same on windows ?? I basically have a wav file and want a text file as output transcribing the audio.. any help is appreciated !! thanks !

    -asmi

    1. Well it’s probably possible, although you’ll have to manually find and install the components, as there’s no package manager. I’d start by reading through the Sphinx documentation to see if they have some instructions for Windows users:

      http://cmusphinx.sourceforge.net/

      Can’t help you more than that, unfortunately, as I’m not a Windows user.

  8. This is what appened on my terminal:

    File “speech2.py”, line 30, in
    recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
    File “speech2.py”, line 17, in decodeSpeech
    speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
    UnboundLocalError: local variable ‘ps’ referenced before assignment

    Can you help me?

    1. Try to create the exact file like in my post. Don’t use the interactive terminal, but run it like python speech_recognition.py my_recorded_speech.wav

    2. –in my speech.wav i have ‘ HI HELLO’ words , im not getting correct answer, please do help——-

      nagesh@nagesh:~$ python sp.py speech.wav
      INFO: cmd_ln.c(506): Parsing command line:
      \
      -hmm /usr/share/pocketsphinx/model/hmm/wsj1 \
      -lm /usr/share/pocketsphinx/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP \
      -dict /usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic

      Current configuration:
      [NAME] [DEFLT] [VALUE]
      -agc none none
      -agcthresh 2.0 2.000000e+00
      -alpha 0.97 9.700000e-01
      -ascale 20.0 2.000000e+01
      -backtrace no no
      -beam 1e-48 1.000000e-48
      -bestpath yes yes
      -bestpathlw 9.5 9.500000e+00
      -cep2spec no no
      -ceplen 13 13
      -cmn current current
      -cmninit 8.0 8.0
      -compallsen no no
      -dict /usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic
      -dictcase no no
      -dither no no
      -doublebw no no
      -ds 1 1
      -fdict
      -feat 1s_c_d_dd 1s_c_d_dd
      -featparams
      -fillprob 1e-8 1.000000e-08
      -frate 100 100
      -fsg
      -fsgusealtpron yes yes
      -fsgusefiller yes yes
      -fwdflat yes yes
      -fwdflatbeam 1e-64 1.000000e-64
      -fwdflatefwid 4 4
      -fwdflatlw 8.5 8.500000e+00
      -fwdflatsfwin 25 25
      -fwdflatwbeam 7e-29 7.000000e-29
      -fwdtree yes yes
      -hmm /usr/share/pocketsphinx/model/hmm/wsj1
      -input_endian little little
      -jsgf
      -kdmaxbbi -1 -1
      -kdmaxdepth 0 0
      -kdtree
      -latsize 5000 5000
      -lda
      -ldadim 0 0
      -lifter 0 0
      -lm /usr/share/pocketsphinx/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP
      -lmctl
      -lmname default default
      -logbase 1.0001 1.000100e+00
      -logfn
      -logspec no no
      -lowerf 133.33334 1.333333e+02
      -lpbeam 1e-40 1.000000e-40
      -lponlybeam 7e-29 7.000000e-29
      -lw 6.5 6.500000e+00
      -maxhistpf 100 100
      -maxhmmpf -1 -1
      -maxnewoov 20 20
      -maxwpf -1 -1
      -mdef
      -mean
      -mfclogdir
      -mixw
      -mixwfloor 0.0000001 1.000000e-07
      -mmap yes yes
      -ncep 13 13
      -nfft 512 512
      -nfilt 40 40
      -nwpen 1.0 1.000000e+00
      -pbeam 1e-48 1.000000e-48
      -pip 1.0 1.000000e+00
      -rawlogdir
      -remove_dc no no
      -round_filters yes yes
      -samprate 16000 1.600000e+04
      -sdmap
      -seed -1 -1
      -sendump
      -silprob 0.005 5.000000e-03
      -smoothspec no no
      -spec2cep no no
      -svspec
      -tmat
      -tmatfloor 0.0001 1.000000e-04
      -topn 4 4
      -toprule
      -transform legacy legacy
      -unit_area yes yes
      -upperf 6855.4976 6.855498e+03
      -usewdphones no no
      -uw 1.0 1.000000e+00
      -var
      -varfloor 0.0001 1.000000e-04
      -varnorm no no
      -verbose no no
      -warp_params
      -warp_type inverse_linear inverse_linear
      -wbeam 7e-29 7.000000e-29
      -wip 0.65 6.500000e-01
      -wlen 0.025625 2.562500e-02

      INFO: cmd_ln.c(506): Parsing command line:
      \
      -lowerf 1 \
      -upperf 4000 \
      -nfilt 20 \
      -transform dct \
      -round_filters no \
      -remove_dc yes \
      -feat s2_4x

      Current configuration:
      [NAME] [DEFLT] [VALUE]
      -agc none none
      -agcthresh 2.0 2.000000e+00
      -alpha 0.97 9.700000e-01
      -cep2spec no no
      -ceplen 13 13
      -cmn current current
      -cmninit 8.0 8.0
      -dither no no
      -doublebw no no
      -feat 1s_c_d_dd s2_4x
      -frate 100 100
      -input_endian little little
      -lda
      -ldadim 0 0
      -lifter 0 0
      -logfn
      -logspec no no
      -lowerf 133.33334 1.000000e+00
      -mfclogdir
      -ncep 13 13
      -nfft 512 512
      -nfilt 40 20
      -rawlogdir
      -remove_dc no yes
      -round_filters yes no
      -samprate 16000 1.600000e+04
      -seed -1 -1
      -smoothspec no no
      -spec2cep no no
      -svspec
      -transform legacy dct
      -unit_area yes yes
      -upperf 6855.4976 4.000000e+03
      -varnorm no no
      -verbose no no
      -warp_params
      -warp_type inverse_linear inverse_linear
      -wlen 0.025625 2.562500e-02

      INFO: acmod.c(82): Parsed model-specific feature parameters from /usr/share/pocketsphinx/model/hmm/wsj1/feat.params
      INFO: mdef.c(520): Reading model definition: /usr/share/pocketsphinx/model/hmm/wsj1/mdef
      INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
      INFO: bin_mdef.c(301): Reading binary model definition: /usr/share/pocketsphinx/model/hmm/wsj1/mdef
      INFO: bin_mdef.c(480): 44 CI-phone, 66516 CD-phone, 5 emitstate/phone, 220 CI-sen, 5220 Sen, 18660 Sen-Seq
      INFO: tmat.c(204): Reading HMM transition probability matrices: /usr/share/pocketsphinx/model/hmm/wsj1/transition_matrices
      INFO: acmod.c(114): Attempting to use SCGMM computation module
      INFO: s2_semi_mgau.c(981): Reading S3 mixture gaussian file ‘/usr/share/pocketsphinx/model/hmm/wsj1/means’
      INFO: s2_semi_mgau.c(1080): 1 mixture Gaussians, 256 components, 4 feature streams, veclen 51
      INFO: s2_semi_mgau.c(981): Reading S3 mixture gaussian file ‘/usr/share/pocketsphinx/model/hmm/wsj1/variances’
      INFO: s2_semi_mgau.c(1080): 1 mixture Gaussians, 256 components, 4 feature streams, veclen 51
      INFO: s2_semi_mgau.c(748): Loading senones from dump file /usr/share/pocketsphinx/model/hmm/wsj1/sendump
      INFO: s2_semi_mgau.c(764): BEGIN FILE FORMAT DESCRIPTION
      INFO: s2_semi_mgau.c(793): Rows: 256, Columns: 5220
      INFO: s2_semi_mgau.c(801): Using memory-mapped I/O for senones
      INFO: kdtree.c(231): Reading tree for feature 0
      INFO: kdtree.c(249): n_density 256 n_comp 12 n_level 8 threshold 0.200000
      INFO: kdtree.c(186): Read 255 nodes
      INFO: kdtree.c(231): Reading tree for feature 1
      INFO: kdtree.c(249): n_density 256 n_comp 24 n_level 8 threshold 0.200000
      INFO: kdtree.c(186): Read 255 nodes
      INFO: kdtree.c(231): Reading tree for feature 2
      INFO: kdtree.c(249): n_density 256 n_comp 3 n_level 8 threshold 0.200000
      INFO: kdtree.c(186): Read 255 nodes
      INFO: kdtree.c(231): Reading tree for feature 3
      INFO: kdtree.c(249): n_density 256 n_comp 12 n_level 8 threshold 0.200000
      INFO: kdtree.c(186): Read 255 nodes
      INFO: feat.c(849): Initializing feature stream to type: ‘s2_4x’, ceplen=13, CMN=’current’, VARNORM=’no’, AGC=’none’
      INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0
      INFO: dict.c(232): Allocating 20 placeholders for new OOVs
      INFO: dict.c(494): 6270 = words in file [/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic]
      WARNING: “dict.c”, line 435: Skipping duplicate definition of
      WARNING: “dict.c”, line 435: Skipping duplicate definition of

      WARNING: “dict.c”, line 435: Skipping duplicate definition of
      INFO: dict.c(494): 3 = words in file [/usr/share/pocketsphinx/model/hmm/wsj1/noisedict]
      INFO: dict.c(349): LEFT CONTEXT TABLES
      INFO: dict.c(1013): Entry Context table contains
      450 entries
      INFO: dict.c(1014): 19800 possible cross word triphones.
      INFO: dict.c(1052): 17920 triphones
      1792 pseudo diphones
      88 uniphones
      INFO: dict.c(1099): Exit Context table contains
      450 entries
      INFO: dict.c(1100): 19800 possible cross word triphones.
      INFO: dict.c(1166): 17920 triphones
      1792 pseudo diphones
      88 uniphones
      INFO: dict.c(1168): 7653 right context entries
      INFO: dict.c(1169): 17 ave entries per exit context
      INFO: dict.c(355): RIGHT CONTEXT TABLES
      INFO: dict.c(1013): Entry Context table contains
      416 entries
      INFO: dict.c(1014): 18304 possible cross word triphones.
      INFO: dict.c(1052): 17388 triphones
      828 pseudo diphones
      88 uniphones
      INFO: dict.c(1099): Exit Context table contains
      416 entries
      INFO: dict.c(1100): 18304 possible cross word triphones.
      INFO: dict.c(1166): 17388 triphones
      828 pseudo diphones
      88 uniphones
      INFO: dict.c(1168): 8753 right context entries
      INFO: dict.c(1169): 21 ave entries per exit context
      ERROR: “ngram_model_arpa.c”, line 155: No \data\ mark in LM file
      INFO: ngram_model_dmp.c(141): Will use memory-mapped I/O for LM file
      INFO: ngram_model_dmp.c(190): ngrams 1=5002, 2=338656, 3=291318
      INFO: ngram_model_dmp.c(236): 5002 = LM.unigrams(+trailer) read
      INFO: ngram_model_dmp.c(286): 338656 = LM.bigrams(+trailer) read
      INFO: ngram_model_dmp.c(313): 291318 = LM.trigrams read
      INFO: ngram_model_dmp.c(338): 32470 = LM.prob2 entries read
      INFO: ngram_model_dmp.c(358): 13795 = LM.bo_wt2 entries read
      INFO: ngram_model_dmp.c(379): 31136 = LM.prob3 entries read
      INFO: ngram_model_dmp.c(408): 662 = LM.tseg_base entries read
      INFO: ngram_model_dmp.c(467): 5002 = ascii word strings read
      INFO: ngram_search_fwdtree.c(156): 0 root, 0 non-root channels, 37 single-phone words
      INFO: ngram_search_fwdtree.c(195): Creating search tree
      INFO: ngram_search_fwdtree.c(203): 0 root, 0 non-root channels, 37 single-phone words
      INFO: ngram_search_fwdtree.c(325): max nonroot chan increased to 13871
      INFO: ngram_search_fwdtree.c(334): 443 root, 13743 non-root channels, 17 single-phone words
      INFO: ngram_search_fwdflat.c(95): fwdflat: min_ef_width = 4, max_sf_win = 25
      INFO: cmn.c(175): CMN: 45.71 2.22 2.01 -1.55 1.03 -0.86 -0.23 -0.60 -1.67 -0.76 -1.19 -0.65 -0.52
      INFO: ngram_search_fwdtree.c(1450): 1255 words recognized (11/fr)
      INFO: ngram_search_fwdtree.c(1452): 284041 senones evaluated (2387/fr)
      INFO: ngram_search_fwdtree.c(1454): 218618 channels searched (1837/fr), 43016 1st, 63470 last
      INFO: ngram_search_fwdtree.c(1458): 3806 words for which last channels evaluated (31/fr)
      INFO: ngram_search_fwdtree.c(1461): 9712 candidate words for entering last phone (81/fr)
      INFO: ngram_search_fwdflat.c(840): 857 words recognized (7/fr)
      INFO: ngram_search_fwdflat.c(842): 88790 senones evaluated (746/fr)
      INFO: ngram_search_fwdflat.c(844): 64428 channels searched (541/fr)
      INFO: ngram_search_fwdflat.c(846): 4597 words searched (38/fr)
      INFO: ngram_search_fwdflat.c(848): 3028 word transitions (25/fr)
      WARNING: “ngram_search.c”, line 1000: not found in last frame, using LAND instead
      INFO: ngram_search.c(1046): lattice start node .0 end node LAND.55
      INFO: ps_lattice.c(1225): Normalizer P(O) = alpha(LAND:55:117) = -1210044
      INFO: ps_lattice.c(1263): Joint P(O,S) = -1210380 P(S|O) = -336
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      CHINA LAND
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Leave a Reply

Your email address will not be published. Required fields are marked *