Day: December 6, 2012

Running julius with coruja language model for portuguese on i.MX51 processor

I’m trying to run the julius voice recognition on i.MX51 with support for portuguese language:

# export ALSADEV="plughw:0,0"
# julius -C julius.jconf 
STAT: include config: julius.jconf
Stat: para: parsing HTK Config file: edaz.conf
Warning: para: "USESILDET" ignored (not supported, or irrelevant)
Stat: para: ENORMALISE=T
Warning: para: NUMCEPS skipped (will be determined by AM header)
Stat: para: CEPLIFTER=22
Stat: para: NUMCHANS=26
Stat: para: PREEMCOEF=0.97
Stat: para: USEHAMMING=T
Stat: para: WINDOWSIZE=250000.0
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Stat: para: TARGETRATE=100000.0
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: ZMEANSOURCE=T
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
STAT: jconf successfully finalized
STAT: *** loading AM01 lapsam
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: read_binhmm: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 15685
Stat: init_phmm: loading ascii hmmlist
Stat: init_phmm: logical names: 59321 in HMMList
Stat: init_phmm: base phones:    40 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: making pseudo bi/mono-phone for IW-triphone
Stat: hmm_lookup: 3080 pseudo phones are added to logical HMM list
STAT: *** AM01 lapsam loaded
STAT: *** loading LM01 dicSr
Stat: init_voca: read 65783 words
Stat: init_ngram: reading in binary n-gram from LaPSLM1.7.1.lm.bin
Stat: ngram_read_bin: file version: 5
Stat: ngram_read_bin_v5: this is forward 3-gram file
stat: ngram_read_bin_v5: reading 1-gram
stat: ngram_read_bin_v5: reading 2-gram
stat: ngram_read_bin_v5: reading 3-gram
Stat: ngram_read_bin: making entry name index
Stat: init_ngram: found unknown word entry ""
Stat: init_ngram: finished reading n-gram
Stat: init_ngram: mapping dictonary words to n-gram entries
Stat: init_ngram: finished word-to-ngram mapping
Warning: BOS word "" has unigram prob of "-99"
Warning: assigining value of EOS word "": -1.934294
STAT: *** LM01 dicSr loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 1 lapsam: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR01 dicSr (AM01 lapsam, LM01 dicSr)
STAT: Building HMM lexicon tree
STAT: lexicon size: 710945 nodes
STAT: coordination check passed
STAT: make successor lists for unigram factoring
STAT: done
STAT:  1-gram factoring values has been pre-computed
STAT: SR01 dicSr composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.2.2 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : WordsInt
 -  Compiled by  : gcc -fPIC

------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM01 "lapsam"
        hmmfilename=LaPSAM1.7.1.am.bin
        hmmmapfilename=LaPSAM1.7.1.tiedlist

 Language Model:
 - LM01 "dicSr"
        vocabulary filename=dictionary_ssp.dic
        n-gram  filename=LaPSLM1.7.1.lm.bin (binary format)

 Recognizer:
 - SR01 "dicSr" (AM01, LM01)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM01 lapsam]

 Acoustic analysis condition:
               parameter = MFCC_E_D_A_Z (39 dim. from 12 cepstrum + energy with CMN)
        sample frequency = 16000 Hz
           sample period =  625  (1 = 100ns)
             window size =  400 samples (25.0 ms)
             frame shift =  160 samples (10.0 ms)
            pre-emphasis = 0.97
            # filterbank = 26
           cepst. lifter = 22
              raw energy = True
        energy normalize = True (scale = 0.1, silence floor = 50.0 dB)
            delta window = 2 frames (20.0 ms) around
              acc window = 2 frames (20.0 ms) around
             hi freq cut = OFF
             lo freq cut = OFF
         zero mean frame = ON
               use power = OFF
                     CVN = OFF
                    VTLN = OFF
    spectral subtraction = off
  cepstral normalization = sentence CMN
         base setup from = HTK Config (and HTK defaults)

------------------------------------------------------------
Acoustic Model(s)

[AM01 "lapsam"]

 HMM Info:
    15685 models, 11951 states, 11951 mpdfs, 215113 Gaussians are defined
              model type = context dependency handling ON
      training parameter = MFCC_E_D_A_Z
           vector length = 39
        number of stream = 1
             stream info = [0-38]
        cov. matrix type = DIAGC
           duration type = NULLD
        max mixture size = 18 Gaussians
     max length of model = 5 states
     logical base phones = 40
       model skip trans. = exist, require multi-path handling
      skippable models = sp (1 model(s))

 AM Parameters:
        Gaussian pruning = safe  (-gprune)
  top N mixtures to calc = 2 / 0  (-tmix)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use 5-best of same LC)
   sp transition penalty = -1.0

------------------------------------------------------------
Language Model(s)

[LM01 "dicSr"] type=n-gram

 N-gram info:
                    spec = 3-gram, forward (left-to-right)
                OOV word = (id=2)
            wordset size = 65784
          1-gram entries =      65784  (  0.5 MB)
          2-gram entries =    4500045  ( 64.7 MB) (98% are valid contexts)
          3-gram entries =   16232140  (153.5 MB)
                   pass1 = 2-gram in the forward n-gram

 Vocabulary Info:
        vocabulary size  = 65783 words, 492911 models
        average word len = 7.5 models, 22.5 states
       maximum state num = 66 nodes per word
       transparent words = not exist
       words under class = not exist

 Parameters:
        (-silhead)head sil word = 1: " @0.000000 [] sil(sil)"
        (-siltail)tail sil word = 0: " @0.000000 [] sil(sil)"

------------------------------------------------------------
Recognizer(s)

[SR01 "dicSr"]  AM01 "lapsam"  +  LM01 "dicSr"

 Lexicon tree:
         total node num = 710945
          root node num =    906
        (148 hi-freq. words are separated from tree lexicon)
          leaf node num =  65783
         fact. node num =  65783

 Inter-word N-gram cache: 
        root node to be cached = 264 / 906 (isolated only)
        word ends to be cached = 65784 (all)
          max. allocation size = 69MB
        (-lmp)  pass1 LM weight = 15.0  ins. penalty = +10.0
        (-lmp2) pass2 LM weight = 15.0  ins. penalty = +10.0
        (-transp)trans. penalty = +0.0 per word
        (-cmalpha)CM alpha coef = 0.050000

 Search parameters: 
            multi-path handling = yes, multi-path mode enabled
        (-b) trellis beam width = 2000
        (-bs)score pruning thres= disabled
        (-n)search candidate num= 3
        (-s)  search stack size = 500
        (-m)    search overflow = after 2000 hypothesis poped
                2nd pass method = searching sentence, generating N-best
        (-b2)  pass2 beam width = 200
        (-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
        (-sb)2nd scan beamthres = 300.0 (in logscore)
        (-n)        search till = 3 candidates found
        (-output)    and output = 3 candidates out of above
         IWCD handling:
           1st pass: approximation (use 5-best of same LC)
           2nd pass: loose (apply when hypo. is popped and scanned)
         factoring score: 1-gram prob. (statically assigned beforehand)
        short pause segmentation = off
        fall back on search fail = off, returns search failure

------------------------------------------------------------
Decoding algorithm:

        1st pass input processing = (forced) buffered, batch
        1st pass method = 1-best approx. generating indexed trellis
        output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                     input type = waveform
                   input source = microphone
            device API          = default
                  sampling freq. = 16000 Hz
                 threaded A/D-in = supported, on
           zero frames stripping = on
                 silence cutting = on
                     level thres = 3000 / 32767
                 zerocross thres = 150 / sec.
                     head margin = 600 msec.
                     tail margin = 1000 msec.
                      chunk size = 1000 samples
            long-term DC removal = on (will compute from first 3.0 sec)
              reject short input = < 50 msec

----------------------- System Information end -----------------------

------
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: current latency time: 34 msec
Stat: adin_alsa: latency set to 34 msec (chunk = 557 bytes)
Stat: "default": imx3stack [imx-3stack] device SGTL5000 SGTL5000-0 [] subdevice #0
STAT: AD-in thread created
<<>>Warning: strip: sample 0-556 is invalid, stripped
Warning: strip: sample 0-556 is invalid, stripped
Warning: strip: sample 0-418 has zero value, stripped

Source: http://aonsquared.co.uk/raspi_voice_control