I’m trying to run the julius voice recognition on i.MX51 with support for portuguese language:
# export ALSADEV="plughw:0,0" # julius -C julius.jconf STAT: include config: julius.jconf Stat: para: parsing HTK Config file: edaz.conf Warning: para: "USESILDET" ignored (not supported, or irrelevant) Stat: para: ENORMALISE=T Warning: para: NUMCEPS skipped (will be determined by AM header) Stat: para: CEPLIFTER=22 Stat: para: NUMCHANS=26 Stat: para: PREEMCOEF=0.97 Stat: para: USEHAMMING=T Stat: para: WINDOWSIZE=250000.0 Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant) Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant) Stat: para: TARGETRATE=100000.0 Warning: para: TARGETKIND skipped (will be determined by AM header) Stat: para: ZMEANSOURCE=T Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant) Warning: no SOURCERATE found Warning: assume source waveform sample rate to 625 (16kHz) STAT: jconf successfully finalized STAT: *** loading AM01 lapsam Stat: init_phmm: Reading in HMM definition Stat: binhmm-header: variance inversed Stat: read_binhmm: has inversed variances Stat: read_binhmm: binary format HMM definition Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp" Stat: read_binhmm: this HMM requires multipath handling at decoding Stat: init_phmm: defined HMMs: 15685 Stat: init_phmm: loading ascii hmmlist Stat: init_phmm: logical names: 59321 in HMMList Stat: init_phmm: base phones: 40 used in logical Stat: init_phmm: finished reading HMM definitions STAT: making pseudo bi/mono-phone for IW-triphone Stat: hmm_lookup: 3080 pseudo phones are added to logical HMM list STAT: *** AM01 lapsam loaded STAT: *** loading LM01 dicSr Stat: init_voca: read 65783 words Stat: init_ngram: reading in binary n-gram from LaPSLM1.7.1.lm.bin Stat: ngram_read_bin: file version: 5 Stat: ngram_read_bin_v5: this is forward 3-gram file stat: ngram_read_bin_v5: reading 1-gram stat: ngram_read_bin_v5: reading 2-gram stat: ngram_read_bin_v5: reading 3-gram Stat: ngram_read_bin: making entry name index Stat: init_ngram: found unknown word entry "" Stat: init_ngram: finished reading n-gram Stat: init_ngram: mapping dictonary words to n-gram entries Stat: init_ngram: finished word-to-ngram mapping Warning: BOS word "" has unigram prob of "-99" Warning: assigining value of EOS word "": -1.934294 STAT: *** LM01 dicSr loaded STAT: ------ STAT: All models are ready, go for final fusion STAT: [1] create MFCC extraction instance(s) STAT: *** create MFCC calculation modules from AM STAT: AM 1 lapsam: create a new module MFCC01 STAT: 1 MFCC modules created STAT: [2] create recognition processing instance(s) with AM and LM STAT: composing recognizer instance SR01 dicSr (AM01 lapsam, LM01 dicSr) STAT: Building HMM lexicon tree STAT: lexicon size: 710945 nodes STAT: coordination check passed STAT: make successor lists for unigram factoring STAT: done STAT: 1-gram factoring values has been pre-computed STAT: SR01 dicSr composed STAT: [3] initialize for acoustic HMM calculation Stat: outprob_init: state-level mixture PDFs, use calc_mix() Stat: addlog: generating addlog table (size = 1953 kB) Stat: addlog: addlog table generated STAT: [4] prepare MFCC storage(s) STAT: All init successfully done STAT: ###### initialize input device ----------------------- System Information begin --------------------- JuliusLib rev.4.2.2 (fast) Engine specification: - Base setup : fast - Supported LM : DFA, N-gram, Word - Extension : WordsInt - Compiled by : gcc -fPIC ------------------------------------------------------------ Configuration of Modules Number of defined modules: AM=1, LM=1, SR=1 Acoustic Model (with input parameter spec.): - AM01 "lapsam" hmmfilename=LaPSAM1.7.1.am.bin hmmmapfilename=LaPSAM1.7.1.tiedlist Language Model: - LM01 "dicSr" vocabulary filename=dictionary_ssp.dic n-gram filename=LaPSLM1.7.1.lm.bin (binary format) Recognizer: - SR01 "dicSr" (AM01, LM01) ------------------------------------------------------------ Speech Analysis Module(s) [MFCC01] for [AM01 lapsam] Acoustic analysis condition: parameter = MFCC_E_D_A_Z (39 dim. from 12 cepstrum + energy with CMN) sample frequency = 16000 Hz sample period = 625 (1 = 100ns) window size = 400 samples (25.0 ms) frame shift = 160 samples (10.0 ms) pre-emphasis = 0.97 # filterbank = 26 cepst. lifter = 22 raw energy = True energy normalize = True (scale = 0.1, silence floor = 50.0 dB) delta window = 2 frames (20.0 ms) around acc window = 2 frames (20.0 ms) around hi freq cut = OFF lo freq cut = OFF zero mean frame = ON use power = OFF CVN = OFF VTLN = OFF spectral subtraction = off cepstral normalization = sentence CMN base setup from = HTK Config (and HTK defaults) ------------------------------------------------------------ Acoustic Model(s) [AM01 "lapsam"] HMM Info: 15685 models, 11951 states, 11951 mpdfs, 215113 Gaussians are defined model type = context dependency handling ON training parameter = MFCC_E_D_A_Z vector length = 39 number of stream = 1 stream info = [0-38] cov. matrix type = DIAGC duration type = NULLD max mixture size = 18 Gaussians max length of model = 5 states logical base phones = 40 model skip trans. = exist, require multi-path handling skippable models = sp (1 model(s)) AM Parameters: Gaussian pruning = safe (-gprune) top N mixtures to calc = 2 / 0 (-tmix) short pause HMM name = "sp" specified, "sp" applied (physical) (-sp) cross-word CD on pass1 = handle by approx. (use 5-best of same LC) sp transition penalty = -1.0 ------------------------------------------------------------ Language Model(s) [LM01 "dicSr"] type=n-gram N-gram info: spec = 3-gram, forward (left-to-right) OOV word = (id=2) wordset size = 65784 1-gram entries = 65784 ( 0.5 MB) 2-gram entries = 4500045 ( 64.7 MB) (98% are valid contexts) 3-gram entries = 16232140 (153.5 MB) pass1 = 2-gram in the forward n-gram Vocabulary Info: vocabulary size = 65783 words, 492911 models average word len = 7.5 models, 22.5 states maximum state num = 66 nodes per word transparent words = not exist words under class = not exist Parameters: (-silhead)head sil word = 1: "@0.000000 [] sil(sil)" (-siltail)tail sil word = 0: "@0.000000 [] sil(sil)" ------------------------------------------------------------ Recognizer(s) [SR01 "dicSr"] AM01 "lapsam" + LM01 "dicSr" Lexicon tree: total node num = 710945 root node num = 906 (148 hi-freq. words are separated from tree lexicon) leaf node num = 65783 fact. node num = 65783 Inter-word N-gram cache: root node to be cached = 264 / 906 (isolated only) word ends to be cached = 65784 (all) max. allocation size = 69MB (-lmp) pass1 LM weight = 15.0 ins. penalty = +10.0 (-lmp2) pass2 LM weight = 15.0 ins. penalty = +10.0 (-transp)trans. penalty = +0.0 per word (-cmalpha)CM alpha coef = 0.050000 Search parameters: multi-path handling = yes, multi-path mode enabled (-b) trellis beam width = 2000 (-bs)score pruning thres= disabled (-n)search candidate num= 3 (-s) search stack size = 500 (-m) search overflow = after 2000 hypothesis poped 2nd pass method = searching sentence, generating N-best (-b2) pass2 beam width = 200 (-lookuprange)lookup range= 5 (tm-5 <= t <tm+5) (-sb)2nd scan beamthres = 300.0 (in logscore) (-n) search till = 3 candidates found (-output) and output = 3 candidates out of above IWCD handling: 1st pass: approximation (use 5-best of same LC) 2nd pass: loose (apply when hypo. is popped and scanned) factoring score: 1-gram prob. (statically assigned beforehand) short pause segmentation = off fall back on search fail = off, returns search failure ------------------------------------------------------------ Decoding algorithm: 1st pass input processing = (forced) buffered, batch 1st pass method = 1-best approx. generating indexed trellis output word confidence measure based on search-time scores ------------------------------------------------------------ FrontEnd: Input stream: input type = waveform input source = microphone device API = default sampling freq. = 16000 Hz threaded A/D-in = supported, on zero frames stripping = on silence cutting = on level thres = 3000 / 32767 zerocross thres = 150 / sec. head margin = 600 msec. tail margin = 1000 msec. chunk size = 1000 samples long-term DC removal = on (will compute from first 3.0 sec) reject short input = < 50 msec ----------------------- System Information end ----------------------- ------ ### read waveform input Stat: capture audio at 16000Hz Stat: adin_alsa: current latency time: 34 msec Stat: adin_alsa: latency set to 34 msec (chunk = 557 bytes) Stat: "default": imx3stack [imx-3stack] device SGTL5000 SGTL5000-0 [] subdevice #0 STAT: AD-in thread created <<>>Warning: strip: sample 0-556 is invalid, stripped Warning: strip: sample 0-556 is invalid, stripped Warning: strip: sample 0-418 has zero value, stripped