Remote dictation to a blog post

It's been part of my fractured workflow for some time. I enjoy making audio recordings of my thoughts, and previously used SpinVox to great effect from my mobile phone to an in-boud email to me. Spinvox was not what the business declared it was, and proved unscalable so disapeared.

At the time I used GTD to manage my projects and it was a highly effective system when combined with SpinVox.

Manually transcribing my audio notes brings incredibly insightful retrospective thoughts, especially as I am an intuitive wholistic visual spacial thinker, rather than an audio linguistic sequential thinker. I see more of my thoughts when I engage my audio linguistic thinking space but, and here is a large one - it takes more than double the time1.

With the advancements in AI and ML I wondered if there was a free open source, self hosted Voice to Text (VTT) platform which can be scripted from an audio file.

In my searches I discovered a YouTube Video by Kris Occhipinti, describing a method using wget and a flac file. I'm going to investigate this now.

Proof of concept / testing

I'm using Kubuntu 10.10, because it's what I installed the other day, I'm likely to host this solution somewhere, it will be on a low power Linux VPS server later, but for now, I will use what I have to hand.

  $ lsb_release -a
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:    Ubuntu 18.10
  Release:        18.10
  Codename:       cosmic
  $ cat /etc/*-release
  DISTRIB_ID=Ubuntu
  DISTRIB_RELEASE=18.10
  DISTRIB_CODENAME=cosmic
  DISTRIB_DESCRIPTION="Ubuntu 18.10"
  NAME="Ubuntu"
  VERSION="18.10 (Cosmic Cuttlefish)"
  ID=ubuntu
  ID_LIKE=debian
  PRETTY_NAME="Ubuntu 18.10"
  VERSION_ID="18.10"
  HOME_URL="https://www.ubuntu.com/"
  SUPPORT_URL="https://help.ubuntu.com/"
  BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
  PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
  VERSION_CODENAME=cosmic
  UBUNTU_CODENAME=cosmic

The beginning

I will take this project as best I can, one step at a time starting with the actual VTT element of the project as without this, there is no reason to create any other component. Before I can test any VTT ideas, I need some audio samples to work with.

I already have many audio recordings, which have yet to be processed. They were recorded on my little second hand digital voice recorder, an Olympus WS-832. I also often use my Sony Xperia Z5 Android mobile phone to record audio notes to myself. Both devices record in many formats, but the recordings I made will be long, and may contain other background sounds. Although which I have manually transferred the Olympus recordings to this computer, I will refrain from doing so with the Android phone for now as I will want to automate this process later.

It would be best to work with the 'perfect' audio sample, so I will record some new test files to process.

Collecting audio samples for testing

I need to capture a flac file with my voice in it. Kris Occhipinti, seems to have changed his name in the YouTube video to sairon, as he has a screen capture with the command he is using in the opening seconds. Perhaps if I watched the entire video I would understand, but for now, I'm just going to use the screenshot he provides.

I've chosen an audio recorder which looks to be quite full featured, called 'audio-recorder', I found this from an article written by Abhishek Prakash on It's FOSS, and looks quite full featured and light weight. I would have gone the Audacity route, but I always end up becoming distracted with it's many features.

  sudo apt-add-repository ppa:audio-recorder/ppa
  sudo apt-get update
  sudo apt-get install audio-recorder

The interface is simple, I selected the correct source device, 'Built-in Audio Analogue Stereo (Microphone)' as the software initially selected 'Built-in Audio Analogue Stereo (Sudio output)' device. Secondly, I altered the format, to /FLAC (cd quality, lossless 44khz).

Unfortunately, 'Audio recorder' does not offer 16000Hz sampling rate by default, only 44000Hz for flac. It does offer .SPX or 'speex' format which could prove useful later. For now, I will record a few .flac files at 44khz, and run them through SoX to reduce the sampling rate on the command line, rather than dig into Audio recorder's settings, and waster time here, I will instead put my mind to finding a solution to an issue I will need to solve later anyway.

I have recorded a few samples of me chatting, reading and speaking clearly and saved them somewhere I can access later.

Converting audio samples

I am using SoX to convert the sample files, as it will be able to convert my other source files to .flac. Also, Kris (or sairon) indicated it works with the process he has indicated, I'll try his method first.

After a little research, I discovered that it is unlikely that I would be able to hear the difference between dithering or not dithering, but the generally accepted advice is to use dither whenever converting to 16bit or less, I don't know where this generally accepted advice has any source, but the advice should good so I'll take it. Thanks KozmoNaut Should I dither? (Converting 24bit to 16bit FLAC) Reply #1 – 2016-12-06 17:29:13!

$ sudo apt-get install sox

At this time, apt installed SoX v14.4.2, so that's what i'' be using here, "For noise shaped dither at the default recommended settings."

sox <infile> -b 16 <outfile> rate 44100 dither -s

This created a new file which was slightly smaller, but when I tested the and It resulted in the same sample rate being reported when tested with flac. More work needs to be done.

Verifying the output of the audio conversion

Firstly I install flac, as it is not currently on this new build. Using apt-get is quite reliable and the following command resulted in the installation of 'flac 1.3.2'.

$ sudo apt-get install flac

To check and compare the and I used the following command. Where is the and running it a second time with the as , a manual eyeballing will suffice for now. If this is used regularly, I may script the output and run differencing on the outputs to identify if the file has been altered.

$ flac -ac "<filename>" | grep -E '(sample|channels)'

The output, indicated the sample rate was sample_rate=44100 on both files. So, KozmoNaut's suggestion to someone else's question did not deliver the required results, that's cool, I just need to dig a little deeper and see what I actually need to do. Instinctively, I tried the following.

$ sox "<infile>" -b 16 "<outfile>" rate 16100 dither -s
sox WARN rate: rate clipped 4 samples; decrease volume?
sox WARN dither: dither clipped 2 samples; decrease volume?

These messages don't look good but I check the with the flac command and the result is fewer lines reporting that the sample_rate=16100. Playing the file back, sounds fine to me, so let's leave this as finshed, but remember that I may need to come back to this stage and alter the settings and conversion process.

Ok, so my initial sample rates were set to 16100Hz for some reason I copied KozmoNaut's '100' thinking it was there for a reason, so I resampled at the correct, 16000Hz rate, and tried again. This didn't work in the first attempt at using Using Google's speech-api either, but I may not need to downsample if I use an alternative service but Iv'e decided on leaving this note here for me for the future.

The actual speech to text tests

I would prefer a quick win here, but always like to have a little control if possible. If I can off-line or in-house the software it's better for my conscience. I won't be concerned that the system I use is susceptible to outside influencers. Aquisitions and Mergers some to mind for free services which are hosted and accessible only by an API, but they are always going to provide more resources than I can afford to put in place in the short term.

On my initial searches, I discovered a Google option, so decided to explore it.

Option 1: Using Google's speech-api failure

You can skip this section if you are in a hurry as this doesn't work

The solution initially proposed by Kris in his video uses Google's speech-api, and wget to send the flac file resulting in a JSON string which is handy, but I've yet to test this part of the process. Initially, in thought I hoped to find a more in-house solution, and, will certainly revisit this section of the project to avoid using Google's services at all. For now, let's press on and see what happens.

GNU Wget 1.19.5 is already installed on my machine. so let's try the command Kris proposed.

$ wget -q -U "Mozilla/5.0" --post-file <outfile> --header="Content-Type: audio/x-flac; rate=16000! -O -"http://www.google.com/speech-api/v1/recognise?lang=en-gb&client=chromium"

You may or nay not have noted, I altered the en-us to en-gb, on the first attempt, let's see what happens Also note that I would replace with the full path to the output file we created earlier.

It didn't work, I've typed something incorrectly here, I'm sure of it as I'm copying from a YouTube video. I never understand why people make YouTube videos for this sort of thing so I'll watch the video properly.

Ok, so my initial sample rates were set to 16100Hz for some reason, so I resampled at the correct, 16000Hz rate, and tried again. This didn't work either. I've updated the notes above to reflect this but it may be that the actual solution doesn't require this step.

Also, Kris, refers to sairon as the original writer of this script, so I did a little more digging and found a page on CommandlineFu.com which actually references the exact same code, ans although Kris does mention sairon a few times in the video, it's not written anywhere in the details of the video, until someone commented on the fact. Besides, this is the point I realise the video is 8 years old, and that the script written by sairon is from March 2011. I'll quickly double check the Google API version and if I need an API key if they still offer this service, but it may be that I'm already too late to the bandwagon here and will need to train up my own or borrow someone else's ML AI software. I'll check out Google one more time before passing them over, as this will be the quickest win if it works.

Option 1b: Using Google's speech-api again

It appears that the API refferred to here, has been replaced with their Cloud Speech-to-Text . So let's give that a go now shall we?

Access to all Cloud Platform products

Get everything that you need to build and run your apps, websites and services, including Firebase and the Google Maps API.
$300 credit for free

Sign up and get $300 to spend on Google Cloud Platform over the next 12 months.
No autocharge after free trial ends

We ask you for your credit card to make sure that you are not a robot. You won't be charged unless you manually upgrade to a paid account.

That ended quickly. I shan't be following this rabbit hole yet. Let's look elsewhere and see what the current stats of open-source speech to text is doing. I read recently in a newsletter that Google and Apple are doing great things, like an entire speech recognition platform on Android in something like 50Mb but it's closed source, so let's see what I can find elsewhere.

Option 2: Open-source options for speech to text

There are quite a few offerings out there, according to Wikipedia's listing List of speech recognition software, there is a selection of Acoustic models and speech corpus (compilation). This list is by no means exhaustive, but it's somewhere to start. Other places which caught my eyes, as I am loving Python at that moment are listed under PyPy's (Python Software Foundation) SpeechRecognition 3.8.1 offering which seems to integrate with many of the options on the Wikipedia page. So rather than a bunch of bash scripts, perhaps I'll start down the Python avenue again.

I digress.

Let's take a look at the first offering my brain picks out. I like the idea of the recognition platform being offline, not because the location the system will eventually running in, will not be connected to the internet, - it will. More because I like the idea of not being reliant on too many external providers. Money talks and so often, a great product is snapped up by a large corporation who make promises, and then well, it's gone.

Running down the rabbit hole, CMUSphinx works offline, but is written in Java and for some reason that puts me off and I don't have the patience to meditate on that strange feeling right now; but as it integrates with Python, I'm ready to read their articles and tutorials to see where they take me.

Option 2a: CMS Sphinx

Option two, selection 'a', after Google's failed attempts, I've decided to have an in-depth look at CMS Sphinx CMUSphinx Tutorial For Developers. Now I wouldn't call myself a developer, but let's read this anyway and get up to speed. I know there will be a lot of things I don't know in this arena and I'm excited to learn about them.

Setting up PocketSphinx

I've decided to take the plunge at what seems to be what I'm looking for. So in the terminal, I installed puthon3, virtualenv and pip. Within the a new and now active virtualenv, I grabbed a copy of the audio_transcribe.py from a random GitHub user called Anthony Zhang who seems to be hosting the python speech_recognition modules under his own repository as far as I can tell. Well regardless, I grabbed the whole file and dumped it into my new virtualenv to work from. Maybe this workflow is a little arse backwards, maybe jumping in at the deep end is what I'm all about.

audio_transcribe.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#!/usr/bin/env python3

import speech_recognition as sr

# obtain path to "english.wav" in the same folder as this script
from os import path
AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")
# AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "french.aiff")
# AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "chinese.flac")

# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

# recognize speech using Sphinx
try:
    print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

# recognize speech using Google Speech Recognition
try:
    # for testing purposes, we're just using the default API key
    # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
    # instead of `r.recognize_google(audio)`
    print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))

# recognize speech using Google Cloud Speech
GOOGLE_CLOUD_SPEECH_CREDENTIALS = r"""INSERT THE CONTENTS OF THE GOOGLE CLOUD SPEECH JSON CREDENTIALS FILE HERE"""
try:
    print("Google Cloud Speech thinks you said " + r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS))
except sr.UnknownValueError:
    print("Google Cloud Speech could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Cloud Speech service; {0}".format(e))

# recognize speech using Wit.ai
WIT_AI_KEY = "INSERT WIT.AI API KEY HERE"  # Wit.ai keys are 32-character uppercase alphanumeric strings
try:
    print("Wit.ai thinks you said " + r.recognize_wit(audio, key=WIT_AI_KEY))
except sr.UnknownValueError:
    print("Wit.ai could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Wit.ai service; {0}".format(e))

# recognize speech using Microsoft Bing Voice Recognition
BING_KEY = "INSERT BING API KEY HERE"  # Microsoft Bing Voice Recognition API keys 32-character lowercase hexadecimal strings
try:
    print("Microsoft Bing Voice Recognition thinks you said " + r.recognize_bing(audio, key=BING_KEY))
except sr.UnknownValueError:
    print("Microsoft Bing Voice Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Microsoft Bing Voice Recognition service; {0}".format(e))

# recognize speech using Houndify
HOUNDIFY_CLIENT_ID = "INSERT HOUNDIFY CLIENT ID HERE"  # Houndify client IDs are Base64-encoded strings
HOUNDIFY_CLIENT_KEY = "INSERT HOUNDIFY CLIENT KEY HERE"  # Houndify client keys are Base64-encoded strings
try:
    print("Houndify thinks you said " + r.recognize_houndify(audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY))
except sr.UnknownValueError:
    print("Houndify could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Houndify service; {0}".format(e))

# recognize speech using IBM Speech to Text
IBM_USERNAME = "INSERT IBM SPEECH TO TEXT USERNAME HERE"  # IBM Speech to Text usernames are strings of the form XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
IBM_PASSWORD = "INSERT IBM SPEECH TO TEXT PASSWORD HERE"  # IBM Speech to Text passwords are mixed-case alphanumeric strings
try:
    print("IBM Speech to Text thinks you said " + r.recognize_ibm(audio, username=IBM_USERNAME, password=IBM_PASSWORD))
except sr.UnknownValueError:
    print("IBM Speech to Text could not understand audio")
except sr.RequestError as e:
    print("Could not request results from IBM Speech to Text service; {0}".format(e))

This should be a working Python script, which will reveal any and all of the modules I will need to install to get PocketSphynx installed and working. I will look at getting a whole Java IDE set up and try Sphinx4 another time, or if I fail at this. I'm not a fan of heavyweight complex IDEs, having started mucking about with computers back in the 90s, I like to use a notepad app and the command line / terminal session as much as possible. For now I'll progress with this method and see where it takes me.

I've customised the script and removed all the other providers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python3

import speech_recognition as sr

# obtain path to "english.wav" in the same folder as this script
from os import path
AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")

# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

# recognize speech using Sphinx
try:
    print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))
import speech_recognition

The second line of this script tells me I need to install speech_recognition as a python project using pip. Then tried installing PocketSphinx, again using pip, and received a number of errors and warnings.

$ pip install speech_recognition
$ pip install PocketSphinx

I realised then there were a number of dependent modules required for each, and probably more, so looked a little further. I expect an IDE would do this all for me, but I like to fettle.

Referenced from Github cmusphinx/pocketsphinx-python

$ python -m pip install --upgrade pip setuptools wheel
$ pip install --upgrade pocketsphinx
$ sudo apt-get install -y python python-dev python-pip build-essential swig git libpulse-dev
$ sudo apt-get install gcc

This is like 'old Linux' I love it, I'm here trying something, seeing errors, interpreting the errors and seeing how to install the missing bits or fix the problem, onwards!

Failed building wheel for pocketsphinx
Running setup.py clean for pocketsphinx
Failed to build pocketsphinx

Brilliant! And there is more:

    deps/sphinxbase/src/libsphinxad/ad_alsa.c:76:10: fatal error: alsa/asoundlib.h: No such file or directory
     #include <alsa/asoundlib.h>
              ^~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ----------------------------------------
Command "/home/user/vtt/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-31sr27xe/pocketsphinx/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-jpu8ofhr/install-record.txt --single-version-externally-managed --compile --install-headers /home/user/vtt/include/site/python3.6/pocketsphinx" failed with error code 1 in /tmp/pip-install-31sr27xe/pocketsphinx/

I'm guessing I need to look at getting the alsa/asoundlib.h library installed for gcc? Let's look into that. At this point, I've installed and downloaded so much crap into this virtualenv, I decide to delete it and start over, setting this machine up as the development platform with no virtualenv for the moment. Let's deactivate, delete and then run the commands again and see what happens, whilst I have a look at this alsa issue.

After a little investigation, I'm not sure I need the alsa modules for this project, but I can get them by installing ogiewon, in a discussion on fatal error: alsa/asoundlib.h: No such file or directory regarding a project called assistant-relay, chriswalker01 was having the same error message.

Ever time I have installed the v1 of assistant-relay on Raspberry Pi Stretch, I have had to install “libasound2-dev” as a pre-requisite to get past the error posted above.

So let's try installing libasound2-dev and see what happens when we try to get pocketsphinx on the system again.

$ sudo apt-get install libasound2-dev
....
$ pip install --upgrade pocketsphinx
Collecting pocketsphinx
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
  Downloading https://files.pythonhosted.org/packages/cd/4a/adea55f189a81aed88efa0b0e1d25628e5ed22622ab9174bf696dd4f9474/pocketsphinx-0.1.15.tar.gz (29.1MB)
    100% |████████████████████████████████| 29.1MB 46kB/s
Building wheels for collected packages: pocketsphinx
  Running setup.py bdist_wheel for pocketsphinx ... done
  Stored in directory: /home/<user>/.cache/pip/wheels/52/fd/52/2f62c9a0036940cc0c89e58ee0b9d00fcf78243aeaf416265f
Successfully built pocketsphinx
Installing collected packages: pocketsphinx
Successfully installed pocketsphinx-0.1.15

Let's try this script I grabbed earlier and see if it can run now.

$ ./audio_transcribe.py
Traceback (most recent call last):
  File "./audio_transcribe.py", line 3, in <module>
    import speech_recognition as sr
ModuleNotFoundError: No module named 'speech_recognition'

Um, ok, so, we've got speech_recognition installed under a different pip? Lets try to install under pip3.

$ pip3 install speech_recognition
Collecting speech_recognition
  Could not find a version that satisfies the requirement speech_recognition (from versions: )
No matching distribution found for speech_recognition

Oh, I'm such a wally.

$ pip3 install SpeechRecognition
Collecting SpeechRecognition
  Using cached https://files.pythonhosted.org/packages/26/e1/7f5678cd94ec1234269d23756dbdaa4c8cfaed973412f88ae8adf7893a50/SpeechRecognition-3.8.1-py2.py3-none-any.whl
Installing collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.8.1

Now then let's make sure we've got pocketsphynx on the right version too.

$ pip2 uninstall pocketsphinx
  DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
  Uninstalling pocketsphinx-0.1.15:
    Would remove:
      /home/<user>/.local/lib/python2.7/site-packages/pocketsphinx-0.1.15.dist-info/*
      /home/<user>/.local/lib/python2.7/site-packages/pocketsphinx/*
      /home/<user>/.local/lib/python2.7/site-packages/sphinxbase/*
  Proceed (y/n)? y
    Successfully uninstalled pocketsphinx-0.1.15

$ pip3 install --upgrade pocketsphinx                    
Collecting pocketsphinx
  Using cached https://files.pythonhosted.org/packages/cd/4a/adea55f189a81aed88efa0b0e1d25628e5ed22622ab9174bf696dd4f9474/pocketsphinx-0.1.15.tar.gz
Building wheels for collected packages: pocketsphinx
  Building wheel for pocketsphinx (setup.py) ... done
  Stored in directory: /home/<user>/.cache/pip/wheels/52/fd/52/2f62c9a0036940cc0c89e58ee0b9d00fcf78243aeaf416265f
Successfully built pocketsphinx
Installing collected packages: pocketsphinx
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/pocketsphinx-0.1.15.dist-info'
Consider using the `--user` option or check the permissions.

Oh, wonderful, so what's the permissions issue here then?

$ ls -a '/usr/local/lib/python3.6/dist-packages/'
.  ..
$ ls -a -l '/usr/local/lib/python3.6/'
total 12
drwxrwsr-x 3 root staff 4096 okt.  18 00:26 .
drwxr-xr-x 4 root root  4096 okt.  18 00:30 ..
drwxrwsr-x 2 root staff 4096 okt.  18 00:26 dist-packages
$ ls -a -l '/usr/local/lib/'
total 16
drwxr-xr-x  4 root root  4096 okt.  18 00:30 .
drwxr-xr-x 10 root root  4096 okt.  18 00:26 ..
drwxrwsr-x  4 root staff 4096 okt.  18 00:38 python2.7
drwxrwsr-x  3 root staff 4096 okt.  18 00:26 python3.6
$ ls -a -l '/usr/local/'
total 40
drwxr-xr-x 10 root root 4096 okt.  18 00:26 .
drwxr-xr-x 10 root root 4096 okt.  18 00:26 ..
drwxr-xr-x  2 root root 4096 mars  22 15:54 bin
drwxr-xr-x  2 root root 4096 okt.  18 00:26 etc
drwxr-xr-x  2 root root 4096 okt.  18 00:26 games
drwxr-xr-x  2 root root 4096 okt.  18 00:26 include
drwxr-xr-x  4 root root 4096 okt.  18 00:30 lib
lrwxrwxrwx  1 root root    9 mars  18 11:07 man -> share/man
drwxr-xr-x  2 root root 4096 okt.  18 00:26 sbin
drwxr-xr-x 10 root root 4096 mars  22 15:54 share
drwxr-xr-x  2 root root 4096 okt.  18 00:26 src

I wonder if we need to add ourself to staff group then, so we can actually write into the dist-packages folder and therfore install pocketsphinx in pip3.

$ sudo adduser <user> staff
Adding user `<user>' to group `staff' ...
Adding user <user> to group staff
Done.
$ pip3 install pocketsphinx
Collecting pocketsphinx
Installing collected packages: pocketsphinx
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/pocketsphinx-0.1.15.dist-info'
Consider using the `--user` option or check the permissions.

Oh, so we have something else going on here then? Or perhaps we need to log off and back on again to allow group changes to propagate. Back in a moment.

Back.

$ pip3 install pocketsphinx
Collecting pocketsphinx
Installing collected packages: pocketsphinx
Successfully installed pocketsphinx-0.1.15

Woohoo, let's try the script again then?

$ python3 ./audio_transcribe.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/speech_recognition/__init__.py", line 203, in __enter__
    self.audio_reader = wave.open(self.filename_or_fileobject, "rb")
  File "/usr/lib/python3.6/wave.py", line 499, in open
    return Wave_read(f)
  File "/usr/lib/python3.6/wave.py", line 163, in __init__
    self.initfp(f)
  File "/usr/lib/python3.6/wave.py", line 130, in initfp
    raise Error('file does not start with RIFF id')
wave.Error: file does not start with RIFF id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/speech_recognition/__init__.py", line 208, in __enter__
    self.audio_reader = aifc.open(self.filename_or_fileobject, "rb")
  File "/usr/lib/python3.6/aifc.py", line 913, in open
    return Aifc_read(f)
  File "/usr/lib/python3.6/aifc.py", line 352, in __init__
    self.initfp(file_object)
  File "/usr/lib/python3.6/aifc.py", line 316, in initfp
    raise Error('file does not start with FORM id')
aifc.Error: file does not start with FORM id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/speech_recognition/__init__.py", line 234, in __enter__
    self.audio_reader = aifc.open(aiff_file, "rb")
  File "/usr/lib/python3.6/aifc.py", line 913, in open
    return Aifc_read(f)
  File "/usr/lib/python3.6/aifc.py", line 358, in __init__
    self.initfp(f)
  File "/usr/lib/python3.6/aifc.py", line 323, in initfp
    raise Error('not an AIFF or AIFF-C file')
aifc.Error: not an AIFF or AIFF-C file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./audio_transcribe.py", line 14, in <module>
    with sr.AudioFile(AUDIO_FILE) as source:
  File "/usr/local/lib/python3.6/dist-packages/speech_recognition/__init__.py", line 236, in __enter__
    raise ValueError("Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format")
ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format

Hmmm.

$ flac -t testing.flac

flac 1.3.2
Copyright (C) 2000-2009  Josh Coalson, 2011-2016  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

testing.flac: WARNING, cannot check MD5 signature since it was unset in the STREAMINFO
ok                    

So let's reencode the .flac as a wav, and try again. Obviously, I've edited the source to reflect the filename change.

$ sox testing.flac english.wav
$ python3 ./audio_transcribe.py
Sphinx thinks you said this is the second test of the voting record and i hope that his recordings of them

Not 100% perfect, but it's free. I wonder what happens if I give it a larger file to work with.

$ python3 ./audio_transcribe.py
Sphinx thinks you said had the stove ultimately the court of rome simply his route to it g. eight to kids judge it easy i'm thankful to that stuff wonders of the present level of smuggling of the low pickup maybe some good quality you'll get the summoning distortions and the reason the mildew artifacts somewhat like to elect a fashion to what route you don't have half of what some solace for the purpose of contempt and

Let's just say, this was nothing like what I said. We need to train.

Adapting the default acoustic model

I've read up on Adapting the default acoustic model creating an adaptation corpus, and directly quoting the page:

The first thing you need to do is to create a corpus of adaptation data. The corpus will consist of

a list of sentences a dictionary describing the pronunciation of all the words in that list of sentences a recording of you speaking each of those sentences

The sections below will refer to these files, so, if you want to follow along we recommend downloading these files now.

$ wget http://cmusphinx.github.io/data/arctic20.fileids
--2019-03-24 21:03:55--  http://cmusphinx.github.io/data/arctic20.fileids
Resolving cmusphinx.github.io (cmusphinx.github.io)... 185.199.108.153, 185.199.111.153, 185.199.109.153, ...
Connecting to cmusphinx.github.io (cmusphinx.github.io)|185.199.108.153|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cmusphinx.github.io/data/arctic20.fileids [following]
--2019-03-24 21:03:55--  https://cmusphinx.github.io/data/arctic20.fileids
Connecting to cmusphinx.github.io (cmusphinx.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 240 [application/octet-stream]
Saving to: ‘arctic20.fileids’

arctic20.fileids                               100%[==================================================================================================>]     240  --.-KB/s    in 0s      

2019-03-24 21:03:55 (12,4 MB/s) - ‘arctic20.fileids’ saved [240/240]

$ wget http://cmusphinx.github.io/data/arctic20.transcription
--2019-03-24 21:04:07--  http://cmusphinx.github.io/data/arctic20.transcription
Resolving cmusphinx.github.io (cmusphinx.github.io)... 185.199.108.153, 185.199.111.153, 185.199.109.153, ...
Connecting to cmusphinx.github.io (cmusphinx.github.io)|185.199.108.153|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cmusphinx.github.io/data/arctic20.transcription [following]
--2019-03-24 21:04:07--  https://cmusphinx.github.io/data/arctic20.transcription
Connecting to cmusphinx.github.io (cmusphinx.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1458 (1,4K) [application/octet-stream]
Saving to: ‘arctic20.transcription’

arctic20.transcription                         100%[==================================================================================================>]   1,42K  --.-KB/s    in 0s      

2019-03-24 21:04:07 (115 MB/s) - ‘arctic20.transcription’ saved [1458/1458]
Sphinxbase

You should also make sure that you have downloaded and compiled sphinxbase and sphinxtrain.

$ git clone https://github.com/cmusphinx/sphinxbase.git
$ sudo apt-get install autoconf libtool automake bison
$ cd ./sphinxbase
$ ./autogen.sh
$ make check
$ make
Sphinxtrain
$ git clone https://github.com/cmusphinx/sphinxtrain.git
$ pip3 install NumPy SciPy
$ cd ./Sphinxtrain
$ ./autogen.sh
$ make check
$ make

Recording your adaptation data

I'm not sure if the previous setup has completed correctly or not, but let's press on. I now need to create a number of audio files to adapt the speech to myself. I'll record the audio on the device I'll most likely be using. That'll be the Android mobile. Then sync the files to the development machine using Syncthing, and when a new file arrives to that folder, using a little bit of python, I'l convert the file to the correct format and take the original file into a new folder by running a script against it, putting the converted file into a third folder which will be used to trigger the eventual conversion. For now, that folder, the third folder, will be used to contain the training files. Confused? I'm not.

The files I downloaded earlier contain the following. The format is comprehensible, and I can see how it all comes together. I may add more sentences later, but for now I'll stick with the options presented to me.

~~~~$ cat arctic20.fileids arctic_0001 arctic_0002 arctic_0003 arctic_0004 arctic_0005 arctic_0006 arctic_0007 arctic_0008 arctic_0009 arctic_0010 arctic_0011 arctic_0012 arctic_0013 arctic_0014 arctic_0015 arctic_0016 arctic_0017 arctic_0018 arctic_0019 arctic_0020


$ cat arctic20.transcription author of the danger trail philip steels etc (arctic_0001) not at this particular case tom apologized whittemore (arctic_0002) for the twentieth time that evening the two men shook hands (arctic_0003) lord but i'm glad to see you again phil (arctic_0004) will we ever forget it (arctic_0005) god bless 'em i hope i'll go on seeing them forever (arctic_0006) and you always want to see it in the superlative degree (arctic_0007) gad your letter came just in time (arctic_0008) he turned sharply and faced gregson across the table (arctic_0009) i'm playing a single hand in what looks like a losing game (arctic_0010) if i ever needed a fighter in my life i need one now (arctic_0011) gregson shoved back his chair and rose to his feet (arctic_0012) he was a head shorter than his companion of almost delicate physique (arctic_0013) now you're coming down to business phil he exclaimed (arctic_0014) it's the aurora borealis (arctic_0015) there's fort churchill a rifleshot beyond the ridge asleep (arctic_0016) from that moment his friendship for belize turns to hatred and jealousy (arctic_0017) there was a change now (arctic_0018) i followed the line of the proposed railroad looking for chances (arctic_0019) clubs and balls and cities grew to be only memories (arctic_0020)

_Hussh, I'm reading..._

So I recorded the above on my phone, setup and configured SyncThing with the phone being a send only device. The files appeared, and for now, I'll manually moved the files, now I'll listen back to them.

$ for i in "*.wav"; do play "$i"; done

Cool, they all worked. Now let's sort out the sampling rate, and the file names. Rather than writing a script for this, I'll manually rename and convert these, it's getting late now and I reckon I'd get distracted by all the possibilities, so I'll be typing this out 20 times.

$ play 'Audio recording 2019-03-24 21-38-26.wav' play WARN alsa: can't encode 0-bit Unknown or not applicable

Audio recording 2019-03-24 21-38-26.wav:

File Size: 732k Bit Rate: 1.41M Encoding: Signed PCM
Channels: 2 @ 16-bit
Samplerate: 44100Hz
Replaygain: off
Duration: 00:00:04.15

In:100% 00:00:04.15 [00:00:00.00] Out:183k [ | ] Clip:0
Done. $ mv 'Audio recording 2019-03-24 21-38-26.wav' arctic_0003_44Hz.wav $ sox arctic_0003_44Hz.wav -b 16 arctic_0003.wav rate 16000 dither -s $ play arctic_0003.wav play WARN alsa: can't encode 0-bit Unknown or not applicable

arctic_0003.wav:

File Size: 266k Bit Rate: 512k Encoding: Signed PCM
Channels: 2 @ 16-bit
Samplerate: 16000Hz
Replaygain: off
Duration: 00:00:04.15

In:100% 00:00:04.15 [00:00:00.00] Out:66.4k [ | ] Clip:0
Done.

You can see here, the samplerate is now 16Hz, and the files have been renamed in accordance with the file-ids.

The next step is to find where the default acoustic model is installed by locating a dictionary file. Then copy the lot to our modelling directory.

$ sudo find / -name en-us.lm.bin /usr/local/lib/python3.6/dist-packages/pocketsphinx/model/en-us.lm.bin $ sudo cp -r /usr/local/lib/python3.6/dist-packages/pocketsphinx/model/en-us/ ./en-us/ $ ls ./en-us/ feat.params mdef means noisedict README sendump transition_matrices variances $ ls -l ./en-us/ total 6472 -rw-rw-r-- 1 nllewellyn nllewellyn 230 mars 24 22:36 feat.params -rw-r--r-- 1 root root 2959176 mars 24 22:36 mdef -rw-r--r-- 1 root root 838732 mars 24 22:36 means -rw-r--r-- 1 root root 56 mars 24 22:36 noisedict -rw-r--r-- 1 root root 1617 mars 24 22:36 README -rw-r--r-- 1 root root 1969024 mars 24 22:36 sendump -rw-r--r-- 1 root root 2080 mars 24 22:36 transition_matrices -rw-r--r-- 1 root root 838732 mars 24 22:36 variances $ sudo cp -r /usr/local/lib/python3.6/dist-packages/pocketsphinx/model/ ./


$ sphinx_fe -argfile en-us/feat.params -samprate 16000 -c arctic20.fileids -di . -do . -ei wav -eo mfc -mswav yes .... Current configuration: [NAME] [DEFLT] [VALUE] -dither no no .... INFO: sphinx_fe.c(787): Converting ./arctic_0001.wav to ./arctic_0001.mfc ERROR: "sphinx_fe.c", line 132: Number of channels 2 does not match configured value in file './arctic_0001.wav'

It appears, I created files with the automatic dither function, and it was decided not to dither, but with other files it may come in, and I do not want a mismatched config later, so I will add the -D switch to disable dithering. Also, a complaint about the number of chanels, two chanels when we need one, so the new commands to run are.

sox arctic_0001_44Hz.wav -b 16 -D -c 1 arctic_0001.wav rate 16000 sox arctic_0002_44Hz.wav -b 16 -D -c 1 arctic_0002.wav rate 16000 sox arctic_0003_44Hz.wav -b 16 -D -c 1 arctic_0003.wav rate 16000 sox arctic_0004_44Hz.wav -b 16 -D -c 1 arctic_0004.wav rate 16000 sox arctic_0005_44Hz.wav -b 16 -D -c 1 arctic_0005.wav rate 16000 sox arctic_0006_44Hz.wav -b 16 -D -c 1 arctic_0006.wav rate 16000 sox arctic_0007_44Hz.wav -b 16 -D -c 1 arctic_0007.wav rate 16000 sox arctic_0008_44Hz.wav -b 16 -D -c 1 arctic_0008.wav rate 16000 sox arctic_0009_44Hz.wav -b 16 -D -c 1 arctic_0009.wav rate 16000 sox arctic_0010_44Hz.wav -b 16 -D -c 1 arctic_0010.wav rate 16000 sox arctic_0011_44Hz.wav -b 16 -D -c 1 arctic_0011.wav rate 16000 sox arctic_0012_44Hz.wav -b 16 -D -c 1 arctic_0012.wav rate 16000 sox arctic_0013_44Hz.wav -b 16 -D -c 1 arctic_0013.wav rate 16000 sox arctic_0014_44Hz.wav -b 16 -D -c 1 arctic_0014.wav rate 16000 sox arctic_0015_44Hz.wav -b 16 -D -c 1 arctic_0015.wav rate 16000 sox arctic_0016_44Hz.wav -b 16 -D -c 1 arctic_0016.wav rate 16000 sox arctic_0017_44Hz.wav -b 16 -D -c 1 arctic_0017.wav rate 16000 sox arctic_0018_44Hz.wav -b 16 -D -c 1 arctic_0018.wav rate 16000 sox arctic_0019_44Hz.wav -b 16 -D -c 1 arctic_0019.wav rate 16000 sox arctic_0020_44Hz.wav -b 16 -D -c 1 arctic_0020.wav rate 16000

$ sphinx_fe -argfile en-us/feat.params -samprate 16000 -c arctic20.fileids -di . -do . -ei wav -eo mfc -mswav yes
Current configuration: [NAME] [DEFLT] [VALUE] -alpha 0.97 9.700000e-01 -argfile en-us/feat.params -blocksize 2048 2048 -build_outdirs yes yes -c arctic20.fileids -cep2spec no no -di . -dither no no -do . -doublebw no no -ei wav -eo mfc -example no no -frate 100 100 -help no no -i -input_endian little little -lifter 0 0 -logspec no no -lowerf 133.33334 1.333333e+02 -mach_endian little little -mswav no yes -ncep 13 13 -nchans 1 1 -nfft 512 512 -nfilt 40 40 -nist no no -npart 0 0 -nskip 0 0 -o -ofmt sphinx sphinx -part 0 0 -raw no no -remove_dc no no -remove_noise yes yes -remove_silence yes yes -round_filters yes yes -runlen -1 -1 -samprate 16000 1.600000e+04 -seed -1 -1 -smoothspec no no -spec2cep no no -sph2pipe no no -transform legacy legacy -unit_area yes yes -upperf 6855.4976 6.855498e+03 -vad_postspeech 50 50 -vad_prespeech 20 20 -vad_startspeech 10 10 -vad_threshold 2.0 2.000000e+00 -verbose no no -warp_params -warp_type inverse_linear inverse_linear -whichchan 0 0 -wlen 0.025625 2.562500e-02

Current configuration: [NAME] [DEFLT] [VALUE] -alpha 0.97 9.700000e-01 -argfile en-us/feat.params -blocksize 2048 2048 -build_outdirs yes yes -c arctic20.fileids -cep2spec no no -di . -dither no no -do . -doublebw no no -ei wav -eo mfc -example no no -frate 100 100 -help no no -i -input_endian little little -lifter 0 22 -logspec no no -lowerf 133.33334 1.300000e+02 -mach_endian little little -mswav no yes -ncep 13 13 -nchans 1 1 -nfft 512 512 -nfilt 40 25 -nist no no -npart 0 0 -nskip 0 0 -o -ofmt sphinx sphinx -part 0 0 -raw no no -remove_dc no no -remove_noise yes yes -remove_silence yes yes -round_filters yes yes -runlen -1 -1 -samprate 16000 1.600000e+04 -seed -1 -1 -smoothspec no no -spec2cep no no -sph2pipe no no -transform legacy dct -unit_area yes yes -upperf 6855.4976 6.800000e+03 -vad_postspeech 50 50 -vad_prespeech 20 20 -vad_startspeech 10 10 -vad_threshold 2.0 2.000000e+00 -verbose no no -warp_params -warp_type inverse_linear inverse_linear -whichchan 0 0 -wlen 0.025625 2.562500e-02

INFO: sphinx_fe.c(967): Processing all remaining utterances at position 0 INFO: sphinx_fe.c(787): Converting ./arctic_0001.wav to ./arctic_0001.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0002.wav to ./arctic_0002.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0003.wav to ./arctic_0003.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0004.wav to ./arctic_0004.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0005.wav to ./arctic_0005.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0006.wav to ./arctic_0006.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0007.wav to ./arctic_0007.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0008.wav to ./arctic_0008.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0009.wav to ./arctic_0009.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0010.wav to ./arctic_0010.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0011.wav to ./arctic_0011.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0012.wav to ./arctic_0012.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0013.wav to ./arctic_0013.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0014.wav to ./arctic_0014.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0015.wav to ./arctic_0015.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0016.wav to ./arctic_0016.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0017.wav to ./arctic_0017.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0018.wav to ./arctic_0018.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0019.wav to ./arctic_0019.mfc INFO: sphinx_fe.c(787): Converting ./arctic_0020.wav to ./arctic_0020.mfc

The bw program is found in /usr/lib/sphinxtrain/bw.

./bw \ -hmmdir en-us \ -moddeffn en-us/mdef.txt \ -ts2cbfn .ptm. \ -feat 1s_c_d_dd \ -svspec 0-12/13-25/26-38 \ -cmn current \ -agc none \ -dictfn cmudict-en-us.dict \ -ctlfn arctic20.fileids \ -lsnfn arctic20.transcription \ -accumdir .

$ ./bw -hmmdir en-us -moddeffn en-us/mdef.txt -ts2cbfn .ptm. -feat 1s_c_d_dd -svspec 0-12/13-25/26-38 -cmn current -agc none -dictfn cmudict-en-us.dict -ctlfn arctic20.fileids -lsnfn arctic20.transcription -accumdir . Current configuration: [NAME] [DEFLT] [VALUE] -2passvar no no -abeam 1e-100 1.000000e-100 -accumdir . -agc none none -agcthresh 2.0 2.000000e+00 -bbeam 1e-100 1.000000e-100 -cb2mllrfn .1cls. .1cls. -cepdir -cepext mfc mfc -ceplen 13 13 -ckptintv 0 -cmn live current -cmninit 40,3,-1 40,3,-1 -ctlfn arctic20.fileids -diagfull no no -dictfn cmudict-en-us.dict -example no no -fdictfn -feat 1s_c_d_dd 1s_c_d_dd -fullvar no no -help no no -hmmdir en-us -latdir -latext -lda -ldadim 0 0 -lsnfn arctic20.transcription -lw 11.5 1.150000e+01 -maxuttlen 0 0 -meanfn -meanreest yes yes -mixwfn -mixwreest yes yes -mllrmat -mmie no no -mmie_type rand rand -moddeffn en-us/mdef.txt -mwfloor 0.00001 1.000000e-05 -npart 0 -nskip 0 -outphsegdir -outputfullpath no no -part 0 -pdumpdir -phsegdir -phsegext phseg phseg -runlen -1 -1 -sentdir -sentext sent sent -spthresh 0.0 0.000000e+00 -svspec 0-12/13-25/26-38 -timing yes yes -tmatfn -tmatreest yes yes -topn 4 4 -tpfloor 0.0001 1.000000e-04 -ts2cbfn .ptm. -varfloor 0.00001 1.000000e-05 -varfn -varnorm no no -varreest yes yes -viterbi no no

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none' INFO: main.c(253): Using subvector specification 0-12/13-25/26-38 INFO: main.c(316): Reading en-us/mdef.txt INFO: model_def_io.c(573): Model definition info: INFO: model_def_io.c(574): 137095 total models defined (42 base, 137053 tri) INFO: model_def_io.c(575): 548380 total states INFO: model_def_io.c(576): 5126 total tied states INFO: model_def_io.c(577): 126 total tied CI states INFO: model_def_io.c(578): 42 total tied transition matrices INFO: model_def_io.c(579): 4 max state/model INFO: model_def_io.c(580): 4 min state/model INFO: s3mixw_io.c(117): Read en-us/mixture_weights [5126x3x128 array] INFO: s3tmat_io.c(118): Read en-us/transition_matrices [42x3x4 array] INFO: mod_inv.c(301): inserting tprob floor 1.000000e-04 and renormalizing INFO: s3gau_io.c(169): Read en-us/means [42x3x128 array] INFO: s3gau_io.c(169): Read en-us/variances [42x3x128 array] INFO: gauden.c(176): 42 total mgau INFO: gauden.c(150): 3 feature streams (|0|=13 |1|=13 |2|=13 ) INFO: gauden.c(187): 128 total densities INFO: gauden.c(90): min_var=1.000000e-05 INFO: gauden.c(165): compute 4 densities/frame INFO: main.c(429): Will reestimate mixing weights. INFO: main.c(431): Will reestimate means. INFO: main.c(433): Will reestimate variances. INFO: main.c(441): Will reestimate transition matrices INFO: main.c(454): Reading main dictionary: cmudict-en-us.dict INFO: lexicon.c(221): 134723 entries added from cmudict-en-us.dict INFO: main.c(464): Reading filler dictionary: en-us/noisedict INFO: lexicon.c(221): 5 entries added from en-us/noisedict INFO: corpus.c(1062): Will process all remaining utts starting at 0 INFO: main.c(663): Reestimation: Baum-Welch INFO: main.c(667): Generating profiling information consumes significant CPU resources. INFO: main.c(668): If you are not interested in profiling, use -timing no column defns ... timing info ... INFO: cmn.c(133): CMN: 44.38 -0.76 0.66 3.42 9.58 1.72 -7.99 -6.75 -6.13 -8.45 -1.71 -6.53 -3.11 utt> 0 arctic_0001 408 0 140 22 8 8 2.435248e-102 -1.504397e+02 -6.137940e+04 utt 0.016x 1.000e upd 0.016x 1.000e fwd 0.008x 1.000e bwd 0.008x 1.000e gau 0.033x 1.001e rsts 0.001x 1.007e rstf 0.001x 1.011e rstu 0.000x 1.001e

INFO: cmn.c(133): CMN: 41.70 -3.65 7.36 7.46 6.50 6.12 -10.61 -6.67 -4.46 -12.72 1.66 -9.13 4.38 utt> 1 arctic_0002 433 0 160 25 9 8 2.960247e-102 -1.534516e+02 -6.644456e+04 utt 0.019x 1.000e upd 0.019x 1.000e fwd 0.010x 1.000e bwd 0.008x 1.000e gau 0.038x 1.003e rsts 0.001x 1.018e rstf 0.001x 1.016e rstu 0.000x 1.000e

INFO: cmn.c(133): CMN: 40.89 -5.55 4.59 7.05 9.86 8.54 -15.13 -2.22 -6.55 -10.94 3.69 -8.79 3.79 utt> 2 arctic_0003 412 0 164 22 8 9 2.050755e-102 -1.497654e+02 -6.170334e+04 utt 0.017x 1.000e upd 0.017x 1.000e fwd 0.009x 1.000e bwd 0.008x 1.000e gau 0.035x 1.002e rsts 0.001x 1.001e rstf 0.001x 1.020e rstu 0.000x 1.001e

INFO: cmn.c(133): CMN: 38.32 2.67 1.56 5.20 6.72 5.88 -10.27 -5.51 -7.55 -14.54 6.63 -6.30 1.53 utt> 3 arctic_0004 339 0 112 24 8 8 3.246291e-102 -1.485753e+02 -5.036702e+04 utt 0.016x 1.000e upd 0.016x 1.000e fwd 0.009x 1.000e bwd 0.007x 1.000e gau 0.033x 1.002e rsts 0.001x 1.020e rstf 0.000x 1.044e rstu 0.000x 1.000e

INFO: cmn.c(133): CMN: 33.92 0.32 1.90 6.49 8.08 2.06 -16.68 -4.06 -4.96 -10.96 2.38 -5.02 3.93 utt> 4 arctic_0005 203 0 68 21 9 8 2.067359e-102 -1.459429e+02 -2.962642e+04 utt 0.016x 1.000e upd 0.016x 1.000e fwd 0.006x 1.000e bwd 0.010x 1.000e gau 0.041x 1.003e rsts 0.002x 0.989e rstf 0.001x 1.014e rstu 0.000x 1.001e

INFO: cmn.c(133): CMN: 40.16 6.49 4.33 2.24 2.66 10.32 -12.85 -6.87 -6.96 -13.68 8.23 -4.65 3.60 utt> 5 arctic_0006 360 0 132 27 10 9 3.379427e-102 -1.482487e+02 -5.336953e+04 utt 0.021x 1.000e upd 0.021x 1.000e fwd 0.011x 1.000e bwd 0.009x 1.000e gau 0.050x 1.001e rsts 0.002x 1.014e rstf 0.001x 1.024e rstu 0.000x 1.002e

INFO: cmn.c(133): CMN: 36.58 -1.52 4.36 9.36 8.41 8.62 -15.10 -7.09 -5.19 -13.28 -0.06 -7.99 -0.08 utt> 6 arctic_0007 406 0 160 24 8 7 3.344124e-102 -1.485250e+02 -6.030117e+04 utt 0.018x 1.000e upd 0.018x 1.000e fwd 0.010x 1.000e bwd 0.007x 1.000e gau 0.037x 1.000e rsts 0.001x 1.003e rstf 0.001x 1.008e rstu 0.000x 1.000e

INFO: cmn.c(133): CMN: 33.88 -1.92 1.45 5.53 4.53 6.12 -12.39 -3.63 -3.52 -12.03 2.65 -3.73 1.92 utt> 7 arctic_0008 333 0 96 21 7 6 1.940803e-102 -1.469178e+02 -4.892363e+04 utt 0.016x 1.000e upd 0.016x 1.000e fwd 0.010x 1.000e bwd 0.007x 1.000e gau 0.031x 1.003e rsts 0.001x 1.020e rstf 0.001x 1.018e rstu 0.000x 1.000e

INFO: cmn.c(133): CMN: 38.62 -4.38 0.77 10.60 7.42 4.59 -15.60 -7.00 -6.76 -10.43 2.87 -6.52 1.61 utt> 8 arctic_0009 438 0 160 22 8 8 1.839698e-102 -1.482632e+02 -6.493930e+04 utt 0.017x 1.000e upd 0.017x 1.000e fwd 0.009x 1.000e bwd 0.007x 1.000e gau 0.034x 1.002e rsts 0.001x 1.004e rstf 0.001x 1.028e rstu 0.000x 1.002e

INFO: cmn.c(133): CMN: 39.49 6.59 -0.44 9.56 7.32 8.40 -13.85 -4.80 -9.19 -12.76 5.15 -7.48 3.97 utt> 9 arctic_0010 385 0 164 24 9 9 2.912845e-102 -1.498354e+02 -5.768663e+04 utt 0.020x 1.000e upd 0.020x 1.000e fwd 0.010x 1.000e bwd 0.009x 1.000e gau 0.044x 1.003e rsts 0.002x 0.998e rstf 0.001x 1.015e rstu 0.000x 1.000e

INFO: cmn.c(133): CMN: 39.14 8.55 -4.96 7.23 8.32 6.39 -14.30 -5.67 -9.00 -8.85 5.09 -2.77 3.05 utt> 10 arctic_0011 386 0 136 22 8 7 2.588119e-102 -1.479282e+02 -5.710029e+04 utt 0.017x 1.000e upd 0.017x 1.000e fwd 0.009x 1.000e bwd 0.008x 1.000e gau 0.039x 1.002e rsts 0.001x 1.015e rstf 0.001x 1.025e rstu 0.000x 1.003e

INFO: cmn.c(133): CMN: 36.74 -4.05 -2.18 11.74 8.36 4.49 -16.11 -4.75 -5.38 -9.45 1.17 -3.01 2.33 utt> 11 arctic_0012 390 0 144 23 8 8 2.088885e-102 -1.462457e+02 -5.703583e+04 utt 0.018x 1.000e upd 0.018x 1.000e fwd 0.009x 1.000e bwd 0.009x 1.000e gau 0.041x 1.002e rsts 0.001x 1.002e rstf 0.001x 1.025e rstu 0.000x 1.001e

INFO: cmn.c(133): CMN: 39.39 1.08 2.54 13.97 8.06 7.83 -15.78 -8.92 -7.60 -10.51 3.49 -8.12 -0.33 utt> 12 arctic_0013 415 0 204 23 9 9 3.119338e-102 -1.504651e+02 -6.244302e+04 utt 0.019x 1.000e upd 0.018x 1.000e fwd 0.010x 1.000e bwd 0.008x 1.000e gau 0.040x 1.003e rsts 0.001x 1.044e rstf 0.001x 1.014e rstu 0.000x 1.002e

INFO: cmn.c(133): CMN: 37.94 4.18 -0.20 8.11 7.64 8.38 -18.19 -8.08 -9.05 -12.48 3.16 -6.40 1.38 utt> 13 arctic_0014 367 0 144 20 7 8 2.206165e-102 -1.480432e+02 -5.433186e+04 utt 0.018x 1.000e upd 0.017x 1.000e fwd 0.009x 1.000e bwd 0.008x 1.000e gau 0.035x 1.003e rsts 0.001x 0.997e rstf 0.001x 1.018e rstu 0.000x 1.002e

INFO: cmn.c(133): CMN: 35.43 0.57 2.97 3.72 3.00 1.97 -12.07 -3.18 -10.71 -11.20 -0.17 -1.52 1.66 utt> 14 arctic_0015 299 0 76 23 7 7 2.339618e-102 -1.474828e+02 -4.409736e+04 utt 0.018x 1.000e upd 0.018x 1.000e fwd 0.010x 1.000e bwd 0.007x 1.000e gau 0.034x 1.003e rsts 0.001x 1.007e rstf 0.001x 1.016e rstu 0.000x 1.005e

INFO: cmn.c(133): CMN: 37.74 0.87 -1.95 9.08 8.71 2.66 -14.70 -11.76 -9.30 -6.29 0.83 -2.70 3.67 WARN: "mk_phone_list.c", line 178: Unable to lookup word 'rifleshot' in the dictionary WARN: "next_utt_states.c", line 83: Unable to produce phonetic transcription for the utterance ' there's fort churchill a rifleshot beyond the ridge asleep ' WARN: "main.c", line 824: Skipped utterance ' there's fort churchill a rifleshot beyond the ridge asleep ' utt> 15 arctic_0016 439 0 76 utt 0.000x 1.006e upd 0.000x 0.996e fwd 0.000x 0.000e bwd 0.000x 0.000e gau 0.000x 0.000e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

INFO: cmn.c(133): CMN: 41.21 0.30 0.05 14.14 17.05 4.05 -18.53 -11.29 -8.17 -9.94 2.80 -3.93 2.75 utt> 16 arctic_0017 463 0 220 25 9 9 2.926126e-102 -1.503591e+02 -6.961625e+04 utt 0.018x 1.000e upd 0.018x 1.000e fwd 0.009x 1.000e bwd 0.009x 1.000e gau 0.043x 1.001e rsts 0.002x 1.000e rstf 0.001x 1.018e rstu 0.000x 1.002e

INFO: cmn.c(133): CMN: 30.77 3.35 -4.06 8.02 8.63 1.88 -11.18 -8.32 -5.84 -5.63 0.89 -1.05 -0.87 utt> 17 arctic_0018 225 0 60 19 7 6 1.104227e-102 -1.460659e+02 -3.286483e+04 utt 0.014x 1.000e upd 0.014x 1.000e fwd 0.006x 1.000e bwd 0.008x 1.000e gau 0.033x 1.002e rsts 0.001x 1.033e rstf 0.001x 1.014e rstu 0.000x 1.005e

INFO: cmn.c(133): CMN: 38.06 8.23 0.02 6.92 12.76 3.92 -12.16 -16.28 -12.55 -8.96 5.02 1.00 -0.91 utt> 18 arctic_0019 431 0 176 26 9 8 2.742257e-102 -1.509374e+02 -6.505402e+04 utt 0.019x 1.000e upd 0.018x 1.000e fwd 0.010x 1.000e bwd 0.008x 1.000e gau 0.042x 1.002e rsts 0.001x 0.999e rstf 0.001x 1.018e rstu 0.000x 1.000e

INFO: cmn.c(133): CMN: 35.60 1.33 3.64 11.76 11.66 9.31 -14.65 -10.28 -7.39 -8.92 1.98 -1.23 -0.75 utt> 19 arctic_0020 412 0 156 23 9 8 1.825108e-102 -1.491215e+02 -6.143807e+04 utt 0.017x 1.000e upd 0.017x 1.000e fwd 0.009x 1.000e bwd 0.008x 1.000e gau 0.038x 1.002e rsts 0.001x 0.998e rstf 0.001x 1.002e rstu 0.000x 1.002e

overall> stats 7105 (-0) -1.490109e+02 -1.058723e+06 0.018x 1.000e WARN: "accum.c", line 628: Over 500 senones never occur in the input data. This is normal for context-dependent untied senone training or for adaptation, but could indicate a serious problem otherwise. INFO: s3mixw_io.c(233): Wrote ./mixw_counts [5126x3x128 array] INFO: s3tmat_io.c(176): Wrote ./tmat_counts [42x3x4 array] INFO: s3gau_io.c(485): Wrote ./gauden_counts with means with vars [42x3x128 vector arrays] INFO: main.c(997): Counts saved to . ~~~~

The error surrounding rifleshot, is because the custom disrifle-shot R AY F AH L SH AA dictionary has it listed as rifle-shot R AY F AH L SH AA T not rifleshot. I'm not sure which end to correct, so I'll leave it for now.

##### Creating a transformation with MLLR

~~~~ $ cp /usr/lib/sphinxtrain/mllr_solve . $ ./mllr_solve \

-meanfn en-us/means \
-varfn en-us/variances \
-outmllrfn mllr_matrix -accumdir .

Current configuration: [NAME] [DEFLT] [VALUE] -accumdir ., -cb2mllrfn .1cls. .1cls. -cdonly no no -example no no -fullvar no no -help no no -meanfn en-us/means -mllradd yes yes -mllrmult yes yes -moddeffn -outmllrfn mllr_matrix -varfloor 1e-3 1.000000e-03 -varfn en-us/variances

INFO: main.c(382): -- 1. Read input mean, (var) and accumulation. INFO: s3gau_io.c(169): Read en-us/means [42x3x128 array] INFO: main.c(397): Reading and accumulating counts from . INFO: s3gau_io.c(386): Read ./gauden_counts with means with vars [42x3x128 vector arrays]

INFO: main.c(436): -- 2. Read cb2mllrfn INFO: main.c(455): n_mllr_class = 1

INFO: main.c(475): -- 3. Calculate mllr matrices INFO: main.c(127): INFO: main.c(128): ---- mllr_solve(): Conventional MLLR method INFO: s3gau_io.c(169): Read en-us/variances [42x3x128 array]

INFO: main.c(208): ---- A. Accum regl, regr INFO: main.c(209): No classes 1, no. stream 3 INFO: main.c(281): ---- B. Compute MLLR matrices (A,B) INFO: mllr.c(182): Computing both multiplicative and additive part of MLLR INFO: mllr.c(182): Computing both multiplicative and additive part of MLLR INFO: mllr.c(182): Computing both multiplicative and additive part of MLLR

INFO: main.c(497): -- 4. Store mllr matrices (A,B) to mllr_matrix

##### Updating the acoustic model files with MAP

$ cp -a en-us en-us-adapt $ cp /usr/lib/sphinxtrain/map_adapt .
$ ./map_adapt \

-moddeffn en-us/mdef.txt \
-ts2cbfn .ptm. \
-meanfn en-us/means \
-varfn en-us/variances \
-mixwfn en-us/mixture_weights \
-tmatfn en-us/transition_matrices \
-accumdir . \
-mapmeanfn en-us-adapt/means \
-mapvarfn en-us-adapt/variances \
-mapmixwfn en-us-adapt/mixture_weights \
-maptmatfn en-us-adapt/transition_matrices

Current configuration: [NAME] [DEFLT] [VALUE] -accumdir ., -bayesmean yes yes -example no no -fixedtau no no -help no no -mapmeanfn en-us-adapt/means -mapmixwfn en-us-adapt/mixture_weights -maptmatfn en-us-adapt/transition_matrices -mapvarfn en-us-adapt/variances -meanfn en-us/means -mixwfn en-us/mixture_weights -moddeffn en-us/mdef.txt -mwfloor 0.00001 1.000000e-05 -tau 10.0 1.000000e+01 -tmatfn en-us/transition_matrices -tpfloor 0.0001 1.000000e-04 -ts2cbfn .ptm. -varfloor 0.00001 1.000000e-05 -varfn en-us/variances

INFO: s3gau_io.c(169): Read en-us/means [42x3x128 array] INFO: s3gau_io.c(169): Read en-us/variances [42x3x128 array] INFO: s3mixw_io.c(117): Read en-us/mixture_weights [5126x3x128 array] INFO: s3tmat_io.c(118): Read en-us/transition_matrices [42x3x4 array] INFO: main.c(433): Reading and accumulating observation counts from . INFO: s3gau_io.c(386): Read ./gauden_counts with means with vars [42x3x128 vector arrays] INFO: s3mixw_io.c(117): Read ./mixw_counts [5126x3x128 array] INFO: s3tmat_io.c(118): Read ./tmat_counts [42x3x4 array] INFO: main.c(78): Estimating tau hyperparameter from variances and observations INFO: main.c(496): Reading en-us/mdef.txt INFO: model_def_io.c(573): Model definition info: INFO: model_def_io.c(574): 137095 total models defined (42 base, 137053 tri) INFO: model_def_io.c(575): 548380 total states INFO: model_def_io.c(576): 5126 total tied states INFO: model_def_io.c(577): 126 total tied CI states INFO: model_def_io.c(578): 42 total tied transition matrices INFO: model_def_io.c(579): 4 max state/model INFO: model_def_io.c(580): 4 min state/model INFO: main.c(132): Re-estimating mixture weights using MAP INFO: main.c(201): Re-estimating transition probabilities using MAP INFO: main.c(534): Re-estimating means using Bayesian interpolation INFO: main.c(540): Interpolating tau hyperparameter for PTM models INFO: main.c(542): Re-estimating variances using MAP INFO: s3gau_io.c(228): Wrote en-us-adapt/means [42x3x128 array] INFO: s3gau_io.c(228): Wrote en-us-adapt/variances [42x3x128 array] INFO: s3mixw_io.c(233): Wrote en-us-adapt/mixture_weights [5126x3x128 array] INFO: s3tmat_io.c(176): Wrote en-us-adapt/transition_matrices [42x3x4 array]

##### Recreating the adapted sendump file

$ cp /usr/lib/sphinxtrain/mk_s2sendump . $ ./mk_s2sendump \

-pocketsphinx yes \
-moddeffn en-us-adapt/mdef.txt \
-mixwfn en-us-adapt/mixture_weights \
-sendumpfn en-us-adapt/sendump

Current configuration: [NAME] [DEFLT] [VALUE] -example no no -help no no -mixwfn en-us-adapt/mixture_weights -moddeffn en-us-adapt/mdef.txt -mwfloor 0.00001 1.000000e-05 -pocketsphinx no yes -sendumpfn en-us-adapt/sendump

INFO: model_def_io.c(573): Model definition info: INFO: model_def_io.c(574): 137095 total models defined (42 base, 137053 tri) INFO: model_def_io.c(575): 548380 total states INFO: model_def_io.c(576): 5126 total tied states INFO: model_def_io.c(577): 126 total tied CI states INFO: model_def_io.c(578): 42 total tied transition matrices INFO: model_def_io.c(579): 4 max state/model INFO: model_def_io.c(580): 4 min state/model INFO: senone.c(210): Reading senone mixture weights: en-us-adapt/mixture_weights INFO: senone.c(331): Read mixture weights for 5126 senones: 3 features x 128 codewords INFO: mk_s2sendump.c(207): Writing PocketSphinx format sendump file: en-us-adapt/sendump

Beautiful. Let's see if we can understand one of our training files. Let's go with the first one.

$ ./audio_transcribe.py
Sphinx thinks you said both evolved the dangers trial for that steals etc

Nope, not very good to be honest. Let's try a few new test files, I'll try 'The quick brown fox jumps over the lazy dog.' and I'll record it four times to see what comes out.

Sphinx thinks you said the quick from fox jumped to both of them they seized over Sphinx thinks you said the equipment from fox jumped to the ladies over Sphinx thinks you said the quicker from folks jumped over the seat of

Something completely different? Let's try, The weather is rather nice today, I wonder what it's going to be like tomorrow?

Sphinx thinks you said the weather's often must say wondered what is going to have to learn ~~~~

That's not so good is it? I need to find an English UK set.


  1. Time is estimated here, initially recording the audio only takes as long as it does to record it! But then processing the audio memo, can take some time, as I talk faster than I write or type, and as I have to struggle hard not to go down the rabbit hole of idea exploration and planning when I should simply be recording the initial idea not subsequent ideas triggered by the first. 

links

social