TSP1 ASR Application Notes#
WER benchmarking#
The tsp-cli stream command can stream pre-recorded audio files
through ASR models and write the transcription output to files.
This enables word error rate (WER) benchmarking to evaluate how
microphone selection and audio processing parameters affect
transcription accuracy.
Prerequisites#
Before running benchmarks, ensure you have the following from the developer portal:
tsp-cli v1.2.0 and STM32 firmware v1.2.0: See STM32 reflash for firmware update instructions.
A host audio ASR model: The default ASR model uses the onboard PDM microphones. For streaming audio from the host, you need a model marked “Host (tsp-cli stream)” on the developer portal. Flash it with:
tsp-cli flash asr-8m-v1-en-8b-hostaudio.img
See tsp-cli flash for details on flashing models.
Note
After flashing a host audio model, the tsp-cli asr command will no
longer work. Use tsp-cli stream as described below to process audio.
Preparing audio files#
TSP1 ASR models expect audio input as a headerless raw stream of signed 16-bit little-endian (S16_LE) samples, single channel, at 16 kHz.
To convert a WAV file to the required format using ffmpeg:
ffmpeg -i input.wav -f s16le -ar 16000 -ac 1 output.raw
Or using sox:
sox input.wav -t raw -r 16000 -e signed-integer -b 16 -c 1 output.raw
Warning
If you are capturing the non-causal transcription stream (see below), your audio file must end with at least 1 second of silence. The non-causal stream uses approximately 1 second of future audio context, so without trailing silence the final words will not be transcribed.
If your audio files do not already end with silence, you can append it
with sox:
sox -t raw -r 16000 -e signed-integer -b 16 -c 1 input.raw \
-t raw output.raw pad 0 1.5
Running a benchmark#
Use tsp-cli stream to feed a raw audio file into the ASR model and
write the transcription to text files:
tsp-cli stream --nn=0 --nn-reset \
--bind="buf7::audio.raw" \
--bind="buf9::noncausal.txt" \
--bind="buf8::causal.txt"
Binding |
Direction |
Description |
|---|---|---|
|
Input |
Raw audio data (1 ch, 16 kHz, S16_LE) |
|
Output |
Causal transcription (lower latency, lower accuracy) |
|
Output |
Non-causal transcription (higher latency, higher accuracy) |
Warning
Both output buffers (buf8 and buf9) must be consumed. If either
buffer is left unbound, it will fill up and cause the neural network to
stall. If you only need one transcription stream, bind the other to
/dev/null to discard it.
For example, to capture only the non-causal transcription:
tsp-cli stream --nn=0 --nn-reset \
--bind="buf7::audio.raw" \
--bind="buf9::noncausal.txt" \
--bind="buf8::/dev/null"
Run tsp-cli stream --help for full documentation of all options.
Live microphone streaming#
With a host audio model, you can also stream live audio from an external
microphone by piping audio capture to tsp-cli stream. Use - in
place of a file path to read from stdin or write to stdout.
For non-causal output:
arecord -t raw -r 16000 -f S16_LE -c 1 --device=plughw:CARD=Generic,DEV=0 \
| tsp-cli stream --nn=0 --nn-reset \
--bind="buf7::-" --bind="buf9::-" --bind="buf8::/dev/null"
For causal output (lower latency):
arecord -t raw -r 16000 -f S16_LE -c 1 --device=plughw:CARD=Generic,DEV=0 \
| tsp-cli stream --nn=0 --nn-reset \
--bind="buf7::-" --bind="buf8::-" --bind="buf9::/dev/null"
Replace the --device value with your audio capture device.
Use arecord -L to see a list of devices.