TSP1 ASR Application Notes#

WER benchmarking#

The tsp-cli stream command can stream pre-recorded audio files through ASR models and write the transcription output to files. This enables word error rate (WER) benchmarking to evaluate how microphone selection and audio processing parameters affect transcription accuracy.

Prerequisites#

Before running benchmarks, ensure you have the following from the developer portal:

  1. tsp-cli v1.2.0 and STM32 firmware v1.2.0: See STM32 reflash for firmware update instructions.

  2. A host audio ASR model: The default ASR model uses the onboard PDM microphones. For streaming audio from the host, you need a model marked “Host (tsp-cli stream)” on the developer portal. Flash it with:

    tsp-cli flash asr-8m-v1-en-8b-hostaudio.img
    

    See tsp-cli flash for details on flashing models.

Note

After flashing a host audio model, the tsp-cli asr command will no longer work. Use tsp-cli stream as described below to process audio.

Preparing audio files#

TSP1 ASR models expect audio input as a headerless raw stream of signed 16-bit little-endian (S16_LE) samples, single channel, at 16 kHz.

To convert a WAV file to the required format using ffmpeg:

ffmpeg -i input.wav -f s16le -ar 16000 -ac 1 output.raw

Or using sox:

sox input.wav -t raw -r 16000 -e signed-integer -b 16 -c 1 output.raw

Warning

If you are capturing the non-causal transcription stream (see below), your audio file must end with at least 1 second of silence. The non-causal stream uses approximately 1 second of future audio context, so without trailing silence the final words will not be transcribed.

If your audio files do not already end with silence, you can append it with sox:

sox -t raw -r 16000 -e signed-integer -b 16 -c 1 input.raw \
    -t raw output.raw pad 0 1.5

Running a benchmark#

Use tsp-cli stream to feed a raw audio file into the ASR model and write the transcription to text files:

tsp-cli stream --nn=0 --nn-reset \
    --bind="buf7::audio.raw" \
    --bind="buf9::noncausal.txt" \
    --bind="buf8::causal.txt"

Binding

Direction

Description

buf7

Input

Raw audio data (1 ch, 16 kHz, S16_LE)

buf8

Output

Causal transcription (lower latency, lower accuracy)

buf9

Output

Non-causal transcription (higher latency, higher accuracy)

Warning

Both output buffers (buf8 and buf9) must be consumed. If either buffer is left unbound, it will fill up and cause the neural network to stall. If you only need one transcription stream, bind the other to /dev/null to discard it.

For example, to capture only the non-causal transcription:

tsp-cli stream --nn=0 --nn-reset \
    --bind="buf7::audio.raw" \
    --bind="buf9::noncausal.txt" \
    --bind="buf8::/dev/null"

Run tsp-cli stream --help for full documentation of all options.

Live microphone streaming#

With a host audio model, you can also stream live audio from an external microphone by piping audio capture to tsp-cli stream. Use - in place of a file path to read from stdin or write to stdout.

For non-causal output:

arecord -t raw -r 16000 -f S16_LE -c 1 --device=plughw:CARD=Generic,DEV=0 \
    | tsp-cli stream --nn=0 --nn-reset \
        --bind="buf7::-" --bind="buf9::-" --bind="buf8::/dev/null"

For causal output (lower latency):

arecord -t raw -r 16000 -f S16_LE -c 1 --device=plughw:CARD=Generic,DEV=0 \
    | tsp-cli stream --nn=0 --nn-reset \
        --bind="buf7::-" --bind="buf8::-" --bind="buf9::/dev/null"

Replace the --device value with your audio capture device. Use arecord -L to see a list of devices.