whispertranscriber
Speech to Text element using whisper_rs.
The element uses a chunking strategy to make it compatible with live use cases. It will run inference on each chunk, possibly prepended with the previous chunk to avoid misdetection near the chunk boundaries.
In live mode, its latency is a factor of:
- The chunk-duration property
- The live-edge-offset property, which represents the duration by which the sliding window of output tokens trails the live edge, thus causing it to overlap with the previous chunk when non-zero
- The latency property, which must be configured to be greater than the actual observed processing latency for one inference run, and cannot be greater than the chunk duration.
The element will log in order to assist with the process of tuning the latency property.
In order to identify tokens the element needs to use DTW token-level timestamps, and currently does not support custom aheads, which means it can only be used with one of the models supported by the model-preset property. That property must be set to match the actual model specified through the model-path property, no check is performed to enforce this.
The element re-exports the features exposed by the whisper-rs crate to select backends, this is an example for building the element with CUDA support enabled:
cargo build --features=cuda
You can download models using the whisper.cpp download script, this is an example for downloading the large-v3 model:
./download-ggml-model.sh large-v3
Equipped with this, this is an example for running live inference with the element introducing a 6 seconds latency:
gst-launch-1.0 filesrc location=/home/meh/Music/chaplin.wav ! \
wavparse ! audioconvert ! audioresample ! clocksync ! \
queue max-size-time=5000000000 max-size-buffers=0 max-size-bytes=0 ! \
whispertranscriber model-path=/home/meh/devel/whisper.cpp/models/ggml-large-v3.bin model-preset=large-v3 chunk-duration=4000 live-edge-offset=1000 latency=1000 ! \
queue ! fakesink dump=true
The above is known to work fine using a RTX 5080 GPU.
You can remove the clocksync element to test offline performance, the above pipeline is known to yield a 10x real time processing rate using a RTX 5080 GPU, and 7x real time for Vulkan on a Radeon RX9070 XT.
Hierarchy
GObject ╰──GInitiallyUnowned ╰──GstObject ╰──GstElement ╰──whispertranscriber
Factory details
Authors: – Mathieu Duponchelle
Classification: – Text/Audio/Filter
Rank – none
Plugin – whisper
Package – gst-plugin-whisper
Pad Templates
sink
audio/x-raw:
rate: 16000
channels: 1
layout: interleaved
format: F32LE
Properties
beam-search-size
“beam-search-size” gint
Set the beam_size value for sampling-strategy=beam-search
Flags : Read / Write
Default value : 5
chunk-duration
“chunk-duration” guint
The duration of chunks to accumulate for inference, in milliseconds. Will count towards total latency.
Flags : Read / Write
Default value : 4000
debug-mode
“debug-mode” gboolean
Enables debug mode, such as dumping the log mel spectrogram.
Flags : Read / Write
Default value : false
detect-language
“detect-language” gboolean
Auto-detect the source language when translate is true
Flags : Read / Write
Default value : false
entropy-thold
“entropy-thold” gfloat
If the gzip compression ratio is higher than this value, treat the decoding as failed
Flags : Read / Write
Default value : 2.4
greedy-best-of
“greedy-best-of” gint
Set the best_of value for sampling-strategy=greedy
Flags : Read / Write
Default value : 5
language
“language” gchararray
The source language to translate from when translate is true
Flags : Read / Write
Default value : NULL
latency
“latency” guint
The expected processing latency. Will count towards total latency.
Flags : Read / Write
Default value : 1000
length-penalty
“length-penalty” gfloat
optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default
Flags : Read / Write
Default value : -1
live-edge-offset
“live-edge-offset” guint
The element will feed in the previous chunk when running inference, and output tokens that are contained within a sliding window that may overlap both chunks. This controls the duration (in milliseconds) of the overlap, and will leave time for tokens near the end of the current chunk to stabilize. Will count towards total latency.
Flags : Read / Write
Default value : 1000
logprob-thold
“logprob-thold” gfloat
if the average log probability is lower than this value, treat the decoding as failed
Flags : Read / Write
Default value : -1
model-path
“model-path” gchararray
Path to ggml-formatted whisper model (https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#ggml-format)
Flags : Read / Write
Default value : NULL
model-preset
“model-preset” GstWhisperTranscriberModelPreset *
Defines how DTW token-level timestamps are gathered, MUST MATCH THE SPECIFIED MODEL
Flags : Read / Write
Default value : tiny (1)
n-threads
“n-threads” gint
Set the number of threads to use for decoding.
Flags : Read / Write
Default value : 1
sampling-strategy
“sampling-strategy” GstWhisperTranscriberSamplingStrategy *
The sampling strategy to use to pick tokens from a list of likely possibilities
Flags : Read / Write
Default value : greedy (0)
suppress-blank
“suppress-blank” gboolean
This will suppress blank outputs
Flags : Read / Write
Default value : true
suppress-nst
“suppress-nst” gboolean
This will suppress non-speech tokens
Flags : Read / Write
Default value : false
translate
“translate” gboolean
Whether to translate to English for multilingual models
Flags : Read / Write
Default value : false
Named constants
GstWhisperTranscriberModelPreset
Members
tiny-en (0) – TinyEn
tiny (1) – Tiny
base-en (2) – BaseEn
base (3) – Base
small-en (4) – SmallEn
small (5) – Small
medium-en (6) – MediumEn
medium (7) – Medium
large-v1 (8) – LargeV1
large-v2 (9) – LargeV2
large-v3 (10) – LargeV3
large-v3-turbo (11) – LargeV3Turbo
GstWhisperTranscriberSamplingStrategy
Members
greedy (0) – Greedy
beam-search (1) – BeamSearch
The results of the search are