whispertranscriber

Speech to Text element using whisper_rs.

The element uses a chunking strategy to make it compatible with live use cases. It will run inference on each chunk, possibly prepended with the previous chunk to avoid misdetection near the chunk boundaries.

In live mode, its latency is a factor of:

  • The chunk-duration property
  • The live-edge-offset property, which represents the duration by which the sliding window of output tokens trails the live edge, thus causing it to overlap with the previous chunk when non-zero
  • The latency property, which must be configured to be greater than the actual observed processing latency for one inference run, and cannot be greater than the chunk duration.

The element will log in order to assist with the process of tuning the latency property.

In order to identify tokens the element needs to use DTW token-level timestamps, and currently does not support custom aheads, which means it can only be used with one of the models supported by the model-preset property. That property must be set to match the actual model specified through the model-path property, no check is performed to enforce this.

The element re-exports the features exposed by the whisper-rs crate to select backends, this is an example for building the element with CUDA support enabled:

cargo build --features=cuda

You can download models using the whisper.cpp download script, this is an example for downloading the large-v3 model:

./download-ggml-model.sh large-v3

Equipped with this, this is an example for running live inference with the element introducing a 6 seconds latency:

gst-launch-1.0 filesrc location=/home/meh/Music/chaplin.wav ! \
  wavparse ! audioconvert ! audioresample ! clocksync ! \
  queue max-size-time=5000000000 max-size-buffers=0 max-size-bytes=0 ! \
  whispertranscriber model-path=/home/meh/devel/whisper.cpp/models/ggml-large-v3.bin model-preset=large-v3 chunk-duration=4000 live-edge-offset=1000 latency=1000 ! \
  queue ! fakesink dump=true

The above is known to work fine using a RTX 5080 GPU.

You can remove the clocksync element to test offline performance, the above pipeline is known to yield a 10x real time processing rate using a RTX 5080 GPU, and 7x real time for Vulkan on a Radeon RX9070 XT.

Hierarchy

GObject
    ╰──GInitiallyUnowned
        ╰──GstObject
            ╰──GstElement
                ╰──whispertranscriber

Factory details

Authors: – Mathieu Duponchelle

Classification:Text/Audio/Filter

Rank – none

Plugin – whisper

Package – gst-plugin-whisper

Pad Templates

sink

audio/x-raw:
           rate: 16000
       channels: 1
         layout: interleaved
         format: F32LE

Presencealways

Directionsink

Object typeGstPad


src

text/x-raw:
         format: utf8

Presencealways

Directionsrc

Object typeGstPad


Properties

beam-search-size

“beam-search-size” gint

Set the beam_size value for sampling-strategy=beam-search

Flags : Read / Write

Default value : 5


chunk-duration

“chunk-duration” guint

The duration of chunks to accumulate for inference, in milliseconds. Will count towards total latency.

Flags : Read / Write

Default value : 4000


debug-mode

“debug-mode” gboolean

Enables debug mode, such as dumping the log mel spectrogram.

Flags : Read / Write

Default value : false


detect-language

“detect-language” gboolean

Auto-detect the source language when translate is true

Flags : Read / Write

Default value : false


entropy-thold

“entropy-thold” gfloat

If the gzip compression ratio is higher than this value, treat the decoding as failed

Flags : Read / Write

Default value : 2.4


gpu-device-id

“gpu-device-id” gint

GPU device id

Flags : Read / Write

Default value : 0


greedy-best-of

“greedy-best-of” gint

Set the best_of value for sampling-strategy=greedy

Flags : Read / Write

Default value : 5


language

“language” gchararray

The source language to translate from when translate is true

Flags : Read / Write

Default value : NULL


latency

“latency” guint

The expected processing latency. Will count towards total latency.

Flags : Read / Write

Default value : 1000


length-penalty

“length-penalty” gfloat

optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default

Flags : Read / Write

Default value : -1


live-edge-offset

“live-edge-offset” guint

The element will feed in the previous chunk when running inference, and output tokens that are contained within a sliding window that may overlap both chunks. This controls the duration (in milliseconds) of the overlap, and will leave time for tokens near the end of the current chunk to stabilize. Will count towards total latency.

Flags : Read / Write

Default value : 1000


logprob-thold

“logprob-thold” gfloat

if the average log probability is lower than this value, treat the decoding as failed

Flags : Read / Write

Default value : -1


model-path

“model-path” gchararray

Path to ggml-formatted whisper model (https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#ggml-format)

Flags : Read / Write

Default value : NULL


model-preset

“model-preset” GstWhisperTranscriberModelPreset *

Defines how DTW token-level timestamps are gathered, MUST MATCH THE SPECIFIED MODEL

Flags : Read / Write

Default value : tiny (1)


n-threads

“n-threads” gint

Set the number of threads to use for decoding.

Flags : Read / Write

Default value : 1


sampling-strategy

“sampling-strategy” GstWhisperTranscriberSamplingStrategy *

The sampling strategy to use to pick tokens from a list of likely possibilities

Flags : Read / Write

Default value : greedy (0)


suppress-blank

“suppress-blank” gboolean

This will suppress blank outputs

Flags : Read / Write

Default value : true


suppress-nst

“suppress-nst” gboolean

This will suppress non-speech tokens

Flags : Read / Write

Default value : false


translate

“translate” gboolean

Whether to translate to English for multilingual models

Flags : Read / Write

Default value : false


use-gpu

“use-gpu” gboolean

Use GPU if available.

Flags : Read / Write

Default value : true


Named constants

GstWhisperTranscriberModelPreset

Members

tiny-en (0) – TinyEn
tiny (1) – Tiny
base-en (2) – BaseEn
base (3) – Base
small-en (4) – SmallEn
small (5) – Small
medium-en (6) – MediumEn
medium (7) – Medium
large-v1 (8) – LargeV1
large-v2 (9) – LargeV2
large-v3 (10) – LargeV3
large-v3-turbo (11) – LargeV3Turbo

GstWhisperTranscriberSamplingStrategy

Members

greedy (0) – Greedy
beam-search (1) – BeamSearch

The results of the search are