whispertranscriber

Speech to Text element using whisper_rs.

The element uses a chunking strategy to make it compatible with live use cases. It will run inference on each chunk, possibly prepended with the previous chunk to avoid misdetection near the chunk boundaries.

In live mode, its latency is a factor of:

The chunk-duration property
The live-edge-offset property, which represents the duration by which the sliding window of output tokens trails the live edge, thus causing it to overlap with the previous chunk when non-zero
The latency property, which must be configured to be greater than the actual observed processing latency for one inference run, and cannot be greater than the chunk duration.

The element will log in order to assist with the process of tuning the latency property.

In order to identify tokens the element needs to use DTW token-level timestamps, and currently does not support custom aheads, which means it can only be used with one of the models supported by the model-preset property. That property must be set to match the actual model specified through the model-path property, no check is performed to enforce this.

The element re-exports the features exposed by the whisper-rs crate to select backends, this is an example for building the element with CUDA support enabled:

cargo build --features=cuda

You can download models using the whisper.cpp download script, this is an example for downloading the large-v3 model:

./download-ggml-model.sh large-v3

Equipped with this, this is an example for running live inference with the element introducing a 6 seconds latency:

gst-launch-1.0 filesrc location=/home/meh/Music/chaplin.wav ! \
  wavparse ! audioconvert ! audioresample ! clocksync ! \
  queue max-size-time=5000000000 max-size-buffers=0 max-size-bytes=0 ! \
  whispertranscriber model-path=/home/meh/devel/whisper.cpp/models/ggml-large-v3.bin model-preset=large-v3 chunk-duration=4000 live-edge-offset=1000 latency=1000 ! \
  queue ! fakesink dump=true

The above is known to work fine using a RTX 5080 GPU.

You can remove the clocksync element to test offline performance, the above pipeline is known to yield a 10x real time processing rate using a RTX 5080 GPU, and 7x real time for Vulkan on a Radeon RX9070 XT.

Hierarchy

GObject
    ╰──GInitiallyUnowned
        ╰──GstObject
            ╰──GstElement
                ╰──whispertranscriber

Factory details

Authors: – Mathieu Duponchelle

Classification: – Text/Audio/Filter

Rank – none

Plugin – whisper

Package – gst-plugin-whisper

Pad Templates

`sink`

audio/x-raw:
           rate: 16000
       channels: 1
         layout: interleaved
         format: F32LE

Presence – always

Direction – sink

Object type – GstPad

`src`

text/x-raw:
         format: utf8

Presence – always

Direction – src

Object type – GstPad

Properties

beam-search-size

“beam-search-size” gint

Set the beam_size value for sampling-strategy=beam-search

Flags : Read / Write

Default value : 5

chunk-duration

“chunk-duration” guint

The duration of chunks to accumulate for inference, in milliseconds. Will count towards total latency.

Flags : Read / Write

Default value : 4000

debug-mode

“debug-mode” gboolean

Enables debug mode, such as dumping the log mel spectrogram.

Flags : Read / Write

Default value : false

detect-language

“detect-language” gboolean

Auto-detect the source language when translate is true

Flags : Read / Write

Default value : false

entropy-thold

“entropy-thold” gfloat

If the gzip compression ratio is higher than this value, treat the decoding as failed

Flags : Read / Write

Default value : 2.4

gpu-device-id

“gpu-device-id” gint

GPU device id

Flags : Read / Write

Default value : 0

greedy-best-of

“greedy-best-of” gint

Set the best_of value for sampling-strategy=greedy

Flags : Read / Write

Default value : 5

language

“language” gchararray

The source language to translate from when translate is true

Flags : Read / Write

Default value : NULL

latency

“latency” guint

The expected processing latency. Will count towards total latency.

Flags : Read / Write

Default value : 1000

length-penalty

“length-penalty” gfloat

optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default

Flags : Read / Write

Default value : -1

live-edge-offset

“live-edge-offset” guint

The element will feed in the previous chunk when running inference, and output tokens that are contained within a sliding window that may overlap both chunks. This controls the duration (in milliseconds) of the overlap, and will leave time for tokens near the end of the current chunk to stabilize. Will count towards total latency.

Flags : Read / Write

Default value : 1000

logprob-thold

“logprob-thold” gfloat

if the average log probability is lower than this value, treat the decoding as failed

Flags : Read / Write

Default value : -1

model-path

“model-path” gchararray

Path to ggml-formatted whisper model (https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#ggml-format)

Flags : Read / Write

Default value : NULL

model-preset

“model-preset” GstWhisperTranscriberModelPreset *

Defines how DTW token-level timestamps are gathered, MUST MATCH THE SPECIFIED MODEL

Flags : Read / Write

Default value : tiny (1)

n-threads

“n-threads” gint

Set the number of threads to use for decoding.

Flags : Read / Write

Default value : 1

sampling-strategy

“sampling-strategy” GstWhisperTranscriberSamplingStrategy *

The sampling strategy to use to pick tokens from a list of likely possibilities

Flags : Read / Write

Default value : greedy (0)

suppress-blank

“suppress-blank” gboolean

This will suppress blank outputs

Flags : Read / Write

Default value : true

suppress-nst

“suppress-nst” gboolean

This will suppress non-speech tokens

Flags : Read / Write

Default value : false

translate

“translate” gboolean

Whether to translate to English for multilingual models

Flags : Read / Write

Default value : false

use-gpu

“use-gpu” gboolean

Use GPU if available.

Flags : Read / Write

Default value : true

Named constants

GstWhisperTranscriberModelPreset

Members

tiny-en (0) – TinyEn

tiny (1) – Tiny

base-en (2) – BaseEn

base (3) – Base

small-en (4) – SmallEn

small (5) – Small

medium-en (6) – MediumEn

medium (7) – Medium

large-v1 (8) – LargeV1

large-v2 (9) – LargeV2

large-v3 (10) – LargeV3

large-v3-turbo (11) – LargeV3Turbo

GstWhisperTranscriberSamplingStrategy

Members

greedy (0) – Greedy

beam-search (1) – BeamSearch

The results of the search are