Skip to content

Start a connection

You can reach our API service by using the WebSocket Secure (WSS) protocol. The endpoint is:

wss://realtime.scriptix.io/v2/realtime

Authentication and configuration are provided via query parameters (see below).

Query parameters

Parameter Value Description Required
token string Scriptix Realtime API Token for authentication (recommended for browser/WebSocket clients) Yes
language string Language identifier in standard ISO-639-1 for speech-to-text-session No
type string Model type: fast (default) or quality No

Example connection URL:

wss://realtime.scriptix.io/v2/realtime?token=your-api-token&language=en&type=fast

Request headers (Alternative for server-side clients)

Server-side WebSocket clients can optionally use headers instead of query parameters:

Parameter Value Description
x-zoom-s2t-key Scriptix Realtime API Token API key of type real-time needed for authorization
x-api-key Scriptix Realtime API Token Alternative header name for API key
api-key Scriptix Realtime API Token Alternative header name for API key

Note: Browser-based WebSocket connections cannot use custom headers, so the token query parameter is required for web applications.


Transcription results

Partial results

The first results sent after receiving audio data are partials. Partials contain the spoken text currently detected and may change as more audio is processed. The text grows incrementally with each update, replacing the previous partial.

{
  "text": "hi how are",
  "is_final": false,
  "offset_ms": 1234,
  "stability": 0.8
}
Field Type Description
text string Growing text that builds incrementally. Replaces previous partial.
is_final boolean Always false for partial results
offset_ms integer Position in audio stream (milliseconds) for synchronization
stability float Confidence score between 0 and 1

Note: Partials are only sent when speech is detected. No results are sent during silence.

Final results

When the realtime engine is confident about a transcription segment, it sends a final result. Finals are emitted after approximately 15 words or after 30 seconds of accumulated audio.

{
  "text": "hi how are you doing today",
  "is_final": true,
  "offset_ms": 1234,
  "words": [
    [" hi", 0, 200, 0.95],
    [" how", 200, 400, 0.92],
    [" are", 400, 600, 0.94],
    [" you", 600, 800, 0.91],
    [" doing", 800, 1000, 0.93],
    [" today", 1000, 1300, 0.96]
  ]
}
Field Type Description
text string Finalized transcription text that won't change
is_final boolean Always true for final results
offset_ms integer Position in audio stream (milliseconds) for synchronization
words array Word-level timestamps: [word, start_ms, end_ms, confidence]

Example flow

Partial: {"text": "hi", "is_final": false, "offset_ms": 100, "stability": 0.6}
Partial: {"text": "hi how", "is_final": false, "offset_ms": 100, "stability": 0.7}
Partial: {"text": "hi how are", "is_final": false, "offset_ms": 100, "stability": 0.8}
Partial: {"text": "hi how are you", "is_final": false, "offset_ms": 100, "stability": 0.85}
... more partials ...
Final:   {"text": "hi how are you doing today", "is_final": true, "offset_ms": 100, "words": [...]}
Partial: {"text": "I'm", "is_final": false, "offset_ms": 2500, "stability": 0.6}  ← New segment starts
Partial: {"text": "I'm doing", "is_final": false, "offset_ms": 2500, "stability": 0.7}