Start a connection

You can reach our API service by using the WebSocket Secure (WSS) protocol. The endpoint is:

wss://realtime.scriptix.io/v2/realtime

Authentication and configuration are provided via query parameters (see below).

Query parameters

Parameter	Value	Description	Required
token	string	Scriptix Realtime API Token for authentication (recommended for browser/WebSocket clients)	Yes
language	string	Language identifier in standard ISO-639-1 for speech-to-text-session	No
type	string	Model type: `fast` (default) or `quality`	No

Example connection URL:

wss://realtime.scriptix.io/v2/realtime?token=your-api-token&language=en&type=fast

Request headers (Alternative for server-side clients)

Server-side WebSocket clients can optionally use headers instead of query parameters:

Parameter	Value	Description
x-zoom-s2t-key	Scriptix Realtime API Token	API key of type real-time needed for authorization
x-api-key	Scriptix Realtime API Token	Alternative header name for API key
api-key	Scriptix Realtime API Token	Alternative header name for API key

Note: Browser-based WebSocket connections cannot use custom headers, so the token query parameter is required for web applications.

Transcription results

Partial results

The first results sent after receiving audio data are partials. Partials contain the spoken text currently detected and may change as more audio is processed. The text grows incrementally with each update, replacing the previous partial.

{
  "text": "hi how are",
  "is_final": false,
  "offset_ms": 1234,
  "stability": 0.8
}

Field	Type	Description
text	string	Growing text that builds incrementally. Replaces previous partial.
is_final	boolean	Always `false` for partial results
offset_ms	integer	Position in audio stream (milliseconds) for synchronization
stability	float	Confidence score between 0 and 1

Note: Partials are only sent when speech is detected. No results are sent during silence.

Final results

When the realtime engine is confident about a transcription segment, it sends a final result. Finals are emitted after approximately 15 words or after 30 seconds of accumulated audio.

{
  "text": "hi how are you doing today",
  "is_final": true,
  "offset_ms": 1234,
  "words": [
    [" hi", 0, 200, 0.95],
    [" how", 200, 400, 0.92],
    [" are", 400, 600, 0.94],
    [" you", 600, 800, 0.91],
    [" doing", 800, 1000, 0.93],
    [" today", 1000, 1300, 0.96]
  ]
}

Field	Type	Description
text	string	Finalized transcription text that won't change
is_final	boolean	Always `true` for final results
offset_ms	integer	Position in audio stream (milliseconds) for synchronization
words	array	Word-level timestamps: `[word, start_ms, end_ms, confidence]`

Example flow

Partial: {"text": "hi", "is_final": false, "offset_ms": 100, "stability": 0.6}
Partial: {"text": "hi how", "is_final": false, "offset_ms": 100, "stability": 0.7}
Partial: {"text": "hi how are", "is_final": false, "offset_ms": 100, "stability": 0.8}
Partial: {"text": "hi how are you", "is_final": false, "offset_ms": 100, "stability": 0.85}
... more partials ...
Final:   {"text": "hi how are you doing today", "is_final": true, "offset_ms": 100, "words": [...]}
Partial: {"text": "I'm", "is_final": false, "offset_ms": 2500, "stability": 0.6}  ← New segment starts
Partial: {"text": "I'm doing", "is_final": false, "offset_ms": 2500, "stability": 0.7}