Protocol Messages

All control messages are exchanged in JSON format, with exception of the audio stream.

Client → Server Messages

1. Start Session

Request:

{
  "action": "start"
}

Description: Initializes the Speech-to-Text engine. Authentication must be provided during connection via query parameters or headers (see Connecting).

Response: Server will respond with a Listening message if successful, or an Error message if failed.

2. Stop Session

Request:

{
  "action": "stop"
}

Description: Stops the Speech-to-Text engine. The service will process any remaining audio buffer and send final results before responding with a Stopped message.

Note: It is not possible to start a new session after stopping. You must disconnect and create a new WebSocket connection.

3. Send Audio Data

Request: Binary data (audio stream)

Format: PCM WAVE, Mono, 16kHz, 16-bit signed integer

Description: Send audio data as binary frames. The server will process the audio and send back Partial and Final transcription results.

Note: Audio data will only be processed after receiving a successful Listening response from the server.

Server → Client Messages

Status Messages

Listening

Response:

{
  "state": "listening",
  "session_id": "abc123-def456-789"
}

Fields: - state (string): Always "listening" - session_id (string): Unique session identifier for this transcription session

Description: Indicates the Speech-to-Text engine is ready to receive audio data. You can now start sending binary audio frames.

Stopped

Response:

{
  "state": "stopped"
}

Fields: - state (string): Always "stopped"

Description: Indicates the session has been stopped and all audio has been processed. The WebSocket connection should be closed after receiving this message.

Shutting Down

Response:

{
  "state": "shutting_down",
  "at": 1674567890
}

Fields: - state (string): Always "shutting_down" - at (integer): Unix timestamp (seconds) when the service will shutdown

Description: Warning sent approximately one hour before scheduled maintenance/shutdown. You should finish your session or reconnect to a different instance before the specified time.

Transcription Results

Partial Results

Response:

{
  "partial": "Hello this is a test"
}

Fields: - partial (string): Current transcription text (may change as more audio is processed)

Description: Interim transcription results sent while audio is being processed. These are preliminary and may be updated or replaced by subsequent partial results or final results.

Note: Partial results are only sent when speech is detected. Silence will not generate partial results.

Full Results

Response:

{
  "result": [
    ["Hello", 0, 480, 1.0],
    ["this", 480, 720, 0.99],
    ["is", 720, 880, 1.0],
    ["a", 880, 960, 0.98],
    ["test", 960, 1280, 0.995]
  ],
  "text": "Hello this is a test"
}

Fields: - result (array): Array of word-level transcription results - Each element: [word, start_ms, stop_ms, confidence] - word (string): Transcribed word - start_ms (integer): Start time in milliseconds (relative to audio processed) - stop_ms (integer): End time in milliseconds (relative to audio processed) - confidence (float): Confidence score (0.0 to 1.0) - text (string): Complete transcribed sentence - speaker (string, optional): Speaker identifier (if diarization enabled) - sconf (float, optional): Sentence confidence score - channel (integer, optional): Audio channel (if multi-channel)

Description: Final transcription results sent when the engine is confident about the result. These results are stable and will not change.

Note: Timestamps (start_ms, stop_ms) are relative to the amount of audio processed (including silence), not the actual session duration. This allows sending audio up to 2x real-time speed.

Streaming Format (Alternative)

The service may also return streaming-style messages:

Streaming Partial

Response:

{
  "text": "Hello this is",
  "is_final": false,
  "offset_ms": 0,
  "stability": 0.87
}

Fields: - text (string): Current transcription text - is_final (boolean): Always false for partial results - offset_ms (integer): Session offset for video/audio alignment - stability (float): Confidence/stability score (0.0 to 1.0)

Streaming Final

Response:

{
  "text": "Hello this is a test",
  "is_final": true,
  "offset_ms": 0,
  "words": [
    ["Hello", 0, 480, 1.0],
    ["this", 480, 720, 0.99],
    ["is", 720, 880, 1.0],
    ["a", 880, 960, 0.98],
    ["test", 960, 1280, 0.995]
  ]
}

Fields: - text (string): Complete transcribed text - is_final (boolean): Always true for final results - offset_ms (integer): Session offset for video/audio alignment - words (array): Word-level timestamps (same format as standard result)

Error Messages

Response:

{
  "error": "error_description"
}

Common Error Messages:

Error Message	Description
`"Session not started"`	Audio data was sent before starting a session with `{"action": "start"}`
`"Invalid message format"`	The JSON message sent by the client is malformed or invalid
`"engine already listening"`	Attempted to start a session while one is already active
`"restarting of sessions is not supported"`	Attempted to start a new session after stopping. Must disconnect and reconnect.
`"model not loaded, contact administrator!"`	The requested language model is not available
`"unable to load model"`	Failed to load the transcription model
`"session is already initializing, please wait"`	Start request sent while session is still initializing
`"An error occurred processing the message"`	General server error during message processing

Connection Authentication Errors

If authentication fails during connection, the WebSocket will be closed with specific close codes:

Close Code	Reason	Description
4400	`invalid_language`	The requested language is not supported or invalid
4402	`no_subscription_found`	No active subscription found for the organization
4403	`invalid_s2t_token`	Invalid or missing API token

Example Message Flow

1. Client connects to wss://realtime.scriptix.io/v2/realtime?token=xxx&language=en
2. Client → Server: {"action": "start"}
3. Server → Client: {"state": "listening", "session_id": "abc123"}
4. Client → Server: <binary audio data>
5. Server → Client: {"partial": "Hello"}
6. Client → Server: <binary audio data>
7. Server → Client: {"partial": "Hello world"}
8. Server → Client: {"result": [["Hello", 0, 480, 1.0], ["world", 480, 960, 0.99]], "text": "Hello world"}
9. Client → Server: {"action": "stop"}
10. Server → Client: {"state": "stopped"}
11. Client closes WebSocket connection