Skip to content

Protocol Messages

All control messages are exchanged in JSON format, with exception of the audio stream.


Client → Server Messages

1. Start Session

Request:

{
  "action": "start"
}

Description: Initializes the Speech-to-Text engine. Authentication must be provided during connection via query parameters or headers (see Connecting).

Response: Server will respond with a Listening message if successful, or an Error message if failed.


2. Stop Session

Request:

{
  "action": "stop"
}

Description: Stops the Speech-to-Text engine. The service will process any remaining audio buffer and send final results before responding with a Stopped message.

Note: It is not possible to start a new session after stopping. You must disconnect and create a new WebSocket connection.


3. Send Audio Data

Request: Binary data (audio stream)

Format: PCM WAVE, Mono, 16kHz, 16-bit signed integer

Description: Send audio data as binary frames. The server will process the audio and send back Partial and Final transcription results.

Note: Audio data will only be processed after receiving a successful Listening response from the server.


Server → Client Messages

Status Messages

Listening

Response:

{
  "state": "listening",
  "session_id": "abc123-def456-789"
}

Fields: - state (string): Always "listening" - session_id (string): Unique session identifier for this transcription session

Description: Indicates the Speech-to-Text engine is ready to receive audio data. You can now start sending binary audio frames.


Stopped

Response:

{
  "state": "stopped"
}

Fields: - state (string): Always "stopped"

Description: Indicates the session has been stopped and all audio has been processed. The WebSocket connection should be closed after receiving this message.


Shutting Down

Response:

{
  "state": "shutting_down",
  "at": 1674567890
}

Fields: - state (string): Always "shutting_down" - at (integer): Unix timestamp (seconds) when the service will shutdown

Description: Warning sent approximately one hour before scheduled maintenance/shutdown. You should finish your session or reconnect to a different instance before the specified time.


Transcription Results

Partial Results

Response:

{
  "partial": "Hello this is a test"
}

Fields: - partial (string): Current transcription text (may change as more audio is processed)

Description: Interim transcription results sent while audio is being processed. These are preliminary and may be updated or replaced by subsequent partial results or final results.

Note: Partial results are only sent when speech is detected. Silence will not generate partial results.


Full Results

Response:

{
  "result": [
    ["Hello", 0, 480, 1.0],
    ["this", 480, 720, 0.99],
    ["is", 720, 880, 1.0],
    ["a", 880, 960, 0.98],
    ["test", 960, 1280, 0.995]
  ],
  "text": "Hello this is a test"
}

Fields: - result (array): Array of word-level transcription results - Each element: [word, start_ms, stop_ms, confidence] - word (string): Transcribed word - start_ms (integer): Start time in milliseconds (relative to audio processed) - stop_ms (integer): End time in milliseconds (relative to audio processed) - confidence (float): Confidence score (0.0 to 1.0) - text (string): Complete transcribed sentence - speaker (string, optional): Speaker identifier (if diarization enabled) - sconf (float, optional): Sentence confidence score - channel (integer, optional): Audio channel (if multi-channel)

Description: Final transcription results sent when the engine is confident about the result. These results are stable and will not change.

Note: Timestamps (start_ms, stop_ms) are relative to the amount of audio processed (including silence), not the actual session duration. This allows sending audio up to 2x real-time speed.


Streaming Format (Alternative)

The service may also return streaming-style messages:

Streaming Partial

Response:

{
  "text": "Hello this is",
  "is_final": false,
  "offset_ms": 0,
  "stability": 0.87
}

Fields: - text (string): Current transcription text - is_final (boolean): Always false for partial results - offset_ms (integer): Session offset for video/audio alignment - stability (float): Confidence/stability score (0.0 to 1.0)


Streaming Final

Response:

{
  "text": "Hello this is a test",
  "is_final": true,
  "offset_ms": 0,
  "words": [
    ["Hello", 0, 480, 1.0],
    ["this", 480, 720, 0.99],
    ["is", 720, 880, 1.0],
    ["a", 880, 960, 0.98],
    ["test", 960, 1280, 0.995]
  ]
}

Fields: - text (string): Complete transcribed text - is_final (boolean): Always true for final results - offset_ms (integer): Session offset for video/audio alignment - words (array): Word-level timestamps (same format as standard result)


Error Messages

Response:

{
  "error": "error_description"
}

Common Error Messages:

Error Message Description
"Session not started" Audio data was sent before starting a session with {"action": "start"}
"Invalid message format" The JSON message sent by the client is malformed or invalid
"engine already listening" Attempted to start a session while one is already active
"restarting of sessions is not supported" Attempted to start a new session after stopping. Must disconnect and reconnect.
"model not loaded, contact administrator!" The requested language model is not available
"unable to load model" Failed to load the transcription model
"session is already initializing, please wait" Start request sent while session is still initializing
"An error occurred processing the message" General server error during message processing

Connection Authentication Errors

If authentication fails during connection, the WebSocket will be closed with specific close codes:

Close Code Reason Description
4400 invalid_language The requested language is not supported or invalid
4402 no_subscription_found No active subscription found for the organization
4403 invalid_s2t_token Invalid or missing API token

Example Message Flow

1. Client connects to wss://realtime.scriptix.io/v2/realtime?token=xxx&language=en
2. Client → Server: {"action": "start"}
3. Server → Client: {"state": "listening", "session_id": "abc123"}
4. Client → Server: <binary audio data>
5. Server → Client: {"partial": "Hello"}
6. Client → Server: <binary audio data>
7. Server → Client: {"partial": "Hello world"}
8. Server → Client: {"result": [["Hello", 0, 480, 1.0], ["world", 480, 960, 0.99]], "text": "Hello world"}
9. Client → Server: {"action": "stop"}
10. Server → Client: {"state": "stopped"}
11. Client closes WebSocket connection