Protocol Messages
All control messages are exchanged in JSON format, with exception of the audio stream.
Client → Server Messages
1. Start Session
Request:
{
"action": "start"
}
Description: Initializes the Speech-to-Text engine. Authentication must be provided during connection via query parameters or headers (see Connecting).
Response: Server will respond with a Listening message if successful, or an Error message if failed.
2. Stop Session
Request:
{
"action": "stop"
}
Description: Stops the Speech-to-Text engine. The service will process any remaining audio buffer and send final results before responding with a Stopped message.
Note: It is not possible to start a new session after stopping. You must disconnect and create a new WebSocket connection.
3. Send Audio Data
Request: Binary data (audio stream)
Format: PCM WAVE, Mono, 16kHz, 16-bit signed integer
Description: Send audio data as binary frames. The server will process the audio and send back Partial and Final transcription results.
Note: Audio data will only be processed after receiving a successful Listening response from the server.
Server → Client Messages
Status Messages
Listening
Response:
{
"state": "listening",
"session_id": "abc123-def456-789"
}
Fields:
- state (string): Always "listening"
- session_id (string): Unique session identifier for this transcription session
Description: Indicates the Speech-to-Text engine is ready to receive audio data. You can now start sending binary audio frames.
Stopped
Response:
{
"state": "stopped"
}
Fields:
- state (string): Always "stopped"
Description: Indicates the session has been stopped and all audio has been processed. The WebSocket connection should be closed after receiving this message.
Shutting Down
Response:
{
"state": "shutting_down",
"at": 1674567890
}
Fields:
- state (string): Always "shutting_down"
- at (integer): Unix timestamp (seconds) when the service will shutdown
Description: Warning sent approximately one hour before scheduled maintenance/shutdown. You should finish your session or reconnect to a different instance before the specified time.
Transcription Results
Partial Results
Response:
{
"partial": "Hello this is a test"
}
Fields:
- partial (string): Current transcription text (may change as more audio is processed)
Description: Interim transcription results sent while audio is being processed. These are preliminary and may be updated or replaced by subsequent partial results or final results.
Note: Partial results are only sent when speech is detected. Silence will not generate partial results.
Full Results
Response:
{
"result": [
["Hello", 0, 480, 1.0],
["this", 480, 720, 0.99],
["is", 720, 880, 1.0],
["a", 880, 960, 0.98],
["test", 960, 1280, 0.995]
],
"text": "Hello this is a test"
}
Fields:
- result (array): Array of word-level transcription results
- Each element: [word, start_ms, stop_ms, confidence]
- word (string): Transcribed word
- start_ms (integer): Start time in milliseconds (relative to audio processed)
- stop_ms (integer): End time in milliseconds (relative to audio processed)
- confidence (float): Confidence score (0.0 to 1.0)
- text (string): Complete transcribed sentence
- speaker (string, optional): Speaker identifier (if diarization enabled)
- sconf (float, optional): Sentence confidence score
- channel (integer, optional): Audio channel (if multi-channel)
Description: Final transcription results sent when the engine is confident about the result. These results are stable and will not change.
Note: Timestamps (start_ms, stop_ms) are relative to the amount of audio processed (including silence), not the actual session duration. This allows sending audio up to 2x real-time speed.
Streaming Format (Alternative)
The service may also return streaming-style messages:
Streaming Partial
Response:
{
"text": "Hello this is",
"is_final": false,
"offset_ms": 0,
"stability": 0.87
}
Fields:
- text (string): Current transcription text
- is_final (boolean): Always false for partial results
- offset_ms (integer): Session offset for video/audio alignment
- stability (float): Confidence/stability score (0.0 to 1.0)
Streaming Final
Response:
{
"text": "Hello this is a test",
"is_final": true,
"offset_ms": 0,
"words": [
["Hello", 0, 480, 1.0],
["this", 480, 720, 0.99],
["is", 720, 880, 1.0],
["a", 880, 960, 0.98],
["test", 960, 1280, 0.995]
]
}
Fields:
- text (string): Complete transcribed text
- is_final (boolean): Always true for final results
- offset_ms (integer): Session offset for video/audio alignment
- words (array): Word-level timestamps (same format as standard result)
Error Messages
Response:
{
"error": "error_description"
}
Common Error Messages:
| Error Message | Description |
|---|---|
"Session not started" |
Audio data was sent before starting a session with {"action": "start"} |
"Invalid message format" |
The JSON message sent by the client is malformed or invalid |
"engine already listening" |
Attempted to start a session while one is already active |
"restarting of sessions is not supported" |
Attempted to start a new session after stopping. Must disconnect and reconnect. |
"model not loaded, contact administrator!" |
The requested language model is not available |
"unable to load model" |
Failed to load the transcription model |
"session is already initializing, please wait" |
Start request sent while session is still initializing |
"An error occurred processing the message" |
General server error during message processing |
Connection Authentication Errors
If authentication fails during connection, the WebSocket will be closed with specific close codes:
| Close Code | Reason | Description |
|---|---|---|
| 4400 | invalid_language |
The requested language is not supported or invalid |
| 4402 | no_subscription_found |
No active subscription found for the organization |
| 4403 | invalid_s2t_token |
Invalid or missing API token |
Example Message Flow
1. Client connects to wss://realtime.scriptix.io/v2/realtime?token=xxx&language=en
2. Client → Server: {"action": "start"}
3. Server → Client: {"state": "listening", "session_id": "abc123"}
4. Client → Server: <binary audio data>
5. Server → Client: {"partial": "Hello"}
6. Client → Server: <binary audio data>
7. Server → Client: {"partial": "Hello world"}
8. Server → Client: {"result": [["Hello", 0, 480, 1.0], ["world", 480, 960, 0.99]], "text": "Hello world"}
9. Client → Server: {"action": "stop"}
10. Server → Client: {"state": "stopped"}
11. Client closes WebSocket connection