Speech to Text - Hamsa API

Hamsa Speech to Text (STT) transcribes Arabic speech across multiple dialects into text with word-level timestamps and speaker identification. Whether you’re transcribing media content, building voice applications, or documenting conversations, Hamsa STT delivers high-accuracy Arabic speech recognition.

Overview

API Reference

Technical API documentation for developers

Quickstart

Get started with STT in minutes

Key features

Arabic dialect recognition

Hamsa STT is optimized for Arabic speech:

Automatic dialect detection: Set language to ar and the model detects the dialect automatically
Code-switching: Natural handling of mixed Arabic-English speech
Colloquial expressions: Recognition of dialect-specific idioms and expressions

Advanced transcription features

Word-level timestamps: Precise timing for each transcribed word — each segment includes word text plus start/end times
Word highlight during playback: In the Media Platform, the current word highlights in sync with playback; click any word to seek
Speaker diarization: Identification of different speakers in multi-speaker audio
Automatic punctuation: Natural punctuation and formatting
SRT subtitle export: Generate formatted subtitles with configurable line/duration options

Flexible integration

Batch API (/v1/jobs/transcribe) — async transcription from media URLs with webhook delivery
Realtime API (/v1/realtime/stt) — synchronous transcription from base64-encoded audio
WebSocket (/v1/realtime/ws) — streaming transcription for real-time applications
Media Platform — web interface for upload, transcribe, and review

API endpoints

Batch API

Async — /v1/jobs/transcribeSubmit a media URL for transcription. Results delivered via webhook.Parameters: mediaUrl, model, language, webhookUrl

Realtime API

Sync — /v1/realtime/sttSend base64-encoded audio, get transcription back directly.Parameters: audioBase64, language, isEosEnabled

Models

Model ID	Best for
`Hamsa-General-V2.0`	General-purpose — media, podcasts, pre-recorded content
`Hamsa-Conversational-V1.0`	Conversational audio — meetings, calls, dialogues

Supported languages

The API accepts two language codes:

Code	Language
`ar`	Arabic (all dialects — auto-detected)
`en`	English

Arabic dialect detection is automatic — you do not need to specify the specific dialect. Set language to ar and the model handles Egyptian, Gulf, Levantine, Iraqi, and other dialects.

Use cases

Media transcription

Transcribe Arabic podcasts, videos, and media content:

Generate subtitles for videos (with SRT export)
Create searchable transcripts
Content analysis and indexing

Voice agents

Power real-time conversational AI:

Customer service voice agents
Live call transcription
Conversation analytics

Meeting documentation

Document Arabic meetings and interviews:

Automatic meeting minutes with speaker identification
Searchable archives
Compliance and record-keeping

Content accessibility

Make Arabic audio content accessible:

Closed captions for videos
Transcripts for audio content
Translation preparation

Getting started

Choose your integration

Use the Batch API for pre-recorded media, the Realtime API for direct transcription, or the WebSocket API for streaming.

Select a model

Use Hamsa-General-V2.0 for general transcription or Hamsa-Conversational-V1.0 for conversational audio.

Submit your audio

Provide a media URL (batch) or base64-encoded audio (realtime), and get your transcription with timestamps and speaker information.

Next steps

Quickstart Guide

Build your first STT integration

WebSocket API

Real-time streaming transcription

Improving Accuracy

Tips for better transcription accuracy

Media Platform

Use STT via web interface

FAQ

What's the difference between the Batch API and Realtime API?

The Batch API (/v1/jobs/transcribe) is async — submit a media URL and receive results via webhook. Use it for pre-recorded files. The Realtime API (/v1/realtime/stt) accepts base64-encoded audio and returns the transcription directly. For streaming, use the WebSocket API.

Do I need to specify the Arabic dialect?

No. Set language to ar and the model automatically detects the specific dialect (Egyptian, Gulf, Levantine, etc.) and transcribes accordingly.

Can the model handle Arabic-English code-switching?

Yes, the models handle speech that switches between Arabic and English, which is common in many Arabic-speaking regions.

Which model should I use?

Use Hamsa-General-V2.0 for general-purpose transcription of media and pre-recorded content. Use Hamsa-Conversational-V1.0 for conversational audio like calls and meetings.

Can I get SRT subtitles?

Yes. Set returnSrtFormat to true in the Batch API request. You can customize subtitle formatting with srtOptions. See the Quickstart for details.

​Overview

API Reference

Quickstart

​Key features

​Arabic dialect recognition

​Advanced transcription features

​Flexible integration

​API endpoints

Batch API

Realtime API

​Models

​Supported languages

​Use cases

​Media transcription

​Voice agents

​Meeting documentation

​Content accessibility

​Getting started

​Next steps