Skip to main content
Hamsa Speech to Text (STT) accurately transcribes Arabic speech across multiple dialects into text with word-level timestamps and speaker identification. Whether you’re transcribing media content, building voice applications, or documenting conversations, Hamsa STT delivers high-accuracy Arabic speech recognition.

Overview

Key features

Arabic dialect recognition

Our STT models are specifically optimized for Arabic speech:
  • Automatic dialect detection: No need to specify the dialect - the model detects it automatically
  • Multiple dialects: Egyptian, Gulf, Levantine, North African, Iraqi, Yemeni, and Modern Standard Arabic
  • Code-switching: Natural handling of mixed Arabic-English speech
  • Colloquial expressions: Recognition of dialect-specific idioms and expressions

Advanced transcription features

  • Word-level timestamps: Precise timing for each transcribed word
  • Speaker diarization: Identification of different speakers in multi-speaker audio
  • Automatic punctuation: Natural punctuation and formatting
  • High accuracy: Optimized models for Arabic speech patterns

Flexible integration

  • REST API for batch transcription
  • WebSocket API for real-time streaming
  • Web interface via Media Platform
  • Multiple audio format support (MP3, WAV, M4A, FLAC, OGG)

Use cases

Media transcription

Transcribe Arabic podcasts, videos, and media content:
  • Generate subtitles for videos
  • Create searchable transcripts
  • Content analysis and indexing
  • Accessibility features

Voice agents

Power real-time conversational AI:
  • Customer service voice agents
  • Live call transcription
  • Real-time language understanding
  • Conversation analytics

Meeting documentation

Document Arabic meetings and interviews:
  • Automatic meeting minutes
  • Speaker identification
  • Searchable archives
  • Compliance and record-keeping

Content accessibility

Make Arabic audio content accessible:
  • Closed captions for videos
  • Transcripts for audio content
  • Search and discovery features
  • Translation preparation

Models

Hamsa offers two STT models optimized for different use cases:

STT Standard

Best for accuracyHigh-accuracy Arabic speech recognition with speaker diarization and detailed timestamps.
  • Up to 60 minutes per file
  • Word-level timestamps
  • Speaker diarization
  • Highest accuracy
  • Batch processing

STT Realtime

Best for speedUltra-fast streaming transcription for real-time voice agents and live conversations.
  • Real-time streaming
  • ~150-250ms latency
  • Word-level timestamps
  • Continuous transcription
  • WebSocket support

Supported languages

Arabic Dialects:
  • Egyptian Arabic (arz)
  • Gulf Arabic (afb) - Saudi, UAE, Kuwait, Bahrain, Qatar, Oman
  • Levantine Arabic (apc) - Syrian, Lebanese, Jordanian, Palestinian
  • North African Arabic (arq/ary) - Moroccan, Algerian, Tunisian, Libyan
  • Iraqi Arabic (acm)
  • Yemeni Arabic (ayn)
  • Modern Standard Arabic (arb)
Other Languages:
  • English (US) (eng)
The models automatically detect the dialect - no manual selection required.

Audio formats

Hamsa STT supports multiple input formats:
  • MP3: Standard compressed audio
  • WAV: Uncompressed audio (recommended)
  • M4A: MPEG-4 audio files
  • FLAC: Lossless compression
  • OGG: Ogg Vorbis audio
  • PCM: Raw audio data (16-bit)
Recommended specifications:
  • Sample rate: 16kHz or higher
  • Bit depth: 16-bit minimum
  • Channels: Mono (recommended for best results)

Getting started

1

Choose your integration

Use the Media Platform for web interface or API for programmatic access
2

Prepare your audio

Ensure audio is in a supported format with good quality (16kHz+, minimal noise)
3

Select a model

Use STT Standard for batch transcription or STT Realtime for live transcription
4

Upload or stream

Upload pre-recorded files or stream live audio depending on your use case
5

Review transcription

Get your transcription with timestamps and speaker information

Next steps

FAQ

Use STT Standard for batch transcription of pre-recorded audio when accuracy is paramount. Use STT Realtime for live transcription in voice agents, real-time calls, or streaming audio applications.
No, our models automatically detect and transcribe the specific Arabic dialect. The system identifies whether it’s Egyptian, Gulf, Levantine, or another dialect and transcribes accordingly.
Yes, our models naturally handle speech that switches between Arabic and English, which is common in many Arabic-speaking regions. The transcription accurately captures both languages.
Speaker diarization works well for up to 4-5 distinct speakers with clear audio. Use STT Standard for best diarization results. Accuracy is highest with good audio quality and minimal speaker overlap.
STT Standard supports audio files up to 60 minutes. For longer audio, split into segments. STT Realtime supports continuous streaming with no duration limit.
Use high-quality audio (16kHz+), minimize background noise, ensure clear speech, and avoid speaker overlap. See our improving accuracy guide for detailed tips.

Resources