This is a leaderboard for MSEB: Massive Speech Embedding Benchmark.
For more information, see the MSEB GitHub repository.
| Tasks: | 882 |
| Task Types: | 12 |
| Languages: | 37 |
| Encoder Name | classification | reasoning | reranking | retrieval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Transcript Truth | Cascaded ASR | Audio | Transcript Truth | Cascaded ASR | Audio | Transcript Truth | Cascaded ASR | Audio | Transcript Truth | Cascaded ASR | Audio | |
| gecko | 0.44 | 0.42 (Δ-0.02, -4.74%) | N/A | 0.08 | 0.06 (Δ-0.02, -27.21%) | N/A | 1.00 | 0.51 (Δ-0.49, -49.10%) | N/A | 0.51 | 0.34 (Δ-0.17, -33.29%) | N/A |
| gemini | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.72 | 0.58 (Δ-0.15, -20.46%) | 0.57 |
| gemini_embedding_001 | 0.83 | 0.80 (Δ-0.03, -3.79%) | N/A | 0.04 | 0.03 (Δ-0.01, -25.00%) | N/A | 1.00 | 0.60 (Δ-0.40, -39.86%) | N/A | 0.67 | 0.51 (Δ-0.17, -24.92%) | 0.56 |
| gemma | N/A | N/A | N/A | 0.53 | 0.47 (Δ-0.07, -12.36%) | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| gpt | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.23 | 0.65 | 0.40 (Δ-0.26, -39.51%) | 0.52 |
| Rank | Encoder Name | classification (mean) | clustering (mean) | reranking (mean) | retrieval (mean) | segmentation (mean) | transcription (mean) |
|---|---|---|---|---|---|---|---|
| 1 | perch | 0.58 ± 0.00 | 0.39 ± 0.00 | N/A | N/A | N/A | N/A |
| 2 | clap | 0.43 ± 0.00 | 0.50 ± 0.25 | N/A | N/A | N/A | N/A |
| 3 | elevenlabs | N/A | N/A | N/A | N/A | N/A | 0.29 ± 0.05 |
| 4 | gemini | N/A | N/A | N/A | 0.55 ± 0.10 | N/A | 0.36 ± 0.05 |
| 5 | gemini_3_flash_preview_with_title_and_context | N/A | N/A | 0.98 ± 0.00 | 0.57 ± 0.06 | N/A | N/A |
| 6 | gemini_3_flash_preview_with_title_and_context_gpt_4o_transcribe | N/A | N/A | N/A | 0.57 ± 0.06 | N/A | N/A |
| 7 | gemini_3_flash_preview_with_title_and_context_transcript_truth | N/A | N/A | N/A | 0.69 ± 0.05 | N/A | N/A |
| 8 | gemini_embedding_001 | N/A | N/A | N/A | 0.57 ± 0.08 | N/A | N/A |
| 9 | gpt | N/A | N/A | 0.23 ± 0.01 | 0.52 ± 0.10 | N/A | 0.28 ± 0.03 |
| 10 | hubert | N/A | 0.48 ± 0.38 | N/A | N/A | N/A | N/A |
| 11 | litellm | N/A | N/A | 0.69 ± 0.01 | 0.50 ± 0.04 | N/A | N/A |
| 12 | raw_spectrogram_25ms_10ms_mean | N/A | 0.48 ± 0.37 | N/A | N/A | N/A | N/A |
| 13 | wav2vec2 | N/A | 0.60 ± 0.37 | N/A | N/A | N/A | N/A |
| 14 | whisper | N/A | N/A | N/A | N/A | 0.40 ± 0.03 | 0.33 ± 0.03 |
| Model | BirdSet/clustering/clustering (VMeasure) [?] | FSD50K/clustering/sound_event (VMeasure) [?] | SVQ/clustering/speaker_age (VMeasure) [?] | SVQ/clustering/speaker_gender (VMeasure) [?] | SVQ/clustering/speaker_id (VMeasure) [?] |
|---|---|---|---|---|---|
| clap | 0.32 | 0.68 | N/A | N/A | N/A |
| hubert | 0.08 | N/A | 0.76 | 0.24 | 0.85 |
| perch | 0.39 | N/A | N/A | N/A | N/A |
| raw_spectrogram_25ms_10ms_mean | 0.09 | N/A | 0.74 | 0.24 | 0.85 |
| wav2vec2 | N/A | N/A | 0.76 | 0.18 | 0.87 |
| Model | None/reranking/query_reranking (MAP) [?] | None/reranking/query_reranking:background_speech (MAP) [?] | None/reranking/query_reranking:clean (MAP) [?] | None/reranking/query_reranking:media_noise (MAP) [?] | None/reranking/query_reranking:traffic_noise (MAP) [?] | SVQ/reranking/query_reranking (MAP) [?] | SVQ/reranking/query_reranking:background_speech (MAP) [?] | SVQ/reranking/query_reranking:clean (MAP) [?] | SVQ/reranking/query_reranking:media_noise (MAP) [?] | SVQ/reranking/query_reranking:traffic_noise (MAP) [?] |
|---|---|---|---|---|---|---|---|---|---|---|
| gemini_3_flash_preview_with_title_and_context | 0.98 | 0.99 | 0.99 | 0.98 | 0.98 | N/A | N/A | N/A | N/A | N/A |
| gpt | N/A | N/A | N/A | N/A | N/A | 0.24 | 0.23 | 0.24 | 0.24 | 0.24 |
| litellm | N/A | N/A | N/A | N/A | N/A | 0.69 | 0.68 | 0.70 | 0.69 | 0.69 |
| Model | None/retrieval/document_retrieval_cross_lang (MRR) [?] | None/retrieval/document_retrieval_cross_lang:background_speech (MRR) [?] | None/retrieval/document_retrieval_cross_lang:clean (MRR) [?] | None/retrieval/document_retrieval_cross_lang:media_noise (MRR) [?] | None/retrieval/document_retrieval_cross_lang:traffic_noise (MRR) [?] | None/retrieval/document_retrieval_in_lang (MRR) [?] | None/retrieval/document_retrieval_in_lang:background_speech (MRR) [?] | None/retrieval/document_retrieval_in_lang:clean (MRR) [?] | None/retrieval/document_retrieval_in_lang:media_noise (MRR) [?] | None/retrieval/document_retrieval_in_lang:traffic_noise (MRR) [?] | None/retrieval/passage_retrieval_cross_lang (MRR) [?] | None/retrieval/passage_retrieval_cross_lang:background_speech (MRR) [?] | None/retrieval/passage_retrieval_cross_lang:clean (MRR) [?] | None/retrieval/passage_retrieval_cross_lang:media_noise (MRR) [?] | None/retrieval/passage_retrieval_cross_lang:traffic_noise (MRR) [?] | SVQ/retrieval/document_retrieval_cross_lang (MRR) [?] | SVQ/retrieval/document_retrieval_cross_lang:background_speech (MRR) [?] | SVQ/retrieval/document_retrieval_cross_lang:clean (MRR) [?] | SVQ/retrieval/document_retrieval_cross_lang:media_noise (MRR) [?] | SVQ/retrieval/document_retrieval_cross_lang:traffic_noise (MRR) [?] | SVQ/retrieval/document_retrieval_in_lang (MRR) [?] | SVQ/retrieval/document_retrieval_in_lang:background_speech (MRR) [?] | SVQ/retrieval/document_retrieval_in_lang:clean (MRR) [?] | SVQ/retrieval/document_retrieval_in_lang:media_noise (MRR) [?] | SVQ/retrieval/document_retrieval_in_lang:traffic_noise (MRR) [?] | SVQ/retrieval/passage_retrieval_cross_lang (MRR) [?] | SVQ/retrieval/passage_retrieval_cross_lang:background_speech (MRR) [?] | SVQ/retrieval/passage_retrieval_cross_lang:clean (MRR) [?] | SVQ/retrieval/passage_retrieval_cross_lang:media_noise (MRR) [?] | SVQ/retrieval/passage_retrieval_cross_lang:traffic_noise (MRR) [?] | SVQ/retrieval/passage_retrieval_in_lang (MRR) [?] | SVQ/retrieval/passage_retrieval_in_lang:background_speech (MRR) [?] | SVQ/retrieval/passage_retrieval_in_lang:clean (MRR) [?] | SVQ/retrieval/passage_retrieval_in_lang:media_noise (MRR) [?] | SVQ/retrieval/passage_retrieval_in_lang:traffic_noise (MRR) [?] |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gemini | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.44 | 0.43 | 0.48 | 0.44 | 0.39 | 0.54 | 0.53 | 0.58 | 0.54 | 0.50 | 0.52 | 0.52 | 0.56 | 0.53 | 0.47 | 0.69 | 0.69 | 0.73 | 0.70 | 0.64 |
| gemini_3_flash_preview_with_title_and_context | 0.51 | 0.52 | 0.55 | 0.52 | 0.45 | 0.56 | 0.54 | 0.60 | 0.56 | 0.51 | 0.64 | 0.64 | 0.68 | 0.64 | 0.59 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| gemini_3_flash_preview_with_title_and_context_gpt_4o_transcribe | 0.51 | 0.52 | 0.54 | 0.50 | 0.47 | 0.58 | 0.57 | 0.62 | 0.59 | 0.54 | 0.63 | 0.63 | 0.67 | 0.63 | 0.57 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| gemini_3_flash_preview_with_title_and_context_transcript_truth | 0.63 | 0.63 | 0.63 | 0.63 | 0.63 | 0.70 | 0.70 | 0.70 | 0.70 | 0.70 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| gemini_embedding_001 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.45 | 0.45 | 0.49 | 0.45 | 0.41 | 0.61 | 0.61 | 0.66 | 0.62 | 0.56 | 0.57 | 0.57 | 0.61 | 0.57 | 0.51 | 0.64 | 0.64 | 0.69 | 0.65 | 0.59 |
| gpt | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.42 | 0.42 | 0.45 | 0.42 | 0.38 | 0.43 | 0.42 | 0.46 | 0.43 | 0.39 | 0.60 | 0.60 | 0.65 | 0.60 | 0.55 | 0.63 | 0.63 | 0.68 | 0.64 | 0.59 |
| litellm | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.48 | 0.47 | 0.53 | 0.47 | 0.46 | 0.52 | 0.53 | 0.56 | 0.53 | 0.48 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| Model | SVQ/transcription/speech_transcription (WER) [?] | SVQ/transcription/speech_transcription:background_speech (WER) [?] | SVQ/transcription/speech_transcription:clean (WER) [?] | SVQ/transcription/speech_transcription:media_noise (WER) [?] | SVQ/transcription/speech_transcription:traffic_noise (WER) [?] |
|---|---|---|---|---|---|
| elevenlabs | 0.29 | 0.35 | 0.22 | 0.28 | 0.30 |
| gemini | 0.36 | 0.43 | 0.29 | 0.35 | 0.37 |
| gpt | 0.28 | 0.28 | 0.24 | 0.28 | 0.34 |
| whisper | 0.33 | 0.35 | 0.28 | 0.32 | 0.37 |
Classification tasks involve assigning one or more predefined labels to an audio segment. These tasks evaluate the model's ability to extract specific features or categories from sound representations, such as intent in speech or species in bioacoustic recordings.
Evaluates the model's ability to identify bird species in soundscapes.
Tasks (See BIRDSET):
Evaluates the model's ability to classify general sound events.
Tasks (See FSD50K):
Evaluates the model's ability to categorize speech into intents.
datetime_query, play_music).Tasks (See SPEECH_MASSIVE):
Clustering tasks evaluate how well sound embeddings group together based on semantic or acoustic similarity without explicit labels during the grouping process. This tests the inherent structure and separability of the embedding space.
Evaluates the semantic grouping of bird vocalizations in the embedding space.
Tasks (See BIRDSET):
Evaluates how well embeddings group by sound event categories.
Tasks (See FSD50K):
Evaluates how well embeddings group by speaker-related attributes.
speaker_id,
speaker_gender, and speaker_age.Tasks (See SVQ):
Reasoning tasks (often implemented as Span Retrieval) require the model to identify specific segments within a text document that directly answer a voice query. This tests deeper semantic understanding and the ability to align fine-grained concepts between speech and text.
The reasoning task requires identifying the exact span of text within a Wikipedia article that answers a voice query.
Tasks (See SVQ):
Given a set of candidate answers, reranking tasks evaluate a model's ability to re-order them such that the most relevant results appear at the top. This is often used to refine the output of a primary retrieval system.
The reranking task assesses a model's ability to refine a list of candidate answers.
Tasks (See SVQ):
Retrieval tasks evaluate a model's ability to find relevant documents or passages from a large corpus given a voice query. This involves mapping speech and text into a shared embedding space where semantic similarity can be measured, often across different languages.
The SVQ retrieval task evaluates the model's ability to find relevant Wikipedia content given a voice query.
Tasks (See SVQ):
Segmentation tasks involve identifying and boundary-marking specific parts of an audio stream, such as salient terms or keywords. This evaluates the model's temporal precision and its ability to distinguish meaningful units of sound.
This task focuses on identifying salient terms within a continuous audio stream.
Tasks (See SVQ):
Transcription tasks (Automatic Speech Recognition) evaluate the model's ability to convert spoken language into written text. This tests the phonological and linguistic coverage of the sound representations.
A standard speech-to-text (ASR) task evaluating the linguistic accuracy of representations.
Tasks (See SVQ):
BirdSet is a large-scale dataset for bioacoustic monitoring, focusing on bird species classification from audio recordings.
The primary task is Multi-label Classification. Given a 5-second audio segment, the model must predict all bird species present in the recording.
BirdSet includes data from various recording sites and setups, referred to as
configurations: - HSN: High Sierra Nevada - NBP: NiBiolas Point - POW:
Powdermill Nature Reserve - SSW: Sapsucker Woods - SNE: Sierra Nevada -
PER: Peru - NES: Northeast - UHH: Hawaii - XCM: Xeno-Canto Mixed -
XCL: Xeno-Canto Low-noise
FSD50K is an open dataset of human-labeled sound events containing 51,197 audio clips totaling 108.3 hours.
The task is Multi-label Sound Event Classification. Audio clips are labeled using 200 classes from the AudioSet ontology.
Speech MASSIVE is a multilingual dataset for Speech Intent Classification, derived from the MASSIVE text dataset.
The task is Intent Classification. Given an utterance, the model must
categorize it into one of 60 predefined intents (e.g., datetime_query,
iot_hue_lightchange, play_music).
It covers 12 languages and provides a challenging testbed for multilingual speech understanding.
Simple Voice Questions (SVQ) is a multilingual dataset designed for evaluating sound representations. It consists of voice queries in multiple languages based on Wikipedia content.
clean: High-quality recording.media_noise: With background music or other media.traffic_noise: With street and vehicle noise.background_speech: With other people talking in the background.