MSEB Leaderboard

This is a leaderboard for MSEB: Massive Speech Embedding Benchmark.

For more information, see the MSEB GitHub repository.

Tasks:882
Task Types:12
Languages:37

Leaderboard Results

Headroom and Audio Comparison

Encoder Name classification reasoning reranking retrieval
Transcript Truth Cascaded ASR Audio Transcript Truth Cascaded ASR Audio Transcript Truth Cascaded ASR Audio Transcript Truth Cascaded ASR Audio
gecko 0.44 0.42 (Δ-0.02, -4.74%) N/A 0.08 0.06 (Δ-0.02, -27.21%) N/A 1.00 0.51 (Δ-0.49, -49.10%) N/A 0.51 0.34 (Δ-0.17, -33.29%) N/A
gemini N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.72 0.58 (Δ-0.15, -20.46%) 0.57
gemini_embedding_001 0.83 0.80 (Δ-0.03, -3.79%) N/A 0.04 0.03 (Δ-0.01, -25.00%) N/A 1.00 0.60 (Δ-0.40, -39.86%) N/A 0.67 0.51 (Δ-0.17, -24.92%) 0.56
gemma N/A N/A N/A 0.53 0.47 (Δ-0.07, -12.36%) N/A N/A N/A N/A N/A N/A N/A
gpt N/A N/A N/A N/A N/A N/A N/A N/A 0.23 0.65 0.40 (Δ-0.26, -39.51%) 0.52

Audio Encoders

Rank Encoder Name classification (mean) clustering (mean) reranking (mean) retrieval (mean) segmentation (mean) transcription (mean)
1 perch 0.58 ± 0.00 0.39 ± 0.00 N/A N/A N/A N/A
2 clap 0.43 ± 0.00 0.50 ± 0.25 N/A N/A N/A N/A
3 elevenlabs N/A N/A N/A N/A N/A 0.29 ± 0.05
4 gemini N/A N/A N/A 0.55 ± 0.10 N/A 0.36 ± 0.05
5 gemini_3_flash_preview_with_title_and_context N/A N/A 0.98 ± 0.00 0.57 ± 0.06 N/A N/A
6 gemini_3_flash_preview_with_title_and_context_gpt_4o_transcribe N/A N/A N/A 0.57 ± 0.06 N/A N/A
7 gemini_3_flash_preview_with_title_and_context_transcript_truth N/A N/A N/A 0.69 ± 0.05 N/A N/A
8 gemini_embedding_001 N/A N/A N/A 0.57 ± 0.08 N/A N/A
9 gpt N/A N/A 0.23 ± 0.01 0.52 ± 0.10 N/A 0.28 ± 0.03
10 hubert N/A 0.48 ± 0.38 N/A N/A N/A N/A
11 litellm N/A N/A 0.69 ± 0.01 0.50 ± 0.04 N/A N/A
12 raw_spectrogram_25ms_10ms_mean N/A 0.48 ± 0.37 N/A N/A N/A N/A
13 wav2vec2 N/A 0.60 ± 0.37 N/A N/A N/A N/A
14 whisper N/A N/A N/A N/A 0.40 ± 0.03 0.33 ± 0.03

Audio Encoder Details

classification

Model BirdSet/classification/ebird_classification (mAP) [?] FSD50K/classification/classification (mAP) [?]
clap N/A 0.43
perch 0.58 N/A

clustering

Model BirdSet/clustering/clustering (VMeasure) [?] FSD50K/clustering/sound_event (VMeasure) [?] SVQ/clustering/speaker_age (VMeasure) [?] SVQ/clustering/speaker_gender (VMeasure) [?] SVQ/clustering/speaker_id (VMeasure) [?]
clap 0.32 0.68 N/A N/A N/A
hubert 0.08 N/A 0.76 0.24 0.85
perch 0.39 N/A N/A N/A N/A
raw_spectrogram_25ms_10ms_mean 0.09 N/A 0.74 0.24 0.85
wav2vec2 N/A N/A 0.76 0.18 0.87

reranking

Model None/reranking/query_reranking (MAP) [?] None/reranking/query_reranking:background_speech (MAP) [?] None/reranking/query_reranking:clean (MAP) [?] None/reranking/query_reranking:media_noise (MAP) [?] None/reranking/query_reranking:traffic_noise (MAP) [?] SVQ/reranking/query_reranking (MAP) [?] SVQ/reranking/query_reranking:background_speech (MAP) [?] SVQ/reranking/query_reranking:clean (MAP) [?] SVQ/reranking/query_reranking:media_noise (MAP) [?] SVQ/reranking/query_reranking:traffic_noise (MAP) [?]
gemini_3_flash_preview_with_title_and_context 0.98 0.99 0.99 0.98 0.98 N/A N/A N/A N/A N/A
gpt N/A N/A N/A N/A N/A 0.24 0.23 0.24 0.24 0.24
litellm N/A N/A N/A N/A N/A 0.69 0.68 0.70 0.69 0.69

retrieval

Model None/retrieval/document_retrieval_cross_lang (MRR) [?] None/retrieval/document_retrieval_cross_lang:background_speech (MRR) [?] None/retrieval/document_retrieval_cross_lang:clean (MRR) [?] None/retrieval/document_retrieval_cross_lang:media_noise (MRR) [?] None/retrieval/document_retrieval_cross_lang:traffic_noise (MRR) [?] None/retrieval/document_retrieval_in_lang (MRR) [?] None/retrieval/document_retrieval_in_lang:background_speech (MRR) [?] None/retrieval/document_retrieval_in_lang:clean (MRR) [?] None/retrieval/document_retrieval_in_lang:media_noise (MRR) [?] None/retrieval/document_retrieval_in_lang:traffic_noise (MRR) [?] None/retrieval/passage_retrieval_cross_lang (MRR) [?] None/retrieval/passage_retrieval_cross_lang:background_speech (MRR) [?] None/retrieval/passage_retrieval_cross_lang:clean (MRR) [?] None/retrieval/passage_retrieval_cross_lang:media_noise (MRR) [?] None/retrieval/passage_retrieval_cross_lang:traffic_noise (MRR) [?] SVQ/retrieval/document_retrieval_cross_lang (MRR) [?] SVQ/retrieval/document_retrieval_cross_lang:background_speech (MRR) [?] SVQ/retrieval/document_retrieval_cross_lang:clean (MRR) [?] SVQ/retrieval/document_retrieval_cross_lang:media_noise (MRR) [?] SVQ/retrieval/document_retrieval_cross_lang:traffic_noise (MRR) [?] SVQ/retrieval/document_retrieval_in_lang (MRR) [?] SVQ/retrieval/document_retrieval_in_lang:background_speech (MRR) [?] SVQ/retrieval/document_retrieval_in_lang:clean (MRR) [?] SVQ/retrieval/document_retrieval_in_lang:media_noise (MRR) [?] SVQ/retrieval/document_retrieval_in_lang:traffic_noise (MRR) [?] SVQ/retrieval/passage_retrieval_cross_lang (MRR) [?] SVQ/retrieval/passage_retrieval_cross_lang:background_speech (MRR) [?] SVQ/retrieval/passage_retrieval_cross_lang:clean (MRR) [?] SVQ/retrieval/passage_retrieval_cross_lang:media_noise (MRR) [?] SVQ/retrieval/passage_retrieval_cross_lang:traffic_noise (MRR) [?] SVQ/retrieval/passage_retrieval_in_lang (MRR) [?] SVQ/retrieval/passage_retrieval_in_lang:background_speech (MRR) [?] SVQ/retrieval/passage_retrieval_in_lang:clean (MRR) [?] SVQ/retrieval/passage_retrieval_in_lang:media_noise (MRR) [?] SVQ/retrieval/passage_retrieval_in_lang:traffic_noise (MRR) [?]
gemini N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.44 0.43 0.48 0.44 0.39 0.54 0.53 0.58 0.54 0.50 0.52 0.52 0.56 0.53 0.47 0.69 0.69 0.73 0.70 0.64
gemini_3_flash_preview_with_title_and_context 0.51 0.52 0.55 0.52 0.45 0.56 0.54 0.60 0.56 0.51 0.64 0.64 0.68 0.64 0.59 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
gemini_3_flash_preview_with_title_and_context_gpt_4o_transcribe 0.51 0.52 0.54 0.50 0.47 0.58 0.57 0.62 0.59 0.54 0.63 0.63 0.67 0.63 0.57 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
gemini_3_flash_preview_with_title_and_context_transcript_truth 0.63 0.63 0.63 0.63 0.63 0.70 0.70 0.70 0.70 0.70 0.75 0.75 0.75 0.75 0.75 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
gemini_embedding_001 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.45 0.45 0.49 0.45 0.41 0.61 0.61 0.66 0.62 0.56 0.57 0.57 0.61 0.57 0.51 0.64 0.64 0.69 0.65 0.59
gpt N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.42 0.42 0.45 0.42 0.38 0.43 0.42 0.46 0.43 0.39 0.60 0.60 0.65 0.60 0.55 0.63 0.63 0.68 0.64 0.59
litellm N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.48 0.47 0.53 0.47 0.46 0.52 0.53 0.56 0.53 0.48 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

segmentation

Model SVQ/segmentation/salient_term (NDCG) [?] SVQ/segmentation/salient_term:background_speech (NDCG) [?] SVQ/segmentation/salient_term:clean (NDCG) [?] SVQ/segmentation/salient_term:media_noise (NDCG) [?] SVQ/segmentation/salient_term:traffic_noise (NDCG) [?]
whisper 0.40 0.39 0.44 0.41 0.37

transcription

Model SVQ/transcription/speech_transcription (WER) [?] SVQ/transcription/speech_transcription:background_speech (WER) [?] SVQ/transcription/speech_transcription:clean (WER) [?] SVQ/transcription/speech_transcription:media_noise (WER) [?] SVQ/transcription/speech_transcription:traffic_noise (WER) [?]
elevenlabs 0.29 0.35 0.22 0.28 0.30
gemini 0.36 0.43 0.29 0.35 0.37
gpt 0.28 0.28 0.24 0.28 0.34
whisper 0.33 0.35 0.28 0.32 0.37

Task Definitions

Classification

Classification tasks involve assigning one or more predefined labels to an audio segment. These tasks evaluate the model's ability to extract specific features or categories from sound representations, such as intent in speech or species in bioacoustic recordings.

BirdSet Classification

Evaluates the model's ability to identify bird species in soundscapes.

  • Task: Multi-label Classification.
  • Input: 5-second audio segments.
  • Labels: Predict all bird species present in the segment based on eBird codes.

Tasks (See BIRDSET):

  • BirdSet/classification/ebird_classification

FSD50K Classification

Evaluates the model's ability to classify general sound events.

  • Task: Multi-label Classification across 200 diverse categories.
  • Evaluation: Mean Average Precision (mAP) is the primary metric.

Tasks (See FSD50K):

  • FSD50K/classification/classification

Speech MASSIVE Classification

Evaluates the model's ability to categorize speech into intents.

  • Task: Intent Classification.
  • Labels: 60 predefined intents (e.g., datetime_query, play_music).
  • Goal: Correctlly identify the user's intent from the audio utterance.

Tasks (See SPEECH_MASSIVE):

  • SpeechMassive/classification/intent_classification

[Back to top]


Clustering

Clustering tasks evaluate how well sound embeddings group together based on semantic or acoustic similarity without explicit labels during the grouping process. This tests the inherent structure and separability of the embedding space.

BirdSet Clustering

Evaluates the semantic grouping of bird vocalizations in the embedding space.

  • Goal: Group bird calls from the same species together without explicit supervised labels.
  • Evaluation: Primarily uses V-Measure to compare predicted clusters against ground-truth species labels.

Tasks (See BIRDSET):

  • BirdSet/clustering/clustering

FSD50K Clustering

Evaluates how well embeddings group by sound event categories.

  • Goal: Cluster audio clips such that clips belonging to the same AudioSet class are grouped together.

Tasks (See FSD50K):

  • FSD50K/clustering/sound_event

SVQ Clustering

Evaluates how well embeddings group by speaker-related attributes.

  • Attributes: The dataset provides labels for speaker_id, speaker_gender, and speaker_age.
  • Goal: Group audio segments based on these metadata labels without using the labels during embedding generation.

Tasks (See SVQ):

  • SVQ/clustering/speaker_age
  • SVQ/clustering/speaker_gender
  • SVQ/clustering/speaker_id

[Back to top]


Reasoning

Reasoning tasks (often implemented as Span Retrieval) require the model to identify specific segments within a text document that directly answer a voice query. This tests deeper semantic understanding and the ability to align fine-grained concepts between speech and text.

SVQ Reasoning (Span Retrieval)

The reasoning task requires identifying the exact span of text within a Wikipedia article that answers a voice query.

  • Task Format: Given an audio query and a target document, the model must predict the start and end offsets of the answer span.
  • In-Lang Reasoning: Query and document share the same language.
  • Cross-Lang Reasoning: Query is in a non-English language; the document and target answer span are in English.

Tasks (See SVQ):

  • SVQ/reasoning/span_reasoning_cross_lang
  • SVQ/reasoning/span_reasoning_in_lang

[Back to top]


Reranking

Given a set of candidate answers, reranking tasks evaluate a model's ability to re-order them such that the most relevant results appear at the top. This is often used to refine the output of a primary retrieval system.

SVQ Reranking

The reranking task assesses a model's ability to refine a list of candidate answers.

  • Input: A voice query and a set of candidate text answers (e.g., top-K results from a first-stage retrieval system).
  • Goal: Re-order the candidates so that the ground-truth answer appears at rank 1.

Tasks (See SVQ):

  • None/reranking/query_reranking
  • None/reranking/query_reranking:background_speech
  • None/reranking/query_reranking:clean
  • None/reranking/query_reranking:media_noise
  • None/reranking/query_reranking:traffic_noise
  • SVQ/reranking/query_reranking
  • SVQ/reranking/query_reranking:background_speech
  • SVQ/reranking/query_reranking:clean
  • SVQ/reranking/query_reranking:media_noise
  • SVQ/reranking/query_reranking:traffic_noise

[Back to top]


Retrieval

Retrieval tasks evaluate a model's ability to find relevant documents or passages from a large corpus given a voice query. This involves mapping speech and text into a shared embedding space where semantic similarity can be measured, often across different languages.

SVQ Retrieval

The SVQ retrieval task evaluates the model's ability to find relevant Wikipedia content given a voice query.

  • In-Lang Retrieval: The voice query and the Wikipedia content are in the same language. This tests native-language semantic alignment.
  • Cross-Lang Retrieval: The voice query is in a non-English language, while the Wikipedia content is in English. This evaluates the model's ability to map cross-lingual concepts into a shared embedding space.
  • Data: Both document-level and passage-level retrieval variants are supported.

Tasks (See SVQ):

  • None/retrieval/document_retrieval_cross_lang
  • None/retrieval/document_retrieval_cross_lang:background_speech
  • None/retrieval/document_retrieval_cross_lang:clean
  • None/retrieval/document_retrieval_cross_lang:media_noise
  • None/retrieval/document_retrieval_cross_lang:traffic_noise
  • None/retrieval/document_retrieval_in_lang
  • None/retrieval/document_retrieval_in_lang:background_speech
  • None/retrieval/document_retrieval_in_lang:clean
  • None/retrieval/document_retrieval_in_lang:media_noise
  • None/retrieval/document_retrieval_in_lang:traffic_noise
  • None/retrieval/passage_retrieval_cross_lang
  • None/retrieval/passage_retrieval_cross_lang:background_speech
  • None/retrieval/passage_retrieval_cross_lang:clean
  • None/retrieval/passage_retrieval_cross_lang:media_noise
  • None/retrieval/passage_retrieval_cross_lang:traffic_noise
  • SVQ/retrieval/document_retrieval_cross_lang
  • SVQ/retrieval/document_retrieval_cross_lang:background_speech
  • SVQ/retrieval/document_retrieval_cross_lang:clean
  • SVQ/retrieval/document_retrieval_cross_lang:media_noise
  • SVQ/retrieval/document_retrieval_cross_lang:traffic_noise
  • SVQ/retrieval/document_retrieval_in_lang
  • SVQ/retrieval/document_retrieval_in_lang:background_speech
  • SVQ/retrieval/document_retrieval_in_lang:clean
  • SVQ/retrieval/document_retrieval_in_lang:media_noise
  • SVQ/retrieval/document_retrieval_in_lang:traffic_noise
  • SVQ/retrieval/passage_retrieval_cross_lang
  • SVQ/retrieval/passage_retrieval_cross_lang:background_speech
  • SVQ/retrieval/passage_retrieval_cross_lang:clean
  • SVQ/retrieval/passage_retrieval_cross_lang:media_noise
  • SVQ/retrieval/passage_retrieval_cross_lang:traffic_noise
  • SVQ/retrieval/passage_retrieval_in_lang
  • SVQ/retrieval/passage_retrieval_in_lang:background_speech
  • SVQ/retrieval/passage_retrieval_in_lang:clean
  • SVQ/retrieval/passage_retrieval_in_lang:media_noise
  • SVQ/retrieval/passage_retrieval_in_lang:traffic_noise

[Back to top]


Segmentation

Segmentation tasks involve identifying and boundary-marking specific parts of an audio stream, such as salient terms or keywords. This evaluates the model's temporal precision and its ability to distinguish meaningful units of sound.

SVQ Segmentation

This task focuses on identifying salient terms within a continuous audio stream.

  • Target: The model must identify the timestamps for "salient terms" or keywords in the voice query.
  • Evaluation: Precision and recall of the predicted time-boundaries compared to human-labeled keywords.

Tasks (See SVQ):

  • SVQ/segmentation/salient_term
  • SVQ/segmentation/salient_term:background_speech
  • SVQ/segmentation/salient_term:clean
  • SVQ/segmentation/salient_term:media_noise
  • SVQ/segmentation/salient_term:traffic_noise

[Back to top]


Transcription

Transcription tasks (Automatic Speech Recognition) evaluate the model's ability to convert spoken language into written text. This tests the phonological and linguistic coverage of the sound representations.

SVQ Transcription

A standard speech-to-text (ASR) task evaluating the linguistic accuracy of representations.

  • Goal: Transcribe the input voice query into the corresponding written text in the same language.
  • Metric: Word Error Rate (WER) and Sentence Error Rate (SER).

Tasks (See SVQ):

  • SVQ/transcription/speech_transcription
  • SVQ/transcription/speech_transcription:background_speech
  • SVQ/transcription/speech_transcription:clean
  • SVQ/transcription/speech_transcription:media_noise
  • SVQ/transcription/speech_transcription:traffic_noise

[Back to top]


Datasets

BirdSet

BirdSet is a large-scale dataset for bioacoustic monitoring, focusing on bird species classification from audio recordings.

Task Description

The primary task is Multi-label Classification. Given a 5-second audio segment, the model must predict all bird species present in the recording.

Configurations

BirdSet includes data from various recording sites and setups, referred to as configurations: - HSN: High Sierra Nevada - NBP: NiBiolas Point - POW: Powdermill Nature Reserve - SSW: Sapsucker Woods - SNE: Sierra Nevada - PER: Peru - NES: Northeast - UHH: Hawaii - XCM: Xeno-Canto Mixed - XCL: Xeno-Canto Low-noise

References

[Back to top]


FSD50K

FSD50K is an open dataset of human-labeled sound events containing 51,197 audio clips totaling 108.3 hours.

Task Description

The task is Multi-label Sound Event Classification. Audio clips are labeled using 200 classes from the AudioSet ontology.

Reference

[Back to top]


Speech MASSIVE

Speech MASSIVE is a multilingual dataset for Speech Intent Classification, derived from the MASSIVE text dataset.

Task Description

The task is Intent Classification. Given an utterance, the model must categorize it into one of 60 predefined intents (e.g., datetime_query, iot_hue_lightchange, play_music).

Coverage

It covers 12 languages and provides a challenging testbed for multilingual speech understanding.

References

[Back to top]


Simple Voice Questions (SVQ) Dataset

Simple Voice Questions (SVQ) is a multilingual dataset designed for evaluating sound representations. It consists of voice queries in multiple languages based on Wikipedia content.

Dataset Characteristics

  • Languages: 17 languages including Arabic, Bengali, English, Finnish, Gujarati, Hindi, Indonesian, Japanese, Kannada, Korean, Malayalam, Marathi, Russian, Swahili, Tamil, Telugu, and Urdu.
  • Environments: To test robustness, queries are provided in four environments:
    • clean: High-quality recording.
    • media_noise: With background music or other media.
    • traffic_noise: With street and vehicle noise.
    • background_speech: With other people talking in the background.

References

[Back to top]