Encoder Name	classification	reasoning	reranking	retrieval
gecko	0.44	0.42 (Δ-0.02, -4.74%)	N/A	0.08	0.06 (Δ-0.02, -27.21%)	N/A	1.00	0.51 (Δ-0.49, -49.10%)	N/A	0.51	0.34 (Δ-0.17, -33.29%)	N/A
gemini	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.72	0.58 (Δ-0.15, -20.46%)	0.57
gemini_embedding_001	0.83	0.80 (Δ-0.03, -3.79%)	N/A	0.04	0.03 (Δ-0.01, -25.00%)	N/A	1.00	0.60 (Δ-0.40, -39.86%)	N/A	0.67	0.51 (Δ-0.17, -24.92%)	0.56
gemma	N/A	N/A	N/A	0.53	0.47 (Δ-0.07, -12.36%)	N/A	N/A	N/A	N/A	N/A	N/A	N/A
gpt	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.23	0.65	0.40 (Δ-0.26, -39.51%)	0.52

Rank	Encoder Name	classification (mean)	clustering (mean)	reasoning (mean)	reranking (mean)	retrieval (mean)	segmentation (mean)	transcription (mean)
1	gemini_embedding_2_transcript_truth	0.80 ± 0.00	N/A	N/A	N/A	N/A	N/A	N/A
2	gemini_embedding_2	0.63 ± 0.00	N/A	N/A	N/A	N/A	N/A	N/A
3	perch	0.58 ± 0.00	0.39 ± 0.00	N/A	N/A	N/A	N/A	N/A
4	gemma4	0.45 ± 0.00	N/A	0.43 ± 0.01	N/A	N/A	N/A	0.58 ± 0.11
5	clap	0.43 ± 0.00	0.50 ± 0.25	N/A	N/A	N/A	N/A	N/A
6	elevenlabs	N/A	N/A	N/A	N/A	N/A	N/A	0.29 ± 0.05
7	gemini	N/A	N/A	N/A	N/A	0.55 ± 0.10	N/A	0.36 ± 0.05
8	gemini_3_flash_preview_with_title_and_context	N/A	N/A	N/A	0.98 ± 0.00	0.57 ± 0.06	N/A	N/A
9	gemini_3_flash_preview_with_title_and_context_gpt_4o_transcribe	N/A	N/A	N/A	N/A	0.57 ± 0.06	N/A	N/A
10	gemini_3_flash_preview_with_title_and_context_transcript_truth	N/A	N/A	N/A	N/A	0.69 ± 0.05	N/A	N/A
11	gemini_embedding_001	N/A	N/A	N/A	N/A	0.57 ± 0.08	N/A	N/A
12	gpt	N/A	N/A	N/A	0.23 ± 0.01	0.52 ± 0.10	N/A	0.28 ± 0.03
13	hubert	N/A	0.48 ± 0.38	N/A	N/A	N/A	N/A	N/A
14	litellm	N/A	N/A	N/A	0.69 ± 0.01	0.50 ± 0.04	N/A	N/A
15	raw_spectrogram_25ms_10ms_mean	N/A	0.48 ± 0.37	N/A	N/A	N/A	N/A	N/A
16	wav2vec2	N/A	0.60 ± 0.37	N/A	N/A	N/A	N/A	N/A
17	whisper	N/A	N/A	N/A	N/A	N/A	0.40 ± 0.03	0.33 ± 0.03

Model	BirdSet/classification/ebird_classification (mAP) [?]	FSD50K/classification/classification (mAP) [?]	SpeechMassive/classification/intent_classification (Accuracy) [?]
clap	N/A	0.43	N/A
gemini_embedding_2	N/A	N/A	0.63
gemini_embedding_2_transcript_truth	N/A	N/A	0.80
gemma4	N/A	N/A	0.45
perch	0.58	N/A	N/A

Model	BirdSet/clustering/clustering (VMeasure) [?]	FSD50K/clustering/sound_event (VMeasure) [?]	SVQ/clustering/speaker_age (VMeasure) [?]	SVQ/clustering/speaker_gender (VMeasure) [?]	SVQ/clustering/speaker_id (VMeasure) [?]
clap	0.32	0.68	N/A	N/A	N/A
hubert	0.08	N/A	0.76	0.24	0.85
perch	0.39	N/A	N/A	N/A	N/A
raw_spectrogram_25ms_10ms_mean	0.09	N/A	0.74	0.24	0.85
wav2vec2	N/A	N/A	0.76	0.18	0.87

Model	SVQ/reasoning/span_reasoning_in_lang (GmeanF1) [?]	SVQ/reasoning/span_reasoning_in_lang:background_speech (GmeanF1) [?]	SVQ/reasoning/span_reasoning_in_lang:clean (GmeanF1) [?]	SVQ/reasoning/span_reasoning_in_lang:media_noise (GmeanF1) [?]	SVQ/reasoning/span_reasoning_in_lang:traffic_noise (GmeanF1) [?]
gemma4	0.43	0.43	0.44	0.43	0.41

Model	None/reranking/query_reranking (MAP) [?]	None/reranking/query_reranking:background_speech (MAP) [?]	None/reranking/query_reranking:clean (MAP) [?]	None/reranking/query_reranking:media_noise (MAP) [?]	None/reranking/query_reranking:traffic_noise (MAP) [?]	SVQ/reranking/query_reranking (MAP) [?]	SVQ/reranking/query_reranking:background_speech (MAP) [?]	SVQ/reranking/query_reranking:clean (MAP) [?]	SVQ/reranking/query_reranking:media_noise (MAP) [?]	SVQ/reranking/query_reranking:traffic_noise (MAP) [?]
gemini_3_flash_preview_with_title_and_context	0.98	0.99	0.99	0.98	0.98	N/A	N/A	N/A	N/A	N/A
gpt	N/A	N/A	N/A	N/A	N/A	0.24	0.23	0.24	0.24	0.24
litellm	N/A	N/A	N/A	N/A	N/A	0.69	0.68	0.70	0.69	0.69

Model	SVQ/segmentation/salient_term (NDCG) [?]	SVQ/segmentation/salient_term:background_speech (NDCG) [?]	SVQ/segmentation/salient_term:clean (NDCG) [?]	SVQ/segmentation/salient_term:media_noise (NDCG) [?]	SVQ/segmentation/salient_term:traffic_noise (NDCG) [?]
whisper	0.40	0.39	0.44	0.41	0.37

Model	SVQ/transcription/speech_transcription (WER) [?]	SVQ/transcription/speech_transcription:background_speech (WER) [?]	SVQ/transcription/speech_transcription:clean (WER) [?]	SVQ/transcription/speech_transcription:media_noise (WER) [?]	SVQ/transcription/speech_transcription:traffic_noise (WER) [?]
elevenlabs	0.29	0.35	0.22	0.28	0.30
gemini	0.36	0.43	0.29	0.35	0.37
gemma4	0.58	0.75	0.44	0.57	0.57
gpt	0.28	0.28	0.24	0.28	0.34
whisper	0.33	0.35	0.28	0.32	0.37

Task Definitions

Classification

Classification tasks involve assigning one or more predefined labels to an audio segment. These tasks evaluate the model's ability to extract specific features or categories from sound representations, such as intent in speech or species in bioacoustic recordings.

BirdSet Classification

Evaluates the model's ability to identify bird species in soundscapes.

Task: Multi-label Classification.
Input: 5-second audio segments.
Labels: Predict all bird species present in the segment based on eBird codes.

Tasks (See BIRDSET):

BirdSet/classification/ebird_classification

FSD50K Classification

Evaluates the model's ability to classify general sound events.

Task: Multi-label Classification across 200 diverse categories.
Evaluation: Mean Average Precision (mAP) is the primary metric.

Tasks (See FSD50K):

FSD50K/classification/classification

Speech MASSIVE Classification

Evaluates the model's ability to categorize speech into intents.

Task: Intent Classification.
Labels: 60 predefined intents (e.g., datetime_query, play_music).
Goal: Correctlly identify the user's intent from the audio utterance.

Tasks (See SPEECH_MASSIVE):

SpeechMassive/classification/intent_classification

[Back to top]

Clustering

Clustering tasks evaluate how well sound embeddings group together based on semantic or acoustic similarity without explicit labels during the grouping process. This tests the inherent structure and separability of the embedding space.

BirdSet Clustering

Evaluates the semantic grouping of bird vocalizations in the embedding space.

Goal: Group bird calls from the same species together without explicit supervised labels.
Evaluation: Primarily uses V-Measure to compare predicted clusters against ground-truth species labels.

Tasks (See BIRDSET):

BirdSet/clustering/clustering

FSD50K Clustering

Evaluates how well embeddings group by sound event categories.

Goal: Cluster audio clips such that clips belonging to the same AudioSet class are grouped together.

Tasks (See FSD50K):

FSD50K/clustering/sound_event

SVQ Clustering

Evaluates how well embeddings group by speaker-related attributes.

Attributes: The dataset provides labels for speaker_id, speaker_gender, and speaker_age.
Goal: Group audio segments based on these metadata labels without using the labels during embedding generation.

Tasks (See SVQ):

SVQ/clustering/speaker_age
SVQ/clustering/speaker_gender
SVQ/clustering/speaker_id

[Back to top]

Reasoning

Reasoning tasks (often implemented as Span Retrieval) require the model to identify specific segments within a text document that directly answer a voice query. This tests deeper semantic understanding and the ability to align fine-grained concepts between speech and text.

SVQ Reasoning (Span Retrieval)

The reasoning task requires identifying the exact span of text within a Wikipedia article that answers a voice query.

Task Format: Given an audio query and a target document, the model must predict the start and end offsets of the answer span.
In-Lang Reasoning: Query and document share the same language.
Cross-Lang Reasoning: Query is in a non-English language; the document and target answer span are in English.

Tasks (See SVQ):

SVQ/reasoning/span_reasoning_cross_lang
SVQ/reasoning/span_reasoning_in_lang
SVQ/reasoning/span_reasoning_in_lang:background_speech
SVQ/reasoning/span_reasoning_in_lang:clean
SVQ/reasoning/span_reasoning_in_lang:media_noise
SVQ/reasoning/span_reasoning_in_lang:traffic_noise

[Back to top]

Reranking

Given a set of candidate answers, reranking tasks evaluate a model's ability to re-order them such that the most relevant results appear at the top. This is often used to refine the output of a primary retrieval system.

SVQ Reranking

The reranking task assesses a model's ability to refine a list of candidate answers.

Input: A voice query and a set of candidate text answers (e.g., top-K results from a first-stage retrieval system).
Goal: Re-order the candidates so that the ground-truth answer appears at rank 1.

Tasks (See SVQ):

None/reranking/query_reranking
None/reranking/query_reranking:background_speech
None/reranking/query_reranking:clean
None/reranking/query_reranking:media_noise
None/reranking/query_reranking:traffic_noise
SVQ/reranking/query_reranking
SVQ/reranking/query_reranking:background_speech
SVQ/reranking/query_reranking:clean
SVQ/reranking/query_reranking:media_noise
SVQ/reranking/query_reranking:traffic_noise

[Back to top]

Retrieval

Retrieval tasks evaluate a model's ability to find relevant documents or passages from a large corpus given a voice query. This involves mapping speech and text into a shared embedding space where semantic similarity can be measured, often across different languages.

SVQ Retrieval

The SVQ retrieval task evaluates the model's ability to find relevant Wikipedia content given a voice query.

In-Lang Retrieval: The voice query and the Wikipedia content are in the same language. This tests native-language semantic alignment.
Cross-Lang Retrieval: The voice query is in a non-English language, while the Wikipedia content is in English. This evaluates the model's ability to map cross-lingual concepts into a shared embedding space.
Data: Both document-level and passage-level retrieval variants are supported.

Tasks (See SVQ):

None/retrieval/document_retrieval_cross_lang
None/retrieval/document_retrieval_cross_lang:background_speech
None/retrieval/document_retrieval_cross_lang:clean
None/retrieval/document_retrieval_cross_lang:media_noise
None/retrieval/document_retrieval_cross_lang:traffic_noise
None/retrieval/document_retrieval_in_lang
None/retrieval/document_retrieval_in_lang:background_speech
None/retrieval/document_retrieval_in_lang:clean
None/retrieval/document_retrieval_in_lang:media_noise
None/retrieval/document_retrieval_in_lang:traffic_noise
None/retrieval/passage_retrieval_cross_lang
None/retrieval/passage_retrieval_cross_lang:background_speech
None/retrieval/passage_retrieval_cross_lang:clean
None/retrieval/passage_retrieval_cross_lang:media_noise
None/retrieval/passage_retrieval_cross_lang:traffic_noise
SVQ/retrieval/document_retrieval_cross_lang
SVQ/retrieval/document_retrieval_cross_lang:background_speech
SVQ/retrieval/document_retrieval_cross_lang:clean
SVQ/retrieval/document_retrieval_cross_lang:media_noise
SVQ/retrieval/document_retrieval_cross_lang:traffic_noise
SVQ/retrieval/document_retrieval_in_lang
SVQ/retrieval/document_retrieval_in_lang:background_speech
SVQ/retrieval/document_retrieval_in_lang:clean
SVQ/retrieval/document_retrieval_in_lang:media_noise
SVQ/retrieval/document_retrieval_in_lang:traffic_noise
SVQ/retrieval/passage_retrieval_cross_lang
SVQ/retrieval/passage_retrieval_cross_lang:background_speech
SVQ/retrieval/passage_retrieval_cross_lang:clean
SVQ/retrieval/passage_retrieval_cross_lang:media_noise
SVQ/retrieval/passage_retrieval_cross_lang:traffic_noise
SVQ/retrieval/passage_retrieval_in_lang
SVQ/retrieval/passage_retrieval_in_lang:background_speech
SVQ/retrieval/passage_retrieval_in_lang:clean
SVQ/retrieval/passage_retrieval_in_lang:media_noise
SVQ/retrieval/passage_retrieval_in_lang:traffic_noise

[Back to top]

Segmentation

Segmentation tasks involve identifying and boundary-marking specific parts of an audio stream, such as salient terms or keywords. This evaluates the model's temporal precision and its ability to distinguish meaningful units of sound.

SVQ Segmentation

This task focuses on identifying salient terms within a continuous audio stream.

Target: The model must identify the timestamps for "salient terms" or keywords in the voice query.
Evaluation: Precision and recall of the predicted time-boundaries compared to human-labeled keywords.

Tasks (See SVQ):

SVQ/segmentation/salient_term
SVQ/segmentation/salient_term:background_speech
SVQ/segmentation/salient_term:clean
SVQ/segmentation/salient_term:media_noise
SVQ/segmentation/salient_term:traffic_noise

[Back to top]

Transcription

Transcription tasks (Automatic Speech Recognition) evaluate the model's ability to convert spoken language into written text. This tests the phonological and linguistic coverage of the sound representations.

SVQ Transcription

A standard speech-to-text (ASR) task evaluating the linguistic accuracy of representations.

Goal: Transcribe the input voice query into the corresponding written text in the same language.
Metric: Word Error Rate (WER) and Sentence Error Rate (SER).

Tasks (See SVQ):

SVQ/transcription/speech_transcription
SVQ/transcription/speech_transcription:background_speech
SVQ/transcription/speech_transcription:clean
SVQ/transcription/speech_transcription:media_noise
SVQ/transcription/speech_transcription:traffic_noise

[Back to top]

Datasets

BirdSet

BirdSet is a large-scale dataset for bioacoustic monitoring, focusing on bird species classification from audio recordings.

Task Description

The primary task is Multi-label Classification. Given a 5-second audio segment, the model must predict all bird species present in the recording.

Configurations

BirdSet includes data from various recording sites and setups, referred to as configurations: - HSN: High Sierra Nevada - NBP: NiBiolas Point - POW: Powdermill Nature Reserve - SSW: Sapsucker Woods - SNE: Sierra Nevada - PER: Peru - NES: Northeast - UHH: Hawaii - XCM: Xeno-Canto Mixed - XCL: Xeno-Canto Low-noise

References

Paper: BirdSet: A Multi-Task Benchmark for Bioacoustic Monitoring
Hugging Face: DBD-research-group/BirdSet

[Back to top]

FSD50K

FSD50K is an open dataset of human-labeled sound events containing 51,197 audio clips totaling 108.3 hours.

Task Description

The task is Multi-label Sound Event Classification. Audio clips are labeled using 200 classes from the AudioSet ontology.

Reference

Paper: FSD50K: an Open Dataset of Human-Labeled Sound Events
Hugging Face: Fhrozen/FSD50k

[Back to top]

Speech MASSIVE

Speech MASSIVE is a multilingual dataset for Speech Intent Classification, derived from the MASSIVE text dataset.

Task Description

The task is Intent Classification. Given an utterance, the model must categorize it into one of 60 predefined intents (e.g., datetime_query, iot_hue_lightchange, play_music).

Coverage

It covers 12 languages and provides a challenging testbed for multilingual speech understanding.

References

Dataset: FBK-MT/Speech-MASSIVE

[Back to top]

Simple Voice Questions (SVQ) Dataset

Simple Voice Questions (SVQ) is a multilingual dataset designed for evaluating sound representations. It consists of voice queries in multiple languages based on Wikipedia content.

Dataset Characteristics

Languages: 17 languages including Arabic, Bengali, English, Finnish, Gujarati, Hindi, Indonesian, Japanese, Kannada, Korean, Malayalam, Marathi, Russian, Swahili, Tamil, Telugu, and Urdu.
Environments: To test robustness, queries are provided in four environments:
- clean: High-quality recording.
- media_noise: With background music or other media.
- traffic_noise: With street and vehicle noise.
- background_speech: With other people talking in the background.

References

Hugging Face: google/svq

[Back to top]

Encoder Name	classification			reasoning			reranking			retrieval
Encoder Name	Transcript Truth	Cascaded ASR	Audio	Transcript Truth	Cascaded ASR	Audio	Transcript Truth	Cascaded ASR	Audio	Transcript Truth	Cascaded ASR	Audio
gecko	0.44	0.42 (Δ-0.02, -4.74%)	N/A	0.08	0.06 (Δ-0.02, -27.21%)	N/A	1.00	0.51 (Δ-0.49, -49.10%)	N/A	0.51	0.34 (Δ-0.17, -33.29%)	N/A
gemini	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.72	0.58 (Δ-0.15, -20.46%)	0.57
gemini_embedding_001	0.83	0.80 (Δ-0.03, -3.79%)	N/A	0.04	0.03 (Δ-0.01, -25.00%)	N/A	1.00	0.60 (Δ-0.40, -39.86%)	N/A	0.67	0.51 (Δ-0.17, -24.92%)	0.56
gemma	N/A	N/A	N/A	0.53	0.47 (Δ-0.07, -12.36%)	N/A	N/A	N/A	N/A	N/A	N/A	N/A
gpt	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.23	0.65	0.40 (Δ-0.26, -39.51%)	0.52

Tasks:	938
Task Types:	12
Languages:	37

MSEB Leaderboard

Leaderboard Results

Headroom and Audio Comparison

Audio Encoders

Audio Encoder Details

classification

clustering

reasoning

reranking

retrieval

segmentation

transcription

Task Definitions

Classification

BirdSet Classification

FSD50K Classification

Speech MASSIVE Classification

Clustering

BirdSet Clustering

FSD50K Clustering

SVQ Clustering

Reasoning

SVQ Reasoning (Span Retrieval)

Reranking

SVQ Reranking

Retrieval

SVQ Retrieval

Segmentation

SVQ Segmentation

Transcription

SVQ Transcription

Datasets

BirdSet

Task Description

Configurations

References

FSD50K

Task Description

Reference

Speech MASSIVE

Task Description

Coverage

References

Simple Voice Questions (SVQ) Dataset

Dataset Characteristics

References