OLAC Record: 2007 NIST Language Recognition Evaluation Test Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2009S04

Metadata

Title: 2007 NIST Language Recognition Evaluation Test Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Martin, Alvin, and Audrey Le. 2007 NIST Language Recognition Evaluation Test Set LDC2009S04. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Martin, Alvin

Le, Audrey

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-10-20

Description: *Introduction* 2007 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology. It consists of 66 hours of conversational telephone speech segments in the following languages and dialects: Arabic, Bengali, Chinese (Cantonese), Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American, Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian, Spanish (Caribbean, non-Caribbean), Tamil, Thai, and Vietnamese. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release. The training data for LRE 2007 consists of the following: * 2003 NIST Language Recognition Evaluation (LDC2006S31) - This material is comprised of: (1) approximately 46 hours of conversational telephone speech segments in the target languages and dialects and (2) the 1996 LRE test data (conversational telephone speech in Arabic (Egyptian colloquial), English (General American, Southern American), Farsi, French, German, Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Caribbean, non-Caribbean), Tamil, and Vietnamese. * 2005 NIST Language Recognition Evaluation (LDC2008S05) - This release consists of approximately 44 hours of conversational telephone speech in English (American, Indian), Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Mexican), and Tamil. * 2007 NIST Language Recognition Evaluation Supplemental Training Data (LDC2009S05) - This release consists of 118 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu, and Tamil. LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Supplemental Training Data (LDC2009S05) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) *Data* Each speech file in the test data is one side of a 4-wire telephone conversation represented as 8-bit 8-kHz mu-law format. There are 7530 speech files in SPHERE (.sph) format for a total of 66 hours of speech. The speech data was compiled from LDCs CALLFRIEND, Fisher Spanish, and Mixer 3 corpora and from data collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments contain three nominal durations of speech: three seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. Unlike previous evaluations, the nominal duration for each test segment was not identified. *Samples* For an example of the data in this corpus, please listen to this sample (WAV). *Updates* None at this time.

Extent: Corpus size: 1861208 KB

Format: Sampling Rate: 8000

Sampling Format: u-law

Identifier: LDC2009S04

https://catalog.ldc.upenn.edu/LDC2009S04

ISBN: 1-58563-529-4

ISLRN: 994-591-828-190-6

DOI: 10.35111/vjag-sh89

Language: Yue Chinese

Vietnamese

Thai

Tamil

Spanish

Russian

Korean

Japanese

Hindi

Persian

English

German

Mandarin Chinese

Bengali

Standard Arabic

Dari

Iranian Persian

Arabic

Language (ISO639): yue

vie

tha

tam

spa

rus

kor

jpn

hin

fas

eng

deu

cmn

ben

arb

prs

pes

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009S04

Rights Holder: Portions © 2005 Oregon Health and Science University, © 1996, 2006, 2009 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009S04

DateStamp: 2021-09-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Martin, Alvin; Le, Audrey. 2009. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_AF country_BD country_CN country_DE country_ES country_GB country_IN country_IR country_JP country_KR country_RU country_SA country_TH country_VN dcmi_Sound iso639_ara iso639_arb iso639_ben iso639_cmn iso639_deu iso639_eng iso639_fas iso639_hin iso639_jpn iso639_kor iso639_pes iso639_prs iso639_rus iso639_spa iso639_tam iso639_tha iso639_vie iso639_yue olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009S04
Up-to-date as of: Wed Oct 29 7:01:10 EDT 2025

Metadata
Title:		2007 NIST Language Recognition Evaluation Test Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Martin, Alvin, and Audrey Le. 2007 NIST Language Recognition Evaluation Test Set LDC2009S04. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Martin, Alvin
Contributor:		Le, Audrey
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-10-20
Description:		Introduction 2007 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology. It consists of 66 hours of conversational telephone speech segments in the following languages and dialects: Arabic, Bengali, Chinese (Cantonese), Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American, Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian, Spanish (Caribbean, non-Caribbean), Tamil, Thai, and Vietnamese. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release. The training data for LRE 2007 consists of the following: * 2003 NIST Language Recognition Evaluation (LDC2006S31) - This material is comprised of: (1) approximately 46 hours of conversational telephone speech segments in the target languages and dialects and (2) the 1996 LRE test data (conversational telephone speech in Arabic (Egyptian colloquial), English (General American, Southern American), Farsi, French, German, Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Caribbean, non-Caribbean), Tamil, and Vietnamese. * 2005 NIST Language Recognition Evaluation (LDC2008S05) - This release consists of approximately 44 hours of conversational telephone speech in English (American, Indian), Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Mexican), and Tamil. * 2007 NIST Language Recognition Evaluation Supplemental Training Data (LDC2009S05) - This release consists of 118 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu, and Tamil. LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Supplemental Training Data (LDC2009S05) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) Data Each speech file in the test data is one side of a 4-wire telephone conversation represented as 8-bit 8-kHz mu-law format. There are 7530 speech files in SPHERE (.sph) format for a total of 66 hours of speech. The speech data was compiled from LDCs CALLFRIEND, Fisher Spanish, and Mixer 3 corpora and from data collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments contain three nominal durations of speech: three seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. Unlike previous evaluations, the nominal duration for each test segment was not identified. Samples For an example of the data in this corpus, please listen to this sample (WAV). Updates None at this time.
Extent:		Corpus size: 1861208 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: u-law
Identifier:		LDC2009S04
		https://catalog.ldc.upenn.edu/LDC2009S04
		ISBN: 1-58563-529-4
		ISLRN: 994-591-828-190-6
		DOI: 10.35111/vjag-sh89
Language:		Yue Chinese
		Vietnamese
		Thai
		Tamil
		Spanish
		Russian
		Korean
		Japanese
		Hindi
		Persian
		English
		German
		Mandarin Chinese
		Bengali
		Standard Arabic
		Dari
		Iranian Persian
		Arabic
Language (ISO639):		yue
		vie
		tha
		tam
		spa
		rus
		kor
		jpn
		hin
		fas
		eng
		deu
		cmn
		ben
		arb
		prs
		pes
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009S04
Rights Holder:		Portions © 2005 Oregon Health and Science University, © 1996, 2006, 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009S04
DateStamp:		2021-09-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Martin, Alvin; Le, Audrey. 2009. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_AF country_BD country_CN country_DE country_ES country_GB country_IN country_IR country_JP country_KR country_RU country_SA country_TH country_VN dcmi_Sound iso639_ara iso639_arb iso639_ben iso639_cmn iso639_deu iso639_eng iso639_fas iso639_hin iso639_jpn iso639_kor iso639_pes iso639_prs iso639_rus iso639_spa iso639_tam iso639_tha iso639_vie iso639_yue olac_primary_text