OLAC Record: 2011 NIST Language Recognition Evaluation Test Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2018S06

Metadata

Title: 2011 NIST Language Recognition Evaluation Test Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Greenberg, Craig, et al. 2011 NIST Language Recognition Evaluation Test Set LDC2018S06. Web Download. Philadelphia: Linguistic Data Consortium, 2018

Contributor: Greenberg, Craig

Martin, Alvin

Graff, David

Walker, Kevin

Jones, Karen

Strassel, Stephanie

Date (W3CDTF): 2018

Date Issued (W3CDTF): 2018-08-15

Description: *Introduction* 2011 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation, approximately 204 hours of conversational telephone speech and broadcast audio collected by LDC in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Punjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian and Urdu. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, and 2009. The 2011 evaluation emphasized the language pair condition and involved both conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS). Further information regarding this evaluation can be found in the evaluation plan which is also included in the documentation for this release. LDC released the prior LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) *Data* This release includes training data for nine language varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Punjabi, Polish, and Slovak -- contained in 893 audited segments of roughly 30 seconds duration and in 400 full-length CTS recordings. The evaluation test set comprises a total of 29,511 audio files, all manually audited at LDC for language and divided equally into three different test conditions according to the nominal amount of speech content per segment. Data was collected between 2009 and 2011, and has been released by LDC as individual corpora grouped by language. The CTS data was obtained using a "claque" collection model in which speakers (claques) called friends or relatives in their social network for a 10-minute conversation in the claque's native language, such that each call would involve a unique callee. Participants were free to speak on topics of their own choosing. All calls were routed through a telephone collection system at LDC which stored the raw mu-law sample stream into separate audio files for each call side. Auditing and selection were applied to the callee side of every call and to the caller (claque) side in at most one call made by each claque. Contiguous regions containing between 25 and 35 seconds of speech were identified by signal analysis and extracted for manual audit. In some cases, shorter segments were also selected for audit. Broadcast audio was recorded via capture of satellite-receiver MPEG streams or analog audio receivers digitizing at 16 kHz. Platforms for data capture were located at LDC and in Tunisia and India. Recordings were analyzed to extract contiguous segments of narrow-band speech of at least 33 seconds duration; longer segments were trimmed to a maximum length of 35 seconds for audit. All audited segments for training and test are presented as 8-kHz, 16-bit PCM, single-channel audio files with NIST SPHERE headers. The full-length CTS data is the same, except that it consists of two channels. *Samples* For examples of the data in this corpus, please listen to this Urdu sample (SPH), Pashto sample (SPH), and English sample (SPH). *Updates* None at this time.

Extent: Corpus size: 16115384 KB

Format: Sampling Rate: 8000

Sampling Format: pcm

Identifier: LDC2018S06

https://catalog.ldc.upenn.edu/LDC2018S06

ISBN: 1-58563-846-3

ISLRN: 766-428-831-656-3

DOI: 10.35111/8j8e-vy57

Language: Mesopotamian Arabic

Bengali

Czech

Dari

English

Persian

Hindi

Lao

Mandarin Chinese

Panjabi

Pushto

Polish

Russian

Slovak

Spanish

Tamil

Thai

Turkish

Ukrainian

Urdu

Standard Arabic

Levantine Arabic

Arabic

Language (ISO639): acm

ben

ces

prs

eng

fas

hin

lao

cmn

pan

pus

pol

rus

slk

spa

tam

tha

tur

ukr

urd

arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2018S06

Rights Holder: Portions © 2011 ABP News, © 2010 Alkass Sports Channel, © 2010-2011 Aljazeera, © 2010 Al Mustakillah TV, © 2010-2011 Al-Shirkatul Islamiyyah, © 2010 Alsumaria TV, © 2010 American Broadcasting Company, © 2011 Amrit Bani TV, © 2010 Android Television Network, © 2011 Appadana International Broadcasting Corp., © 2010-2011 Ariamehr International TV, © 2010-2011 Ariana Afghanistan International Television Network, © 2010 Assyria Sat, © 2010-2011 Atimemedia Co., Ltd, © 2011 Bayyinah Productions LLC, © 2009-2011 BBC, © 2010 Cable News Network, LP, LLLP, © 2010-2011 Channel One TV, © 2009-2010 China Central TV, © 2011 Czech Television, © 2011 ET Now, © 2011 Frequency 1, © 2010 Impact Television Network, © 2011 Independent News Service, © 2010 Iran TVNetwork.com, © 2010 IRINN, © 2010 Jiangsu Radio and Television General Station, © 2010 National Broadcasting Company, Inc., © 2010-2011 NATTV.com, © 2011 Natural TV, © 2010 New Tang Dynasty TV, © 2010-2011 Persian Radio & Iranian Live TV, © 2010 Persian TV One, © 2010 Phoenix TV, © 2010-2011 Polskie Radio S.A., © 2010 Qatar Radio, © 2011 Radio National Television of Laos, © 2010-2011 Radio Sedaye Ashena, © 2010-2011 Radio Television of Afghanistan, © 2010 Radio Tunis, © 2010-2011 Rangarang, © 2010 RAZ-E-ZINDAGI, © 2011 Rajya Sahba TV, © 2011 RTVS - Rozhlas a televízia Slovenska, © 2010 SAT-7 International, © 2010 Sharjah Media Corporation, © 2009 Spanish Radio and Television Corporation, © 2010 Syria TV, © 2010-2011 Thai TV Global Network, © 2011 TOLOnews, © 2010-2011 TRT World, © 2010 UTR, © 2010 Yemen TV, © 2009, 2010, 2011, 2018 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2018S06

DateStamp: 2021-09-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Greenberg, Craig; Martin, Alvin; Graff, David; Walker, Kevin; Jones, Karen; Strassel, Stephanie. 2018. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_AF country_BD country_CN country_CZ country_ES country_GB country_IN country_IQ country_LA country_PK country_PL country_RU country_SA country_SK country_TH country_TR country_UA dcmi_Sound iso639_acm iso639_ara iso639_arb iso639_ben iso639_ces iso639_cmn iso639_eng iso639_fas iso639_hin iso639_lao iso639_pan iso639_pol iso639_prs iso639_pus iso639_rus iso639_slk iso639_spa iso639_tam iso639_tha iso639_tur iso639_ukr iso639_urd olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2018S06
Up-to-date as of: Wed Oct 29 7:01:48 EDT 2025

Metadata
Title:		2011 NIST Language Recognition Evaluation Test Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Greenberg, Craig, et al. 2011 NIST Language Recognition Evaluation Test Set LDC2018S06. Web Download. Philadelphia: Linguistic Data Consortium, 2018
Contributor:		Greenberg, Craig
		Martin, Alvin
		Graff, David
		Walker, Kevin
		Jones, Karen
		Strassel, Stephanie
Date (W3CDTF):		2018
Date Issued (W3CDTF):		2018-08-15
Description:		Introduction 2011 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation, approximately 204 hours of conversational telephone speech and broadcast audio collected by LDC in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Punjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian and Urdu. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, and 2009. The 2011 evaluation emphasized the language pair condition and involved both conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS). Further information regarding this evaluation can be found in the evaluation plan which is also included in the documentation for this release. LDC released the prior LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) Data This release includes training data for nine language varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Punjabi, Polish, and Slovak -- contained in 893 audited segments of roughly 30 seconds duration and in 400 full-length CTS recordings. The evaluation test set comprises a total of 29,511 audio files, all manually audited at LDC for language and divided equally into three different test conditions according to the nominal amount of speech content per segment. Data was collected between 2009 and 2011, and has been released by LDC as individual corpora grouped by language. The CTS data was obtained using a "claque" collection model in which speakers (claques) called friends or relatives in their social network for a 10-minute conversation in the claque's native language, such that each call would involve a unique callee. Participants were free to speak on topics of their own choosing. All calls were routed through a telephone collection system at LDC which stored the raw mu-law sample stream into separate audio files for each call side. Auditing and selection were applied to the callee side of every call and to the caller (claque) side in at most one call made by each claque. Contiguous regions containing between 25 and 35 seconds of speech were identified by signal analysis and extracted for manual audit. In some cases, shorter segments were also selected for audit. Broadcast audio was recorded via capture of satellite-receiver MPEG streams or analog audio receivers digitizing at 16 kHz. Platforms for data capture were located at LDC and in Tunisia and India. Recordings were analyzed to extract contiguous segments of narrow-band speech of at least 33 seconds duration; longer segments were trimmed to a maximum length of 35 seconds for audit. All audited segments for training and test are presented as 8-kHz, 16-bit PCM, single-channel audio files with NIST SPHERE headers. The full-length CTS data is the same, except that it consists of two channels. Samples For examples of the data in this corpus, please listen to this Urdu sample (SPH), Pashto sample (SPH), and English sample (SPH). Updates None at this time.
Extent:		Corpus size: 16115384 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: pcm
Identifier:		LDC2018S06
		https://catalog.ldc.upenn.edu/LDC2018S06
		ISBN: 1-58563-846-3
		ISLRN: 766-428-831-656-3
		DOI: 10.35111/8j8e-vy57
Language:		Mesopotamian Arabic
		Bengali
		Czech
		Dari
		English
		Persian
		Hindi
		Lao
		Mandarin Chinese
		Panjabi
		Pushto
		Polish
		Russian
		Slovak
		Spanish
		Tamil
		Thai
		Turkish
		Ukrainian
		Urdu
		Standard Arabic
		Levantine Arabic
		Arabic
Language (ISO639):		acm
		ben
		ces
		prs
		eng
		fas
		hin
		lao
		cmn
		pan
		pus
		pol
		rus
		slk
		spa
		tam
		tha
		tur
		ukr
		urd
		arb
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2018S06
Rights Holder:		Portions © 2011 ABP News, © 2010 Alkass Sports Channel, © 2010-2011 Aljazeera, © 2010 Al Mustakillah TV, © 2010-2011 Al-Shirkatul Islamiyyah, © 2010 Alsumaria TV, © 2010 American Broadcasting Company, © 2011 Amrit Bani TV, © 2010 Android Television Network, © 2011 Appadana International Broadcasting Corp., © 2010-2011 Ariamehr International TV, © 2010-2011 Ariana Afghanistan International Television Network, © 2010 Assyria Sat, © 2010-2011 Atimemedia Co., Ltd, © 2011 Bayyinah Productions LLC, © 2009-2011 BBC, © 2010 Cable News Network, LP, LLLP, © 2010-2011 Channel One TV, © 2009-2010 China Central TV, © 2011 Czech Television, © 2011 ET Now, © 2011 Frequency 1, © 2010 Impact Television Network, © 2011 Independent News Service, © 2010 Iran TVNetwork.com, © 2010 IRINN, © 2010 Jiangsu Radio and Television General Station, © 2010 National Broadcasting Company, Inc., © 2010-2011 NATTV.com, © 2011 Natural TV, © 2010 New Tang Dynasty TV, © 2010-2011 Persian Radio & Iranian Live TV, © 2010 Persian TV One, © 2010 Phoenix TV, © 2010-2011 Polskie Radio S.A., © 2010 Qatar Radio, © 2011 Radio National Television of Laos, © 2010-2011 Radio Sedaye Ashena, © 2010-2011 Radio Television of Afghanistan, © 2010 Radio Tunis, © 2010-2011 Rangarang, © 2010 RAZ-E-ZINDAGI, © 2011 Rajya Sahba TV, © 2011 RTVS - Rozhlas a televízia Slovenska, © 2010 SAT-7 International, © 2010 Sharjah Media Corporation, © 2009 Spanish Radio and Television Corporation, © 2010 Syria TV, © 2010-2011 Thai TV Global Network, © 2011 TOLOnews, © 2010-2011 TRT World, © 2010 UTR, © 2010 Yemen TV, © 2009, 2010, 2011, 2018 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2018S06
DateStamp:		2021-09-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Greenberg, Craig; Martin, Alvin; Graff, David; Walker, Kevin; Jones, Karen; Strassel, Stephanie. 2018. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_AF country_BD country_CN country_CZ country_ES country_GB country_IN country_IQ country_LA country_PK country_PL country_RU country_SA country_SK country_TH country_TR country_UA dcmi_Sound iso639_acm iso639_ara iso639_arb iso639_ben iso639_ces iso639_cmn iso639_eng iso639_fas iso639_hin iso639_lao iso639_pan iso639_pol iso639_prs iso639_pus iso639_rus iso639_slk iso639_spa iso639_tam iso639_tha iso639_tur iso639_ukr iso639_urd olac_primary_text