OLAC Record: 2003 NIST Rich Transcription Evaluation Data

OLAC Record
oai:www.ldc.upenn.edu:LDC2007S10

Metadata

Title: 2003 NIST Rich Transcription Evaluation Data

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Fiscus, Jonathan G., et al. 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Fiscus, Jonathan G.

Doddington, George R.

Le, Audrey

Sanders, Greg

Przybocki, Mark

Pallett, David

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-08-17

Description: *Introduction* 2003 NIST Rich Transcription Evaluation Data contains the test material used in the 2003 Rich Transcription Spring and Fall evaluations administered by the NIST (National Institute of Standards and Technology) Speech Group. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational telephone speech in English. For complete information about the evaluations, see the Rich Text Evaluation website. *Data* The BN datasets were selected from TDT-4 sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts. The CTS datasets consist of material from various LDC telephone speech data. All evaluation excerpts were transcribed to the nearest turn. The English CTS set is approximately 6 hours long and is composed of 5-minute excerpts from 72 different conversations: 36 from the Switchboard Cellular collection and 36 from the Fisher collection. The Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute excerpts from 12 different conversations from the CallFriend Mandarin Chinese data. The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts from 12 different conversations from the CallHome Egyptian Arabic data. No manual (human-annotated) segmentations were provided. Sites were required to generate their own segmentations automatically. Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file. *Samples* * English Broacast News Audio * Indices * Transcriptions The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Extent: Corpus size: 2097152 KB

Identifier: LDC2007S10

https://catalog.ldc.upenn.edu/LDC2007S10

ISBN: 1-58563-446-8

ISLRN: 951-213-258-921-8

DOI: 10.35111/v8j8-m006

Language: English

Egyptian Arabic

Standard Arabic

Mandarin Chinese

Language (ISO639): eng

arz

arb

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007S10

Rights Holder: Portions © 2001 American Broadcasting Company, © 2001 Cable News Network, LP, LLLP, © 2001 China Broadcasting System (Taiwan), © 2001 China Central TV, © 2001 China National Radio, © 2001 China Television System (Taiwan), © 2001 National Broadcasting Company, © 2001 Nile TV, © 2001 Public Radio International, © 1996-2005, 2007 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007S10

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Fiscus, Jonathan G.; Doddington, George R.; Le, Audrey; Sanders, Greg; Przybocki, Mark; Pallett, David. 2007. Linguistic Data Consortium.
Terms: area_Africa area_Asia area_Europe country_CN country_EG country_GB country_SA dcmi_Sound iso639_arb iso639_arz iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007S10
Up-to-date as of: Wed Oct 29 7:00:51 EDT 2025

Metadata
Title:		2003 NIST Rich Transcription Evaluation Data
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Fiscus, Jonathan G., et al. 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Fiscus, Jonathan G.
		Doddington, George R.
		Le, Audrey
		Sanders, Greg
		Przybocki, Mark
		Pallett, David
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-08-17
Description:		Introduction 2003 NIST Rich Transcription Evaluation Data contains the test material used in the 2003 Rich Transcription Spring and Fall evaluations administered by the NIST (National Institute of Standards and Technology) Speech Group. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational telephone speech in English. For complete information about the evaluations, see the Rich Text Evaluation website. Data The BN datasets were selected from TDT-4 sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts. The CTS datasets consist of material from various LDC telephone speech data. All evaluation excerpts were transcribed to the nearest turn. The English CTS set is approximately 6 hours long and is composed of 5-minute excerpts from 72 different conversations: 36 from the Switchboard Cellular collection and 36 from the Fisher collection. The Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute excerpts from 12 different conversations from the CallFriend Mandarin Chinese data. The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts from 12 different conversations from the CallHome Egyptian Arabic data. No manual (human-annotated) segmentations were provided. Sites were required to generate their own segmentations automatically. Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file. Samples * English Broacast News Audio * Indices * Transcriptions The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:		Corpus size: 2097152 KB
Identifier:		LDC2007S10
		https://catalog.ldc.upenn.edu/LDC2007S10
		ISBN: 1-58563-446-8
		ISLRN: 951-213-258-921-8
		DOI: 10.35111/v8j8-m006
Language:		English
		Egyptian Arabic
		Standard Arabic
		Mandarin Chinese
Language (ISO639):		eng
		arz
		arb
		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007S10
Rights Holder:		Portions © 2001 American Broadcasting Company, © 2001 Cable News Network, LP, LLLP, © 2001 China Broadcasting System (Taiwan), © 2001 China Central TV, © 2001 China National Radio, © 2001 China Television System (Taiwan), © 2001 National Broadcasting Company, © 2001 Nile TV, © 2001 Public Radio International, © 1996-2005, 2007 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007S10
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Fiscus, Jonathan G.; Doddington, George R.; Le, Audrey; Sanders, Greg; Przybocki, Mark; Pallett, David. 2007. Linguistic Data Consortium.
Terms:		area_Africa area_Asia area_Europe country_CN country_EG country_GB country_SA dcmi_Sound iso639_arb iso639_arz iso639_cmn iso639_eng olac_primary_text