OLAC Record
oai:www.ldc.upenn.edu:LDC2004S11

Metadata
Title:2002 Rich Transcription Broadcast News and Conversational Telephone Speech
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Garofolo, John S., Jonathan Fiscus, and Audrey Le. 2002 Rich Transcription Broadcast News and Conversational Telephone Speech LDC2004S11. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Garofolo, John S.
Fiscus, Jonathan G.
Le, Audrey
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-11-19
Description:*Introduction* 2002 Rich Transcription Broadcast News and Conversational Telephone Speech was produced by the Linguistic Data Consortium (LDC) and contains 10 hours of English broadcast news and conversational telephone speech audio in addition to reference transcripts and annotation data. This corpus contains the test material used in the 2002 Rich Transcription (RT-02) Evaluation of Broadcast News and Conversational Telephone Speech, administered by the NIST Speech Group in the Spring of 2002. The RT-02 Meeting Recognition Evaluation material is available in a separate distribution. For complete up-to-date information, see the RT-02 Evaluation Website. The RT-02 Evaluation supported two main evaluation tasks: * Speech-To-Text (STT) Tasks -- included three processing speeds (1x real time, 10x real time, and unlimited time) for both the Broadcast News (BN) and Conversational Telephone Speech (CTS) domains. * Metadata Extraction (MDE) Task -- consisted of a speaker diarization task for the BN and CTS domains. *Data* This distribution of the RT-02 Evaluation Data contains only Broadcast News and Conversational Telephone Speech data. Meeting data used in the RT-02 Evaluation is not included in this distribution and is packaged in a separate distribution. All recordings are in English. The BN data is composed of six approximately 10-minute excerpts from six different broadcasts. Each waveform is a SPHERE-headered, single-channel, 16-bit PCM file. The broadcasts were selected from programs from MNB, PRI, NBC, CNN, VOA, and ABC, all collected in 1998. Although the entire audio file is provided for and available within broadcast adaptation, only the excerpts listed in the system input UEM files are included in the evaluation. The evaluation excerpts were transcribed to the nearest story boundary. The CTS data is composed of 60 approximately five-minute excerpts from 60 different conversations: 20 from Switchboard-1 data, 20 from Switchboard-2 data, and 20 from Switchboard Cellular-2 data. Evaluation excerpts were transcribed to the nearest turn. Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file. The reference transcripts are also provided in this corpus. The official format for STT reference data is STM (files with the extension 'stm'), while the official format for MDE reference data is RTTM (files with the extension 'rttm'). Files with the extensions 'txt' or 'utf' are the original reference transcripts before any format conversions, additions of annotations, etc., and are included for completeness. *Samples* Please examine this example to review a sample of this corpus. *Updates* There are no updates available at this time. The World is the co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:Corpus size: 816364 KB
Identifier:LDC2004S11
https://catalog.ldc.upenn.edu/LDC2004S11
ISBN: 1-58563-311-9
ISLRN: 171-813-937-657-8
DOI: 10.35111/h4y1-2g85
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004S11
Rights Holder:Portions © 2004 Trustees of the University of Pennsylvania, © 1998 American Broadcasting Company, © 1998 National Broadcasting Company, Inc., © 1998 Cable News Network LP, LLP. All Rights Reserved, © 1998 Public Radio International.

The World is the co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004S11
DateStamp:  2024-03-25
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Garofolo, John S.; Fiscus, Jonathan G.; Le, Audrey. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004S11
Up-to-date as of: Fri Dec 6 7:46:52 EST 2024