OLAC Record
oai:www.ldc.upenn.edu:LDC2004T13

Metadata
Title:NIST Meeting Pilot Corpus Transcripts and Metadata
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Garofolo, John S., et al. NIST Meeting Pilot Corpus Transcripts and Metadata LDC2004T13. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Garofolo, John S.
Michel, Martial
Stanford, Vincent M.
Tabassi, Elham
Fiscus, Jonathan G.
Laprun, Christophe D.
Pratz, Nicolas
Lard, Jerome
Strassel, Stephanie
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-07-12
Description:*Introduction* NIST Meeting Pilot Corpus Transcripts and Metadata was produced by the Linguistic Data Consortium (LDC) and contains the full speech transcripts created by LDC from about 15 hours of speech as well as a metadata database with useful information about the meeting forums, topics, participants, recording conditions, and equipment. The corresponding speech files are available as the NIST Meeting Pilot Corpus Speech (LDC2004S09). These recordings and transcripts were made for the NIST Automatic Meeting Recognition Project. Huge efforts are being expended in mining information in newswire, news broadcasts, and conversational speech, however, little has been done to address such applications in the more challenging and equally important meeting domain. Meetings have several important properties not found in other domains, such as being diverse in formality and vocabulary, being highly interactive across multiple participants, using distant microphones, using overlapping camera views, and necessitating multi-media information integration. The development of smart meeting room core technologies that can automatically recognize and extract important information from multi-media sensor inputs will provide an invaluable resource for a variety of business, academic, and governmental applications. *Data* The data for the NIST Automatic Meeting Recognition Project was collected at the NIST Gaithersburg, MD, Meeting Data Collection Laboratory and includes 19 meetings recorded between November 2001 and December 2003. The Pilot Corpus contains a total of 15:09:24 of exploitable data. A total of 61 subjects were involved in these meetings. The following is a breakdown by participant origin and sex: # Male Instances # Unique Males # Female Instances # Unique Females Total Participants Instances Total Unique Participants Native 54 30 33 15 87 45 Non-Native 18 11 10 5 28 16 Total 72 41 43 20 115 61 The full transcriptions included in this release were created using a "quick" transcription procedure and stored in TXT format. There are approximately 151 K-words (thousands of words) and 6K unique words. A variety of information was manually recorded during the collection of the pilot corpus about the subjects and recording setup. This information was stored in a relational database. An HTML snapshot of the database, done on June 15th, 2004, has been included here under the "metadata" directory. *Samples* Please view the following sample: * Transcript (txt) *Updates* There are no updates available at this time.
Identifier:LDC2004T13
https://catalog.ldc.upenn.edu/LDC2004T13
ISBN: 1-58563-303-8
ISLRN: 682-718-319-529-5
DOI: 10.35111/dahz-tn26
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004T13
Rights Holder:Portions © 2004 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T13
DateStamp:  2024-03-20
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Garofolo, John S.; Michel, Martial; Stanford, Vincent M.; Tabassi, Elham; Fiscus, Jonathan G.; Laprun, Christophe D.; Pratz, Nicolas; Lard, Jerome; Strassel, Stephanie. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T13
Up-to-date as of: Fri Dec 6 7:46:55 EST 2024