OLAC Record
oai:www.ldc.upenn.edu:LDC2019S05

Metadata
Title:VAST Chinese Speech and Transcripts
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Tracey, Jennifer, Stephanie Strassel, and Neil Kuster. VAST Chinese Speech and Transcripts LDC2019S05. Web Download. Philadelphia: Linguistic Data Consortium, 2019
Contributor:Tracey, Jennifer
Strassel, Stephanie
Kuster, Neil
Date (W3CDTF):2019
Date Issued (W3CDTF):2019-03-15
Description:*Introduction* VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts. The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition. The collection was designed to ensure that the audio covered a wide range of speakers, communication domains, noise environments, and data sources. The data included in this corpus comprises the subset of files selected for transcription from the larger pool of Chinese data collected during the project. *Data* The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC using XTrans, which supports manual transcription across multiple channels, languages and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release. A targeted second pass was made to check for various errors, to correct use of transcription conventions, and to add marking for proper names. The audio data is presented as 16kHz 16-bit flac compressed files (.flac). When uncompressed, the audio files are in PCM MS-WAV format. Transcripts are UTF-8 encoded plain text files in tdf format. *Samples* Please view this audio sample and transcript sample. *Updates* None at this time.
Extent:Corpus size: 4064560 KB
Format:Sampling Rate: 16000
Sampling Format: pcm
Identifier:LDC2019S05
https://catalog.ldc.upenn.edu/LDC2019S05
ISBN: 1-58563-879-X
ISLRN: 067-262-881-745-5
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/LDC%20User%20Agreement%20for%20Non-Members.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2019S05
Rights Holder:Portions © 2011-2018 YouTube, LLC, © 2019 Trustees of the University of Pennsylvania
Type (DCMI):Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2019S05
DateStamp:  2020-01-06
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil. 2019. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2019S05
Up-to-date as of: Sat Jan 18 13:58:39 EST 2020