OLAC Record oai:www.ldc.upenn.edu:LDC2019S05 |
Metadata | ||
Title: | VAST Chinese Speech and Transcripts | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Tracey, Jennifer, Stephanie Strassel, and Neil Kuster. VAST Chinese Speech and Transcripts LDC2019S05. Web Download. Philadelphia: Linguistic Data Consortium, 2019 | |
Contributor: | Tracey, Jennifer | |
Strassel, Stephanie | ||
Kuster, Neil | ||
Date (W3CDTF): | 2019 | |
Date Issued (W3CDTF): | 2019-03-15 | |
Description: | *Introduction* VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts. The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition. The collection was designed to ensure that the audio covered a wide range of speakers, communication domains, noise environments, and data sources. The data included in this corpus comprises the subset of files selected for transcription from the larger pool of Chinese data collected during the project. *Data* The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC using XTrans, which supports manual transcription across multiple channels, languages and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release. A targeted second pass was made to check for various errors, to correct use of transcription conventions, and to add marking for proper names. The audio data is presented as 16kHz 16-bit flac compressed files (.flac). When uncompressed, the audio files are in PCM MS-WAV format. Transcripts are UTF-8 encoded plain text files in tdf format. *Samples* Please view this audio sample and transcript sample. *Updates* None at this time. | |
Extent: | Corpus size: 4064560 KB | |
Format: | Sampling Rate: 16000 | |
Sampling Format: pcm | ||
Identifier: | LDC2019S05 | |
https://catalog.ldc.upenn.edu/LDC2019S05 | ||
ISBN: 1-58563-879-X | ||
ISLRN: 067-262-881-745-5 | ||
DOI: 10.35111/n1gk-as61 | ||
Language: | Mandarin Chinese | |
Language (ISO639): | cmn | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2019S05 | |
Rights Holder: | Portions © 2011-2018 YouTube, LLC, © 2019 Trustees of the University of Pennsylvania | |
Type (DCMI): | Sound | |
Text | ||
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2019S05 | |
DateStamp: | 2020-11-30 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil. 2019. Linguistic Data Consortium. | |
Terms: | area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text |