OLAC Record: VAST Chinese Speech and Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2019S05

Metadata

Title: VAST Chinese Speech and Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Tracey, Jennifer, Stephanie Strassel, and Neil Kuster. VAST Chinese Speech and Transcripts LDC2019S05. Web Download. Philadelphia: Linguistic Data Consortium, 2019

Contributor: Tracey, Jennifer

Strassel, Stephanie

Kuster, Neil

Date (W3CDTF): 2019

Date Issued (W3CDTF): 2019-03-15

Description: *Introduction* VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts. The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition. The collection was designed to ensure that the audio covered a wide range of speakers, communication domains, noise environments, and data sources. The data included in this corpus comprises the subset of files selected for transcription from the larger pool of Chinese data collected during the project. *Data* The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC using XTrans, which supports manual transcription across multiple channels, languages and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release. A targeted second pass was made to check for various errors, to correct use of transcription conventions, and to add marking for proper names. The audio data is presented as 16kHz 16-bit flac compressed files (.flac). When uncompressed, the audio files are in PCM MS-WAV format. Transcripts are UTF-8 encoded plain text files in tdf format. *Samples* Please view this audio sample and transcript sample. *Updates* None at this time.

Extent: Corpus size: 4064560 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2019S05

https://catalog.ldc.upenn.edu/LDC2019S05

ISBN: 1-58563-879-X

ISLRN: 067-262-881-745-5

DOI: 10.35111/n1gk-as61

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2019S05

Rights Holder: Portions © 2011-2018 YouTube, LLC, © 2019 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2019S05

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil. 2019. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2019S05
Up-to-date as of: Wed Oct 29 7:01:52 EDT 2025

Metadata
Title:		VAST Chinese Speech and Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Tracey, Jennifer, Stephanie Strassel, and Neil Kuster. VAST Chinese Speech and Transcripts LDC2019S05. Web Download. Philadelphia: Linguistic Data Consortium, 2019
Contributor:		Tracey, Jennifer
		Strassel, Stephanie
		Kuster, Neil
Date (W3CDTF):		2019
Date Issued (W3CDTF):		2019-03-15
Description:		Introduction VAST Chinese Speech and Transcripts was developed by the Linguistic Data Consortium (LDC) for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts. The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition. The collection was designed to ensure that the audio covered a wide range of speakers, communication domains, noise environments, and data sources. The data included in this corpus comprises the subset of files selected for transcription from the larger pool of Chinese data collected during the project. Data The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC using XTrans, which supports manual transcription across multiple channels, languages and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release. A targeted second pass was made to check for various errors, to correct use of transcription conventions, and to add marking for proper names. The audio data is presented as 16kHz 16-bit flac compressed files (.flac). When uncompressed, the audio files are in PCM MS-WAV format. Transcripts are UTF-8 encoded plain text files in tdf format. Samples Please view this audio sample and transcript sample. Updates None at this time.
Extent:		Corpus size: 4064560 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2019S05
		https://catalog.ldc.upenn.edu/LDC2019S05
		ISBN: 1-58563-879-X
		ISLRN: 067-262-881-745-5
		DOI: 10.35111/n1gk-as61
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2019S05
Rights Holder:		Portions © 2011-2018 YouTube, LLC, © 2019 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2019S05
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Tracey, Jennifer; Strassel, Stephanie; Kuster, Neil. 2019. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text