OLAC Record: HUB5 Mandarin Telephone Speech and Transcripts Second Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2018S18

Metadata

Title: HUB5 Mandarin Telephone Speech and Transcripts Second Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Linguistic Data Consortium. HUB5 Mandarin Telephone Speech and Transcripts Second Edition LDC2018S18. Web Download. Philadelphia: Linguistic Data Consortium, 2018

Contributor: Linguistic Data Consortium

Date (W3CDTF): 2018

Date Issued (W3CDTF): 2018-12-17

Description: *Introduction* HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format and adds Pinyin transcripts, forced alignment and updated documentation and metadata. *Data* This release consists of (1) approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations. Audio data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations lasted up to 30 minutes. The audio data was recorded as 8kHz u-law SPH encoded stereo files with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is included with the corpus. SPH files are not included in this second edition. Completed calls passed through two human audits. The first audit was conducted to verify that the target language was spoken by the participants and to check the quality of the recordings. The second audit was conducted by a native speaker familiar with Mainland and Taiwan Mandarin dialects to classify the conversations under one of the two categories. Audit information is available in in the corpus documentation. Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release adds Pinyin translations of the transcripts in UTF-8 and includes the original transcripts converted to UTF-8. For forced alignment, files were converted to linear-PCM encoding, and the speaker channels were split into separate files to avoid overlapping. The aligned files are presented in tab-separated files and in TextGrid files. Alignment data is provided in UTF-8. *Samples* Please view the following samples: * Audio * Transcript * Pinyin Transcript * Tab-Seperated Alignment * Phoneme TextGrid * Word TextGrid *Updates* None at this time.

Extent: Corpus size: 1244136 KB

Format: Sampling Rate: 8000

Sampling Format: ulaw

Identifier: LDC2018S18

https://catalog.ldc.upenn.edu/LDC2018S18

ISBN: 1-58563-867-6

ISLRN: 299-779-903-540-2

DOI: 10.35111/4js2-xd38

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2018S18

Rights Holder: Portions © 1996, 1998, 2018 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2018S18

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Linguistic Data Consortium. 2018. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2018S18
Up-to-date as of: Wed Oct 29 7:01:51 EDT 2025

Metadata
Title:		HUB5 Mandarin Telephone Speech and Transcripts Second Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Linguistic Data Consortium. HUB5 Mandarin Telephone Speech and Transcripts Second Edition LDC2018S18. Web Download. Philadelphia: Linguistic Data Consortium, 2018
Contributor:		Linguistic Data Consortium
Date (W3CDTF):		2018
Date Issued (W3CDTF):		2018-12-17
Description:		Introduction HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by the Linguistic Data Consortium (LDC) in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format and adds Pinyin transcripts, forced alignment and updated documentation and metadata. Data This release consists of (1) approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations. Audio data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations lasted up to 30 minutes. The audio data was recorded as 8kHz u-law SPH encoded stereo files with one end of the phone call on each channel. In this release, files were converted to WAV format, and information from the original SPH headers is included with the corpus. SPH files are not included in this second edition. Completed calls passed through two human audits. The first audit was conducted to verify that the target language was spoken by the participants and to check the quality of the recordings. The second audit was conducted by a native speaker familiar with Mainland and Taiwan Mandarin dialects to classify the conversations under one of the two categories. Audit information is available in in the corpus documentation. Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release adds Pinyin translations of the transcripts in UTF-8 and includes the original transcripts converted to UTF-8. For forced alignment, files were converted to linear-PCM encoding, and the speaker channels were split into separate files to avoid overlapping. The aligned files are presented in tab-separated files and in TextGrid files. Alignment data is provided in UTF-8. Samples Please view the following samples: * Audio * Transcript * Pinyin Transcript * Tab-Seperated Alignment * Phoneme TextGrid * Word TextGrid Updates None at this time.
Extent:		Corpus size: 1244136 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: ulaw
Identifier:		LDC2018S18
		https://catalog.ldc.upenn.edu/LDC2018S18
		ISBN: 1-58563-867-6
		ISLRN: 299-779-903-540-2
		DOI: 10.35111/4js2-xd38
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2018S18
Rights Holder:		Portions © 1996, 1998, 2018 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2018S18
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Linguistic Data Consortium. 2018. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text