OLAC Record: Mandarin-English Code-Switching in South-East Asia

OLAC Record
oai:www.ldc.upenn.edu:LDC2015S04

Metadata

Title: Mandarin-English Code-Switching in South-East Asia

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Nanyang Technological University, and Universiti Sains Malaysia. Mandarin-English Code-Switching in South-East Asia LDC2015S04. Web Download. Philadelphia: Linguistic Data Consortium, 2015

Contributor: Nanyang Technological University

Universiti Sains Malaysia

Date (W3CDTF): 2015

Date Issued (W3CDTF): 2015-04-15

Description: *Introduction* Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities. *Data* The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian. The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length. Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances. The transcription file for each audio file is stored in UTF-8 tab-separated text file format. Development and Training Divisions are available as a seperate download (SEAME_train_dev_division.zip) and on the provider's Github page. *Samples* Please view this audio sample and transcript sample. *Updates* As of 12/14/2015, an additional set of transcription files were added for all the audio. The transcriptions are updated based on the original transcription, with adding the previously un-transcribed utterance. The language label also is also added for each utterance in the transcription. File directories were also changed to reflect the update, specifically, the change is made under /data/{recording_type}/transcript/{phase_number}/ Where - the {recording_type} is equal to 'conversation' or 'interview' - the {phase_number} is equal to 'phaseI' or 'phaseII' +) 'phaseI' contains all the existing transcription from the first release +) 'phaseII' contains the newly updated transcriptions, where some typo mistakes, wrong boundary markers are corrected. Un-transcribed segments, which are normally monolingual and language label for each segment are added. The documentation for the corpus also updated to include the detail description on the new update in section 3) Transcription.

Extent: Corpus size: 8472528 KB

Format: Sampling Rate: 16000

Sampling Format: flac

Identifier: LDC2015S04

https://catalog.ldc.upenn.edu/LDC2015S04

ISBN: 1-58563-709-2

ISLRN: 594-468-772-379-0

DOI: 10.35111/5gyy-zq54

Language: Mandarin Chinese

English

Language (ISO639): cmn

eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Provenance: Collected by Nanyang Technological University (NTU) in Singapore and Universities Sains Malaysia (USM) in Malaysia.

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2015S04

Rights Holder: Portions © 2015 Nanyang Technical University, Universiti Sains Malaysia, Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2015S04

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Nanyang Technological University; Universiti Sains Malaysia. 2015. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Sound dcmi_Text iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2015S04
Up-to-date as of: Wed Oct 29 7:01:30 EDT 2025

Metadata
Title:		Mandarin-English Code-Switching in South-East Asia
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Nanyang Technological University, and Universiti Sains Malaysia. Mandarin-English Code-Switching in South-East Asia LDC2015S04. Web Download. Philadelphia: Linguistic Data Consortium, 2015
Contributor:		Nanyang Technological University
Contributor:		Universiti Sains Malaysia
Date (W3CDTF):		2015
Date Issued (W3CDTF):		2015-04-15
Description:		Introduction Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities. Data The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian. The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length. Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances. The transcription file for each audio file is stored in UTF-8 tab-separated text file format. Development and Training Divisions are available as a seperate download (SEAME_train_dev_division.zip) and on the provider's Github page. Samples Please view this audio sample and transcript sample. Updates As of 12/14/2015, an additional set of transcription files were added for all the audio. The transcriptions are updated based on the original transcription, with adding the previously un-transcribed utterance. The language label also is also added for each utterance in the transcription. File directories were also changed to reflect the update, specifically, the change is made under /data/{recording_type}/transcript/{phase_number}/ Where - the {recording_type} is equal to 'conversation' or 'interview' - the {phase_number} is equal to 'phaseI' or 'phaseII' +) 'phaseI' contains all the existing transcription from the first release +) 'phaseII' contains the newly updated transcriptions, where some typo mistakes, wrong boundary markers are corrected. Un-transcribed segments, which are normally monolingual and language label for each segment are added. The documentation for the corpus also updated to include the detail description on the new update in section 3) Transcription.
Extent:		Corpus size: 8472528 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: flac
Identifier:		LDC2015S04
		https://catalog.ldc.upenn.edu/LDC2015S04
		ISBN: 1-58563-709-2
		ISLRN: 594-468-772-379-0
		DOI: 10.35111/5gyy-zq54
Language:		Mandarin Chinese
Language:		English
Language (ISO639):		cmn
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Provenance:		Collected by Nanyang Technological University (NTU) in Singapore and Universities Sains Malaysia (USM) in Malaysia.
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2015S04
Rights Holder:		Portions © 2015 Nanyang Technical University, Universiti Sains Malaysia, Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2015S04
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Nanyang Technological University; Universiti Sains Malaysia. 2015. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Sound dcmi_Text iso639_cmn iso639_eng olac_primary_text