OLAC Record: CSLU: Names Release 1.3

OLAC Record
oai:www.ldc.upenn.edu:LDC2006S39

Metadata

Title: CSLU: Names Release 1.3

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Names Release 1.3 LDC2006S39. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Muthusamy, Yeshwant

Cole, Ronald Allan

Oshika, Beatrice

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-07-21

Description: *Introduction* CSLU: Names Release 1.3 was developed by the Center for Spoken Language Understanding (CSLU) and contains 24,245 files totalling over 6 hours of name utterances, both first and last names, from several thousand different speakers over the telephone along with transcripts. A common problem in training and developing speech recognition systems is scarcity of data, especially particular phonemic contexts. The CSLU is attempting to address this problem with the Names Corpus. Name utterances are "spontaneous" in that the subject is not reading from a word list. Another area of active research is the development of name recognition systems. The Names Corpus is a useful resource for addressing this problem. *Data* The utterances in this corpus were taken from many other telephone speech data collections that have been completed at the CSLU. In most data collections, the callers were asked to leave their name at some point. Also, the callers would occasionally leave their name in the midst of another utterance. The names in these situations were extracted out of the host utterance and added to the Names Corpus. Each file in the Names Corpus has an orthographic transcription following the CSLU Labeling Conventions. Also, to take advantage of the phonemic variability, many of the utterances have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed. There are three file formats used in this corpus: * The .wav file is a 16-bit, linearly encoded RIFF standard file format. * The .txt file is simply an ASCII text file representing the orthographic transcription. * The .phn file contains a time aligned phonetic transcription. Release 1.3 of this corpus contains 24,245 files, all of which have been phonetically labeled. Approximately 40% of the bigram phonemic contexts possible, without regard to language constraints, are represented. *Samples* For an example of the data in this publication, please listen to this audio sample (WAV) and view its transcription (TXT). *Updates* None at this time.

Extent: Corpus size: 487424 KB

Format: Sampling Rate: 8000

Sampling Format: ulaw

Identifier: LDC2006S39

https://catalog.ldc.upenn.edu/LDC2006S39

ISBN: 1-58563-394-1

ISLRN: 972-485-703-759-3

DOI: 10.35111/qyw6-w652

Language: English

Language (ISO639): eng

License: CSLU Agreement: https://catalog.ldc.upenn.edu/license/cslu-corpora-non-commercial-research-only.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006S39

Rights Holder: Portions © 2001, 2003 Speech Technology Center Ltd., © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006S39

DateStamp: 2021-06-14

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Muthusamy, Yeshwant; Cole, Ronald Allan; Oshika, Beatrice. 2006. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006S39
Up-to-date as of: Wed Oct 29 7:00:20 EDT 2025

Metadata
Title:		CSLU: Names Release 1.3
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Names Release 1.3 LDC2006S39. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Muthusamy, Yeshwant
		Cole, Ronald Allan
		Oshika, Beatrice
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-07-21
Description:		Introduction CSLU: Names Release 1.3 was developed by the Center for Spoken Language Understanding (CSLU) and contains 24,245 files totalling over 6 hours of name utterances, both first and last names, from several thousand different speakers over the telephone along with transcripts. A common problem in training and developing speech recognition systems is scarcity of data, especially particular phonemic contexts. The CSLU is attempting to address this problem with the Names Corpus. Name utterances are "spontaneous" in that the subject is not reading from a word list. Another area of active research is the development of name recognition systems. The Names Corpus is a useful resource for addressing this problem. Data The utterances in this corpus were taken from many other telephone speech data collections that have been completed at the CSLU. In most data collections, the callers were asked to leave their name at some point. Also, the callers would occasionally leave their name in the midst of another utterance. The names in these situations were extracted out of the host utterance and added to the Names Corpus. Each file in the Names Corpus has an orthographic transcription following the CSLU Labeling Conventions. Also, to take advantage of the phonemic variability, many of the utterances have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed. There are three file formats used in this corpus: * The .wav file is a 16-bit, linearly encoded RIFF standard file format. * The .txt file is simply an ASCII text file representing the orthographic transcription. * The .phn file contains a time aligned phonetic transcription. Release 1.3 of this corpus contains 24,245 files, all of which have been phonetically labeled. Approximately 40% of the bigram phonemic contexts possible, without regard to language constraints, are represented. Samples For an example of the data in this publication, please listen to this audio sample (WAV) and view its transcription (TXT). Updates None at this time.
Extent:		Corpus size: 487424 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: ulaw
Identifier:		LDC2006S39
		https://catalog.ldc.upenn.edu/LDC2006S39
		ISBN: 1-58563-394-1
		ISLRN: 972-485-703-759-3
		DOI: 10.35111/qyw6-w652
Language:		English
Language (ISO639):		eng
License:		CSLU Agreement: https://catalog.ldc.upenn.edu/license/cslu-corpora-non-commercial-research-only.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006S39
Rights Holder:		Portions © 2001, 2003 Speech Technology Center Ltd., © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006S39
DateStamp:		2021-06-14
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Muthusamy, Yeshwant; Cole, Ronald Allan; Oshika, Beatrice. 2006. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text