OLAC Record: Voicemail Corpus Part II

OLAC Record
oai:www.ldc.upenn.edu:LDC2002S35

Metadata

Title: Voicemail Corpus Part II

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Padmanabhan, Mukund, et al. Voicemail Corpus Part II LDC2002S35. Web Download. Philadelphia: Linguistic Data Consortium, 2002

Contributor: Padmanabhan, Mukund

Kingsbury, Brian

Ramabhadran, Bhuvana

Huang, Jing

Chen, Stanley

Saon, George

Mangu, Lidia

Date (W3CDTF): 2002

Date Issued (W3CDTF): 2002-11-08

Description: *Introduction* Voicemail Corpus Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2002S35 and ISBN 1-58563-242-2. Voicemail Corpus Part II is a continuation of Voicemail Corpus Part I, LDC98S77. *Data* This publication is comprised of speech and script files, and is structured in training and evaluation data. The training data consists of 2,048 voicemail messages and the corresponding script files. The speech and script files are organized in 41 directories, each of which contains up to 50 messages. The evaluation data consists of 50 voicemail messages and 50 scripts. The speech data is provided in sphere format it is sampled at 8 KHz, and recorded in 8-bit ulaw, totalling approximately 14 hours (406 MB) for training and 23 minutes (11 MB) for evaluation. In addition to the individual script files, there are three files which represent a concatenation of the individual scripts: train_scripts.all and eval_scripts .all represent a concatenation of the training and evaluation script files, one file per line, each line beginning with the fileID. eval_scripts_filtered.all is a filtered version of the file eval_scripts.all, after eliminating the tagged elements () and the proper nouns marker. *Updates* A more recent version of the paper Automatic Speech Recognition Performance on a Voicemail Transcription Task (M. Padmanabhan, G. Saon, J. Huang, B. Kingsbury and L. Mangu, IEEE Transactions on Speech and Audio Processing, vol 10, number 7, pp 433-442, October 2002) is available in both PDF and PS format by email request. *Samples* Please view the following samples: * Audio Sample 1 (SPH) * Transcript Sample 1 (TXT) * Audio Sample 2 (SPH) * Transcript Sample 2 (TXT) *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Extent: Corpus size: 450560 KB

Format: Sampling Rate: 8000

Sampling Format: ulaw

Identifier: LDC2002S35

https://catalog.ldc.upenn.edu/LDC2002S35

ISBN: 1-58563-242-2

ISLRN: 550-933-474-715-5

DOI: 10.35111/d86m-x152

Language: English

Language (ISO639): eng

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2002S35

Rights Holder: Portions © 2002 International Business Machines Corporation, © 2002 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2002S35

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Padmanabhan, Mukund; Kingsbury, Brian; Ramabhadran, Bhuvana; Huang, Jing; Chen, Stanley; Saon, George; Mangu, Lidia. 2002. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2002S35
Up-to-date as of: Wed Oct 29 7:00:12 EDT 2025

Metadata
Title:		Voicemail Corpus Part II
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Padmanabhan, Mukund, et al. Voicemail Corpus Part II LDC2002S35. Web Download. Philadelphia: Linguistic Data Consortium, 2002
Contributor:		Padmanabhan, Mukund
		Kingsbury, Brian
		Ramabhadran, Bhuvana
		Huang, Jing
		Chen, Stanley
		Saon, George
		Mangu, Lidia
Date (W3CDTF):		2002
Date Issued (W3CDTF):		2002-11-08
Description:		Introduction Voicemail Corpus Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2002S35 and ISBN 1-58563-242-2. Voicemail Corpus Part II is a continuation of Voicemail Corpus Part I, LDC98S77. Data This publication is comprised of speech and script files, and is structured in training and evaluation data. The training data consists of 2,048 voicemail messages and the corresponding script files. The speech and script files are organized in 41 directories, each of which contains up to 50 messages. The evaluation data consists of 50 voicemail messages and 50 scripts. The speech data is provided in sphere format it is sampled at 8 KHz, and recorded in 8-bit ulaw, totalling approximately 14 hours (406 MB) for training and 23 minutes (11 MB) for evaluation. In addition to the individual script files, there are three files which represent a concatenation of the individual scripts: train_scripts.all and eval_scripts .all represent a concatenation of the training and evaluation script files, one file per line, each line beginning with the fileID. eval_scripts_filtered.all is a filtered version of the file eval_scripts.all, after eliminating the tagged elements () and the proper nouns marker. Updates A more recent version of the paper Automatic Speech Recognition Performance on a Voicemail Transcription Task (M. Padmanabhan, G. Saon, J. Huang, B. Kingsbury and L. Mangu, IEEE Transactions on Speech and Audio Processing, vol 10, number 7, pp 433-442, October 2002) is available in both PDF and PS format by email request. Samples Please view the following samples: * Audio Sample 1 (SPH) * Transcript Sample 1 (TXT) * Audio Sample 2 (SPH) * Transcript Sample 2 (TXT) Additional Licensing Instructions This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.
Extent:		Corpus size: 450560 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: ulaw
Identifier:		LDC2002S35
		https://catalog.ldc.upenn.edu/LDC2002S35
		ISBN: 1-58563-242-2
		ISLRN: 550-933-474-715-5
		DOI: 10.35111/d86m-x152
Language:		English
Language (ISO639):		eng
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2002S35
Rights Holder:		Portions © 2002 International Business Machines Corporation, © 2002 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2002S35
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Padmanabhan, Mukund; Kingsbury, Brian; Ramabhadran, Bhuvana; Huang, Jing; Chen, Stanley; Saon, George; Mangu, Lidia. 2002. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text