OLAC Record: CSR-III Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC95S23

Metadata

Title: CSR-III Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Linguistic Data Consortium, and NIST Multimodal Information Group. CSR-III Speech LDC95S23. Web Download. Philadelphia: Linguistic Data Consortium, 1995

Contributor: Linguistic Data Consortium

NIST Multimodal Information Group

Date (W3CDTF): 1995

Description: CSR-III Speech, the third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection, is a three CD-ROM set that contains complete development, test and evaluation test sets for speaker-independent, large-vocabulary speech recognition systems. The development and evaluation tests share a common structure, consisting of two core test components ("hubs") and seven specialized test components ("spokes"). The hub tests, which were mandatory for all ARPA CSR participants in the November 1994 evaluations, provide a baseline for ASR performance, while the spokes provide the means for assessing the impact of particular speaking conditions or processing strategies in relation to baseline performance. Participants were free to take any combination of spoke tests according to their research interests. Taken together, the collection encompasses 180 speakers, each producing 20-40 sentences. These are organized into two complete development test sets and one evaluation set. The collection also includes complete documentation on the test specifications, data collection procedures, transcriptions and scoring protocols, together with the latest available version of NIST software for scoring ASR results and managing SPHERE waveform files. All speech data is accompanied by both the prompting texts and the detailed orthographic transcriptions of the utterances. This was the first ARPA CSR benchmark test in which prompting texts were drawn from a variety of English news sources. Whereas earlier benchmarks were based on Wall Street Journal (WSJ) excerpts from the period 1987-89, CSR-III prompts are from a variety of North American Business News Services: Reuters News Service, New York Times, Wahington Post and Los Angeles Times as well as WSJ; all texts are drawn from financial news articles written during the period of April through June 1994. (NAB stands for "North American Business," in contrast to earlier benchmarks and training collections labeled "WSJ"). A companion to the 1994 Benchmark Speech data collection is the four-disk CSR-III Text Collection (LDC95T6), which includes the ARPA CSR 1994 Standard Language Model. This corpus is also available from the LDC as a 1995 release.

Format: Sampling Rate: 16000

Sampling Format: 1-channel pcm compressed

Identifier: LDC95S23

https://catalog.ldc.upenn.edu/LDC95S23

ISBN: 1-58563-045-4

ISLRN: 388-101-290-949-9

DOI: 10.35111/pzfc-hd68

Language: English

Language (ISO639): eng

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC95S23

Rights Holder: Portions © 1994 Dow Jones & Company, Inc., Los Angeles Times-Washington Post News Service, Inc., New York Times, Reuters America, Inc., © 1994, 1995 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC95S23

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Linguistic Data Consortium; NIST Multimodal Information Group. 1995. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC95S23
Up-to-date as of: Wed Oct 29 7:00:33 EDT 2025

Metadata
Title:		CSR-III Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Linguistic Data Consortium, and NIST Multimodal Information Group. CSR-III Speech LDC95S23. Web Download. Philadelphia: Linguistic Data Consortium, 1995
Contributor:		Linguistic Data Consortium
Contributor:		NIST Multimodal Information Group
Date (W3CDTF):		1995
Description:		CSR-III Speech, the third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection, is a three CD-ROM set that contains complete development, test and evaluation test sets for speaker-independent, large-vocabulary speech recognition systems. The development and evaluation tests share a common structure, consisting of two core test components ("hubs") and seven specialized test components ("spokes"). The hub tests, which were mandatory for all ARPA CSR participants in the November 1994 evaluations, provide a baseline for ASR performance, while the spokes provide the means for assessing the impact of particular speaking conditions or processing strategies in relation to baseline performance. Participants were free to take any combination of spoke tests according to their research interests. Taken together, the collection encompasses 180 speakers, each producing 20-40 sentences. These are organized into two complete development test sets and one evaluation set. The collection also includes complete documentation on the test specifications, data collection procedures, transcriptions and scoring protocols, together with the latest available version of NIST software for scoring ASR results and managing SPHERE waveform files. All speech data is accompanied by both the prompting texts and the detailed orthographic transcriptions of the utterances. This was the first ARPA CSR benchmark test in which prompting texts were drawn from a variety of English news sources. Whereas earlier benchmarks were based on Wall Street Journal (WSJ) excerpts from the period 1987-89, CSR-III prompts are from a variety of North American Business News Services: Reuters News Service, New York Times, Wahington Post and Los Angeles Times as well as WSJ; all texts are drawn from financial news articles written during the period of April through June 1994. (NAB stands for "North American Business," in contrast to earlier benchmarks and training collections labeled "WSJ"). A companion to the 1994 Benchmark Speech data collection is the four-disk CSR-III Text Collection (LDC95T6), which includes the ARPA CSR 1994 Standard Language Model. This corpus is also available from the LDC as a 1995 release.
Format:		Sampling Rate: 16000
Format:		Sampling Format: 1-channel pcm compressed
Identifier:		LDC95S23
		https://catalog.ldc.upenn.edu/LDC95S23
		ISBN: 1-58563-045-4
		ISLRN: 388-101-290-949-9
		DOI: 10.35111/pzfc-hd68
Language:		English
Language (ISO639):		eng
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC95S23
Rights Holder:		Portions © 1994 Dow Jones & Company, Inc., Los Angeles Times-Washington Post News Service, Inc., New York Times, Reuters America, Inc., © 1994, 1995 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC95S23
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Linguistic Data Consortium; NIST Multimodal Information Group. 1995. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text