OLAC Record: WSJCAM0 Cambridge Read News

OLAC Record
oai:www.ldc.upenn.edu:LDC95S24

Metadata

Title: WSJCAM0 Cambridge Read News

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Robinson, Tony, et al. WSJCAM0 Cambridge Read News LDC95S24. Web Download. Philadelphia: Linguistic Data Consortium, 1995

Contributor: Robinson, Tony

Fransen, Jeroen

Pye, David

Foote, Jonathan

Renals, Steve

Woodland, Phil

Young, Steve

Date (W3CDTF): 1995

Description: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition (The Cambridge University Version of the ARPA CSR Corpus WSJ0). This release of WSJCA0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of August 31, 1994. This collection was modelled directly on the ARPA CSR Corpus released by LDC in 1993: it used the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal. There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 were native speakers of British English and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments. The contents of the publication consist of the following: * Training data from head-mounted microphone * Development test data from head-mounted microphone, plus first set of evaluation test data * Training data from desk-mounted microphone * Development test data from desk-mounted microphone, plus second set of evaluation test data There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone. Within the train and test sets, speech data are organized by speaker prompting texts and detailed transcriptions and speaker information are included in each speaker directory. All waveform files have NIST SPHERE headers. Waveform data are compressed using the Shorten algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package. *Samples* Please view the following samples: * Head Mounted Mic * Desk Mounted Mic * Phoneme Alignments * Word Alignments *Updates* On October 1, 2015 the corpus was modified to be released as a web download. Documentation was modified to reflect this.

Extent: Corpus size: 3670016 KB

Format: Sampling Rate: 16000

Sampling Format: 1-channel pcm compressed

Identifier: LDC95S24

https://catalog.ldc.upenn.edu/LDC95S24

ISBN: 1-58563-058-6

ISLRN: 500-945-172-283-3

DOI: 10.35111/8p7y-7b92

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC95S24

Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 1995 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC95S24

DateStamp: 2024-01-08

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Robinson, Tony; Fransen, Jeroen; Pye, David; Foote, Jonathan; Renals, Steve; Woodland, Phil; Young, Steve. 1995. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC95S24
Up-to-date as of: Wed Oct 29 7:00:34 EDT 2025

Metadata
Title:		WSJCAM0 Cambridge Read News
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Robinson, Tony, et al. WSJCAM0 Cambridge Read News LDC95S24. Web Download. Philadelphia: Linguistic Data Consortium, 1995
Contributor:		Robinson, Tony
		Fransen, Jeroen
		Pye, David
		Foote, Jonathan
		Renals, Steve
		Woodland, Phil
		Young, Steve
Date (W3CDTF):		1995
Description:		A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition (The Cambridge University Version of the ARPA CSR Corpus WSJ0). This release of WSJCA0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of August 31, 1994. This collection was modelled directly on the ARPA CSR Corpus released by LDC in 1993: it used the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal. There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 were native speakers of British English and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments. The contents of the publication consist of the following: * Training data from head-mounted microphone * Development test data from head-mounted microphone, plus first set of evaluation test data * Training data from desk-mounted microphone * Development test data from desk-mounted microphone, plus second set of evaluation test data There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone. Within the train and test sets, speech data are organized by speaker prompting texts and detailed transcriptions and speaker information are included in each speaker directory. All waveform files have NIST SPHERE headers. Waveform data are compressed using the Shorten algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package. Samples Please view the following samples: * Head Mounted Mic * Desk Mounted Mic * Phoneme Alignments * Word Alignments Updates On October 1, 2015 the corpus was modified to be released as a web download. Documentation was modified to reflect this.
Extent:		Corpus size: 3670016 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: 1-channel pcm compressed
Identifier:		LDC95S24
		https://catalog.ldc.upenn.edu/LDC95S24
		ISBN: 1-58563-058-6
		ISLRN: 500-945-172-283-3
		DOI: 10.35111/8p7y-7b92
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC95S24
Rights Holder:		Portions © 1987-1989 Dow Jones & Company, Inc., © 1995 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC95S24
DateStamp:		2024-01-08
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Robinson, Tony; Fransen, Jeroen; Pye, David; Foote, Jonathan; Renals, Steve; Woodland, Phil; Young, Steve. 1995. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text