OLAC Record: CSR-II (WSJ1) Other

OLAC Record
oai:www.ldc.upenn.edu:LDC94S13C

Metadata

Title: CSR-II (WSJ1) Other

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Linguistic Data Consortium, NIST Multimodal Information Group, and Janet Baker. CSR-II (WSJ1) Other LDC94S13C. Web Download. Philadelphia: Linguistic Data Consortium, 1994

Contributor: Linguistic Data Consortium

NIST Multimodal Information Group

Baker, Janet M.

Date (W3CDTF): 1994

Description: LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C - CSR-II Other speech *Data* The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University. *Updates* The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension.

Format: Sampling Rate: 16000

Sampling Format: 1-channel pcm compressed

Identifier: LDC94S13C

https://catalog.ldc.upenn.edu/LDC94S13C

ISBN: 1-58563-032-2

ISLRN: 595-241-014-505-5

DOI: 10.35111/yr3q-9j20

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC94S13C

Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 1992, 1993, 1994 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC94S13C

DateStamp: 2024-10-07

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Linguistic Data Consortium; NIST Multimodal Information Group; Baker, Janet M. 1994. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC94S13C
Up-to-date as of: Wed Oct 29 7:00:31 EDT 2025

Metadata
Title:		CSR-II (WSJ1) Other
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Linguistic Data Consortium, NIST Multimodal Information Group, and Janet Baker. CSR-II (WSJ1) Other LDC94S13C. Web Download. Philadelphia: Linguistic Data Consortium, 1994
Contributor:		Linguistic Data Consortium
		NIST Multimodal Information Group
		Baker, Janet M.
Date (W3CDTF):		1994
Description:		LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C - CSR-II Other speech Data The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University. Updates The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension.
Format:		Sampling Rate: 16000
Format:		Sampling Format: 1-channel pcm compressed
Identifier:		LDC94S13C
		https://catalog.ldc.upenn.edu/LDC94S13C
		ISBN: 1-58563-032-2
		ISLRN: 595-241-014-505-5
		DOI: 10.35111/yr3q-9j20
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC94S13C
Rights Holder:		Portions © 1987-1989 Dow Jones & Company, Inc., © 1992, 1993, 1994 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC94S13C
DateStamp:		2024-10-07
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Linguistic Data Consortium; NIST Multimodal Information Group; Baker, Janet M. 1994. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text