OLAC Record oai:www.ldc.upenn.edu:LDC94S13A |
Metadata | ||
Title: | CSR-II (WSJ1) Complete | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Linguistic Data Consortium, NIST Multimodal Information Group, and Janet Baker. CSR-II (WSJ1) Complete LDC94S13A. Web Download. Philadelphia: Linguistic Data Consortium, 1994 | |
Contributor: | Linguistic Data Consortium | |
NIST Multimodal Information Group | ||
Baker, Janet M. | ||
Date (W3CDTF): | 1994 | |
Description: | LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C - CSR-II Other speech *Data* The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University. *Samples* Please listen to this audio sample. *Updates* The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension. | |
Extent: | Corpus size: 18874368 KB | |
Format: | Sampling Rate: 16000 | |
Sampling Format: 1-channel pcm compressed | ||
Identifier: | LDC94S13A | |
https://catalog.ldc.upenn.edu/LDC94S13A | ||
ISBN: 1-58563-030-6 | ||
ISLRN: 819-269-127-206-2 | ||
DOI: 10.35111/q7sb-vv12 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC94S13A | |
Rights Holder: | Portions © 1987-1989 Dow Jones & Company, Inc., © 1992, 1993, 1994 Trustees of the University of Pennsylvania | |
Type (DCMI): | Sound | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC94S13A | |
DateStamp: | 2024-10-07 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Linguistic Data Consortium; NIST Multimodal Information Group; Baker, Janet M. 1994. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text |