OLAC Record: CHiME2 WSJ0

OLAC Record
oai:www.ldc.upenn.edu:LDC2017S10

Metadata

Title: CHiME2 WSJ0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Vincent, Emmanuel, et al. CHiME2 WSJ0 LDC2017S10. Web Download. Philadelphia: Linguistic Data Consortium, 2017

Contributor: Vincent, Emmanuel

Barker, Jon

Watanabe, Shinji

Le Roux, Jonathan

Nesta, Francesco

Matassoni, Marco

Date (W3CDTF): 2017

Date Issued (W3CDTF): 2017-06-15

Description: *Introduction* CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME2 WSJ0 reflects the medium vocabulary track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text. LDC also released CHiME2 Grid (LDC2017S07) and CHiME3 (LDC2017S24). *Data* Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are in isolated form and in embedded form. The latter involves five seconds of background noise before and after the utterance. Seven hours of noise background not part of the training set are also included. Also included are baseline scoring, decoding and retraining tools based on Cambridge University' s tool, HTK (the Hidden Markov Toolkit) and related recipes. These tools include three baseline speaker-independent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts. *Samples* Please listen to the following samples: * Embedded * Isolated * Reverberated * Scaled *Updates* None at this time.

Extent: Corpus size: 38385568 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2017S10

https://catalog.ldc.upenn.edu/LDC2017S10

ISBN: 1-58563-801-3

ISLRN: 071-714-384-459-0

DOI: 10.35111/cxwc-kb75

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2017S10

Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 2017 Inria Nancy - Grand Est, University of Sheffield, Mitsubishi Electric Research Labs, Fondazione Bruno Kessler, © 1992, 1993, 1996, 2017 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2017S10

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Vincent, Emmanuel; Barker, Jon; Watanabe, Shinji; Le Roux, Jonathan; Nesta, Francesco; Matassoni, Marco. 2017. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2017S10
Up-to-date as of: Thu Sep 18 1:00:53 EDT 2025

Metadata
Title:		CHiME2 WSJ0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Vincent, Emmanuel, et al. CHiME2 WSJ0 LDC2017S10. Web Download. Philadelphia: Linguistic Data Consortium, 2017
Contributor:		Vincent, Emmanuel
		Barker, Jon
		Watanabe, Shinji
		Le Roux, Jonathan
		Nesta, Francesco
		Matassoni, Marco
Date (W3CDTF):		2017
Date Issued (W3CDTF):		2017-06-15
Description:		Introduction CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME2 WSJ0 reflects the medium vocabulary track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text. LDC also released CHiME2 Grid (LDC2017S07) and CHiME3 (LDC2017S24). Data Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are in isolated form and in embedded form. The latter involves five seconds of background noise before and after the utterance. Seven hours of noise background not part of the training set are also included. Also included are baseline scoring, decoding and retraining tools based on Cambridge University' s tool, HTK (the Hidden Markov Toolkit) and related recipes. These tools include three baseline speaker-independent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts. Samples Please listen to the following samples: * Embedded * Isolated * Reverberated * Scaled Updates None at this time.
Extent:		Corpus size: 38385568 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2017S10
		https://catalog.ldc.upenn.edu/LDC2017S10
		ISBN: 1-58563-801-3
		ISLRN: 071-714-384-459-0
		DOI: 10.35111/cxwc-kb75
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2017S10
Rights Holder:		Portions © 1987-1989 Dow Jones & Company, Inc., © 2017 Inria Nancy - Grand Est, University of Sheffield, Mitsubishi Electric Research Labs, Fondazione Bruno Kessler, © 1992, 1993, 1996, 2017 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2017S10
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Vincent, Emmanuel; Barker, Jon; Watanabe, Shinji; Le Roux, Jonathan; Nesta, Francesco; Matassoni, Marco. 2017. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text