OLAC Record: WTIMIT 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2010S02

Metadata

Title: WTIMIT 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Bauer, Patrick, and Tim Fingscheidt. WTIMIT 1.0 LDC2010S02. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: Bauer, Patrick

Fingscheidt, Tim

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-03-17

Description: *Introduction* WTIMIT 1.0 is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1). TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in American English from eight major dialectic regions, each reading ten phonetically rich sentences. The TIMIT speech corpus was completed in 1993, being intended for acoustic-phonetic studies as well as for development and evaluation of automatic speech recognition (ASR) systems. In the meantime, five TIMIT derivatives have been developed: FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and STC-TIMIT. The FFMTIMIT (LDC96S32) corpus (Free-Field Microphone TIMIT) consists of the original TIMIT database, being recorded by a free-field microphone. NTIMIT (LDC93S2) (Network TIMIT) serves as a telephone bandwidth adjunct to TIMIT, containing its speech files transmitted over a telephone handset and the NYNEX telephone network, subject to a large variety of channel conditions. For the cellular bandwidth speech corpus CTIMIT (LDC96S30), the original TIMIT recordings were passed through cellular telephone circuits. The HTIMIT (LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192 male and 192 female speakers through different telephone handsets for the study of telephone transducer effects on speech. For the single-channel telephone corpus STC-TIMIT (LDC2008S03), the TIMIT recordings were sent through a real and, in contrast to NTIMIT, single telephone channel. While some of these derivative TIMIT corpora consist of wideband speech, others are telephony corpora representing narrowband speech, i.e., sampled at 8 kHz and containing frequency components from about 300 Hz to 3.4 kHz. Until now, no real-world wideband telephony speech corpus has been publicly available. Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2 (i.e., Adaptive Multi-Rate Wideband, AMR-WB), and G.711.1, wideband telephony speech transmission is already feasible nowadays, even in an increasing number of mobile networks. Hence, a wideband telephone bandwidth adjunct to TIMIT is desirable for a wide range of scientific investigations, as well as development and evaluation of systems, e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile TIMIT) contains the recordings of the original TIMIT speech files after transmission over a real 3G AMR-WB mobile network. WTIMIT 1.0 is organized according to the original TIMIT corpus. The training subset consists of 4620 speech files, while the test subset contains 1680 speech files. The speech format of the WTIMIT corpus is raw (i.e., no header information) and specified as follows: * 16 kHz sampling rate * 16 bit, 1-channel linear PCM sampling format * little-endian byte order * signed *Data* Data preparation was conducted by converting the original TIMIT speech files into raw data (i.e., dropping the first 1024 bytes of header information) and concatenating them to 11 signal chunks of at most 30 minutes duration. In order to allow precise de-concatenation after transmission, and in order to be able to examine codec influence and channel distortion, each signal chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1 kHz sine wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After having stored the prepared speech chunks on a laptop PC, they are ready for transmission over T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands. At the sending end, the speech chunks were played back by a laptop PC. Via an IEEE 1394 link (FireWire), the data was transmitted digitally to an external DAC (digital-to-analog converter) of type RME Fireface 400. The analog signal was then fed electrically into the microphone input of the transmitting Nokia 6220 mobile phone. For this purpose, an audio quality test cable for Nokia mobile phones was used. Prior to the actual transmission, the output attenuation of the DAC was adjusted such as to prevent analog saturation at the input circuit of the phone while ensuring optimal dynamic range. Furthermore, a call to the phone at the receiving end, a second mobile phone of type Nokia 6220, was established for each speech chunk separately. Using the field test monitoring software of the phones, we confirmed that they were situated in different network cells at all times during transmission; moreover, we verified that the proper speech codec, the widely used AMR-WB at a constant data rate of 12.65 kbit/s, was being employed. Note that this bitrate is by far the most widely used one. Furthermore, the internal microphone equalization of the transmitting mobile phone was switched off. At the receiving end, the analog headphone output of the receiving mobile phone was connected electrically to an ADC (analog-to-digital converter) of type RME Fireface 400. The analog input gain of the latter device was adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision. The digital speech signals were transferred to a laptop PC again via an IEEE 1394 link and recorded onto a hard drive. The transmitted speech chunks were decimated from 48 kHz to 16 kHz sampling rate using a high-quality lowpass filter. Finally, they were de-concatenated by maximizing the cross-correlation between them and the original speech files. We followed the de-concatenation methodology of STC-TIMIT, as described in STC-TIMIT: Generation of a Single-channel Telephone Corpus, in order to assure a precise sample alignment to the TIMIT speech files. Hence, utterances in WTIMIT 1.0 can be considered to be time-aligned with an average precision of 0.0625 ms (one sample) with those of TIMIT. Basically, TIMIT's original label files (*.TXT, *.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of about 10 to 20 ms were found to be frequently produced by the channel mainly during speech pauses. Parts of the affected speech files are therefore slightly misaligned against the original label information. These channel effects may be related to the packet switching domain in the UMTS Core Network. Depending on the traffic load in the network, packets are buffered and queued, which results in a variable packet delay (jitter). If you have any problems, questions or suggestions concerning WTIMIT, please send a brief email to Tim Fingscheidt (Technische Universität Braunschweig, Braunschweig, Germany): fingscheidt@ifn.ing.tu-bs.de. *Samples* Please examine the following samples for an example of the data in this corpus (raw audio has been converted to wav for purposes of demonstration): * Audio File * Text * Words * Phonemes *Acknowledgement* The authors would like to thank Mr. Dirk Kistowski-Cames, Deutsche Telekom AG, Bonn, Germany, for providing general project support and SIM cards, and Mr. Petri Lang, T-Mobile NL, The Hague, The Netherlands, for local support and SIM cards. Thanks also to Mr. Panu Nevala, Nokia, Oulu, Finland, for providing the prepared mobile phones, which are in that form not available on the market. This work was funded by German Research Foundation (DFG) under grant no. FI 1494/2-1.

Extent: Corpus size: 723968 KB

Format: Sampling Rate: 16000

Sampling Format: 1-channel signed linear PCM (raw)

Identifier: LDC2010S02

https://catalog.ldc.upenn.edu/LDC2010S02

ISBN: 1-58563-540-5

ISLRN: 134-879-445-817-7

DOI: 10.35111/b2mr-ep81

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010S02

Rights Holder: Portions © 2009, 2010 Tim Fingscheidt, © 1993, 2010 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010S02

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Bauer, Patrick; Fingscheidt, Tim. 2010. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010S02
Up-to-date as of: Wed Oct 29 7:01:11 EDT 2025

Metadata
Title:		WTIMIT 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Bauer, Patrick, and Tim Fingscheidt. WTIMIT 1.0 LDC2010S02. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		Bauer, Patrick
Contributor:		Fingscheidt, Tim
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-03-17
Description:		Introduction WTIMIT 1.0 is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1). TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in American English from eight major dialectic regions, each reading ten phonetically rich sentences. The TIMIT speech corpus was completed in 1993, being intended for acoustic-phonetic studies as well as for development and evaluation of automatic speech recognition (ASR) systems. In the meantime, five TIMIT derivatives have been developed: FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and STC-TIMIT. The FFMTIMIT (LDC96S32) corpus (Free-Field Microphone TIMIT) consists of the original TIMIT database, being recorded by a free-field microphone. NTIMIT (LDC93S2) (Network TIMIT) serves as a telephone bandwidth adjunct to TIMIT, containing its speech files transmitted over a telephone handset and the NYNEX telephone network, subject to a large variety of channel conditions. For the cellular bandwidth speech corpus CTIMIT (LDC96S30), the original TIMIT recordings were passed through cellular telephone circuits. The HTIMIT (LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192 male and 192 female speakers through different telephone handsets for the study of telephone transducer effects on speech. For the single-channel telephone corpus STC-TIMIT (LDC2008S03), the TIMIT recordings were sent through a real and, in contrast to NTIMIT, single telephone channel. While some of these derivative TIMIT corpora consist of wideband speech, others are telephony corpora representing narrowband speech, i.e., sampled at 8 kHz and containing frequency components from about 300 Hz to 3.4 kHz. Until now, no real-world wideband telephony speech corpus has been publicly available. Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2 (i.e., Adaptive Multi-Rate Wideband, AMR-WB), and G.711.1, wideband telephony speech transmission is already feasible nowadays, even in an increasing number of mobile networks. Hence, a wideband telephone bandwidth adjunct to TIMIT is desirable for a wide range of scientific investigations, as well as development and evaluation of systems, e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile TIMIT) contains the recordings of the original TIMIT speech files after transmission over a real 3G AMR-WB mobile network. WTIMIT 1.0 is organized according to the original TIMIT corpus. The training subset consists of 4620 speech files, while the test subset contains 1680 speech files. The speech format of the WTIMIT corpus is raw (i.e., no header information) and specified as follows: * 16 kHz sampling rate * 16 bit, 1-channel linear PCM sampling format * little-endian byte order * signed Data Data preparation was conducted by converting the original TIMIT speech files into raw data (i.e., dropping the first 1024 bytes of header information) and concatenating them to 11 signal chunks of at most 30 minutes duration. In order to allow precise de-concatenation after transmission, and in order to be able to examine codec influence and channel distortion, each signal chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1 kHz sine wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After having stored the prepared speech chunks on a laptop PC, they are ready for transmission over T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands. At the sending end, the speech chunks were played back by a laptop PC. Via an IEEE 1394 link (FireWire), the data was transmitted digitally to an external DAC (digital-to-analog converter) of type RME Fireface 400. The analog signal was then fed electrically into the microphone input of the transmitting Nokia 6220 mobile phone. For this purpose, an audio quality test cable for Nokia mobile phones was used. Prior to the actual transmission, the output attenuation of the DAC was adjusted such as to prevent analog saturation at the input circuit of the phone while ensuring optimal dynamic range. Furthermore, a call to the phone at the receiving end, a second mobile phone of type Nokia 6220, was established for each speech chunk separately. Using the field test monitoring software of the phones, we confirmed that they were situated in different network cells at all times during transmission; moreover, we verified that the proper speech codec, the widely used AMR-WB at a constant data rate of 12.65 kbit/s, was being employed. Note that this bitrate is by far the most widely used one. Furthermore, the internal microphone equalization of the transmitting mobile phone was switched off. At the receiving end, the analog headphone output of the receiving mobile phone was connected electrically to an ADC (analog-to-digital converter) of type RME Fireface 400. The analog input gain of the latter device was adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision. The digital speech signals were transferred to a laptop PC again via an IEEE 1394 link and recorded onto a hard drive. The transmitted speech chunks were decimated from 48 kHz to 16 kHz sampling rate using a high-quality lowpass filter. Finally, they were de-concatenated by maximizing the cross-correlation between them and the original speech files. We followed the de-concatenation methodology of STC-TIMIT, as described in STC-TIMIT: Generation of a Single-channel Telephone Corpus, in order to assure a precise sample alignment to the TIMIT speech files. Hence, utterances in WTIMIT 1.0 can be considered to be time-aligned with an average precision of 0.0625 ms (one sample) with those of TIMIT. Basically, TIMIT's original label files (.TXT, .WRD, .PHN) are valid for WTIMIT as well. However, misalignments of about 10 to 20 ms were found to be frequently produced by the channel mainly during speech pauses. Parts of the affected speech files are therefore slightly misaligned against the original label information. These channel effects may be related to the packet switching domain in the UMTS Core Network. Depending on the traffic load in the network, packets are buffered and queued, which results in a variable packet delay (jitter). If you have any problems, questions or suggestions concerning WTIMIT, please send a brief email to Tim Fingscheidt (Technische Universität Braunschweig, Braunschweig, Germany): fingscheidt@ifn.ing.tu-bs.de. Samples* Please examine the following samples for an example of the data in this corpus (raw audio has been converted to wav for purposes of demonstration): * Audio File * Text * Words * Phonemes Acknowledgement The authors would like to thank Mr. Dirk Kistowski-Cames, Deutsche Telekom AG, Bonn, Germany, for providing general project support and SIM cards, and Mr. Petri Lang, T-Mobile NL, The Hague, The Netherlands, for local support and SIM cards. Thanks also to Mr. Panu Nevala, Nokia, Oulu, Finland, for providing the prepared mobile phones, which are in that form not available on the market. This work was funded by German Research Foundation (DFG) under grant no. FI 1494/2-1.
Extent:		Corpus size: 723968 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: 1-channel signed linear PCM (raw)
Identifier:		LDC2010S02
		https://catalog.ldc.upenn.edu/LDC2010S02
		ISBN: 1-58563-540-5
		ISLRN: 134-879-445-817-7
		DOI: 10.35111/b2mr-ep81
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010S02
Rights Holder:		Portions © 2009, 2010 Tim Fingscheidt, © 1993, 2010 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010S02
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Bauer, Patrick; Fingscheidt, Tim. 2010. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text