OLAC Record: ICSI Meeting Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2004S02

Metadata

Title: ICSI Meeting Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Janin, Adam, et al. ICSI Meeting Speech LDC2004S02. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Janin, Adam

Edwards, Jane

Ellis, Dan

Gelbart, David

Morgan, Nelson

Peskin, Barbara

Pfau, Thilo

Shriberg, Elizabeth

Stolcke, Andreas

Wooters, Chuck

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-01-30

Description: *Introduction* ICSI Meeting Speech was produced by the Linguistic Data Consortium (LDC) and contains approximately 72 hours of English meeting speech. The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute (ICSI) in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. Word-level orthographic transcriptions are available as ICSI Meeting Transcripts (LDC2004T04). *Data* The collection includes 922 files, totaling 883 hours of audio representing 72 hours of speech. The speech is structured as one subdirectory per meeting, containing wave files for each channel (and possibly .blp files, specifying any censored intervals). The meetings were simultaneously recorded using close-talking microphones for each speaker (generally head-mounted, but early meetings contain some lapel microphones), as well as six table-top microphones: four high-quality omnidirectional PZM microphones arrayed down the center of the conference table, and two inexpensive microphone elements mounted on a mock PDA. All meetings were recorded in the same instrumented meeting room. The audio was collected at a 48 kHz sample rate, downsampled on the fly to 16 kHz. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit linear (big-endian) wave files, shorten-compressed in NIST SPHERE format. In addition to recording the meetings themselves, the participants were also asked to read digit strings, similar to those found in TIDIGITS, at the start or end of the meeting. This small-vocabulary read-speech component of the recordings -- using the same meeting room, speakers, and microphones -- provides a valuable supplement to the natural conversational data, allowing a factorization of the speech challenges offered by the corpus. For all but a dozen of the meetings included in the corpus, at least some of the participants read digit strings; for the great majority of meetings, all participants did. The digit readings are included as part of the wave files for the meeting as a whole and are fully transcribed as part of the associated transcripts. There are a total of 53 unique speakers in the corpus (40 male, 13 female). Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe. *Samples* Please listen to this audio sample. *Sponsorship* The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM. *Updates* There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.

Extent: Corpus size: 33554432 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2004S02

https://catalog.ldc.upenn.edu/LDC2004S02

ISBN: 1-58563-285-6

ISLRN: 723-437-529-684-1

DOI: 10.35111/dgej-5m98

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004S02

Rights Holder: Portions © 2000-2003 International Computer Science Institute, © 2004 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004S02

DateStamp: 2024-04-02

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Janin, Adam; Edwards, Jane; Ellis, Dan; Gelbart, David; Morgan, Nelson; Peskin, Barbara; Pfau, Thilo; Shriberg, Elizabeth; Stolcke, Andreas; Wooters, Chuck. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004S02
Up-to-date as of: Wed Oct 29 7:00:18 EDT 2025

Metadata
Title:		ICSI Meeting Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Janin, Adam, et al. ICSI Meeting Speech LDC2004S02. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Janin, Adam
		Edwards, Jane
		Ellis, Dan
		Gelbart, David
		Morgan, Nelson
		Peskin, Barbara
		Pfau, Thilo
		Shriberg, Elizabeth
		Stolcke, Andreas
		Wooters, Chuck
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-01-30
Description:		Introduction ICSI Meeting Speech was produced by the Linguistic Data Consortium (LDC) and contains approximately 72 hours of English meeting speech. The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute (ICSI) in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. Word-level orthographic transcriptions are available as ICSI Meeting Transcripts (LDC2004T04). Data The collection includes 922 files, totaling 883 hours of audio representing 72 hours of speech. The speech is structured as one subdirectory per meeting, containing wave files for each channel (and possibly .blp files, specifying any censored intervals). The meetings were simultaneously recorded using close-talking microphones for each speaker (generally head-mounted, but early meetings contain some lapel microphones), as well as six table-top microphones: four high-quality omnidirectional PZM microphones arrayed down the center of the conference table, and two inexpensive microphone elements mounted on a mock PDA. All meetings were recorded in the same instrumented meeting room. The audio was collected at a 48 kHz sample rate, downsampled on the fly to 16 kHz. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit linear (big-endian) wave files, shorten-compressed in NIST SPHERE format. In addition to recording the meetings themselves, the participants were also asked to read digit strings, similar to those found in TIDIGITS, at the start or end of the meeting. This small-vocabulary read-speech component of the recordings -- using the same meeting room, speakers, and microphones -- provides a valuable supplement to the natural conversational data, allowing a factorization of the speech challenges offered by the corpus. For all but a dozen of the meetings included in the corpus, at least some of the participants read digit strings; for the great majority of meetings, all participants did. The digit readings are included as part of the wave files for the meeting as a whole and are fully transcribed as part of the associated transcripts. There are a total of 53 unique speakers in the corpus (40 male, 13 female). Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe. Samples Please listen to this audio sample. Sponsorship The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM. Updates There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.
Extent:		Corpus size: 33554432 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2004S02
		https://catalog.ldc.upenn.edu/LDC2004S02
		ISBN: 1-58563-285-6
		ISLRN: 723-437-529-684-1
		DOI: 10.35111/dgej-5m98
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004S02
Rights Holder:		Portions © 2000-2003 International Computer Science Institute, © 2004 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004S02
DateStamp:		2024-04-02
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Janin, Adam; Edwards, Jane; Ellis, Dan; Gelbart, David; Morgan, Nelson; Peskin, Barbara; Pfau, Thilo; Shriberg, Elizabeth; Stolcke, Andreas; Wooters, Chuck. 2004. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text