OLAC Record: Switchboard-1 Release 2

OLAC Record
oai:www.ldc.upenn.edu:LDC97S62

Metadata

Title: Switchboard-1 Release 2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Godfrey, John J., and Edward Holliman. Switchboard-1 Release 2 LDC97S62. Web Download. Philadelphia: Linguistic Data Consortium, 1993

Contributor: Godfrey, John J.

Holliman, Edward

Date (W3CDTF): 1993

Description: *Introduction* The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed. Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. *Data* In this release, assembled and published by the LDC, all known errors affecting the original publication of speech files were corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release and to make the usage of the sample_count header field consistent with standard Sphere usage. (In particular, the sample_count field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file this has been corrected in the new release.) Since the 1997 release, the Switchboard transcripts have been carefully revised at The Institute for Signal and Information Processing (ISIP) and additional problems have been discovered and patched. Three speech files, part of the original release, were inadvertently left off the 1997 revision. After corpus users noted some problems in the original speaker attribution table, LDC audited the problem calls and corrected the attributions. The latest version of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected word alignments are all available at ISIP. The LDC makes the transcript summaries available via in the online docs folder. Researchers have used SWB-1 data for various annotation projects including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date orthographic transcriptions, and phonetic transcriptions. This summary documents which files have been used for the various annotations. In addition to the index of these file characteristics, there is also a table detailing speaker attributes. *Samples* Please view this audio sample. *Updates* 08/11/2015: The three files from the 03/26/2013 update were converted into unshortened sphere. File tables and documentation were updated to reflect the conversion of these files. The corpus is also now available as a web download. All copies of this corpora obtained after the above date include this update. 03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update. 09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes. 11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory at https://catalog.ldc.upenn.edu/docs/LDC97S62/ 09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is available as a free download via the online documentation folder.

Extent: Corpus size: 14610176 KB

Format: Sampling Rate: 8000

Sampling Format: 2-channel ulaw

Identifier: LDC97S62

https://catalog.ldc.upenn.edu/LDC97S62

ISBN: 1-58563-121-3

ISLRN: 988-076-156-109-5

DOI: 10.35111/sw3h-rw02

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC97S62

Rights Holder: Portions © 1992, 1993, 1997 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC97S62

DateStamp: 2022-03-21

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Godfrey, John J.; Holliman, Edward. 1993. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC97S62
Up-to-date as of: Wed Oct 29 7:00:44 EDT 2025

Metadata
Title:		Switchboard-1 Release 2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Godfrey, John J., and Edward Holliman. Switchboard-1 Release 2 LDC97S62. Web Download. Philadelphia: Linguistic Data Consortium, 1993
Contributor:		Godfrey, John J.
Contributor:		Holliman, Edward
Date (W3CDTF):		1993
Description:		Introduction The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed. Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. Data In this release, assembled and published by the LDC, all known errors affecting the original publication of speech files were corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release and to make the usage of the sample_count header field consistent with standard Sphere usage. (In particular, the sample_count field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file this has been corrected in the new release.) Since the 1997 release, the Switchboard transcripts have been carefully revised at The Institute for Signal and Information Processing (ISIP) and additional problems have been discovered and patched. Three speech files, part of the original release, were inadvertently left off the 1997 revision. After corpus users noted some problems in the original speaker attribution table, LDC audited the problem calls and corrected the attributions. The latest version of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected word alignments are all available at ISIP. The LDC makes the transcript summaries available via in the online docs folder. Researchers have used SWB-1 data for various annotation projects including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date orthographic transcriptions, and phonetic transcriptions. This summary documents which files have been used for the various annotations. In addition to the index of these file characteristics, there is also a table detailing speaker attributes. Samples Please view this audio sample. Updates 08/11/2015: The three files from the 03/26/2013 update were converted into unshortened sphere. File tables and documentation were updated to reflect the conversion of these files. The corpus is also now available as a web download. All copies of this corpora obtained after the above date include this update. 03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update. 09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes. 11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory at https://catalog.ldc.upenn.edu/docs/LDC97S62/ 09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is available as a free download via the online documentation folder.
Extent:		Corpus size: 14610176 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: 2-channel ulaw
Identifier:		LDC97S62
		https://catalog.ldc.upenn.edu/LDC97S62
		ISBN: 1-58563-121-3
		ISLRN: 988-076-156-109-5
		DOI: 10.35111/sw3h-rw02
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC97S62
Rights Holder:		Portions © 1992, 1993, 1997 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC97S62
DateStamp:		2022-03-21
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Godfrey, John J.; Holliman, Edward. 1993. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text