OLAC Record: Santa Barbara Corpus of Spoken American English Part III

OLAC Record
oai:www.ldc.upenn.edu:LDC2004S10

Metadata

Title: Santa Barbara Corpus of Spoken American English Part III

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Du Bois, John W., and Robert Englebretson. Santa Barbara Corpus of Spoken American English Part III LDC2004S10. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Du Bois, John W.

Englebretson, Robert

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-09-23

Description: *Introduction* Santa Barbara Corpus of Spoken American English Part III was produced by the Linguistic Data Consortium (LDC) and contains 6 hours of conversational English audio as well as associated transcripts. Santa Barbara Corpus of Spoken American English Part III is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)). Santa Barbara Corpus of Spoken American English Part III is also part of the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component. For software and additional data resources, please refer to the following sites: Talkbank, International Corpus of English. The first two parts of this collection can be found as: * Santa Barbara Corpus of Spoken American English Part I (LDC2000S85). * Santa Barbara Corpus of Spoken American English Part II (LDC2003S06). *Data* The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz. The speech files total 1.8 GB, representing over 116 K-words (thousands of words) and over 9K unique words in transcription. The gender breakdown of subjects is 40 female, 19 male. The transcripts are in .trn format with the following structure: 2.660 2.805 JOANNE: But, 2.805 4.685 so these slides be real interesting. 6.140 6.325 KEN: ... Yeah. 6.325 7.710 I think it'll be real interesting Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The file sbc040.flt is empty indicating there was no personal information to filter out. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. *Acknowledgements* The completion and release of this corpus was facilitated by funding extended by the Talkbank project. Talkbank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania. Produced at the LDC by Nii Martey. *Samples* Please view the following samples: * Speech file (wav) * Transcript *Updates* None at this time

Extent: Corpus size: 1887436 KB

Format: Sampling Rate: 22050

Sampling Format: pcm

Identifier: LDC2004S10

https://catalog.ldc.upenn.edu/LDC2004S10

ISBN: 1-58563-308-9

ISLRN: 801-946-303-326-6

DOI: 10.35111/3fke-7p97

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004S10

Rights Holder: Portions © 2003 University of California, © 2003 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004S10

DateStamp: 2024-03-26

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Du Bois, John W.; Englebretson, Robert. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004S10
Up-to-date as of: Tue May 20 0:13:33 EDT 2025

Metadata
Title:		Santa Barbara Corpus of Spoken American English Part III
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Du Bois, John W., and Robert Englebretson. Santa Barbara Corpus of Spoken American English Part III LDC2004S10. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Du Bois, John W.
Contributor:		Englebretson, Robert
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-09-23
Description:		Introduction Santa Barbara Corpus of Spoken American English Part III was produced by the Linguistic Data Consortium (LDC) and contains 6 hours of conversational English audio as well as associated transcripts. Santa Barbara Corpus of Spoken American English Part III is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)). Santa Barbara Corpus of Spoken American English Part III is also part of the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component. For software and additional data resources, please refer to the following sites: Talkbank, International Corpus of English. The first two parts of this collection can be found as: * Santa Barbara Corpus of Spoken American English Part I (LDC2000S85). * Santa Barbara Corpus of Spoken American English Part II (LDC2003S06). Data The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz. The speech files total 1.8 GB, representing over 116 K-words (thousands of words) and over 9K unique words in transcription. The gender breakdown of subjects is 40 female, 19 male. The transcripts are in .trn format with the following structure: 2.660 2.805 JOANNE: But, 2.805 4.685 so these slides be real interesting. 6.140 6.325 KEN: ... Yeah. 6.325 7.710 I think it'll be real interesting Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The file sbc040.flt is empty indicating there was no personal information to filter out. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. Acknowledgements The completion and release of this corpus was facilitated by funding extended by the Talkbank project. Talkbank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania. Produced at the LDC by Nii Martey. Samples Please view the following samples: * Speech file (wav) * Transcript Updates None at this time
Extent:		Corpus size: 1887436 KB
Format:		Sampling Rate: 22050
Format:		Sampling Format: pcm
Identifier:		LDC2004S10
		https://catalog.ldc.upenn.edu/LDC2004S10
		ISBN: 1-58563-308-9
		ISLRN: 801-946-303-326-6
		DOI: 10.35111/3fke-7p97
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004S10
Rights Holder:		Portions © 2003 University of California, © 2003 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004S10
DateStamp:		2024-03-26
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Du Bois, John W.; Englebretson, Robert. 2004. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text