OLAC Record: Santa Barbara Corpus of Spoken American English Part IV

OLAC Record
oai:www.ldc.upenn.edu:LDC2005S25

Metadata

Title: Santa Barbara Corpus of Spoken American English Part IV

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Du Bois, John W., and Robert Englebretson. Santa Barbara Corpus of Spoken American English Part IV LDC2005S25. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Du Bois, John W.

Englebretson, Robert

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-09-20

Description: *Introduction* Santa Barbara Corpus of Spoken American English Part IV was produced by Linguistic Data Consortium (LDC) and contains approximately 5.5 hours of conversational and prepared English speech and associated transcripts. The corpus was collected by the University of California, Santa Barbara (UCSB) Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)). The corpus is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. For software and additional data resources, please refer to the following sites: TalkBank, International Corpus of English. The first three parts of this collection are available here: * Santa Barbara Corpus of Spoken American English Part I (LDC2000S85). * Santa Barbara Corpus of Spoken American English Part II (LDC2003S06). * Santa Barbara Corpus of Spoken American English Part III (LDC2003S10). *Data* The gender breakdown for speakers in this corpus was: 33 male, 25 female. In addition, the following metadata is included: age, dialect of english, dialect state, current state, highest level of education, years of education, occupation, ethnicity. The audio data consists of 14 WAV format speech files, recorded in two-channel PCM, at 22050 Hz, representing over 58,000 words and over 6,000 unique words in the transcribed text. The corpus also includes transcript files in TXT format, as well as files specifying spans in each audio file that have been filtered to remove personal identifying information. *Samples* For an example of the data in this corpus, please examine this audio sample (WAV) and its transcript (TXT). *Sponsorship* The completion and release of this corpus was facilitated by funding extended by the TalkBank Project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania. *Updates* None at this time.

Format: Sampling Rate: 22050

Sampling Format: 2-channel pcm

Identifier: LDC2005S25

https://catalog.ldc.upenn.edu/LDC2005S25

ISBN: 158563-348-8

ISLRN: 659-853-066-274-9

DOI: 10.35111/c9nh-1v54

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005S25

Rights Holder: Portions © 2003 University of California, © 2003 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005S25

DateStamp: 2022-01-20

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Du Bois, John W.; Englebretson, Robert. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005S25
Up-to-date as of: Wed Oct 29 7:00:17 EDT 2025

Metadata
Title:		Santa Barbara Corpus of Spoken American English Part IV
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Du Bois, John W., and Robert Englebretson. Santa Barbara Corpus of Spoken American English Part IV LDC2005S25. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Du Bois, John W.
Contributor:		Englebretson, Robert
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-09-20
Description:		Introduction Santa Barbara Corpus of Spoken American English Part IV was produced by Linguistic Data Consortium (LDC) and contains approximately 5.5 hours of conversational and prepared English speech and associated transcripts. The corpus was collected by the University of California, Santa Barbara (UCSB) Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)). The corpus is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. For software and additional data resources, please refer to the following sites: TalkBank, International Corpus of English. The first three parts of this collection are available here: * Santa Barbara Corpus of Spoken American English Part I (LDC2000S85). * Santa Barbara Corpus of Spoken American English Part II (LDC2003S06). * Santa Barbara Corpus of Spoken American English Part III (LDC2003S10). Data The gender breakdown for speakers in this corpus was: 33 male, 25 female. In addition, the following metadata is included: age, dialect of english, dialect state, current state, highest level of education, years of education, occupation, ethnicity. The audio data consists of 14 WAV format speech files, recorded in two-channel PCM, at 22050 Hz, representing over 58,000 words and over 6,000 unique words in the transcribed text. The corpus also includes transcript files in TXT format, as well as files specifying spans in each audio file that have been filtered to remove personal identifying information. Samples For an example of the data in this corpus, please examine this audio sample (WAV) and its transcript (TXT). Sponsorship The completion and release of this corpus was facilitated by funding extended by the TalkBank Project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania. Updates None at this time.
Format:		Sampling Rate: 22050
Format:		Sampling Format: 2-channel pcm
Identifier:		LDC2005S25
		https://catalog.ldc.upenn.edu/LDC2005S25
		ISBN: 158563-348-8
		ISLRN: 659-853-066-274-9
		DOI: 10.35111/c9nh-1v54
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005S25
Rights Holder:		Portions © 2003 University of California, © 2003 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005S25
DateStamp:		2022-01-20
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Du Bois, John W.; Englebretson, Robert. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text