OLAC Record: Boston University Radio Speech Corpus

OLAC Record
oai:www.ldc.upenn.edu:LDC96S36

Metadata

Title: Boston University Radio Speech Corpus

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ostendorf, Mari, Patti Price, and Stefanie Shattuck-Hufnagel. Boston University Radio Speech Corpus LDC96S36. Web Download. Philadelphia: Linguistic Data Consortium, 1996

Contributor: Ostendorf, Mari

Price, Patti

Shattuck-Hufnagel, Stefanie

Date (W3CDTF): 1996

Description: The Boston University Radio Speech Corpus was collected primarily to support research in text-to-speech synthesis, particularly generation of prosodic patterns. The corpus consists of professionally read radio news data, including speech and accompanying annotations, suitable for speech and language research. The corpus includes speech from seven (four male, three female) FM radio news announcers associated with WBUR, a public radio station. The main radio news portion of the corpus consists of over seven hours of news stories recorded in the WBUR radio studio during broadcasts over a two year period. In addition, the announcers were also recorded in a laboratory at Boston University. In this, the lab news portion, the announcers read a total of 24 stories from the radio news portion. The announcers were first asked to read the stories in their non-radio style and then, 30 minutes later, to read the same stories in their radio style. Each story read by an announcer was digitized in paragraph size units, which typically include several sentences. The files were digitized at a 16k Hz sample rate using a 16-bit A/D. The paragraphs were annotated with the orthographic transcription, phonetic alignments, part-of-speech tags and prosodic markers. The orthographic transcripts were generated by hand and include indication of where the speaker took a breath. The phonetic alignments and part-of-speech tags were generated automatically and hand corrected. The prosodic labels were marked by hand and are available only for a subset of the corpus. A zipped compressed file example.zip is available. Please be aware that this file is slightly larger than 1 Mb (1,278,998 bytes). An additional sample file, LDC1996.tgz and WAV sample are also available.

Extent: Corpus size: 1992294 KB

Format: Sampling Rate: 16000

Sampling Format: 1-channel pcm

Identifier: LDC96S36

https://catalog.ldc.upenn.edu/LDC96S36

ISBN: 1-58563-060-8

ISLRN: 601-939-678-076-4

DOI: 10.35111/z7xk-z229

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC96S36

Rights Holder: Portions © 1996 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC96S36

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ostendorf, Mari; Price, Patti; Shattuck-Hufnagel, Stefanie. 1996. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC96S36
Up-to-date as of: Wed Oct 29 7:00:37 EDT 2025

Metadata
Title:		Boston University Radio Speech Corpus
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ostendorf, Mari, Patti Price, and Stefanie Shattuck-Hufnagel. Boston University Radio Speech Corpus LDC96S36. Web Download. Philadelphia: Linguistic Data Consortium, 1996
Contributor:		Ostendorf, Mari
		Price, Patti
		Shattuck-Hufnagel, Stefanie
Date (W3CDTF):		1996
Description:		The Boston University Radio Speech Corpus was collected primarily to support research in text-to-speech synthesis, particularly generation of prosodic patterns. The corpus consists of professionally read radio news data, including speech and accompanying annotations, suitable for speech and language research. The corpus includes speech from seven (four male, three female) FM radio news announcers associated with WBUR, a public radio station. The main radio news portion of the corpus consists of over seven hours of news stories recorded in the WBUR radio studio during broadcasts over a two year period. In addition, the announcers were also recorded in a laboratory at Boston University. In this, the lab news portion, the announcers read a total of 24 stories from the radio news portion. The announcers were first asked to read the stories in their non-radio style and then, 30 minutes later, to read the same stories in their radio style. Each story read by an announcer was digitized in paragraph size units, which typically include several sentences. The files were digitized at a 16k Hz sample rate using a 16-bit A/D. The paragraphs were annotated with the orthographic transcription, phonetic alignments, part-of-speech tags and prosodic markers. The orthographic transcripts were generated by hand and include indication of where the speaker took a breath. The phonetic alignments and part-of-speech tags were generated automatically and hand corrected. The prosodic labels were marked by hand and are available only for a subset of the corpus. A zipped compressed file example.zip is available. Please be aware that this file is slightly larger than 1 Mb (1,278,998 bytes). An additional sample file, LDC1996.tgz and WAV sample are also available.
Extent:		Corpus size: 1992294 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: 1-channel pcm
Identifier:		LDC96S36
		https://catalog.ldc.upenn.edu/LDC96S36
		ISBN: 1-58563-060-8
		ISLRN: 601-939-678-076-4
		DOI: 10.35111/z7xk-z229
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC96S36
Rights Holder:		Portions © 1996 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC96S36
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ostendorf, Mari; Price, Patti; Shattuck-Hufnagel, Stefanie. 1996. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text