OLAC Record: Czech Broadcast News MDE Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2010T02

Metadata

Title: Czech Broadcast News MDE Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Kolar, Jachym, and Jan Svec. Czech Broadcast News MDE Transcripts LDC2010T02. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: Kolar, Jachym

Svec, Jan

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-01-20

Description: *Introduction* Czech Broadcast News MDE Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2010T02 and isbn 1-58563-534-0, was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53). The audio recordings were collected from February 1, 2000 through April 22, 2000 from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2 and Cesky rozhlas 3 Vlatva - CRo3) and two television stations (Ceska televize - CTV and Prima TV). The broadcasts included both public and commercial subjects and were presented in various styles, ranging from a formal style to a colloquial style more typical for commercial broadcast companies that do not primarily focus on news. The goal of MDE research is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling and sensible conventions for representing speaker turns and identity are further elements in the readable transcript. The transcripts and annotations in this corpus are stored in two formats: QAn (Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2. More information can be found on the website Structural Metadata Annotation for Czech. *Sponsorship* The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. 2C06020 and ME909. *Samples* * Quick Annotator Transcript * RTTM Annotation

Extent: Corpus size: 24576 KB

Identifier: LDC2010T02

https://catalog.ldc.upenn.edu/LDC2010T02

ISBN: 1-58563-534-0

ISLRN: 539-629-573-162-3

DOI: 10.35111/0maf-6v04

Language: Czech

Language (ISO639): ces

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010T02

Rights Holder: Portions © 2000 Ceska televize, © 2000 Cesky rozhlas 1 Radiozurnal, © 2000 Cesky rohlas 2 Praha, © 2000 Cesky rozhlas 3 Vlatva, © 2000 FTV Primiera, © 2004, 2010 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010T02

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Kolar, Jachym; Svec, Jan. 2010. Linguistic Data Consortium.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T02
Up-to-date as of: Thu Sep 18 0:59:39 EDT 2025

Metadata
Title:		Czech Broadcast News MDE Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Kolar, Jachym, and Jan Svec. Czech Broadcast News MDE Transcripts LDC2010T02. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		Kolar, Jachym
Contributor:		Svec, Jan
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-01-20
Description:		Introduction Czech Broadcast News MDE Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2010T02 and isbn 1-58563-534-0, was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53). The audio recordings were collected from February 1, 2000 through April 22, 2000 from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2 and Cesky rozhlas 3 Vlatva - CRo3) and two television stations (Ceska televize - CTV and Prima TV). The broadcasts included both public and commercial subjects and were presented in various styles, ranging from a formal style to a colloquial style more typical for commercial broadcast companies that do not primarily focus on news. The goal of MDE research is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling and sensible conventions for representing speaker turns and identity are further elements in the readable transcript. The transcripts and annotations in this corpus are stored in two formats: QAn (Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2. More information can be found on the website Structural Metadata Annotation for Czech. Sponsorship The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. 2C06020 and ME909. Samples * Quick Annotator Transcript * RTTM Annotation
Extent:		Corpus size: 24576 KB
Identifier:		LDC2010T02
		https://catalog.ldc.upenn.edu/LDC2010T02
		ISBN: 1-58563-534-0
		ISLRN: 539-629-573-162-3
		DOI: 10.35111/0maf-6v04
Language:		Czech
Language (ISO639):		ces
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010T02
Rights Holder:		Portions © 2000 Ceska televize, © 2000 Cesky rozhlas 1 Radiozurnal, © 2000 Cesky rohlas 2 Praha, © 2000 Cesky rozhlas 3 Vlatva, © 2000 FTV Primiera, © 2004, 2010 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010T02
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Kolar, Jachym; Svec, Jan. 2010. Linguistic Data Consortium.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text