OLAC Record: Voice of America (VOA) Czech Broadcast News Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2000T53

Metadata

Title: Voice of America (VOA) Czech Broadcast News Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: J, Psutka, et al. Voice of America (VOA) Czech Broadcast News Transcripts LDC2000T53. Web Download. Philadelphia: Linguistic Data Consortium, 2000

Contributor: J, Psutka

V, Radova

L, Muller

J, Matousek

P, Ircing

Date (W3CDTF): 2000

Description: *Introduction* Voice of America (VOA) Czech Broadcast News Transcripts was developed by the University of West Bohemia. The transcripts in this release correspond to Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89). Support for this work was provided by the Ministry of Education of the Czech Republic (Grant No. VS97159); by the Ministry of Education of the Czech Republic (Project ME293); and by the NSF Language Engineering Workshop at the Johns Hopkins University, Baltimore, MD USA (NSF Grant No. IIS-9820687). *Data* Between February 9 and May 28, 1999, the Linguistic Data Consortium (LDC) collected approximately 30 hours of Czech broadcast audio from the Voice of America news service. The 62 data files presented in this corpus represent the transcripts of the daily broadcasts of 30-minute news programs. The transcriptions were created by native Czech speakers, Pavel Ircing, Jindrich Matousek, Ludek Muller, and Vlasta Radova, working at the Department of Cybernetics, University of West Bohemia in Pilsen under the direction of Josef Psutka. They used transcription software provided by LDC (the "Transcriber" package), developed by Eduoard Geoffrois and Claude Barras at DGA, France, with assistance from Zhibiao Wu at LDC. The version of Transcriber used for this project produced a text file format which is no longer supported by the software; also, the format does not resemble any previous transcription format published by LDC. Therefore, the files in this release have been converted into an SGML format that has been used for other broadcast news transcription corpora, specifcally, the the "Universal Transcription Format" (UTF -- not to be confused with the "Unicode Transformation Formats") defined by the speech group at NIST (National Institute of Standards and Technology). A description of that format is provided in the "utf.ps" (Postscript) and "utf.pdf" (Adobe Acrobat) files, and the formal SGML definition is provided in "utf.dtd," all in the release "doc" directory. The transcription text is rendered using the ISO 8859-2 character set. Information relating this character set to the Unicode standard is available at this site and from the Unicode Consortium. Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred. Please click on LDC2000T53.sample to view an example transcript. *Updates* There are no updates at this time.

Identifier: LDC2000T53

https://catalog.ldc.upenn.edu/LDC2000T53

ISBN: 1-58563-180-9

ISLRN: 152-783-757-211-5

DOI: 10.35111/zsbe-6d67

Language: Czech

Language (ISO639): ces

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2000T53

Rights Holder: Portions © 2000 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2000T53

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: J, Psutka; V, Radova; L, Muller; J, Matousek; P, Ircing. 2000. Linguistic Data Consortium.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2000T53
Up-to-date as of: Fri Aug 8 0:26:30 EDT 2025

Metadata
Title:		Voice of America (VOA) Czech Broadcast News Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		J, Psutka, et al. Voice of America (VOA) Czech Broadcast News Transcripts LDC2000T53. Web Download. Philadelphia: Linguistic Data Consortium, 2000
Contributor:		J, Psutka
		V, Radova
		L, Muller
		J, Matousek
		P, Ircing
Date (W3CDTF):		2000
Description:		Introduction Voice of America (VOA) Czech Broadcast News Transcripts was developed by the University of West Bohemia. The transcripts in this release correspond to Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89). Support for this work was provided by the Ministry of Education of the Czech Republic (Grant No. VS97159); by the Ministry of Education of the Czech Republic (Project ME293); and by the NSF Language Engineering Workshop at the Johns Hopkins University, Baltimore, MD USA (NSF Grant No. IIS-9820687). Data Between February 9 and May 28, 1999, the Linguistic Data Consortium (LDC) collected approximately 30 hours of Czech broadcast audio from the Voice of America news service. The 62 data files presented in this corpus represent the transcripts of the daily broadcasts of 30-minute news programs. The transcriptions were created by native Czech speakers, Pavel Ircing, Jindrich Matousek, Ludek Muller, and Vlasta Radova, working at the Department of Cybernetics, University of West Bohemia in Pilsen under the direction of Josef Psutka. They used transcription software provided by LDC (the "Transcriber" package), developed by Eduoard Geoffrois and Claude Barras at DGA, France, with assistance from Zhibiao Wu at LDC. The version of Transcriber used for this project produced a text file format which is no longer supported by the software; also, the format does not resemble any previous transcription format published by LDC. Therefore, the files in this release have been converted into an SGML format that has been used for other broadcast news transcription corpora, specifcally, the the "Universal Transcription Format" (UTF -- not to be confused with the "Unicode Transformation Formats") defined by the speech group at NIST (National Institute of Standards and Technology). A description of that format is provided in the "utf.ps" (Postscript) and "utf.pdf" (Adobe Acrobat) files, and the formal SGML definition is provided in "utf.dtd," all in the release "doc" directory. The transcription text is rendered using the ISO 8859-2 character set. Information relating this character set to the Unicode standard is available at this site and from the Unicode Consortium. Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred. Please click on LDC2000T53.sample to view an example transcript. Updates There are no updates at this time.
Identifier:		LDC2000T53
		https://catalog.ldc.upenn.edu/LDC2000T53
		ISBN: 1-58563-180-9
		ISLRN: 152-783-757-211-5
		DOI: 10.35111/zsbe-6d67
Language:		Czech
Language (ISO639):		ces
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2000T53
Rights Holder:		Portions © 2000 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2000T53
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		J, Psutka; V, Radova; L, Muller; J, Matousek; P, Ircing. 2000. Linguistic Data Consortium.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text