OLAC Record: TDT2 Multilanguage Text Version 4.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2001T57

Metadata

Title: TDT2 Multilanguage Text Version 4.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Liberman, Mark, et al. TDT2 Multilanguage Text Version 4.0 LDC2001T57. Web Download. Philadelphia: Linguistic Data Consortium, 2001

Contributor: Liberman, Mark

Alabiso, Jennifer

Graff, David

Cieri, Christopher

Wayne, Charles

Doddington, George R.

Fiscus, Jonathan G.

Date (W3CDTF): 2001

Description: *Introduction* Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking). *Data* TDT2 Multilanguage Text Corpus Version 4.0 contains news data collected daily from nine news sources in two languages (American English and Mandarin Chinese), over a period of six months (January - June 1998). Both manually-created reference text and automatically- generated text (ASR and/or machine translation) are provided for all broadcast and all Mandarin data. This version has been prepared to complement the first general release of the TDT3 Multilanguage Text Corpus, providing new enhancements to make the data content more accessible to a broader research community. The news sources and approximate number of stories per source (in thousands) are as follows: English sources (thousands of stories) New York Times Newswire Service 11.8 Associated Press Worldstream Service 12.8 Cable News Network, Headline News 15.8 American Broadcasting Co., World News Tonight 2.1 Public Radio International, The World 2.9 Voice of America (news programs) 8.2 Total English stories: 53.6 thousand) Mandarin sources (thousands of stories) Xinhua News Agency 11.3 Zaobao News Agency 5.2 Voice of America (news programs) 2.3 Total Mandarin stories: 18.8 thousand This release consists of the English and Mandarin text components of the TDT2 corpus. The data was collected daily over a period of six months (January-June 1998) from the following sources. * American Broadcasting Company (ABC) * Associated Press * Cable News Network, Inc. (CNN) * New York Times * Public Radio International (PRI) * Voice of America (VOA) * Xinhua News Agency * ZaoBao News The data is provided in the following formats. .sgm: Reference true-text, with markup providing story boundaries and descriptive information .tkn: Tokenized version of sgml data, with all descriptive and boundary information removed .as0: Output of the Dragon ASR system in tokenized form with information on timing, speaker clusters, and confidence .as1: Output of the BBN ASR system in tokenized form with timing information (English Only) .mttkn: SYSTRAN output from .tkn (Mandarin Only) .mtas0: SYSTRAN output from .as0 (Mandarin Only) The corpus also includes topic relevance tables as well as tables for locating story boundaries. *Updates* 7/21/16 - Topic tables were added to the release and the online documentation folder. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Extent: Corpus size: 626688 KB

Identifier: LDC2001T57

https://catalog.ldc.upenn.edu/LDC2001T57

ISBN: 1-58563-183-3

ISLRN: 662-457-089-041-7

DOI: 10.35111/zfj3-tp72

Language: English

Mandarin Chinese

Language (ISO639): eng

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2001T57

Rights Holder: Portions © 1998 American Broadcasting Company, The Associated Press, Cable News Network, LP, LLLP, New York Times, Public Radio International, SPH AsiaOne Ltd, Xinhua News Agency, © 1998-2001 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2001T57

DateStamp: 2021-06-16

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Liberman, Mark; Alabiso, Jennifer; Graff, David; Cieri, Christopher; Wayne, Charles; Doddington, George R.; Fiscus, Jonathan G. 2001. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2001T57
Up-to-date as of: Wed Oct 29 7:00:09 EDT 2025

Metadata
Title:		TDT2 Multilanguage Text Version 4.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Liberman, Mark, et al. TDT2 Multilanguage Text Version 4.0 LDC2001T57. Web Download. Philadelphia: Linguistic Data Consortium, 2001
Contributor:		Liberman, Mark
		Alabiso, Jennifer
		Graff, David
		Cieri, Christopher
		Wayne, Charles
		Doddington, George R.
		Fiscus, Jonathan G.
Date (W3CDTF):		2001
Description:		Introduction Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking). Data TDT2 Multilanguage Text Corpus Version 4.0 contains news data collected daily from nine news sources in two languages (American English and Mandarin Chinese), over a period of six months (January - June 1998). Both manually-created reference text and automatically- generated text (ASR and/or machine translation) are provided for all broadcast and all Mandarin data. This version has been prepared to complement the first general release of the TDT3 Multilanguage Text Corpus, providing new enhancements to make the data content more accessible to a broader research community. The news sources and approximate number of stories per source (in thousands) are as follows: English sources (thousands of stories) New York Times Newswire Service 11.8 Associated Press Worldstream Service 12.8 Cable News Network, Headline News 15.8 American Broadcasting Co., World News Tonight 2.1 Public Radio International, The World 2.9 Voice of America (news programs) 8.2 Total English stories: 53.6 thousand) Mandarin sources (thousands of stories) Xinhua News Agency 11.3 Zaobao News Agency 5.2 Voice of America (news programs) 2.3 Total Mandarin stories: 18.8 thousand This release consists of the English and Mandarin text components of the TDT2 corpus. The data was collected daily over a period of six months (January-June 1998) from the following sources. * American Broadcasting Company (ABC) * Associated Press * Cable News Network, Inc. (CNN) * New York Times * Public Radio International (PRI) * Voice of America (VOA) * Xinhua News Agency * ZaoBao News The data is provided in the following formats. .sgm: Reference true-text, with markup providing story boundaries and descriptive information .tkn: Tokenized version of sgml data, with all descriptive and boundary information removed .as0: Output of the Dragon ASR system in tokenized form with information on timing, speaker clusters, and confidence .as1: Output of the BBN ASR system in tokenized form with timing information (English Only) .mttkn: SYSTRAN output from .tkn (Mandarin Only) .mtas0: SYSTRAN output from .as0 (Mandarin Only) The corpus also includes topic relevance tables as well as tables for locating story boundaries. Updates 7/21/16 - Topic tables were added to the release and the online documentation folder. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:		Corpus size: 626688 KB
Identifier:		LDC2001T57
		https://catalog.ldc.upenn.edu/LDC2001T57
		ISBN: 1-58563-183-3
		ISLRN: 662-457-089-041-7
		DOI: 10.35111/zfj3-tp72
Language:		English
Language:		Mandarin Chinese
Language (ISO639):		eng
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2001T57
Rights Holder:		Portions © 1998 American Broadcasting Company, The Associated Press, Cable News Network, LP, LLLP, New York Times, Public Radio International, SPH AsiaOne Ltd, Xinhua News Agency, © 1998-2001 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2001T57
DateStamp:		2021-06-16
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Liberman, Mark; Alabiso, Jennifer; Graff, David; Cieri, Christopher; Wayne, Charles; Doddington, George R.; Fiscus, Jonathan G. 2001. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text