OLAC Record: HARD 2004 Topics and Annotations

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T29

Metadata

Title: HARD 2004 Topics and Annotations

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Strassel, Stephanie, and Meghan Glenn. HARD 2004 Topics and Annotations LDC2005T29. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Strassel, Stephanie

Glenn, Meghan

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-12-20

Description: *Introduction* The HARD 2004 Text Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 225 million tokens of English text. This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as HARD 2004 Topics and Annotations (LDC2005T29). This corpus was created with support from the DARPA TIDES Program and LDC. *Data* The corpus comprises eight English newswire and web text sources from January - December 2003. The sources and their volumes of data appear in the table below: Source Code Stories Total Tokens Avg. Token/Story Agence France Presse - English AFE 226,515 71,829,978 317 Associated Press Newswire APE 237,067 93,294,584 393 Central News Agency Taiwan - English CNE 3,674 797,194 217 Los Angeles Times/Washington Post LAT 18,287 12,576,721 687 New York Times NYT 28,190 16,673,028 591 Salon.com SLN 3,321 4,710,500 1,418 Ummah Press - English UME 2,607 782,064 299 Xinhua News Agency - English XIE 117,854 24,016,670 203 Totals 637,515 224,680,739 Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components: * Keyword (optional), surrounded by tags * Date/time (optional), surrounded by tags * Headline, surrounded by tags * Main part, surrounded by tags. Tags are used within this part to identify paragraph boundaries. For more information please visit the HARD Project website. *Samples* For an example of the data in this corpus, please view this sample (TXT). *Updates* None at this time.

Extent: Corpus size: 20480 KB

Identifier: LDC2005T29

https://catalog.ldc.upenn.edu/LDC2005T29

ISBN: 1-58563-373-9

ISLRN: 721-717-066-331-5

DOI: 10.35111/8sx8-1q92

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T29

Rights Holder: © 2005 Trustees of the University of Pennsylvania.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T29

DateStamp: 2021-07-23

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Strassel, Stephanie; Glenn, Meghan. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T29
Up-to-date as of: Wed Oct 29 7:00:26 EDT 2025

Metadata
Title:		HARD 2004 Topics and Annotations
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Strassel, Stephanie, and Meghan Glenn. HARD 2004 Topics and Annotations LDC2005T29. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Strassel, Stephanie
Contributor:		Glenn, Meghan
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-12-20
Description:		Introduction The HARD 2004 Text Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 225 million tokens of English text. This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as HARD 2004 Topics and Annotations (LDC2005T29). This corpus was created with support from the DARPA TIDES Program and LDC. Data The corpus comprises eight English newswire and web text sources from January - December 2003. The sources and their volumes of data appear in the table below: Source Code Stories Total Tokens Avg. Token/Story Agence France Presse - English AFE 226,515 71,829,978 317 Associated Press Newswire APE 237,067 93,294,584 393 Central News Agency Taiwan - English CNE 3,674 797,194 217 Los Angeles Times/Washington Post LAT 18,287 12,576,721 687 New York Times NYT 28,190 16,673,028 591 Salon.com SLN 3,321 4,710,500 1,418 Ummah Press - English UME 2,607 782,064 299 Xinhua News Agency - English XIE 117,854 24,016,670 203 Totals 637,515 224,680,739 Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components: * Keyword (optional), surrounded by tags * Date/time (optional), surrounded by tags * Headline, surrounded by tags * Main part, surrounded by tags. Tags are used within this part to identify paragraph boundaries. For more information please visit the HARD Project website. Samples For an example of the data in this corpus, please view this sample (TXT). Updates None at this time.
Extent:		Corpus size: 20480 KB
Identifier:		LDC2005T29
		https://catalog.ldc.upenn.edu/LDC2005T29
		ISBN: 1-58563-373-9
		ISLRN: 721-717-066-331-5
		DOI: 10.35111/8sx8-1q92
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T29
Rights Holder:		© 2005 Trustees of the University of Pennsylvania.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T29
DateStamp:		2021-07-23
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Strassel, Stephanie; Glenn, Meghan. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text