OLAC Record
oai:www.ldc.upenn.edu:LDC2005T28

Metadata
Title:HARD 2004 Text
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Kong, Junbo, et al. HARD 2004 Text LDC2005T28. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:Kong, Junbo
Graff, David
Maeda, Kazuaki
Strassel, Stephanie
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-12-20
Description:*Introduction* The HARD 2004 Text Corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2005T28 and ISBN 1-58563-372-0. This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as LDC2005T29, HARD 2004 Topics and Annotations. This corpus was created with support from the DARPA TIDES Program and LDC. *Data* The corpus comprises eight English newswire and web text sources from January-December 2003. The sources are AFE: Agence France Presse - English APE: Associated Press Newswire CNE: Central News Agency Taiwan - English LAT: Los Angeles Times/Washington Post NYT: New York Times SLN: Salon.com UME: Ummah Press - English XIE: Xinhua News Agency - English Volume of data for each source appears in the table below:Source Stories Total Tokens Avg. Token/Story ---------------------------------------------------------- AFE: 226,515 71,829,978 317 APE: 237,067 93,294,584 393 CNE: 3,674 797,194 217 LAT: 18,287 12,576,721 687 NYT: 28,190 16,673,028 591 SLN: 3,321 4,710,500 1,418 UME: 2,607 782,064 299 XIE: 117,854 24,016,670 203 Total: 637,515 224,680,739Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components: - Keyword (optional), surrounded by tags - Date/time (optional), surrounded by tags - Headline, surrounded by tags - Main part, surrounded by tags. Tags are used within this part to identify paragraph boundaries. For more information please visit the HARD Project website. *Samples* For an example of the data in this corpus, please review this sample.
Extent:Corpus size: 1572864 KB
Identifier:LDC2005T28
https://catalog.ldc.upenn.edu/LDC2005T28
ISBN: 1-58563-372-0
ISLRN: 269-933-843-612-1
DOI: 10.35111/7h63-pd43
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2005T28
Rights Holder: Portions © 2003 Agence France Presse, © 2003 The Associated Press, © 2003 Central News Agency Taiwan, © 2003 Los Angeles Times-Washington Post News Service, Inc., © 2003 The New York Times, © 2003 Salon.com, ©2003 Ummah Press Service, © 2003 Xinhua News Agency, ©2005 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005T28
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Kong, Junbo; Graff, David; Maeda, Kazuaki; Strassel, Stephanie. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T28
Up-to-date as of: Mon Mar 25 7:20:08 EDT 2024