OLAC Record oai:www.ldc.upenn.edu:LDC2010T15 |
Metadata | ||
Title: | Message Understanding Conference 7 Timed (MUC7_T) | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Tomanek, Katrin, and Udo Hahn. Message Understanding Conference 7 Timed (MUC7_T) LDC2010T15. Web Download. Philadelphia: Linguistic Data Consortium, 2010 | |
Contributor: | Tomanek, Katrin | |
Hahn, Udo | ||
Date (W3CDTF): | 2010 | |
Date Issued (W3CDTF): | 2010-09-17 | |
Description: | *Introduction* Message Understanding Conference 7 Timed (MUC7_T), Linguistic Data Consortium (LDC) catalog number LDC2010T15 and isbn 1-58563-560-X, was developed by researchers at Jena University Language & Information Engnineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany. It is a re-annotation of a portion of the MUC7 corpus (Linguistic Data Consortium, LDC2001T02), which consists of New York Times news stories annotated for use in the Message Understanding Conference 7 (MUC7) evaluation. The series of MUC evaluations in the 1990s focused on emerging information extraction technologies. Further information about NIST's MUC7 evaluation can be found MUC project website. MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for named entities (persons, locations and organizations) with a time stamp indicating the time measured for the linguistic decision making process. The corpus was developed for two principal purposes: for use in evaluations of selective sampling strategies, such as Active Learning; and to create predictive models for annotation costs. The annotation was performed by two advanced students of linguistics with good English language skills who followed the the original guidelines of the MUC7 named entity task (which can be found in the online documentation for the MUC7 corpus). *Data* The data is stored in XML format. There is an element anno_example for each annotation example that has the original MUC7 document as text context. The MUC7 document was tokenized using the Stanford Tokenizer3 with white spaces marking token boundaries. The tokenizer is part of the Stanford Parser package which can be obtained from The Stanford Natural Language Processing Group. The following attributes are used for the element anno_example: Attribute Explanation anno_time The time it took to annotate the annotation unit of this annotation example (time in milliseconds). anno_unit_tokens All tokens of the annotation unit. anno_unit_offset Offsets for the tokens of the annotation unit relative to all tokens in the annotation example. anno_unit_labels Labels for the tokens of the annotation unit (these labels are taken from MUC7). doc_id ID of the document of the annotation example. sent_id ID of the sentence of the annotation example. anno_unit_id ID of the unit of the annotation example. muc7_org_filename The name of the original MUC7 document from which this annotation example is taken. *Dirctory Structure* The directory structure of the corpus is as follows: data: This subdirectory contains the MUC7_T data; the data for annotator A and B are in separate folders. For each annotator, there is a version of MUC7_T with CNP-level and with sentence-level annotations. docs: This subdirectory contains detailed documentation as well as publications describing applications of MUC7_T. There is also a small JavaDoc for the Java tools (see the tools subdirectory below). dtd: This subdirectory contains the Document Type Definition (DTD) for the data files. tools: This subdirectory contains a small Java API which allows users to read the MUC7_T XML data so that each annotation example is represented by a Java object. The API incudes the source code and a jar package. The source code has been tested with Java 1.5 and Java 1.6. *Updates* Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T15. *Samples* The following XML excerpts are representative the data in this corpus: * CNP * Sentence Level | |
Extent: | Corpus size: 142336 KB | |
Identifier: | LDC2010T15 | |
https://catalog.ldc.upenn.edu/LDC2010T15 | ||
ISBN: 1-58563-560-X | ||
ISLRN: 895-206-642-518-8 | ||
DOI: 10.35111/m7m6-db83 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2010T15 | |
Rights Holder: | Portions © 1996 New York Times, © 2001, 2010 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2010T15 | |
DateStamp: | 2021-02-17 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Tomanek, Katrin; Hahn, Udo. 2010. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Text iso639_eng olac_primary_text |