OLAC Record
oai:www.ldc.upenn.edu:LDC2013T12

Metadata
Title:Manually Annotated Sub-Corpus Third Release
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Ide, Nancy, et al. Manually Annotated Sub-Corpus Third Release LDC2013T12. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:Ide, Nancy
Suderman, Keith
Baker, Collin
Passonneau, Rebecca
Fellbaum, Christiane
Date (W3CDTF):2013
Date Issued (W3CDTF):2013-07-17
Description:*Introduction* Manually Annotated Sub-Corpus (MASC) Third Release was developed as part of The American National Corpus project and consists of approximately 500,000 words of contemporary American English written and spoken data annotated for a wide variety of linguistic phenomena. The MASC project was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project provides appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. The aim is to offset some of the high costs of producing high quality linguistic annotations via a distribution of effort and to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. It also provides data from a much wider variety of genres than are often present in existing multiply-annotated corpora of English, and all of the data in the corpus are drawn from current American English so as to be most useful for natural language processing applications used in the web-based environment. Further information about the pojrect is available at the MASC website. The source texts were drawn from the open portion of the American National Corpus Second Release, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus, a collection of various genres inlcuding broadcast, newswire, email, and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. MASC Third Release includes the the contents of MASC First Release (LDC2010T22) (82,000 words) which is also available from LDC. There is no second release. *Data* All data in this release was annotated for logical structure (paragraph, headings, etc.), token and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks) and named entities (person, organization, location and date). Portions of the corpus were also annotated for FrameNet frames (40k full text), Penn Treebank syntax (82k) and opinion (50k). All annotations were either manually produced or hand-validated and represented in ISO-GrAF standoff format. The original texts were derived from original electronic versions in a wide variety of formats, including but not limited to Quark Express, XML, Microsoft Word, Portable Document Format (PDF), HTML, and plain text. Transduction procedures varied depending on the original format. As little correction or other editorial modification as possible was applied to the text. Corrections to the text were either made in standoff documents containing the corrected version or were reflected in values of segmentation, token, sentence, or other segmental unit, and/or part of speech annotation. The data are segmented into minimal regions spanning the primary data. Minimal regions are identified as the smallest unit any of the tokenizations applied to data references. Token annotations reference these regions as appropriate. Sentences reference regions in primary data. *Samples* Please consult this email sample and telephone sample. *Updates* None at this time.
Extent:Corpus size: 358112 KB
Identifier:LDC2013T12
https://catalog.ldc.upenn.edu/LDC2013T12
ISBN: 1-58563-647-9
ISLRN: 021-129-973-518-8
DOI: 10.35111/ctg7-5698
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2013T12
Rights Holder:Portions © 2003, 2005, 2013 American National Corpus Project, © 2000 The Associated Press, © 1987-1989 Dow Jones & Company, Inc., © 1999-2002 English Language Institute, the University of Michigan, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 2000, 2002 New York Times, © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 1993, 1997-2003, 2005, 2010, 2013 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2013T12
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ide, Nancy; Suderman, Keith; Baker, Collin; Passonneau, Rebecca; Fellbaum, Christiane. 2013. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2013T12
Up-to-date as of: Fri Dec 6 7:48:12 EST 2024