OLAC Record: Manually Annotated Sub-Corpus Third Release

OLAC Record
oai:www.ldc.upenn.edu:LDC2013T12

Metadata

Title: Manually Annotated Sub-Corpus Third Release

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ide, Nancy, et al. Manually Annotated Sub-Corpus Third Release LDC2013T12. Web Download. Philadelphia: Linguistic Data Consortium, 2013

Contributor: Ide, Nancy

Suderman, Keith

Baker, Collin

Passonneau, Rebecca

Fellbaum, Christiane

Date (W3CDTF): 2013

Date Issued (W3CDTF): 2013-07-17

Description: *Introduction* Manually Annotated Sub-Corpus (MASC) Third Release was developed as part of The American National Corpus project and consists of approximately 500,000 words of contemporary American English written and spoken data annotated for a wide variety of linguistic phenomena. The MASC project was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project provides appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. The aim is to offset some of the high costs of producing high quality linguistic annotations via a distribution of effort and to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. It also provides data from a much wider variety of genres than are often present in existing multiply-annotated corpora of English, and all of the data in the corpus are drawn from current American English so as to be most useful for natural language processing applications used in the web-based environment. Further information about the pojrect is available at the MASC website. The source texts were drawn from the open portion of the American National Corpus Second Release, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus, a collection of various genres inlcuding broadcast, newswire, email, and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. MASC Third Release includes the the contents of MASC First Release (LDC2010T22) (82,000 words) which is also available from LDC. There is no second release. *Data* All data in this release was annotated for logical structure (paragraph, headings, etc.), token and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks) and named entities (person, organization, location and date). Portions of the corpus were also annotated for FrameNet frames (40k full text), Penn Treebank syntax (82k) and opinion (50k). All annotations were either manually produced or hand-validated and represented in ISO-GrAF standoff format. The original texts were derived from original electronic versions in a wide variety of formats, including but not limited to Quark Express, XML, Microsoft Word, Portable Document Format (PDF), HTML, and plain text. Transduction procedures varied depending on the original format. As little correction or other editorial modification as possible was applied to the text. Corrections to the text were either made in standoff documents containing the corrected version or were reflected in values of segmentation, token, sentence, or other segmental unit, and/or part of speech annotation. The data are segmented into minimal regions spanning the primary data. Minimal regions are identified as the smallest unit any of the tokenizations applied to data references. Token annotations reference these regions as appropriate. Sentences reference regions in primary data. *Samples* Please consult this email sample and telephone sample. *Updates* None at this time.

Extent: Corpus size: 358112 KB

Identifier: LDC2013T12

https://catalog.ldc.upenn.edu/LDC2013T12

ISBN: 1-58563-647-9

ISLRN: 021-129-973-518-8

DOI: 10.35111/ctg7-5698

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2013T12

Rights Holder: Portions © 2003, 2005, 2013 American National Corpus Project, © 2000 The Associated Press, © 1987-1989 Dow Jones & Company, Inc., © 1999-2002 English Language Institute, the University of Michigan, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 2000, 2002 New York Times, © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 1993, 1997-2003, 2005, 2010, 2013 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2013T12

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ide, Nancy; Suderman, Keith; Baker, Collin; Passonneau, Rebecca; Fellbaum, Christiane. 2013. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2013T12
Up-to-date as of: Wed Oct 29 7:01:24 EDT 2025

Metadata
Title:		Manually Annotated Sub-Corpus Third Release
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ide, Nancy, et al. Manually Annotated Sub-Corpus Third Release LDC2013T12. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:		Ide, Nancy
		Suderman, Keith
		Baker, Collin
		Passonneau, Rebecca
		Fellbaum, Christiane
Date (W3CDTF):		2013
Date Issued (W3CDTF):		2013-07-17
Description:		Introduction Manually Annotated Sub-Corpus (MASC) Third Release was developed as part of The American National Corpus project and consists of approximately 500,000 words of contemporary American English written and spoken data annotated for a wide variety of linguistic phenomena. The MASC project was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project provides appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. The aim is to offset some of the high costs of producing high quality linguistic annotations via a distribution of effort and to solve some of the usability problems for annotations produced at different sites by harmonizing their representation formats. It also provides data from a much wider variety of genres than are often present in existing multiply-annotated corpora of English, and all of the data in the corpus are drawn from current American English so as to be most useful for natural language processing applications used in the web-based environment. Further information about the pojrect is available at the MASC website. The source texts were drawn from the open portion of the American National Corpus Second Release, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus, a collection of various genres inlcuding broadcast, newswire, email, and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. MASC Third Release includes the the contents of MASC First Release (LDC2010T22) (82,000 words) which is also available from LDC. There is no second release. Data All data in this release was annotated for logical structure (paragraph, headings, etc.), token and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks) and named entities (person, organization, location and date). Portions of the corpus were also annotated for FrameNet frames (40k full text), Penn Treebank syntax (82k) and opinion (50k). All annotations were either manually produced or hand-validated and represented in ISO-GrAF standoff format. The original texts were derived from original electronic versions in a wide variety of formats, including but not limited to Quark Express, XML, Microsoft Word, Portable Document Format (PDF), HTML, and plain text. Transduction procedures varied depending on the original format. As little correction or other editorial modification as possible was applied to the text. Corrections to the text were either made in standoff documents containing the corrected version or were reflected in values of segmentation, token, sentence, or other segmental unit, and/or part of speech annotation. The data are segmented into minimal regions spanning the primary data. Minimal regions are identified as the smallest unit any of the tokenizations applied to data references. Token annotations reference these regions as appropriate. Sentences reference regions in primary data. Samples Please consult this email sample and telephone sample. Updates None at this time.
Extent:		Corpus size: 358112 KB
Identifier:		LDC2013T12
		https://catalog.ldc.upenn.edu/LDC2013T12
		ISBN: 1-58563-647-9
		ISLRN: 021-129-973-518-8
		DOI: 10.35111/ctg7-5698
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2013T12
Rights Holder:		Portions © 2003, 2005, 2013 American National Corpus Project, © 2000 The Associated Press, © 1987-1989 Dow Jones & Company, Inc., © 1999-2002 English Language Institute, the University of Michigan, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 2000, 2002 New York Times, © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 1993, 1997-2003, 2005, 2010, 2013 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2013T12
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ide, Nancy; Suderman, Keith; Baker, Collin; Passonneau, Rebecca; Fellbaum, Christiane. 2013. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text