OLAC Record: Manually Annotated Sub-Corpus First Release

OLAC Record
oai:www.ldc.upenn.edu:LDC2010T22

Metadata

Title: Manually Annotated Sub-Corpus First Release

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ide, Nancy, et al. Manually Annotated Sub-Corpus First Release LDC2010T22. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: Ide, Nancy

Suderman, Keith

Baker, Collin

Passonneau, Rebecca

Fellbaum, Christiane

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-12-20

Description: *Introduction* The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium (LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. Researchers from Vassar College, Columbia University and the International Computer Science Institute, University of California at Berkeley are the principal participants the WordNet project provides consulting. The source texts in MASC I are drawn from the open portion of the American National Corpus (ANC) Second Release LDC2005T35, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus LDC2009T10, (LU Corpus), a collection of various genres including broadcast, newswire, email and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. All of the words of data in MASC I have validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities and Penn Treebank syntax. Full-text FrameNet annotations are available for seventeen texts and WordNet word sense annotations are available for 1000 occurrences of each of fifty-three words. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects. Software and services available from the ANC project website enable transduction of MASC into a wide variety of physical formats. *Data* The MASC directory contains two folders: masc-1.0.3 and masc_wordsense. masc-1.0.3 contains the actual MASC corpus and consists of two folders, spoken and written. The spoken folder contains data and annotations for spoken material, and the written folder contains the same for written texts. The files in each of the respective folders have naming conventions that describe the contents of the file. masc_wordsense contains the MASC sentence samples with word sense annotations using WordNet sense numbers as the annotation values. *Updates* Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T22. *Samples* Contact: ldc@ldc.upenn.edu © 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.

Extent: Corpus size: 183296 KB

Identifier: LDC2010T22

https://catalog.ldc.upenn.edu/LDC2010T22

ISBN: 1-58563-569-3

ISLRN: 461-028-050-892-8

DOI: 10.35111/5f8p-g428

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010T22

Rights Holder: Portions © 2000 The Associated Press, © 1987-1989 Dow Jones & Company, Inc., © 2000 New York Times, © 1997-2002, 2010 Trustees of the University of Pennsylvania

Contact: ldc@ldc.upenn.edu © 2010 http://www.ldc.upenn.edu" rel="nofollow"> Linguistic Data Consortium , http://www.upenn.edu" rel="nofollow"> Trustees of the University of Pennsylvania . All Rights Reserved.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010T22

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ide, Nancy; Suderman, Keith; Baker, Collin; Passonneau, Rebecca; Fellbaum, Christiane. 2010. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T22
Up-to-date as of: Wed Oct 29 7:01:14 EDT 2025

Metadata
Title:		Manually Annotated Sub-Corpus First Release
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ide, Nancy, et al. Manually Annotated Sub-Corpus First Release LDC2010T22. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		Ide, Nancy
		Suderman, Keith
		Baker, Collin
		Passonneau, Rebecca
		Fellbaum, Christiane
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-12-20
Description:		Introduction The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium (LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. Researchers from Vassar College, Columbia University and the International Computer Science Institute, University of California at Berkeley are the principal participants the WordNet project provides consulting. The source texts in MASC I are drawn from the open portion of the American National Corpus (ANC) Second Release LDC2005T35, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the Language Understanding Annotation Corpus LDC2009T10, (LU Corpus), a collection of various genres including broadcast, newswire, email and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. All of the words of data in MASC I have validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities and Penn Treebank syntax. Full-text FrameNet annotations are available for seventeen texts and WordNet word sense annotations are available for 1000 occurrences of each of fifty-three words. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects. Software and services available from the ANC project website enable transduction of MASC into a wide variety of physical formats. Data The MASC directory contains two folders: masc-1.0.3 and masc_wordsense. masc-1.0.3 contains the actual MASC corpus and consists of two folders, spoken and written. The spoken folder contains data and annotations for spoken material, and the written folder contains the same for written texts. The files in each of the respective folders have naming conventions that describe the contents of the file. masc_wordsense contains the MASC sentence samples with word sense annotations using WordNet sense numbers as the annotation values. Updates Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T22. Samples Contact: ldc@ldc.upenn.edu © 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.
Extent:		Corpus size: 183296 KB
Identifier:		LDC2010T22
		https://catalog.ldc.upenn.edu/LDC2010T22
		ISBN: 1-58563-569-3
		ISLRN: 461-028-050-892-8
		DOI: 10.35111/5f8p-g428
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010T22
Rights Holder:		Portions © 2000 The Associated Press, © 1987-1989 Dow Jones & Company, Inc., © 2000 New York Times, © 1997-2002, 2010 Trustees of the University of Pennsylvania Contact: ldc@ldc.upenn.edu © 2010 http://www.ldc.upenn.edu" rel="nofollow"> Linguistic Data Consortium , http://www.upenn.edu" rel="nofollow"> Trustees of the University of Pennsylvania . All Rights Reserved.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010T22
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ide, Nancy; Suderman, Keith; Baker, Collin; Passonneau, Rebecca; Fellbaum, Christiane. 2010. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text