OLAC Record: American National Corpus (ANC) Second Release

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T35

Metadata

Title: American National Corpus (ANC) Second Release

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Reppen, Randi, Nancy Ide, and Keith Suderman. American National Corpus (ANC) Second Release LDC2005T35. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Reppen, Randi

Ide, Nancy

Suderman, Keith

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-12-15

Description: *Introduction* American National Corpus (ANC) Second Release was developed by various contributors and contains approximately 22 million words of American English text from multiple genres with various annotation such as part-of-speech (POS) tagging. The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. The ANC is being developed with help from a consortium of American English dictionary publishers and companies interested in language processing that was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project. The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels. *Data* In addition to the more than 10 million words added in the Second Release, this corpus contains a new corrected and validated version of the 11 million word ANC First Release and software for searching and retrieving multiple stand-off annotations. ANC Second Release contains texts from the following sources (* denotes new source in the Second Release): * Transcribed telephone speech * The New York Times * Berlitz Travel Guides * Slate Magazine * ICIC Corpus of Fundraising Texts * * The Michigan Corpus of Academic Spoken English (MICASE) * * Various non-fiction * Various fiction * * Various medical research articles * * Anonymized posts to the Phoenix Board/Buffistas.org * The corpus includes the data as a UTF-16 encoded file plus annotations of the documents such as automatic POS tagging with two different types of tagsets, automatic noun and verb phrase identification, and stuctural information at the paragraph and sentence level. The goal of the ANC is to ultimately contain a core corpus of at least 100 million words, including both written and spoken data (transcripts) comparable across genres to the BNC. ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in accordance with the agreement by which it is governed. Additional documentation and information is available at the ANC web site. *Samples* For examples of the data in this corpus, please review this plain text sample (TXT) and its POS annotation with Penn tagset (XML). *Updates* None at this time. *Sponsorship* The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.

Extent: Corpus size: 5662310 KB

Identifier: LDC2005T35

https://catalog.ldc.upenn.edu/LDC2005T35

ISBN: 1-58563-369-0

ISLRN: 797-978-576-065-6

DOI: 10.35111/251h-g440

Language: English

Language (ISO639): eng

License: American National Corpus 2nd Release - Open: https://catalog.ldc.upenn.edu/license/anc-2nd-release-open.pdf

American National Corpus 2nd Release - Restricted: https://catalog.ldc.upenn.edu/license/anc-2nd-release-restricted.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T35

Rights Holder: Portions © 2002 New York Times, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 1999-2002, English Language Institute, the University of Michigan, © 2003, 2005 American National Corpus Project, © 1993, 1997, 2003, 2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T35

DateStamp: 2021-07-16

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Reppen, Randi; Ide, Nancy; Suderman, Keith. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T35
Up-to-date as of: Wed Oct 29 7:00:53 EDT 2025

Metadata
Title:		American National Corpus (ANC) Second Release
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Reppen, Randi, Nancy Ide, and Keith Suderman. American National Corpus (ANC) Second Release LDC2005T35. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Reppen, Randi
		Ide, Nancy
		Suderman, Keith
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-12-15
Description:		Introduction American National Corpus (ANC) Second Release was developed by various contributors and contains approximately 22 million words of American English text from multiple genres with various annotation such as part-of-speech (POS) tagging. The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. The ANC is being developed with help from a consortium of American English dictionary publishers and companies interested in language processing that was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project. The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels. Data In addition to the more than 10 million words added in the Second Release, this corpus contains a new corrected and validated version of the 11 million word ANC First Release and software for searching and retrieving multiple stand-off annotations. ANC Second Release contains texts from the following sources (* denotes new source in the Second Release): * Transcribed telephone speech * The New York Times * Berlitz Travel Guides * Slate Magazine * ICIC Corpus of Fundraising Texts * * The Michigan Corpus of Academic Spoken English (MICASE) * * Various non-fiction * Various fiction * * Various medical research articles * * Anonymized posts to the Phoenix Board/Buffistas.org * The corpus includes the data as a UTF-16 encoded file plus annotations of the documents such as automatic POS tagging with two different types of tagsets, automatic noun and verb phrase identification, and stuctural information at the paragraph and sentence level. The goal of the ANC is to ultimately contain a core corpus of at least 100 million words, including both written and spoken data (transcripts) comparable across genres to the BNC. ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in accordance with the agreement by which it is governed. Additional documentation and information is available at the ANC web site. Samples For examples of the data in this corpus, please review this plain text sample (TXT) and its POS annotation with Penn tagset (XML). Updates None at this time. Sponsorship The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.
Extent:		Corpus size: 5662310 KB
Identifier:		LDC2005T35
		https://catalog.ldc.upenn.edu/LDC2005T35
		ISBN: 1-58563-369-0
		ISLRN: 797-978-576-065-6
		DOI: 10.35111/251h-g440
Language:		English
Language (ISO639):		eng
License:		American National Corpus 2nd Release - Open: https://catalog.ldc.upenn.edu/license/anc-2nd-release-open.pdf
License:		American National Corpus 2nd Release - Restricted: https://catalog.ldc.upenn.edu/license/anc-2nd-release-restricted.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T35
Rights Holder:		Portions © 2002 New York Times, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 1999-2002, English Language Institute, the University of Michigan, © 2003, 2005 American National Corpus Project, © 1993, 1997, 2003, 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T35
DateStamp:		2021-07-16
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Reppen, Randi; Ide, Nancy; Suderman, Keith. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text