OLAC Record oai:www.ldc.upenn.edu:LDC2005T35 |
Metadata | ||
Title: | American National Corpus (ANC) Second Release | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Reppen, Randi, Nancy Ide, and Keith Suderman. American National Corpus (ANC) Second Release LDC2005T35. Web Download. Philadelphia: Linguistic Data Consortium, 2005 | |
Contributor: | Reppen, Randi | |
Ide, Nancy | ||
Suderman, Keith | ||
Date (W3CDTF): | 2005 | |
Date Issued (W3CDTF): | 2005-12-15 | |
Description: | *Introduction* American National Corpus (ANC) Second Release was developed by various contributors and contains approximately 22 million words of American English text from multiple genres with various annotation such as part-of-speech (POS) tagging. The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. The ANC is being developed with help from a consortium of American English dictionary publishers and companies interested in language processing that was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project. The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels. *Data* In addition to the more than 10 million words added in the Second Release, this corpus contains a new corrected and validated version of the 11 million word ANC First Release and software for searching and retrieving multiple stand-off annotations. ANC Second Release contains texts from the following sources (* denotes new source in the Second Release): * Transcribed telephone speech * The New York Times * Berlitz Travel Guides * Slate Magazine * ICIC Corpus of Fundraising Texts * * The Michigan Corpus of Academic Spoken English (MICASE) * * Various non-fiction * Various fiction * * Various medical research articles * * Anonymized posts to the Phoenix Board/Buffistas.org * The corpus includes the data as a UTF-16 encoded file plus annotations of the documents such as automatic POS tagging with two different types of tagsets, automatic noun and verb phrase identification, and stuctural information at the paragraph and sentence level. The goal of the ANC is to ultimately contain a core corpus of at least 100 million words, including both written and spoken data (transcripts) comparable across genres to the BNC. ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in accordance with the agreement by which it is governed. Additional documentation and information is available at the ANC web site. *Samples* For examples of the data in this corpus, please review this plain text sample (TXT) and its POS annotation with Penn tagset (XML). *Updates* None at this time. *Sponsorship* The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania. | |
Extent: | Corpus size: 5662310 KB | |
Identifier: | LDC2005T35 | |
https://catalog.ldc.upenn.edu/LDC2005T35 | ||
ISBN: 1-58563-369-0 | ||
ISLRN: 797-978-576-065-6 | ||
DOI: 10.35111/251h-g440 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | American National Corpus 2nd Release - Open: https://catalog.ldc.upenn.edu/license/anc-2nd-release-open.pdf | |
American National Corpus 2nd Release - Restricted: https://catalog.ldc.upenn.edu/license/anc-2nd-release-restricted.pdf | ||
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2005T35 | |
Rights Holder: | Portions © 2002 New York Times, © 2003 Langenscheidt Publishers, © 1996-2000 Microsoft, Inc., © 1999, 2001, 2003 Oxford University Press, © 2003 Word, Inc., © 1998-2005 Orin Hargraves, © 2004 Ferd Eggan, © 2003 Indiana Center for Intercultural Communication, © 1999-2002, English Language Institute, the University of Michigan, © 2003, 2005 American National Corpus Project, © 1993, 1997, 2003, 2005 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2005T35 | |
DateStamp: | 2021-07-16 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Reppen, Randi; Ide, Nancy; Suderman, Keith. 2005. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Text iso639_eng olac_primary_text |