OLAC Record oai:www.ldc.upenn.edu:LDC2020T20 |
Metadata | ||
Title: | BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Agarwal, Nitin, et al. BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech LDC2020T20. Web Download. Philadelphia: Linguistic Data Consortium, 2020 | |
Contributor: | Agarwal, Nitin | |
Franchini, Michelle | ||
Kappler, Michelle | ||
Micciulla, Linnea | ||
Pradhan, Sameer | ||
Ramshaw, Lance | ||
Date (W3CDTF): | 2020 | |
Date Issued (W3CDTF): | 2020-12-15 | |
Description: | *Introduction* BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on English discussion forum (DF), SMS/Chat and conversational telephone speech (CTS). The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. *Data* DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Arabic and Chinese CALLHOME and CALLFRIEND telephone collections; the audio files were transcribed and translated into English. Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs. Annotation files are presented in UTF-8 encoded XML format. *Acknowledgements* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Samples* Please view these samples: * CTS Sample (TXT) * DF Sample (TXT) * SMS Sample (TXT) *Updates* None at this time. | |
Extent: | Corpus size: 13290 KB | |
Identifier: | LDC2020T20 | |
https://catalog.ldc.upenn.edu/LDC2020T20 | ||
ISBN: 1-58563-951-6 | ||
ISLRN: 494-155-932-422-8 | ||
DOI: 10.35111/8wq1-d250 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2020T20 | |
Rights Holder: | Portions © 1996, 1997, 2011- 2020 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2020T20 | |
DateStamp: | 2021-03-17 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Agarwal, Nitin; Franchini, Michelle; Kappler, Michelle; Micciulla, Linnea; Pradhan, Sameer; Ramshaw, Lance. 2020. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Text iso639_eng olac_primary_text |