OLAC Record: BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2021T07

Metadata

Title: BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Agarwal, Nitin, et al. BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech LDC2021T07. Web Download. Philadelphia: Linguistic Data Consortium, 2021

Contributor: Agarwal, Nitin

Francini, Michelle

Kappler, Michelle

Micciulla, Linnea

Pradhan, Sameer

Ramshaw, Lance

Date (W3CDTF): 2021

Date Issued (W3CDTF): 2021-03-15

Description: *Introduction* BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Chinese discussion forum (DF), SMS/Chat and conversational telephone speech (CTS). The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. *Data* DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Chinese CALLHOME and CALLFRIEND telephone collections. Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs. Annotation files are presented in UTF-8 encoded XML format. *Sponsorship* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Samples* Please view these samples: * SMS/Chat Sample (TXT) * DF Sample (TXT) * CTS Sample (TXT) *Updates* None at this time.

Extent: Corpus size: 10881 KB

Identifier: LDC2021T07

https://catalog.ldc.upenn.edu/LDC2021T07

ISBN: 1-58563-958-3

ISLRN: 877-443-319-938-1

DOI: 10.35111/wncq-zy49

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2021T07

Rights Holder: Portions © 1996, 2012-2016, 2021 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2021T07

DateStamp: 2022-01-01

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Agarwal, Nitin; Francini, Michelle; Kappler, Michelle; Micciulla, Linnea; Pradhan, Sameer; Ramshaw, Lance. 2021. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2021T07
Up-to-date as of: Wed Oct 29 7:02:04 EDT 2025

Metadata
Title:		BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Agarwal, Nitin, et al. BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech LDC2021T07. Web Download. Philadelphia: Linguistic Data Consortium, 2021
Contributor:		Agarwal, Nitin
		Francini, Michelle
		Kappler, Michelle
		Micciulla, Linnea
		Pradhan, Sameer
		Ramshaw, Lance
Date (W3CDTF):		2021
Date Issued (W3CDTF):		2021-03-15
Description:		Introduction BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Chinese discussion forum (DF), SMS/Chat and conversational telephone speech (CTS). The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. Data DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Chinese CALLHOME and CALLFRIEND telephone collections. Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs. Annotation files are presented in UTF-8 encoded XML format. Sponsorship This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Samples Please view these samples: * SMS/Chat Sample (TXT) * DF Sample (TXT) * CTS Sample (TXT) Updates None at this time.
Extent:		Corpus size: 10881 KB
Identifier:		LDC2021T07
		https://catalog.ldc.upenn.edu/LDC2021T07
		ISBN: 1-58563-958-3
		ISLRN: 877-443-319-938-1
		DOI: 10.35111/wncq-zy49
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2021T07
Rights Holder:		Portions © 1996, 2012-2016, 2021 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2021T07
DateStamp:		2022-01-01
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Agarwal, Nitin; Francini, Michelle; Kappler, Michelle; Micciulla, Linnea; Pradhan, Sameer; Ramshaw, Lance. 2021. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text