OLAC Record
oai:www.ldc.upenn.edu:LDC2010T05

Metadata
Title:NPS Internet Chatroom Conversations, Release 1.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Forsyth, Eric, Jane Lin, and Craig Martell. NPS Internet Chatroom Conversations, Release 1.0 LDC2010T05. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:Forsyth, Eric
Lin, Jane
Martell, Craig
Date (W3CDTF):2010
Date Issued (W3CDTF):2010-03-17
Description:*Introduction* NPS Internet Chatroom Conversations, Release 1.0 consists of 10,567 English posts (45,068 tokens) gathered from age-specific chat rooms of various online chat services in October and November 2006. Each file is a text recording from one of these chat rooms for a short period on a particular day. Users should be aware that some of the conversations in this corpus feature subjects and language that some people may find offensive or objectionable, including discussions of a sexual nature. This corpus was developed by researchers at the Department of Computer Science, Naval Postgraduate School, Monterey, California. Although much work has been accomplished in Natural Language Processing (NLP) in traditional written and spoken language domains, relatively little has been performed in the newer computer-mediated communication (CMC) domains enabled by the Internet, such as text-based chat. One factor inhibiting research in this area has been the dearth of annotated CMC corpora available to the broader research community, despite the increasing use of CMC in a variety of areas and applications. NPS Internet Chatroom Conversations is one of the first text-based chat corpora tagged with lexical and discourse information. This corpus might be used to develop stochastic NLP applications that perform tasks such as conversation thread topic detection, author profiling, entity identification, and social network analysis. Each post is annotated with a chat dialog-act tag, and individual tokens within each post are annotated with part-of-speech tags. 3,507 tokenized posts were automatically tagged using a part-of-speech tagger trained on the Penn Treebank corpora, combined with a regular expression that identified privacy-masked user names and emoticons. Similarly, simple regular expression matching was employed to assign an initial chat dialog-act to each of this subset of posts. This initial tagging was verified by hand (with corrections made where necessary). The remaining 7,060 posts were POS-tagged using a POS tagger that was trained on the newly hand-tagged chat data and the Penn Treebank corpora. Dialog-act tagging on the remaining posts was accomplished using a back-propagation neural network trained on 21 features of the initial dialog-act-labeled posts. The tagging of this second group of posts was also manually verified (and corrected where necessary). Ultimately, all of the 10,567 privacy-masked posts, representing 45,068 tokens, were annotated with manually verified part-of-speech and dialog act information. Filenames consist of date, target age group, and number of posts. For example, the file 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on October 19, 2006. The posts have been privacy-masked in two ways. First, all usernames have been changed to generic names of the form "UserN", where N is a unique integer consistently used for each respective poster across all files. The posts were then read by humans to remove other personally identifiable information. Within each file, usernames are prepended with the date and chat room portions of the filename. So in the above filename example, UserN becomes 10-19-20sUserN. *Samples* Please examine this sample for an example of the data in this corpus. *References* [1] Eric N. Forsyth and Craig H. Martell, "Lexical and Discourse Analysis of Online Chat Dialog," Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), pp. 19-26, September 2007. [2] T. Wu, F. M. Khan, T. A. Fisher, L. A. Shuler and W. M. Pottenger, "Posting act tagging using transformation-based learning," Proceedings of the Workshop on Foundations of Data Mining and Discovery, IEEE International Conference on Data Mining, December 2002. [3] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema and M. Meteer, "Dialogue act modeling for automatic tagging and recognition of conversational speech," Computational Linguistics, vol. 26, no. 3, pp. 339-373, 2000. [4] M. Zitzen and D. Stein, "Chat and conversation: a case of transmedial stability?" Linguistics, vol. 42, no. 5, pp. 983-1021, 2004.
Extent:Corpus size: 6553 KB
Identifier:LDC2010T05
https://catalog.ldc.upenn.edu/LDC2010T05
ISBN: 1-58563-538-3
ISLRN: 675-764-258-846-3
DOI: 10.35111/eqdj-ta72
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2010T05
Rights Holder:Portions © 2010 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2010T05
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Forsyth, Eric; Lin, Jane; Martell, Craig. 2010. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T05
Up-to-date as of: Fri Dec 6 7:47:53 EST 2024