OLAC Record: Phrase Detectives Corpus

OLAC Record
oai:www.ldc.upenn.edu:LDC2017T08

Metadata

Title: Phrase Detectives Corpus

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Chamberlain, Jon, Massimo Poesio, and Udo Kruschwitz. Phrase Detectives Corpus LDC2017T08. Web Download. Philadelphia: Linguistic Data Consortium, 2017

Contributor: Chamberlain, Jon

Poesio, Massimo

Kruschwitz, Udo

Date (W3CDTF): 2017

Date Issued (W3CDTF): 2017-05-15

Description: *Introduction* Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference. GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts. *Data* The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Wikipedia articles and annotation files are presented as XML and Project Gutenberg source files are presented as plain text. All text is encoded as UTF-8. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game). The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22). *Samples* Please view the following source sample and annotation sample. *Updates* None at this time.

Extent: Corpus size: 28024 KB

Identifier: LDC2017T08

https://catalog.ldc.upenn.edu/LDC2017T08

ISBN: 1-58563-798-X

ISLRN: 052-688-100-874-5

DOI: 10.35111/9890-p128

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2017T08

Rights Holder: Portions © 2017 University of Essex, © 2017 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2017T08

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Chamberlain, Jon; Poesio, Massimo; Kruschwitz, Udo. 2017. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2017T08
Up-to-date as of: Wed Oct 29 7:01:42 EDT 2025

Metadata
Title:		Phrase Detectives Corpus
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Chamberlain, Jon, Massimo Poesio, and Udo Kruschwitz. Phrase Detectives Corpus LDC2017T08. Web Download. Philadelphia: Linguistic Data Consortium, 2017
Contributor:		Chamberlain, Jon
		Poesio, Massimo
		Kruschwitz, Udo
Date (W3CDTF):		2017
Date Issued (W3CDTF):		2017-05-15
Description:		Introduction Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference. GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts. Data The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Wikipedia articles and annotation files are presented as XML and Project Gutenberg source files are presented as plain text. All text is encoded as UTF-8. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game). The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22). Samples Please view the following source sample and annotation sample. Updates None at this time.
Extent:		Corpus size: 28024 KB
Identifier:		LDC2017T08
		https://catalog.ldc.upenn.edu/LDC2017T08
		ISBN: 1-58563-798-X
		ISLRN: 052-688-100-874-5
		DOI: 10.35111/9890-p128
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2017T08
Rights Holder:		Portions © 2017 University of Essex, © 2017 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2017T08
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Chamberlain, Jon; Poesio, Massimo; Kruschwitz, Udo. 2017. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text