OLAC Record: PennBioIE CYP 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T20

Metadata

Title: PennBioIE CYP 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Liberman, Mark, Mark Mandel, and GlaxoSmithKline Pharmaceuticals R&D. PennBioIE CYP 1.0 LDC2008T20. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: Liberman, Mark

Mandel, Mark

GlaxoSmithKline Pharmaceuticals R&D

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-11-18

Description: *Introduction* The PennBioIE CYP Corpus consists of 1100 PubMed abstracts on the inhibition of cytochrome P450 enzymes, comprising approximately 274,000 words of biomedical text, tokenized and annotated for paragraph, sentence, part of speech, and five types of biomedical named entities in three categories of interest. 324 of the abstracts have also been syntactically annotated. All of the annotation was based on Penn Treebank II standards, with some modifications for special characteristics of the biomedical text. The entity definitions were developed and revised in an extensive process of interaction between domain experts and biomedically trained annotators. The data was prepared by the Linguistic Data Consortium for the Institute for Research in Cognitive Science, with funding from the National Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research (ITR) program, in collaboration with GlaxoSmithKline Pharmaceuticals R&D. *Data Description* The corpus contains 1100 PubMed abstracts comprising approximately 313,000 total words of text. Each file has been tokenized and its biomedical portions (274,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 5 types of named entity. Each token has a part-of-speech tag. Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+ + K+)ATPase", "two hundred seven"); and named entity mentions may comprise several tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span sentence boundaries. Biomedical and non-biomedical text: The title and body of each abstract are considered to be biomedical text, and the automatic and manual annotations in them have been extensively curated. Everything else, such as citation information and author names, is considered non-biomedical; this has not been entity annotated, and its automated tokenization and part of speech tags have not been curated and are known to be unreliable. In non-biomedical text, the tag "section" is used instead of "sentence", allowing users to include or exclude these parts. There are approximately 327,000 words of biomedical text and 39,000 words of non-biomedical text. *Principles and Methods* Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result, annotation practices have sometimes involved compromises which might not have been necessary if the earlier annotation had been able to integrate the requirements of the later work. Such integration is necessary here because of the scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and the goal of making the annotation layers work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. Such integration is also possible given the relatively long term of the grant (five years) and because researchers were starting with fresh text, applying all layers of annotation themselves. The texts are annotated at the following layers: * Paragraph * Sentence * Biomedical entity * Token and part of speech * Syntax (treebanking) (some texts only) * Semantic relations Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected ; entity annotation is manual. The authors originally used a POS tagger trained on Penn Treebank data, which made many errors on the very different text of these biomedical abstracts. When there was enough manually-corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)). Annotation at all layers except entity is based on the Penn Treebank II guidelines, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, annotations being made in a separate file. *Samples* For an example of the data contained in this corpus, please examine this page containing examples of the source text, the standoff annotations, tokenization, treebank, and interactive HTML view.

Extent: Corpus size: 167936 KB

Identifier: LDC2008T20

https://catalog.ldc.upenn.edu/LDC2008T20

ISBN: 1-58563-498-0

ISLRN: 379-986-207-358-6

DOI: 10.35111/b8n8-1j96

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T20

Rights Holder: Portions © 2002 - 2008 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T20

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Liberman, Mark; Mandel, Mark; GlaxoSmithKline Pharmaceuticals R&D. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T20
Up-to-date as of: Wed Oct 29 7:01:03 EDT 2025

Metadata
Title:		PennBioIE CYP 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Liberman, Mark, Mark Mandel, and GlaxoSmithKline Pharmaceuticals R&D. PennBioIE CYP 1.0 LDC2008T20. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		Liberman, Mark
		Mandel, Mark
		GlaxoSmithKline Pharmaceuticals R&D
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-11-18
Description:		Introduction The PennBioIE CYP Corpus consists of 1100 PubMed abstracts on the inhibition of cytochrome P450 enzymes, comprising approximately 274,000 words of biomedical text, tokenized and annotated for paragraph, sentence, part of speech, and five types of biomedical named entities in three categories of interest. 324 of the abstracts have also been syntactically annotated. All of the annotation was based on Penn Treebank II standards, with some modifications for special characteristics of the biomedical text. The entity definitions were developed and revised in an extensive process of interaction between domain experts and biomedically trained annotators. The data was prepared by the Linguistic Data Consortium for the Institute for Research in Cognitive Science, with funding from the National Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research (ITR) program, in collaboration with GlaxoSmithKline Pharmaceuticals R&D. Data Description The corpus contains 1100 PubMed abstracts comprising approximately 313,000 total words of text. Each file has been tokenized and its biomedical portions (274,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 5 types of named entity. Each token has a part-of-speech tag. Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+ + K+)ATPase", "two hundred seven"); and named entity mentions may comprise several tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span sentence boundaries. Biomedical and non-biomedical text: The title and body of each abstract are considered to be biomedical text, and the automatic and manual annotations in them have been extensively curated. Everything else, such as citation information and author names, is considered non-biomedical; this has not been entity annotated, and its automated tokenization and part of speech tags have not been curated and are known to be unreliable. In non-biomedical text, the tag "section" is used instead of "sentence", allowing users to include or exclude these parts. There are approximately 327,000 words of biomedical text and 39,000 words of non-biomedical text. Principles and Methods Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result, annotation practices have sometimes involved compromises which might not have been necessary if the earlier annotation had been able to integrate the requirements of the later work. Such integration is necessary here because of the scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and the goal of making the annotation layers work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. Such integration is also possible given the relatively long term of the grant (five years) and because researchers were starting with fresh text, applying all layers of annotation themselves. The texts are annotated at the following layers: * Paragraph * Sentence * Biomedical entity * Token and part of speech * Syntax (treebanking) (some texts only) * Semantic relations Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected ; entity annotation is manual. The authors originally used a POS tagger trained on Penn Treebank data, which made many errors on the very different text of these biomedical abstracts. When there was enough manually-corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)). Annotation at all layers except entity is based on the Penn Treebank II guidelines, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, annotations being made in a separate file. Samples For an example of the data contained in this corpus, please examine this page containing examples of the source text, the standoff annotations, tokenization, treebank, and interactive HTML view.
Extent:		Corpus size: 167936 KB
Identifier:		LDC2008T20
		https://catalog.ldc.upenn.edu/LDC2008T20
		ISBN: 1-58563-498-0
		ISLRN: 379-986-207-358-6
		DOI: 10.35111/b8n8-1j96
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T20
Rights Holder:		Portions © 2002 - 2008 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T20
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Liberman, Mark; Mandel, Mark; GlaxoSmithKline Pharmaceuticals R&D. 2008. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text