OLAC Record: PennBioIE Oncology 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T21

Metadata

Title: PennBioIE Oncology 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Liberman, Mark, Mark Mandel, and Peter White. PennBioIE Oncology 1.0 LDC2008T21. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: Liberman, Mark

Mandel, Mark

White, Peter

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-11-18

Description: *Introduction* The PennBioIE Oncology Corpus consists of 1414 PubMed abstracts on cancer, concentrating on molecular genetics, and comprising approximately 327,000 words of biomedical text,tokenized and annotated for paragraph, sentence, part of speech, and 24 types of biomedical named entities in five categories of interest. 318 of the abstracts have also been syntactically annotated. All of the annotation was based on Penn Treebank II standards, with some modifications for special characteristics of the biomedical text. The entity definitions were developed and revised in an extensive process of interaction between domain experts and biomedically trained annotators. The oncology data comprises two subcorpora: * The Sanger subcorpus (san) consists of abstracts of 577 articles previously annotated by the Sanger Institute for global mention of oncological named entities. These annotations were metadata reflecting the presence or absence of such mentions anywhere in the text, without reference to specific strings. The articles concentrate on variations in a small set of human genes associated with many different types of cancer; they were not part of ongoing work at Sanger, and the annotations were never published. We did not refer to the Sanger annotations after selection of the abstracts. * The neuroblastoma subcorpus (nb) consists of 837 abstracts of articles dealing with this particular type of cancer selected by colleagues at Children's Hospital of Philadelphia. They do not all concentrate on genetics, but they mention a much larger number of genes than the Sanger files do. The data was prepared by the Linguistic Data Consortium for the Institute for Research in Cognitive Science, with funding from the National Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research (ITR) program, in collaboration with Dr. Peter White's group in Pediatric Oncology at the Children's Hospital of Philadelphia. *Data Description* The corpus contains 1412 PubMed abstracts comprising approximately 381,000 total words of text. Each file has been tokenized and its biomedical portions (327,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity. Each token has a part-of-speech tag. Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+ + K+)ATPase", "two hundred seven"); and named entity mentions may comprise several tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span sentence boundaries. Biomedical and non-biomedical text: The title and body of each abstract are considered to be biomedical text, and the automatic and manual annotations in them have been extensively curated. Everything else, such as citation information and author names, is considered non-biomedical; this has not been entity annotated, and its automated tokenization and part of speech tags have not been curated and are known to be unreliable. In non-biomedical text, the tag "section" is used instead of "sentence", allowing users to include or exclude these parts. There are approximately 274,000 words of biomedical text and 54,000 words of non-biomedical text. (Because of a problem with software maintenance, about 24,000 tokens in biomedical text, mostly in the nb2 subcorpus, are missing POS tags.) Domains: The abstracts are divided across two domains: * the molecular genetics of cancer, from a list selected by the Cancer Genome Project of the Sanger Institute (v0.9: 588 files; v1.0: 577 files) * neuroblastoma, a type of cancer that develops from nerve tissue in infants and children (v0.9: 569 files; v1.0: 837 files = 392 from v0.9 + 445 new) The difference between the domains is apparent in the ratio of distinct mentions (types) of tumor types and of gene, after normalization: 3.5 times as many tumor types in the Sanger files, but 5.8 times as many genes in the neuroblastoma files. Other divisions of the corpus: The files are further subdivided by annotation level into three subcorpora, each with its own subdirectory on this CD and its own set of metadata files. * nb1: neuroblastoma annotated to level 1 (407 files) * nb2: neuroblastoma annotated to level 2 (430 files) * san: Sanger annotated to level 2 (all 577 files) Metadata is also provided for * onco: the entire v1.0 oncology corpus (1414 files) * nb: nb1 + nb2, all the neuroblastoma data regardless of annotation level (837 files) * o2: nb2 + san, all the level 2 data regardless of subcorpus (1007 files) Version 0.9 is included in this release in a separate directory. It is similarly organized, though with only one level of annotation, less detailed than v1.0's level 1: * onco09: all the v0.9 oncology corpus (1157 files) * nb09: neuroblastoma (569 files) * san09: Sanger (588 files) A subset of the v0.9 data was also syntactically annotated (treebanked): * onco09t: (318 files) * nb09t: (115 files) * san09t: (203 files) *Principles and Methods* Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result, annotation practices have sometimes involved compromises which might not have been necessary if the earlier annotation had been able to integrate the requirements of the later work. Such integration is necessary here because of the scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and the goal of making the annotation layers work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. Such integration is also possible given the relatively long term of the grant (five years) and because researchers were starting with fresh text, applying all layers of annotation themselves. The texts are annotated at the following layers: * Paragraph * Sentence * Biomedical entity * Token and part of speech * Syntax (treebanking) (some texts only) * Semantic relations (some oncology texts only) Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected; entity annotation is manual. The authors originally used a POS tagger trained on Penn Treebank data, which made many errors on the very different text of these biomedical abstracts. When there was enough manually-corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)). Annotation at all layers except entity is based on the Penn Treebank II guidelines, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. *Samples* For an example of the annotations in this corpus, please consult this page containing examples of the source text, the standoff annotations, tokenization, treebank*, and interactive HTML view*. * v0.9 annotation only

Extent: Corpus size: 391168 KB

Identifier: LDC2008T21

https://catalog.ldc.upenn.edu/LDC2008T21

ISBN: 1-58563-490-5

ISLRN: 206-787-441-605-3

DOI: 10.35111/bv8s-f634

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T21

Rights Holder: Portions © 2002-2008 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T21

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Liberman, Mark; Mandel, Mark; White, Peter. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T21
Up-to-date as of: Wed Oct 29 7:01:04 EDT 2025

Metadata
Title:		PennBioIE Oncology 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Liberman, Mark, Mark Mandel, and Peter White. PennBioIE Oncology 1.0 LDC2008T21. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		Liberman, Mark
		Mandel, Mark
		White, Peter
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-11-18
Description:		Introduction The PennBioIE Oncology Corpus consists of 1414 PubMed abstracts on cancer, concentrating on molecular genetics, and comprising approximately 327,000 words of biomedical text,tokenized and annotated for paragraph, sentence, part of speech, and 24 types of biomedical named entities in five categories of interest. 318 of the abstracts have also been syntactically annotated. All of the annotation was based on Penn Treebank II standards, with some modifications for special characteristics of the biomedical text. The entity definitions were developed and revised in an extensive process of interaction between domain experts and biomedically trained annotators. The oncology data comprises two subcorpora: * The Sanger subcorpus (san) consists of abstracts of 577 articles previously annotated by the Sanger Institute for global mention of oncological named entities. These annotations were metadata reflecting the presence or absence of such mentions anywhere in the text, without reference to specific strings. The articles concentrate on variations in a small set of human genes associated with many different types of cancer; they were not part of ongoing work at Sanger, and the annotations were never published. We did not refer to the Sanger annotations after selection of the abstracts. * The neuroblastoma subcorpus (nb) consists of 837 abstracts of articles dealing with this particular type of cancer selected by colleagues at Children's Hospital of Philadelphia. They do not all concentrate on genetics, but they mention a much larger number of genes than the Sanger files do. The data was prepared by the Linguistic Data Consortium for the Institute for Research in Cognitive Science, with funding from the National Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research (ITR) program, in collaboration with Dr. Peter White's group in Pediatric Oncology at the Children's Hospital of Philadelphia. Data Description The corpus contains 1412 PubMed abstracts comprising approximately 381,000 total words of text. Each file has been tokenized and its biomedical portions (327,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity. Each token has a part-of-speech tag. Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+ + K+)ATPase", "two hundred seven"); and named entity mentions may comprise several tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span sentence boundaries. Biomedical and non-biomedical text: The title and body of each abstract are considered to be biomedical text, and the automatic and manual annotations in them have been extensively curated. Everything else, such as citation information and author names, is considered non-biomedical; this has not been entity annotated, and its automated tokenization and part of speech tags have not been curated and are known to be unreliable. In non-biomedical text, the tag "section" is used instead of "sentence", allowing users to include or exclude these parts. There are approximately 274,000 words of biomedical text and 54,000 words of non-biomedical text. (Because of a problem with software maintenance, about 24,000 tokens in biomedical text, mostly in the nb2 subcorpus, are missing POS tags.) Domains: The abstracts are divided across two domains: * the molecular genetics of cancer, from a list selected by the Cancer Genome Project of the Sanger Institute (v0.9: 588 files; v1.0: 577 files) * neuroblastoma, a type of cancer that develops from nerve tissue in infants and children (v0.9: 569 files; v1.0: 837 files = 392 from v0.9 + 445 new) The difference between the domains is apparent in the ratio of distinct mentions (types) of tumor types and of gene, after normalization: 3.5 times as many tumor types in the Sanger files, but 5.8 times as many genes in the neuroblastoma files. Other divisions of the corpus: The files are further subdivided by annotation level into three subcorpora, each with its own subdirectory on this CD and its own set of metadata files. * nb1: neuroblastoma annotated to level 1 (407 files) * nb2: neuroblastoma annotated to level 2 (430 files) * san: Sanger annotated to level 2 (all 577 files) Metadata is also provided for * onco: the entire v1.0 oncology corpus (1414 files) * nb: nb1 + nb2, all the neuroblastoma data regardless of annotation level (837 files) * o2: nb2 + san, all the level 2 data regardless of subcorpus (1007 files) Version 0.9 is included in this release in a separate directory. It is similarly organized, though with only one level of annotation, less detailed than v1.0's level 1: * onco09: all the v0.9 oncology corpus (1157 files) * nb09: neuroblastoma (569 files) * san09: Sanger (588 files) A subset of the v0.9 data was also syntactically annotated (treebanked): * onco09t: (318 files) * nb09t: (115 files) * san09t: (203 files) Principles and Methods Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result, annotation practices have sometimes involved compromises which might not have been necessary if the earlier annotation had been able to integrate the requirements of the later work. Such integration is necessary here because of the scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and the goal of making the annotation layers work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. Such integration is also possible given the relatively long term of the grant (five years) and because researchers were starting with fresh text, applying all layers of annotation themselves. The texts are annotated at the following layers: * Paragraph * Sentence * Biomedical entity * Token and part of speech * Syntax (treebanking) (some texts only) * Semantic relations (some oncology texts only) Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected; entity annotation is manual. The authors originally used a POS tagger trained on Penn Treebank data, which made many errors on the very different text of these biomedical abstracts. When there was enough manually-corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)). Annotation at all layers except entity is based on the Penn Treebank II guidelines, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. Samples For an example of the annotations in this corpus, please consult this page containing examples of the source text, the standoff annotations, tokenization, treebank, and interactive HTML view. * v0.9 annotation only
Extent:		Corpus size: 391168 KB
Identifier:		LDC2008T21
		https://catalog.ldc.upenn.edu/LDC2008T21
		ISBN: 1-58563-490-5
		ISLRN: 206-787-441-605-3
		DOI: 10.35111/bv8s-f634
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T21
Rights Holder:		Portions © 2002-2008 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T21
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Liberman, Mark; Mandel, Mark; White, Peter. 2008. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text