OLAC Record: BioProp Version 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T04

Metadata

Title: BioProp Version 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Hsu, Wen-Lian. BioProp Version 1.0 LDC2009T04. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Hsu, Wen-Lian

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-08-18

Description: *Introduction* BioProp Version 1.0 was developed by researchers at Academia Sinica, Taipei, Taiwan. It consists of proposition bank-style annotations for approximately 500 English biomedical journal abstracts. The source abstracts, annotated in accordance with Penn Treebank II guidelines, are contained in the GENIA Treebank (GTB). The GTB was developed at the Tsujii Laboratory at the University of Tokyo. The purpose of the GENIA Project was to develop tools and resources for automatic information extraction of biomedical information. One result of that work is the GENIA corpus, a collection of 2000 biomedical journal abstracts containing semantic class annotation for biomedical terms, part-of-speech (POS) tags and coreferences. The GTB is a subset of that corpus. BioProp Version 1.0 adds a proposition bank to the GTB. Proposition Bank (PropBank) contains annotations of predicate argument structures and semantic roles in a treebank schema in the newswire domain. To construct BioProp Version 1.0, a semantic role labeling (SRL) system trained on PropBank was used to annotate the GTB. SRL, also called shallow semantic parsing, is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as agent and patient, and adjunct arguments, such as time, manner and location. The term "argument" refers to a syntactic constituent of the sentence related to the predicate, and the term "semantic role" refers to the semantic relationship between a sentence's predicate and argument. To suit the needs in the biomedical domain, the PropBank annotation guidelines were modified to characterize semantic roles as components of biological events. Specifically, thirty verbs were selected according to their frequency of use or importance in biomedical texts. Since targets in information extraction are relations of named entities, only sentences containing protein or gene names were used to count each verb's frequency. Verbs of general usage were filtered out in order to keep the focus on biomedical verbs. Some verbs that do not have a high frequency but play important roles in describing biomedical relations, such as "phosphorylate" and "transactivate," were also selected. The BioProp annotation was based on Levin?s verb classes as defined in the VerbNet lexicon. In VerbNet, the arguments of each verb are represented at the semantic level, and thus have associated semantic roles. However, since some verbs may have different usages in biomedical and newswire texts, it is necessary to customize the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic method was used to annotate BioProp. The annotation process consisted of the following steps: * Identification of predicate candidates * Automatic annotation of the biomedical semantic roles using newswire SRL system * Transformation of automatic tagging results into WordFreak format * Review by human annotators *Data* BioProp Version 1.0 consists of approximately 150,000 words. Each line in the corpus provides a PAS annotation that can be mapped to a sentence in the GTB. *Samples* 91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC 91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC 91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC 91094881 8 217:222 bind 160:183-ARG1 184:210-C-ARG1 211:216-R-ARG1 223:247-ARG2 217:222-rel 248:275-ARGM-ADV 91094881 9 45:53 suppress 0:13-ARGM-ADV 16:38-ARG0 54:78-ARG1 39:44-ARGM-MOD 45:53-rel 79:105-C-ARG1 106:135-ARGM-LOC 91094881 10 49:56 block 0:8-ARGM-DIS 11:44-ARG1 49:56-rel 57:82-ARG0 83:115-ARGM-LOC 91101115 2 99:108 increase 0:98-ARG1 99:108-rel 109:152-ARGM-CAU 91101115 3 159:163 bind 119:153-ARG1 164:191-ARG2 154:158-R-ARG1 159:163-rel

Extent: Corpus size: 269 KB

Identifier: LDC2009T04

https://catalog.ldc.upenn.edu/LDC2009T04

ISBN: 1-58563-504-9

ISLRN: 969-572-383-651-0

DOI: 10.35111/y10h-d817

Language: English

Language (ISO639): eng

License: BioProp Version 1.0 Agreement: https://catalog.ldc.upenn.edu/license/bioprop-version-1-dot-0.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T04

Rights Holder: Portions © 2006-2008 Academia Sinica, © 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T04

DateStamp: 2025-04-17

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Hsu, Wen-Lian. 2009. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T04
Up-to-date as of: Wed Oct 29 7:01:06 EDT 2025

Metadata
Title:		BioProp Version 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Hsu, Wen-Lian. BioProp Version 1.0 LDC2009T04. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Hsu, Wen-Lian
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-08-18
Description:		Introduction BioProp Version 1.0 was developed by researchers at Academia Sinica, Taipei, Taiwan. It consists of proposition bank-style annotations for approximately 500 English biomedical journal abstracts. The source abstracts, annotated in accordance with Penn Treebank II guidelines, are contained in the GENIA Treebank (GTB). The GTB was developed at the Tsujii Laboratory at the University of Tokyo. The purpose of the GENIA Project was to develop tools and resources for automatic information extraction of biomedical information. One result of that work is the GENIA corpus, a collection of 2000 biomedical journal abstracts containing semantic class annotation for biomedical terms, part-of-speech (POS) tags and coreferences. The GTB is a subset of that corpus. BioProp Version 1.0 adds a proposition bank to the GTB. Proposition Bank (PropBank) contains annotations of predicate argument structures and semantic roles in a treebank schema in the newswire domain. To construct BioProp Version 1.0, a semantic role labeling (SRL) system trained on PropBank was used to annotate the GTB. SRL, also called shallow semantic parsing, is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as agent and patient, and adjunct arguments, such as time, manner and location. The term "argument" refers to a syntactic constituent of the sentence related to the predicate, and the term "semantic role" refers to the semantic relationship between a sentence's predicate and argument. To suit the needs in the biomedical domain, the PropBank annotation guidelines were modified to characterize semantic roles as components of biological events. Specifically, thirty verbs were selected according to their frequency of use or importance in biomedical texts. Since targets in information extraction are relations of named entities, only sentences containing protein or gene names were used to count each verb's frequency. Verbs of general usage were filtered out in order to keep the focus on biomedical verbs. Some verbs that do not have a high frequency but play important roles in describing biomedical relations, such as "phosphorylate" and "transactivate," were also selected. The BioProp annotation was based on Levin?s verb classes as defined in the VerbNet lexicon. In VerbNet, the arguments of each verb are represented at the semantic level, and thus have associated semantic roles. However, since some verbs may have different usages in biomedical and newswire texts, it is necessary to customize the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic method was used to annotate BioProp. The annotation process consisted of the following steps: * Identification of predicate candidates * Automatic annotation of the biomedical semantic roles using newswire SRL system * Transformation of automatic tagging results into WordFreak format * Review by human annotators Data BioProp Version 1.0 consists of approximately 150,000 words. Each line in the corpus provides a PAS annotation that can be mapped to a sentence in the GTB. Samples 91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC 91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC 91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC 91094881 8 217:222 bind 160:183-ARG1 184:210-C-ARG1 211:216-R-ARG1 223:247-ARG2 217:222-rel 248:275-ARGM-ADV 91094881 9 45:53 suppress 0:13-ARGM-ADV 16:38-ARG0 54:78-ARG1 39:44-ARGM-MOD 45:53-rel 79:105-C-ARG1 106:135-ARGM-LOC 91094881 10 49:56 block 0:8-ARGM-DIS 11:44-ARG1 49:56-rel 57:82-ARG0 83:115-ARGM-LOC 91101115 2 99:108 increase 0:98-ARG1 99:108-rel 109:152-ARGM-CAU 91101115 3 159:163 bind 119:153-ARG1 164:191-ARG2 154:158-R-ARG1 159:163-rel
Extent:		Corpus size: 269 KB
Identifier:		LDC2009T04
		https://catalog.ldc.upenn.edu/LDC2009T04
		ISBN: 1-58563-504-9
		ISLRN: 969-572-383-651-0
		DOI: 10.35111/y10h-d817
Language:		English
Language (ISO639):		eng
License:		BioProp Version 1.0 Agreement: https://catalog.ldc.upenn.edu/license/bioprop-version-1-dot-0.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T04
Rights Holder:		Portions © 2006-2008 Academia Sinica, © 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T04
DateStamp:		2025-04-17
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Hsu, Wen-Lian. 2009. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text