OLAC Record
oai:catalogue.elra.info:ELRA-W0084

Metadata
Title:Arboretum treebank
Abstract:The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions: 1. Native dependency format (Constraint Grammar format) 2. Dependency annotation converted to MALT xml format 3. Native constituent tree format (Cross-language VISL standard) 4. Constituent format converted to TIGER xml
Access Rights:Rights available for: Research Use, Commercial Use
Date Available (W3CDTF):2015-11-30
Date Issued (W3CDTF):2015-11-30
Date Modified (W3CDTF):2015-11-30
Description:Written Corpora
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences, taken from Korpus 90 and Korpus 2000, both compiled by the Society for Danish Language and Literature (http://ordnet.dk/korpusdk/fakta), and containing samples of written Danish from the 90'ies and from around the year 2000, respectively. The treebank consists of about 425,000 tokens. There are ca. 22,260 sentences/utterances containing 3 or more tokens. In a first pass, all material was tokenized and tagged with the DanGram parser, using hand-written Constraint Grammar rules. In a next stage, the parser's dependency grammar and constituent conversion was applied to produce full syntactic tree structures. The automatic annotation was then revised both at the morphosyntactic and the structural levels, with iterative improvements made to the parser at the same time. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes, facilitating conversion to different descriptive traditions. In addition, the dependency version contains structural markers concerning coordination and clause boundaries, as well as some morphological information concerning compounding. The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions: 1. Native dependency format (Constraint Grammar format) 2. Dependency annotation converted to MALT xml format 3. Native constituent tree format (Cross-language VISL standard) 4. Constituent format converted to TIGER xml
Identifier:ELRA-W0084
http://catalog.elra.info/product_info.php?products_id=1248
Language:Danish
Language (ISO639):dan
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0084
DateStamp:  2015-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2015. ELRA (European Language Resources Association).
Terms: area_Europe country_DK dcmi_Text iso639_dan olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0084
Up-to-date as of: Wed Nov 6 9:17:43 EST 2019