OLAC Record
oai:www.clarin.si:11356/1213

Metadata
Title:Training corpus jos1M 1.2
Bibliographic Citation:http://hdl.handle.net/11356/1213
Creator:Erjavec, Tomaž
Krek, Simon
Dobrovoljc, Kaja
Date (W3CDTF):2019-02-13T17:13:39Z
Date Available:2019-02-13T17:13:39Z
Description:The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions and lemmas with about one fourth of the more problematic annotations hand-validated. The morphosyntactic descriptions are given in both the JOS/MULTEXT-East framework (http://nl.ijs.si/ME/V6/msd/), as well as in the framework of Universal Dependencies for Slovene (https://universaldependencies.org/treebanks/sl_ssj/index.html). The corpus is available in source TEI XML with the MSDs in English or Slovene and in the derived vertical format, used by CQP and (no)Sketch Engine concordancers and in CONLL-U, used by Universal Dependencies. Note that the corpus does not contain syntactic dependencies. The texts or paragraphs of the jos1M corpus overlap with this of the ssj500k annotated corpus (http://hdl.handle.net/11356/1210), but the latter has been fully manually annotated, as well as having its tokenisation and sentence segmentation corrected. The texts and paragraphs in the jos1M corpus are marked if they are also included in ssj500k, while the CONLL-U is also split into the part that is included in ssj500k and that which is not. The latter can serve as an additional training set for morphosyntactic tagging and lemmatisation to ssj500k.
Identifier (URI):http://hdl.handle.net/11356/1213
Language:Slovenian
Language (ISO639):slv
Publisher:Jožef Stefan Institute
Replaces (URI):http://hdl.handle.net/11356/1037
Rights:Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
Subject:tagging
lemmatisation
manual annotation
TEI
CONLL-U
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  Slovenian language resource repository CLARIN.SI
Description:  http://www.language-archives.org/archive/clarin.si
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.clarin.si:11356/1213
DateStamp:  2019-02-13
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Erjavec, Tomaž; Krek, Simon; Dobrovoljc, Kaja. 2019. Jožef Stefan Institute.
Terms: area_Europe country_SI dcmi_Text iso639_slv olac_primary_text


http://www.language-archives.org/item.php/oai:www.clarin.si:11356/1213
Up-to-date as of: Tue Aug 20 10:27:23 EDT 2019