OLAC Record
oai:www.clarin.si:11356/1123

Metadata
Title:CMC training corpus Janes-Tag 2.0
Bibliographic Citation:http://hdl.handle.net/11356/1123
Creator:Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
Ljubešić, Nikola
Zupan, Katja
Date (W3CDTF):2017-05-15T15:30:07Z
Date Available:2017-05-15T15:30:07Z
Description:Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 1.2, 2.0 corrects some minor errors and includes named entity annotation. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
Identifier (URI):http://hdl.handle.net/11356/1123
Is Replaced By (URI):http://hdl.handle.net/11356/1238
Language:Slovenian
Language (ISO639):slv
Publisher:Jožef Stefan Institute
Replaces (URI):http://hdl.handle.net/11356/1085
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
Subject:computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
named entities
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  Slovenian language resource repository CLARIN.SI
Description:  http://www.language-archives.org/archive/clarin.si
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.clarin.si:11356/1123
DateStamp:  2019-09-12
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Erjavec, Tomaž; Fišer, Darja; Čibej, Jaka; Arhar Holdt, Špela; Ljubešić, Nikola; Zupan, Katja. 2017. Jožef Stefan Institute.
Terms: area_Europe country_SI dcmi_Text iso639_slv olac_primary_text


http://www.language-archives.org/item.php/oai:www.clarin.si:11356/1123
Up-to-date as of: Thu Dec 5 9:50:15 EST 2019