Title:CMC training corpus Janes-Norm 1.1
Bibliographic Citation:http://hdl.handle.net/11356/1083
Creator:Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
Date (W3CDTF):2016-12-28T11:41:07Z
Date Available:2016-12-28T11:41:07Z
Description:Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1081.
Identifier (URI):http://hdl.handle.net/11356/1083
Is Replaced By (URI):http://hdl.handle.net/11356/1084
Language (ISO639):slv
Publisher:Jožef Stefan Institute
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subject:computer-mediated communication
word normalisation
manual annotation
Type (DCMI):Text
Type (OLAC):primary_text


Citation: Erjavec, Tomaž; Fišer, Darja; Čibej, Jaka; Arhar Holdt, Špela. 2016. Jožef Stefan Institute.
