Title:CMC training corpus Janes-Syn 1.0
Bibliographic Citation:http://hdl.handle.net/11356/1086
Creator:Arhar Holdt, Špela
Erjavec, Tomaž
Fišer, Darja
Date (W3CDTF):2017-01-03T11:38:46Z
Date Available:2017-01-03T11:38:46Z
Description:Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene computer-mediated communication and for detailed linguistic explorations which require highly accurate and reliable annotations. Words in the dataset are normalised, lemmatised, PoS-tagged and syntactically annotated with the JOS dependency model (http://eng.slovenscina.eu/tehnologije/razclenjevalnik). The annotations on all levels were manually corrected. The corpus creation and structure are described in: ARHAR HOLDT, Špela, FIŠER, Darja, ERJAVEC, Tomaž, KREK, Simon. Syntactic annotation of Slovene CMC : first steps. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia, 2016, pp. 3-6. http://nl.ijs.si/janes/cmc-corpora2016/proceedings/ Janes-Syn was created from two larger corpora that are also available in the repository: Janes-Norm (http://hdl.handle.net/11356/1084) and Janes-Tag (http://hdl.handle.net/11356/1123).
Identifier (URI):http://hdl.handle.net/11356/1086
Language (ISO639):slv
Publisher:Jožef Stefan Institute
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subject:computer-mediated communication
dependency treebank
syntactic annotation
manual annotation
Type (DCMI):Text
Type (OLAC):primary_text


Citation: Arhar Holdt, Špela; Erjavec, Tomaž; Fišer, Darja. 2017. Jožef Stefan Institute.
