OLAC Record
oai:www.clarin.si:11356/1068

Metadata
Title:Dataset of normalised Slovene text KonvNormSl 1.0
Bibliographic Citation:http://hdl.handle.net/11356/1068
Creator:Ljubešić, Nikola
Zupan, Katja
Fišer, Darja
Erjavec, Tomaž
Date (W3CDTF):2016-07-27T12:15:05Z
Date Available:2016-09-01T23:00:07Z
Description:Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/) Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/). The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
Identifier (URI):http://hdl.handle.net/11356/1068
Language:Slovenian
Language (ISO639):slv
Publisher:Jožef Stefan Institute
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
Subject:word normalisation
historical language
computer-mediated communication
experimental data
manual annotation
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  Slovenian language resource repository CLARIN.SI
Description:  http://www.language-archives.org/archive/clarin.si
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.clarin.si:11356/1068
DateStamp:  2017-01-20
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ljubešić, Nikola; Zupan, Katja; Fišer, Darja; Erjavec, Tomaž. 2016. Jožef Stefan Institute.
Terms: area_Europe country_SI dcmi_Text iso639_slv olac_primary_text


http://www.language-archives.org/item.php/oai:www.clarin.si:11356/1068
Up-to-date as of: Thu Sep 26 21:22:24 EDT 2019