Title:CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings
Bibliographic Citation:http://hdl.handle.net/11234/1-1989
Creator:Ginter, Filip
Hajič, Jan
Luotolahti, Juhani
Straka, Milan
Zeman, Daniel
Date (W3CDTF):2017-03-16T11:57:32Z
Date Available:2017-03-16T11:57:32Z
Description:Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/). For each language, automatic annotations in CoNLL-U format are provided in a separate archive. The word embeddings for all languages are distributed in one archive. Note that the CC BY-SA-NC 4.0 license applies to the automatically generated annotations and word embeddings, not to the underlying data, which may have different license and impose additional restrictions. Update 2018-09-03 =============== Added data in the 4 “surprise languages” from the 2017 ST: Buryat, Kurmanji, North Sami and Upper Sorbian. This has been promised before, during CoNLL-ST 2018 we gave the participants a link to this record saying the data was here. It wasn't, sorry. But now it is.
Identifier (URI):http://hdl.handle.net/11234/1-1989
Language:Multiple languages
Language (ISO639):mul
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Subject:CoNLL 2017
word embeddings
automatic annotation
Multiple languages
Subject (ISO639):mul
Type (DCMI):Text
Type (OLAC):language_description


