OLAC Record: Slavic Forest, Norwegian Wood (scripts)

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-1970

Metadata

Title: Slavic Forest, Norwegian Wood (scripts)

Bibliographic Citation: http://hdl.handle.net/11234/1-1970

Creator: Rosa, Rudolf

Zeman, Daniel

Mareček, David

Žabokrtský, Zdeněk

Date (W3CDTF): 2017-04-06T14:33:14Z

Date Available: 2017-04-06T14:33:14Z

Description: Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.

Identifier (URI): http://hdl.handle.net/11234/1-1970

Language: Czech

Slovak

Slovenian

Croatian

Danish

Swedish

Norwegian

Language (ISO639): ces

slk

slv

hrv

dan

swe

nor

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Rights: GNU General Public License 2 or later (GPL-2.0)

http://opensource.org/licenses/GPL-2.0

Subject: parsing

dependency parser

universal dependencies

cross-lingual parsing

Type: toolService

Type (DCMI): Software

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-1970

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Rosa, Rudolf; Zeman, Daniel; Mareček, David; Žabokrtský, Zdeněk. 2017. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Europe country_CZ country_DK country_HR country_NO country_SE country_SI country_SK dcmi_Software iso639_ces iso639_dan iso639_hrv iso639_nor iso639_slk iso639_slv iso639_swe

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-1970
Up-to-date as of: Sun May 4 0:11:14 EDT 2025

Metadata
Title:		Slavic Forest, Norwegian Wood (scripts)
Bibliographic Citation:		http://hdl.handle.net/11234/1-1970
Creator:		Rosa, Rudolf
		Zeman, Daniel
		Mareček, David
		Žabokrtský, Zdeněk
Date (W3CDTF):		2017-04-06T14:33:14Z
Date Available:		2017-04-06T14:33:14Z
Description:		Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.
Identifier (URI):		http://hdl.handle.net/11234/1-1970
Language:		Czech
		Slovak
		Slovenian
		Croatian
		Danish
		Swedish
		Norwegian
Language (ISO639):		ces
		slk
		slv
		hrv
		dan
		swe
		nor
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:		GNU General Public License 2 or later (GPL-2.0)
Rights:		http://opensource.org/licenses/GPL-2.0
Subject:		parsing
		dependency parser
		universal dependencies
		cross-lingual parsing
Type:		toolService
Type (DCMI):		Software
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-1970
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Rosa, Rudolf; Zeman, Daniel; Mareček, David; Žabokrtský, Zdeněk. 2017. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Europe country_CZ country_DK country_HR country_NO country_SE country_SI country_SK dcmi_Software iso639_ces iso639_dan iso639_hrv iso639_nor iso639_slk iso639_slv iso639_swe