OLAC Record: Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5528

Metadata

Title: Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora

Bibliographic Citation: http://hdl.handle.net/11234/1-5528

Creator: Estève, Louis Clément

Savary, Agata

Lavergne, Thomas

Date (W3CDTF): 2024-07-12T11:53:50Z

Date Available: 2024-07-12T11:53:50Z

Description: This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese). They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367). These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2. VMWE tokens were merged into single tokens. The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format. For compression, bzip2 was used.

Identifier (URI): http://hdl.handle.net/11234/1-5528

Language: German

Modern Greek (1453-)

Basque

French

Irish

Hebrew

Hindi

Italian

Polish

Portuguese

Romanian

Swedish

Turkish

Chinese

Language (ISO639): deu

ell

eus

fra

gle

heb

hin

ita

pol

por

ron

swe

tur

zho

Publisher: Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique

Rights: PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement

https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2-raw

Subject: verbal multiword expressions

word embeddings

word2vec

German language

Modern Greek (1453-) language

Basque language

French language

Irish language

Hebrew language

Hindi language

Italian language

Polish language

Portuguese language

Romanian language

Swedish language

Turkish language

Chinese language

Subject (ISO639): deu

ell

eus

fra

gle

heb

hin

ita

pol

por

ron

swe

tur

zho

Type: lexicalConceptualResource

Type (DCMI): Text

Type (OLAC): lexicon

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-5528

DateStamp: 2024-07-12

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Estève, Louis Clément; Savary, Agata; Lavergne, Thomas. 2024. Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique.
Terms: area_Asia area_Europe country_DE country_ES country_FR country_GR country_IE country_IL country_IN country_IT country_PL country_PT country_RO country_SE country_TR dcmi_Text iso639_deu iso639_ell iso639_eus iso639_fra iso639_gle iso639_heb iso639_hin iso639_ita iso639_pol iso639_por iso639_ron iso639_swe iso639_tur iso639_zho olac_lexicon

Inferred Metadata
Country: Germany Spain France Greece Ireland Israel India Italy Poland Portugal Romania Sweden Turkey
Area: Asia Europe

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5528
Up-to-date as of: Mon Jun 16 1:08:33 EDT 2025

Metadata
Title:		Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora
Bibliographic Citation:		http://hdl.handle.net/11234/1-5528
Creator:		Estève, Louis Clément
		Savary, Agata
		Lavergne, Thomas
Date (W3CDTF):		2024-07-12T11:53:50Z
Date Available:		2024-07-12T11:53:50Z
Description:		This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese). They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367). These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2. VMWE tokens were merged into single tokens. The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format. For compression, bzip2 was used.
Identifier (URI):		http://hdl.handle.net/11234/1-5528
Language:		German
		Modern Greek (1453-)
		Basque
		French
		Irish
		Hebrew
		Hindi
		Italian
		Polish
		Portuguese
		Romanian
		Swedish
		Turkish
		Chinese
Language (ISO639):		deu
		ell
		eus
		fra
		gle
		heb
		hin
		ita
		pol
		por
		ron
		swe
		tur
		zho
Publisher:		Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique
Rights:		PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement
Rights:		https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2-raw
Subject:		verbal multiword expressions
		word embeddings
		word2vec
		German language
		Modern Greek (1453-) language
		Basque language
		French language
		Irish language
		Hebrew language
		Hindi language
		Italian language
		Polish language
		Portuguese language
		Romanian language
		Swedish language
		Turkish language
		Chinese language
Subject (ISO639):		deu
		ell
		eus
		fra
		gle
		heb
		hin
		ita
		pol
		por
		ron
		swe
		tur
		zho
Type:		lexicalConceptualResource
Type (DCMI):		Text
Type (OLAC):		lexicon
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-5528
DateStamp:		2024-07-12
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Estève, Louis Clément; Savary, Agata; Lavergne, Thomas. 2024. Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique.
Terms:		area_Asia area_Europe country_DE country_ES country_FR country_GR country_IE country_IL country_IN country_IT country_PL country_PT country_RO country_SE country_TR dcmi_Text iso639_deu iso639_ell iso639_eus iso639_fra iso639_gle iso639_heb iso639_hin iso639_ita iso639_pol iso639_por iso639_ron iso639_swe iso639_tur iso639_zho olac_lexicon
Inferred Metadata
Country:		Germany Spain France Greece Ireland Israel India Italy Poland Portugal Romania Sweden Turkey
Area:		Asia Europe