OLAC Record
oai:hughandbecky.us:0004

Metadata
Title:Building Multilingual Comparable Corpora
Abstract:Building on existing corpora and new documentary fieldwork in West Africa, we are creating a multilingual comparative corpus. We present a technology toolkit and three parallel workflows that can be used to mobilize language materials for a variety of purposes, particularly for the discovery of discourse patterns in legacy materials.
Access Rights:Open Access
Bibliographic Citation:Paterson, Rebecca Dow Smith, Abbie Hantgan & Ekaterina Aplonova. 2021. Building Multilingual Comparable Corpora. Presentation Abstract presented at the The 7th International Conference on Language Documentation & Conservation (ICLDC), 4–7 March, University of Hawai‘i at Mānoa. http://ling.lll.hawaii.edu/sites/icldc.
Contributor (researcher):Rebecca Paterson
Christian Chanard
Abbie Hantgan‑Sonko
Contributor (speaker):Rebecca Paterson
Christian Chanard
Abbie Hantgan‑Sonko
Description:Building on existing corpora and new audio/video documentary fieldwork from 12+ languages from across West Africa, we are creating a multilingual comparative corpus with input from 20+ collaborating researchers (Nikitina et al. 2020). We present a toolkit of technologies and three parallel workflows that can be used to mobilize language materials from diverse sources for a variety of purposes, particularly for the discovery of discourse patterns in legacy materials that could then be used in revitalization efforts. Our toolkit includes the following technologies: ELAN-CorpA (Chanard 2015; 2019), Fieldworks Language Explorer (FLEx, SIL International), Toolbox (SIL International), ELAN Tools (Chanard et al. 2020, under development), SpeechReporting Template (Nikitina et al. 2019) and Tsakorpus (Arkhangelskiy 2019). The three workflows differ with regard to the initial file format and the software platform that is to be used for parsing and glossing of texts. All three workflows lead to a collection of annotated files that can be queried with ELAN-CorpA (Hantgan 2019). In the first workflow, (1) ELAN-CorpA is used for time-aligned translation and transcription of a recorded text; (2) FLEx is used to parse and gloss the text, and (3) ELAN Toolsconverts the .flextext export into the project template for use in ELAN-CorpA. (4) Once in the project template, the text is annotated for project categories and complex queries can be run across all texts in any language using search features of ELAN-CorpA. In the second workflow, (1) transcription, translation, parsing and glossing is done in Toolbox, (2) ELAN Tools converts a Toolbox file to the project template; (3) ELAN-CorpA is used to time align and annotate for the project. In the third workflow, translation, transcription, parsing and glossing is all done in ELAN-CorpA using the project template from initial stages. Using these three workflows, the data from various source file types is processed into a shared format that will be displayed in an online platform via Tsakorpus.This methodology may be of interest to community members looking for ways to prepare already-collected language materials in order to display them on the internet, those interested in specific questions regarding discourse phenomena, typologists, and linguists in general.
Identifier (URI):https://hughandbecky.us/Becky-CV/talk/2021-building-multilingual-comparable-corpora/
Language:English
Language (ISO639):eng
Subject:language_documentation
text_and_corpus_linguistics
West Afrcia
ELAN
Corpora
Methodology
Subject (OLAC):language_documentation
text_and_corpus_linguistics
Type (DCMI):Event
Type (OLAC):language_description

OLAC Info

Archive:  Rebecca Paterson's Interactive Research Portfolio
Description:  http://www.language-archives.org/archive/hughandbecky.us
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:hughandbecky.us:0004
DateStamp:  2021-02-22
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Rebecca Paterson (speaker); Rebecca Paterson (researcher); Christian Chanard (speaker); Christian Chanard (researcher); Abbie Hantgan‑Sonko (researcher); Abbie Hantgan‑Sonko (speaker). n.d. Rebecca Paterson's Interactive Research Portfolio.
Terms: area_Europe country_GB dcmi_Event iso639_eng olac_language_description olac_language_documentation olac_text_and_corpus_linguistics


http://www.language-archives.org/item.php/oai:hughandbecky.us:0004
Up-to-date as of: Mon May 3 8:38:23 EDT 2021