A Survey of the State of the Art in Digital Language Documentation and Description

Steven Bird and Gary Simons
Draft: 5 December 2000


About this document.
This document has been prepared in conjunction with the workshop on Web-Based Language Documentation and Description, held in Philadelphia on 12-15 December 2000. It is a follow-up to the requirements document, helping to assess the extent to which the requirements are met by the present state of the art.
2004-03-30 NOTE: This document is no longer maintained, and contains many broken hyperlinks



Whether one is collecting new language data, searching a corpus for an instance of some linguistic phenomenon, looking for dictionaries and texts from a particular language family, converting data to work with a favorite tool, cataloging language resources, or any of a host of similar tasks, one is immediately confronted with a series of questions:

  1. What data is available?
  2. What tools are available?
  3. How adequate are these resources?
  4. Who is creating and using these resources?
  5. Where can I go for advice?

A more extensive list of such questions (with answers) is available at the LTG Helpdesk FAQ.

1. What data is available?

In recent months, we have conducted a survey of language archives [http://www.ldc.upenn.edu/exploration/survey.html]. Respondents were asked to answer the following questions:

1. Name and Location
1 Please provide the archive name, URL, host institution, country, contact person and email address.
2. Catalog
2.1 If the archive has a catalog in a standardized format, what fields does it contain? If not, what contextual information about the resources are collected? What other information would you like to collect if you could?
2.2 If the electronic catalog conforms to some standard, please tell us the name of the standard.
2.3 To what extent have the archived materials been cataloged electronically?
2.4 If there is an online public access catalog, please give its URL.
3. Holdings
3.1 What geographical regions and languages are covered?
3.2 Please give impressionistic estimates of the archive holdings for each of the data types: Texts; Wordlists, Vocabularies, Lexicons, Dictionaries; Field Notes, Correspondence, Misc files; Descriptions (Grammars, Phonologies, etc); Audio Recordings; Video Recordings.
3.3 Please list any other data types which are not included above, or any other comments on the archive holdings.
3.4 What proportion of the holdings are unique to the archive and not available elsewhere?
4. Electronic Publication
4.1 To what extent are the archive holdings published electronically, where "published" means that there is a well-defined procedure such that anyone at all can get a standard copy of the data, either on digital media or over the internet?
4.2 To what extent are the archive holdings accessible over the web?
4.3 Is permission required before materials can be accessed?
4.4 Is there any fee for materials?
4.5 How are author and/or editor defined for the electronic publications? Is there a bibliographical citation method?
4.6 Do the electronic publications have ISBN numbers?
4.7 What plans are there to expand the electronic publication of archive holdings?
5. General Issues
5.1 Who is the legal owner of archived materials? The original collector or his/her estate? The language community? The archive or its host institution? Some combination of these
5.2 Beyond legal ownership, are there any asserted or perceived moral rights concerning archived materials? Do the holders of the archive see the original speakers or their representatives as controlling publication?
5.3 In cases where no electronic publication is planned, why is this so? (e.g. funding, licensing, technical know-how, lack of interest).
5.4 Is any of the data in a proprietary format (e.g. MS Word)? If so, are there plans to transfer it to an open standard (e.g., XML)?
6. Do you have any other comments about digital archives of language material, or on this survey?

Responses were received from some twenty archives, and the completed survey forms are all available online [http://www.ldc.upenn.edu/exploration/survey/].

The full set of archives which have digital catalogs and holdings, or concrete plans for these, is listed below, with URLs and contact names.

  1. AILLA: Archive of Indigenous Languages of Latin America
    [http://uts.cc.utexas.edu/~ailla/introeng.html]
    Joel Sherzer, Anthony Woodbury, University of Texas, Austin
  2. ALMA: African Language Material Archive
    [http://polyglot.lss.wisc.edu/afrst/wara.html]
    Leigh Swigart, West African Research Association
  3. ANLC: Alaska Native Language Center Archives
    [http://www.uaf.edu/anlc]
    Gary Holton, University of Alaska
  4. APS: American Philosophical Society American Indian Manuscript Collections
    [http://www.amphilsoc.org/library/guides/indians/]
    Robert Cox, American Philosophical Society
  5. ASEDA: Aboriginal Studies Electronic Data Archive
    [http://coombs.anu.edu.au/SpecialProj/ASEDA/ASEDA.html]
    Patrick McConvell, Australian Institute of Aboriginal and Torres Strait Islander Studies
  6. BAS: Bavarian Archive of Speech Signals
    [http://www.phonetik.uni-muenchen.de/Bas/BasHomeeng.html]
    Florian Schiel, University of Munich
  7. CDEL: Center for the Documentation of Endangered Languages
    [http://php.indiana.edu/~aisri/lab/home.html]
    Douglas Parks, Wally Hooper, Indiana University
  8. CHILDES: Child Language Data Exchange System
    [http://childes.psy.cmu.edu]
    Brian MacWhinney, Carnegie Mellon University
  9. Corpus Documentale Latinum Portugaliae
    Antonio Emiliano, University of Lisbon
  10. CNNC: Charlotte Narrative and Conversation Collection
    [http://www.uncc.edu/english/cnnc/]
    Boyd Davis, Pat Ryckman, University of North Carolina, Charlotte
  11. Creolist Archives
    [http://www.ling.su.se/Creole/Text_Collection.shtml]
    Mikael Parkvall, University of Stockholm
  12. CDLI: Cuneiform Digital Library Initiative
    [http://cdli.ucla.edu/]
    Robert Englund, UCLA
  13. ELRA: European Language Resources Association
    [http://www.icp.inpg.fr/ELRA/catalog.html]
    Khalid Choukri, Paris
  14. LACITO Linguistic Data Archive
    [http://195.83.92.32/index.html.en]
    Boyd Michailovsky, CNRS, Paris
  15. Linguistic Data Consortium
    [http://www.ldc.upenn.edu/Catalog/]
    Mark Liberman, University of Pennsylvania
  16. LPCA: Language and Popular Culture in Africa Text Archives
    [http://www.pscw.uva.nl/lpca/textarchives/toc.html]
    Vincent De Rooij, University of Amsterdam
  17. Max Planck Institute Language Archive and DOBES Archive
    Peter Wittenburg, Max Planck Institute
  18. NAA: National Anthropological Archives
    [http://www.nmnh.si.edu/naa/]
    Robert Leopold, Smithsonian Institution
  19. OTA: Oxford Text Archive
    [http://ota.ahds.ac.uk/ota/]
    Michael Popham, Oxford University
  20. SIL Language and Culture Archive
    Joan Spanne, Summer Institute of Linguistics
  21. SIL-MEX: SIL Mexico Archive
    [http://www.sil.org/mexico/]
    Albert Bickford, Summer Institute of Lingustics
  22. Survey of California and Other Indian Languages
    [http://linguistics.berkeley.edu/Survey/]
    Leanne Hinton, University of California, Berkeley
  23. UHLCS: University of Helsinki Language Corpus Server
    [http://www.ling.helsinki.fi/uhlcs/]
    Pirkko Suihkonen, University of Helsinki

Most of these archives have a partial digital catalog, and about 25% have a complete digital catalog. A couple of them use MARC or TEI. The following is a list of catalog fields which are used or proposed by the above archives.

Archives use some subset of these elements, in a variety of formats. For certain elements an archive has evidently adopted a controlled vocabularly. At present there are no widely used standards for the storage format, or for the controlled vocabularies, such that the catalog information from different archives is comparable.

About half of these archives have some materials in digital form, and about 20% are completely digital. Digital materials are stored in a variety of formats, including: HTML, SGML, XML, PDF, TEI Lite, Filemaker, MS Access, MS Word, and project-internal formats.

To find out what is available, it is necessary to consult the catalogs of each archive independently, typically using different interfaces and vocabularies for each one.

There are links pages, e.g. Corpus Linguistics.

2. What tools are available?

Available tools are listed on several links pages, including the following:

For LinguistList and the CMU AI Repository, the categorization of the tools is by application domain (e.g. text analysis, morphology, fonts, ...). For the Linguistic Annotation and Linguistic Exploration pages, there is a key for the platform. In the other cases there is no categorization.

The ACL/DFKI Natural Language Software Registry

The Natural Language Software Registry is a key community resource initiated by the ACL and organized by DFKI in Saarbrücken.

Uses a taxonomy based on: State of the art in Language Technology

http://registry.dfki.de/ Hans Uszkoreit, Thierry Declerck

Categories:

  1. annotation tools
  2. evaluation tools
  3. resources: grammars, lexicons, multimodal corpora, spoken language corpora, terminology, written language corpora
  4. multimodality
  5. NLP development aid: tools, formalisms, machine learning methods, architectures, theories
  6. spoken language: signal analysis, signal editing, signal process, speaker recognition, speech analysis, speech editing, speech processing, speech production, speech recognition, speech synthesis, spoken dialog systems, spoken language generation, spoken language translation, spoken language understanding, text-to-speech synthesis, voice analysis, voice processing
  7. written language: alignment tools, corpus analysis, deep generation, deep syntactic analysis, document image analysis, grammar and style checkers, handling controlled languages, information extraction, information retrieval, language guesser, lemmatizer, lexicon management, morphological generation, morphological analysis, optical character recognition, part-of-speech tagging, partial parsing, processing mark-up languages, segmenter, semantic and pragmatic analysis, shallow generation, shallow parsing, speech checkers, stemmer, summarization, terminology extraction, terminology management, text classification, tokenizaitno, translation memory, written dialog systems, written language translation, written language understanding

Search form, permitting search on the following fields: name, abstract, description, license (free, to negotiate, commercial), kind of license (academic, multiple user, commercial), main section, operating system, supported language

3. How adequate are these resources? (draft)

learn by trial and error

no systematic evaluation available

just tools - no support for interoperability, standard formats, etc

best practice recommendations exist (e.g. TEI, CES) - what is the extent of their adoption?

4. Who is creating and using these resources?

The community is arranged into three main groups. The first group is engaged in the core activity of generating and using language resources. The second group provides the technical foundation for this core activity, while the third group constitutes the adminstrative umbrella.

1. CREATORS AND USERS OF LANGUAGE RESOURCES - THE CORE ACTIVITY
Speakers
using and learning languages; providing primary materials and commentary; promoting language use and teaching.
Descriptivists
linguists, sociolinguists, and linguistic anthropologists documenting language structure and use.
Educators
teaching specific languages, and the linguistic structure of specific languages.
Theorists
developing new models of the human language faculty.
Technologists
developing new human language technologies.
2. IMMEDIATE INFRASTRUCTURE - THE TECHNICAL FOUNDATION
Archivists
digital archivists and librarians providing storage and access for language resources.
Developers
computer scientists developing models, formats, architectures and tools for creating and searching digital language data.
Publishers
disseminating language resources in paper and digital form.
3. SPONSORS AND PROMOTERS - THE UMBRELLA
Professional Associations
promoting language resources, and the adoption of best-practices for digital archives.
Government Funding Agencies
establishing funding priorities, and evaluating and enabling language resources.
Non-Governmental Organizations
promoting and funding language resources.
Table 1: The Language Resources Community

Some archives catalog/distribute the resources of others.

5. Where can I go for advice?

Creators, users and archivers of language resources are often faced with a bewildering array of technological options, with no obvious source for competent advice. The most popular method for obtaining advice is the large collection of electronic mailing lists. On many of the following lists there is significant exchange of information concerning best practices.
anthro-listanthro-l@listserv.acsu.buffalo.edu
archives-listarchives@listserv.muohio.edu
corpora-listcorpora@hd.uib.no
diglib-listdiglib@infoserv.nlc-bnc.ca
elsnet-listelsnet-list@let.ruu.nl
electronic-records-listerecs-l@listserv.albany.edu
empiricists-listempiricists@unagi.cis.upenn.edu
endangered-languages-listendangered-languages-l@carmen.murdoch.edu.au
exploration-listlinguistic-exploration@listserv.linguistlist.org
language-culture-listlanguage-culture@cs.uchicago.edu
linganth-listlinganth@cc.rochester.edu
linguist-listlinguist@listserv.linguistlist.org
nl-kr-listnl-kr@cs.rpi.edu
salt-requestsalt-request@cstr.ed.ac.uk
saltmilsaltmil@egroups.com

Another source of advice is the LTG Helpdesk. This site represents a vision for a repository / clearing house for best practice recommendations.

People needing advice typically resort to posting a query on one or more lists, sorting through the responses, and possibly posting a summary of responses back to the lists. However, it is often difficult to decide a good course of action, when the primary information is an uncoordinated set of suggestions originating from strangers on a mailing list. In an period of rapidly evolving technology, a wrong choice can wind up in a dead end, and painstakingly collected data ends up being unusable. Numerous experiences of this community attest to this reality. So how can we make wise use of the new technological opportunities before us?