OLAC Record

Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Baayen, R H., R Piepenbrock, and L Gulikers. CELEX2 LDC96L14. Web Download. Philadelphia: Linguistic Data Consortium, 1995
Contributor:Baayen, R H.
Piepenbrock, R
Gulikers, L
Date (W3CDTF):1995
Description:*Introduction* This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.0). CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Pre-mastering and production was done by the LDC. For each language, this data set contains detailed information on: * orthography (variations in spelling, hyphenation) * phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress) * morphology (derivational and compositional structure, inflectional paradigms) * syntax (word class, word class-specific subcategorizations, argument structures) * word frequency (summed word and lemma counts, based on recent and representative text corpora) The databases have not been tailored to fit any particular database management program. Instead, the information is in ASCII files in a UNIX directory tree that can be queried with tools, such as AWK or ICON. Unique identity numbers allow the linking of information from different files. Some kinds of information have to be computed online; wherever necessary, AWK functions have been provided to recover this information. README files specify the details of their use. A detailed User Guide describing the various kinds of lexical information available is supplied. All sections of this guide are POSTSCRIPT files, except for some additional notes on the German lexicon in plain ASCII. CELEX-2 The second release of CELEX contains an enhanced, expanded version of the German lexical database (2.5), featuring approximately 1,000 new lemma entries, revised morphological parses, verb argument structures, inflectional paradigm codes and a corpus type lexicon. A complete PostScript version of the Germanic Linguistic Guide is also included, in both European A-4 format and American Letter format. For German, the total number of lemmas included is now 51,728, while all their inflected forms number 365,530. Moreover, phonetic syllable frequencies have been added for (British) English and Dutch. Apart from this, and provision of frequency information alongside every lexical feature, no changes have been made to Dutch and English lexicons. Complete AWK-scripts are now provided to compute representations not found in the (plain ASCII) lexical data files, corresponding to the features described in CELEX User Guide, which is included as well. For each language, i.e. English, German and Dutch, the data contains detailed information on the orthography (variations in spelling, hyphenation), the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word-class specific subcategorisation, argument structures) and word frequency (summed word and lemma counts, based on resent and representative text corpora) of both wordforms and lemmas. Unique identity numbers allow the linking of information from different files with the aid of an efficient, index-based C-program. Like its predecessor, this release is mastered using the ISO 9660 daa format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh and UNIX environments. As the new release does not omit any data from the first edition, the current release will replace the old one. *Updates* Petra Stiener has developed a number of scripts to modify and update CELEX2 to a modern format. They are available on her github page. LREC papers related to these updates are accessible at the following urls: http://aclweb.org/anthology/W17-7619 & http://www.lrec-conf.org/proceedings/lrec2016/summaries/761.html.
Extent:Corpus size: 287744 KB
ISBN: 1-58563-085-3
ISLRN: 204-698-863-053-1
DOI: 10.35111/gs6s-gm48
Language (ISO639):eng
License:CELEX Agreement: https://catalog.ldc.upenn.edu/license/celex-user-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC96L14
Subject:German language
English language
Dutch language
Subject (ISO639):deu
Type (DCMI):Text
Type (OLAC):lexicon


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC96L14
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Baayen, R H.; Piepenbrock, R; Gulikers, L. 1995. Linguistic Data Consortium.
Terms: area_Europe country_DE country_GB country_NL dcmi_Text iso639_deu iso639_eng iso639_nld olac_lexicon

Inferred Metadata

Country: GermanyUnited KingdomNetherlands
Area: Europe

Up-to-date as of: Mon Mar 25 7:19:54 EDT 2024