OLAC Record
oai:www.ldc.upenn.edu:LDC2009T26

Metadata
Title:NXT Switchboard Annotations
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Calhoun, Sasha, et al. NXT Switchboard Annotations LDC2009T26. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:Calhoun, Sasha
Carletta, Jean
Jurafsky, Daniel
Nissim, Malvina
Ostendorf, Mari
Zaenen, Annie
Date (W3CDTF):2009
Date Issued (W3CDTF):2009-11-20
Description:*Introduction* NXT Switchboard Annotations, brings together in NITE XML, a single XML format, the multiple layers of annotation performed on a transcript subset from Switchboard 1- Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration among researchers from Edinburgh University, Stanford University and the University of Washington. The original Switchboard corpus is a collection of spontaneous telephone conversations between previously unacquainted speakers of American English on a variety of topics chosen from a pre-determined list. A subset of one million words from those conversations was annotated for syntactic structure and disfluencies as part of the Penn Treebank project. Phonetic transcripts were generated by the International Computer Science Institute, University of California Berkeley and later corrected by the Institute for Signal Information Processing, Mississippi State Univeristy. The Penn Treebank transcripts provided the basis for the NXT Switchboard corpus, and the noun phrases from that subset were annotated for animacy. The Treebank transcript was then aligned with the corresponding subset from the corrected Mississippi State (MS-State) transcript in order to provide word timing information. Focus/contrast and prosodic annotations, as well as phone/syllable alignment were next added to the annotations. The previous annotations of dialog acts and prosody were converted to NITE XML. Lastly, hand annotations for markables were added to provide information about their animacy and information structure, including coreferential links. *NXT Annotation* NXT is an open source toolkit that enables mutiple linguistic annotations to be assembled into a unified database. It uses a stand-off XML data format that consists of several XML files that point to each other. The NXT format provides a data model that describes how the various annotations for a corpus relate to one another. For that reason, it does not impose any particular linguistic theory or any particular markup structure. Instead, users define their annotations in a "metadata" file that expresses their contents and how they relate to each other in terms of the graph structure for the corpus annotations overall. The relationships that can be defined in the data model draw annotations together into a set of intersecting trees, but also allow arbitrary links between annotations over the top of this structure, giving a representation that is highly expressive, easier to process than arbitrary graphs and structured in a way that helps data users. NXT's other core component is a query language designed specifically for working with data conforming to this data model. Together, the data model and query language allow annotations to be treated as one coherent set containing both structural and timing information. The data in NXT Switchboard Annotations was converted from the Penn Treebank bracketed format in which the Switchboard corpus was originally distributed using an XML-based tool for syntactic query that comes with a ready-made Switchboard converter. Conversion was performed using a set of XSL stylesheets to extract each of the multiple XML files associated with one dialogue. The data was divided into separate XML files representing the orthographic transcription, syntax, turn structure, disfluencies and movement, or the relationship between traces and their sources. Transcription consists of a flat list of terminals: words, punctuation, traces, and so on. Syntax starts with a flat list of parses and works down through nonterminals, grounding in terminals (which are in the transcription file, but are referenced by pointers that indicate they are to be treated as if they were part of the tree itself). Turn structure is simply a flat list of turns that themselves contain parses as children, again via pointers into the syntax file. Yet another file couples reparanda and repairs into disfluencies by pointing to the appropriate nonterminals using named roles. A movement file similarly links sources with their target traces. While this representation may seem awkward, it has advantages over the original arrangement. First, it places the information in a single tree structure, with co-indexing for the crossing links that are sometimes required for disfluency and movement. Secondly, it facilitates querying the crossing structures, since they are treated on a par with other structures within the data. Although this ease is not particularly important for the initial, syntactic data, it is crucial for a correct understanding of discourse phenomena such as coreference. Third, separating the tags into their various types makes it easier to add data using external processes (part-of-speech taggers, named entity recognizers, and the like). Fourth, different people can change different data files at the same time without conflict, as long as neither edit the files they point to and both are able to lock complete paths of files pointing to the data they are revising. Last, a data set can be loaded in whole or in part, speeding up some processing. The NITE XML Toolkit itself treats the data seamlessly no matter whether it is in one file or many. *Licensing* This corpus is made available to LDC not-for-profit members and all nonmembers under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. NXT Switchboard Annotations is available to LDC's for-profit members under the terms of their For-Profit Membership Agreements. *Samples* For an example of the data in this corpus, please consult the Getting Started section of the provider's web site.
Extent:Corpus size: 148480 KB
Identifier:LDC2009T26
https://catalog.ldc.upenn.edu/LDC2009T26
ISBN: 1-58563-526-X
ISLRN: 922-902-627-783-3
DOI: 10.35111/nn2p-v103
Language:English
Language (ISO639):eng
License:Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (NFP, Non-Member): https://catalog.ldc.upenn.edu/license/creative-comons-attribution-noncommercial-sharealike-3-dot-0-unported.pdf
LDC For-Profit Membership Agreement: https://catalog.ldc.upenn.edu/license/ldc-for-profit-membership.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2009T26
Rights Holder:Portions © 1992, 1993, 1997, 1999, 2009 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2009T26
DateStamp:  2021-10-27
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Calhoun, Sasha; Carletta, Jean; Jurafsky, Daniel; Nissim, Malvina; Ostendorf, Mari; Zaenen, Annie. 2009. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T26
Up-to-date as of: Mon Mar 25 7:20:23 EDT 2024