OLAC Record: NXT Switchboard Annotations

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T26

Metadata

Title: NXT Switchboard Annotations

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Calhoun, Sasha, et al. NXT Switchboard Annotations LDC2009T26. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Calhoun, Sasha

Carletta, Jean

Jurafsky, Daniel

Nissim, Malvina

Ostendorf, Mari

Zaenen, Annie

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-11-20

Description: *Introduction* NXT Switchboard Annotations, brings together in NITE XML, a single XML format, the multiple layers of annotation performed on a transcript subset from Switchboard 1- Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration among researchers from Edinburgh University, Stanford University and the University of Washington. The original Switchboard corpus is a collection of spontaneous telephone conversations between previously unacquainted speakers of American English on a variety of topics chosen from a pre-determined list. A subset of one million words from those conversations was annotated for syntactic structure and disfluencies as part of the Penn Treebank project. Phonetic transcripts were generated by the International Computer Science Institute, University of California Berkeley and later corrected by the Institute for Signal Information Processing, Mississippi State Univeristy. The Penn Treebank transcripts provided the basis for the NXT Switchboard corpus, and the noun phrases from that subset were annotated for animacy. The Treebank transcript was then aligned with the corresponding subset from the corrected Mississippi State (MS-State) transcript in order to provide word timing information. Focus/contrast and prosodic annotations, as well as phone/syllable alignment were next added to the annotations. The previous annotations of dialog acts and prosody were converted to NITE XML. Lastly, hand annotations for markables were added to provide information about their animacy and information structure, including coreferential links. *NXT Annotation* NXT is an open source toolkit that enables mutiple linguistic annotations to be assembled into a unified database. It uses a stand-off XML data format that consists of several XML files that point to each other. The NXT format provides a data model that describes how the various annotations for a corpus relate to one another. For that reason, it does not impose any particular linguistic theory or any particular markup structure. Instead, users define their annotations in a "metadata" file that expresses their contents and how they relate to each other in terms of the graph structure for the corpus annotations overall. The relationships that can be defined in the data model draw annotations together into a set of intersecting trees, but also allow arbitrary links between annotations over the top of this structure, giving a representation that is highly expressive, easier to process than arbitrary graphs and structured in a way that helps data users. NXT's other core component is a query language designed specifically for working with data conforming to this data model. Together, the data model and query language allow annotations to be treated as one coherent set containing both structural and timing information. The data in NXT Switchboard Annotations was converted from the Penn Treebank bracketed format in which the Switchboard corpus was originally distributed using an XML-based tool for syntactic query that comes with a ready-made Switchboard converter. Conversion was performed using a set of XSL stylesheets to extract each of the multiple XML files associated with one dialogue. The data was divided into separate XML files representing the orthographic transcription, syntax, turn structure, disfluencies and movement, or the relationship between traces and their sources. Transcription consists of a flat list of terminals: words, punctuation, traces, and so on. Syntax starts with a flat list of parses and works down through nonterminals, grounding in terminals (which are in the transcription file, but are referenced by pointers that indicate they are to be treated as if they were part of the tree itself). Turn structure is simply a flat list of turns that themselves contain parses as children, again via pointers into the syntax file. Yet another file couples reparanda and repairs into disfluencies by pointing to the appropriate nonterminals using named roles. A movement file similarly links sources with their target traces. While this representation may seem awkward, it has advantages over the original arrangement. First, it places the information in a single tree structure, with co-indexing for the crossing links that are sometimes required for disfluency and movement. Secondly, it facilitates querying the crossing structures, since they are treated on a par with other structures within the data. Although this ease is not particularly important for the initial, syntactic data, it is crucial for a correct understanding of discourse phenomena such as coreference. Third, separating the tags into their various types makes it easier to add data using external processes (part-of-speech taggers, named entity recognizers, and the like). Fourth, different people can change different data files at the same time without conflict, as long as neither edit the files they point to and both are able to lock complete paths of files pointing to the data they are revising. Last, a data set can be loaded in whole or in part, speeding up some processing. The NITE XML Toolkit itself treats the data seamlessly no matter whether it is in one file or many. *Licensing* This corpus is made available to LDC not-for-profit members and all nonmembers under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. NXT Switchboard Annotations is available to LDC's for-profit members under the terms of their For-Profit Membership Agreements. *Samples* For an example of the data in this corpus, please consult the Getting Started section of the provider's web site.

Extent: Corpus size: 148480 KB

Identifier: LDC2009T26

https://catalog.ldc.upenn.edu/LDC2009T26

ISBN: 1-58563-526-X

ISLRN: 922-902-627-783-3

DOI: 10.35111/nn2p-v103

Language: English

Language (ISO639): eng

License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (NFP, Non-Member): https://catalog.ldc.upenn.edu/license/creative-comons-attribution-noncommercial-sharealike-3-dot-0-unported.pdf

LDC For-Profit Membership Agreement: https://catalog.ldc.upenn.edu/license/ldc-for-profit-membership.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T26

Rights Holder: Portions © 1992, 1993, 1997, 1999, 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T26

DateStamp: 2021-10-27

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Calhoun, Sasha; Carletta, Jean; Jurafsky, Daniel; Nissim, Malvina; Ostendorf, Mari; Zaenen, Annie. 2009. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T26
Up-to-date as of: Wed Oct 29 7:01:09 EDT 2025

Metadata
Title:		NXT Switchboard Annotations
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Calhoun, Sasha, et al. NXT Switchboard Annotations LDC2009T26. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Calhoun, Sasha
		Carletta, Jean
		Jurafsky, Daniel
		Nissim, Malvina
		Ostendorf, Mari
		Zaenen, Annie
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-11-20
Description:		Introduction NXT Switchboard Annotations, brings together in NITE XML, a single XML format, the multiple layers of annotation performed on a transcript subset from Switchboard 1- Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration among researchers from Edinburgh University, Stanford University and the University of Washington. The original Switchboard corpus is a collection of spontaneous telephone conversations between previously unacquainted speakers of American English on a variety of topics chosen from a pre-determined list. A subset of one million words from those conversations was annotated for syntactic structure and disfluencies as part of the Penn Treebank project. Phonetic transcripts were generated by the International Computer Science Institute, University of California Berkeley and later corrected by the Institute for Signal Information Processing, Mississippi State Univeristy. The Penn Treebank transcripts provided the basis for the NXT Switchboard corpus, and the noun phrases from that subset were annotated for animacy. The Treebank transcript was then aligned with the corresponding subset from the corrected Mississippi State (MS-State) transcript in order to provide word timing information. Focus/contrast and prosodic annotations, as well as phone/syllable alignment were next added to the annotations. The previous annotations of dialog acts and prosody were converted to NITE XML. Lastly, hand annotations for markables were added to provide information about their animacy and information structure, including coreferential links. NXT Annotation NXT is an open source toolkit that enables mutiple linguistic annotations to be assembled into a unified database. It uses a stand-off XML data format that consists of several XML files that point to each other. The NXT format provides a data model that describes how the various annotations for a corpus relate to one another. For that reason, it does not impose any particular linguistic theory or any particular markup structure. Instead, users define their annotations in a "metadata" file that expresses their contents and how they relate to each other in terms of the graph structure for the corpus annotations overall. The relationships that can be defined in the data model draw annotations together into a set of intersecting trees, but also allow arbitrary links between annotations over the top of this structure, giving a representation that is highly expressive, easier to process than arbitrary graphs and structured in a way that helps data users. NXT's other core component is a query language designed specifically for working with data conforming to this data model. Together, the data model and query language allow annotations to be treated as one coherent set containing both structural and timing information. The data in NXT Switchboard Annotations was converted from the Penn Treebank bracketed format in which the Switchboard corpus was originally distributed using an XML-based tool for syntactic query that comes with a ready-made Switchboard converter. Conversion was performed using a set of XSL stylesheets to extract each of the multiple XML files associated with one dialogue. The data was divided into separate XML files representing the orthographic transcription, syntax, turn structure, disfluencies and movement, or the relationship between traces and their sources. Transcription consists of a flat list of terminals: words, punctuation, traces, and so on. Syntax starts with a flat list of parses and works down through nonterminals, grounding in terminals (which are in the transcription file, but are referenced by pointers that indicate they are to be treated as if they were part of the tree itself). Turn structure is simply a flat list of turns that themselves contain parses as children, again via pointers into the syntax file. Yet another file couples reparanda and repairs into disfluencies by pointing to the appropriate nonterminals using named roles. A movement file similarly links sources with their target traces. While this representation may seem awkward, it has advantages over the original arrangement. First, it places the information in a single tree structure, with co-indexing for the crossing links that are sometimes required for disfluency and movement. Secondly, it facilitates querying the crossing structures, since they are treated on a par with other structures within the data. Although this ease is not particularly important for the initial, syntactic data, it is crucial for a correct understanding of discourse phenomena such as coreference. Third, separating the tags into their various types makes it easier to add data using external processes (part-of-speech taggers, named entity recognizers, and the like). Fourth, different people can change different data files at the same time without conflict, as long as neither edit the files they point to and both are able to lock complete paths of files pointing to the data they are revising. Last, a data set can be loaded in whole or in part, speeding up some processing. The NITE XML Toolkit itself treats the data seamlessly no matter whether it is in one file or many. Licensing This corpus is made available to LDC not-for-profit members and all nonmembers under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. NXT Switchboard Annotations is available to LDC's for-profit members under the terms of their For-Profit Membership Agreements. Samples For an example of the data in this corpus, please consult the Getting Started section of the provider's web site.
Extent:		Corpus size: 148480 KB
Identifier:		LDC2009T26
		https://catalog.ldc.upenn.edu/LDC2009T26
		ISBN: 1-58563-526-X
		ISLRN: 922-902-627-783-3
		DOI: 10.35111/nn2p-v103
Language:		English
Language (ISO639):		eng
License:		Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (NFP, Non-Member): https://catalog.ldc.upenn.edu/license/creative-comons-attribution-noncommercial-sharealike-3-dot-0-unported.pdf
License:		LDC For-Profit Membership Agreement: https://catalog.ldc.upenn.edu/license/ldc-for-profit-membership.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T26
Rights Holder:		Portions © 1992, 1993, 1997, 1999, 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T26
DateStamp:		2021-10-27
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Calhoun, Sasha; Carletta, Jean; Jurafsky, Daniel; Nissim, Malvina; Ostendorf, Mari; Zaenen, Annie. 2009. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text