OLAC Record: HyTER Networks of Selected OpenMT08/09 Sentences

OLAC Record
oai:www.ldc.upenn.edu:LDC2014T09

Metadata

Title: HyTER Networks of Selected OpenMT08/09 Sentences

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Dreyer, Markus, and Daniel Marcu. HyTER Networks of Selected OpenMT08/09 Sentences LDC2014T09. Web Download. Philadelphia: Linguistic Data Consortium, 2014

Contributor: Dreyer, Markus

Marcu, Daniel

Date (W3CDTF): 2014

Date Issued (W3CDTF): 2014-05-15

Description: *Introduction* HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency. *Data* The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated. This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML. *Samples* Please view this FST sample and Reference XML sample. *Updates* None at this time.

Extent: Corpus size: 336276 KB

Identifier: LDC2014T09

https://catalog.ldc.upenn.edu/LDC2014T09

ISBN: 1-58563-678-9

ISLRN: 811-846-772-709-6

DOI: 10.35111/ed7d-z579

Language: English

Mandarin Chinese

Arabic

Chinese

Language (ISO639): eng

cmn

ara

zho

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online,Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013, 2014 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2014T09

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Dreyer, Markus; Marcu, Daniel. 2014. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2014T09
Up-to-date as of: Wed Oct 29 7:01:26 EDT 2025

Metadata
Title:		HyTER Networks of Selected OpenMT08/09 Sentences
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Dreyer, Markus, and Daniel Marcu. HyTER Networks of Selected OpenMT08/09 Sentences LDC2014T09. Web Download. Philadelphia: Linguistic Data Consortium, 2014
Contributor:		Dreyer, Markus
Contributor:		Marcu, Daniel
Date (W3CDTF):		2014
Date Issued (W3CDTF):		2014-05-15
Description:		Introduction HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency. Data The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated. This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML. Samples Please view this FST sample and Reference XML sample. Updates None at this time.
Extent:		Corpus size: 336276 KB
Identifier:		LDC2014T09
		https://catalog.ldc.upenn.edu/LDC2014T09
		ISBN: 1-58563-678-9
		ISLRN: 811-846-772-709-6
		DOI: 10.35111/ed7d-z579
Language:		English
		Mandarin Chinese
		Arabic
		Chinese
Language (ISO639):		eng
		cmn
		ara
		zho
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online,Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013, 2014 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2014T09
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Dreyer, Markus; Marcu, Daniel. 2014. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_zho olac_primary_text