OLAC Record: Semantic Textual Similarity (STS) 2013 Machine Translation

OLAC Record
oai:www.ldc.upenn.edu:LDC2013T18

Metadata

Title: Semantic Textual Similarity (STS) 2013 Machine Translation

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Agirre, Eneko, et al. Semantic Textual Similarity (STS) 2013 Machine Translation LDC2013T18. Web Download. Philadelphia: Linguistic Data Consortium, 2013

Contributor: Agirre, Eneko

Cer, Daniel

Diab, Mona

Gonzalez-Agirre, Aitor

Guo, Weiwei

Date (W3CDTF): 2013

Date Issued (W3CDTF): 2013-09-16

Description: *Introduction* Semantic Textual Similarity (STS) 2013 Machine Translation was developed as part of the STS 2013 Shared Task which was held in conjunction with *SEM 2013, the second joint conference on lexical and computational semantics organized by the ACL (Association of Computational Linguistics) interest groups SIGLEX and SIGSEM. It is comprised of one text file containing 750 English sentence pairs translated from the Arabic and Chinese newswire and web data sources. The goal of the Semantic Textual Similarity (STS) task was to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on natural language processing (NLP) applications. STS measures the degree of semantic equivalence. The STS task was proposed as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. More information is available at the STS 2013 Shared Task homepage. *Data* The source data is Arabic and Chinese newswire and web data collected by LDC that was translated and used in the DARPA GALE (Global Autonomous Language Exploitation) program and in several NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150 pairs are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07). The data was built to identify semantic textual similarity between two short text passages. The corpus is comprised of two tab delimited sentences per line. The first sentence is a translation and the second sentence is a post-edited translation. Post-editing is a process to improve machine translation with a minimum of manual labor. The gold standard similarity values and other STS datasets can be obtained from the STS homepage, linked above. *Samples* Please view this text sample. *Updates* None at this time.

Extent: Corpus size: 264 KB

Identifier: LDC2013T18

https://catalog.ldc.upenn.edu/LDC2013T18

ISBN: 1-58563-656-8

ISLRN: 857-492-590-583-5

DOI: 10.35111/cy4d-7c39

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2013T18

Rights Holder: Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online, Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2013T18

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Agirre, Eneko; Cer, Daniel; Diab, Mona; Gonzalez-Agirre, Aitor; Guo, Weiwei. 2013. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2013T18
Up-to-date as of: Wed Oct 29 7:01:25 EDT 2025

Metadata
Title:		Semantic Textual Similarity (STS) 2013 Machine Translation
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Agirre, Eneko, et al. Semantic Textual Similarity (STS) 2013 Machine Translation LDC2013T18. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:		Agirre, Eneko
		Cer, Daniel
		Diab, Mona
		Gonzalez-Agirre, Aitor
		Guo, Weiwei
Date (W3CDTF):		2013
Date Issued (W3CDTF):		2013-09-16
Description:		Introduction Semantic Textual Similarity (STS) 2013 Machine Translation was developed as part of the STS 2013 Shared Task which was held in conjunction with SEM 2013, the second joint conference on lexical and computational semantics organized by the ACL (Association of Computational Linguistics) interest groups SIGLEX and SIGSEM. It is comprised of one text file containing 750 English sentence pairs translated from the Arabic and Chinese newswire and web data sources. The goal of the Semantic Textual Similarity (STS) task was to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on natural language processing (NLP) applications. STS measures the degree of semantic equivalence. The STS task was proposed as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. More information is available at the STS 2013 Shared Task homepage. Data* The source data is Arabic and Chinese newswire and web data collected by LDC that was translated and used in the DARPA GALE (Global Autonomous Language Exploitation) program and in several NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150 pairs are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07). The data was built to identify semantic textual similarity between two short text passages. The corpus is comprised of two tab delimited sentences per line. The first sentence is a translation and the second sentence is a post-edited translation. Post-editing is a process to improve machine translation with a minimum of manual labor. The gold standard similarity values and other STS datasets can be obtained from the STS homepage, linked above. Samples Please view this text sample. Updates None at this time.
Extent:		Corpus size: 264 KB
Identifier:		LDC2013T18
		https://catalog.ldc.upenn.edu/LDC2013T18
		ISBN: 1-58563-656-8
		ISLRN: 857-492-590-583-5
		DOI: 10.35111/cy4d-7c39
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2013T18
Rights Holder:		Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online, Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2013T18
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Agirre, Eneko; Cer, Daniel; Diab, Mona; Gonzalez-Agirre, Aitor; Guo, Weiwei. 2013. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text