OLAC Record: 2001 Topic Annotated Enron Email Data Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T22

Metadata

Title: 2001 Topic Annotated Enron Email Data Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Dr. Michael W. Berry, Murray Browne, and Ben Signer. 2001 Topic Annotated Enron Email Data Set LDC2007T22. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Dr. Michael W. Berry

Browne, Murray

Signer, Ben

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-06-20

Description: *Introduction* The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936) emails from Enron Corporation (Enron) manually indexed into 32 topics. It is a subset of the original Enron Email Data Set of 1.5 million emails that was posted on the Federal Energy Regulatory Commission website as a matter of public record during the investigation of Enron. The original set suffered from document integrity problems; attempts were made to improve the quality of the data and to remove some sensitive and private information. Dr. William Cohen of Carnegie Mellon University took the lead in distributing the improved corpus, consisting of 517,431 Enron employee emails that covered the period 1999-2002. This corpus is a subset of the Carnegie Mellon data set and covers the period from January 2001 to December 2001. The email topics reflect the business activities and interests of Enron employees in that year: California energy problems and the subsequent state and Federal investigations, Enron's downfall (newsfeeds and interoffice communications), Enron's venture with the Dabhol India Power Company, Enrononline (Enron's trading infrastructure), competitors (Dynegy, El Paso Pipeline) and even fantasy football and college football. Eliminated from this data set are duplicates, emails that are too small and emails that are not really topics but are types (personnel memos and personal quips). The manual indexing was performed in the summer of 2006 by two people who worked closely together: a research associate familiar with the Enron saga and a junior in economics at the University of Tennessee. The original Enron Email Data Set is the first large email set made available to researchers, but until now there has been no ability to assess the performance of topic detection and tracking algorithms with the email set. Having an annotated subset such as this one should provide text mining researchers with a way to evaluate the accuracy of new algorithms for clustering and classification. This data set can also be used to provide communication context for researchers using the Enron Email Data Set in social network analysis. Previous annotations such as the one developed at UC Berkeley have been primarily based on email type rather than the specific topic(s) of discussion. This annotation can be used to qualify the discussion topics between individuals and groups comprising a social network of Enron employees. *Updates* As of Aug 13, 2007, an update corrects a small error in the subjection annotation file. Those members and licensees who received this publication prior to Aug 13, 2007 should re-download the corpus. All copies issued since this date have been corrected.

Extent: Corpus size: 1677721 KB

Identifier: LDC2007T22

https://catalog.ldc.upenn.edu/LDC2007T22

ISBN: 1-58563-441-7

ISLRN: 171-422-435-824-5

DOI: 10.35111/sk40-2c88

Language: English

Language (ISO639): eng

License: 2001 Topic Annotated Enron Email Data Set Agreement: https://catalog.ldc.upenn.edu/license/2001-topic-annotated-enron-email-data-set.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T22

Rights Holder: Portions © 2006, 2007 Dr. Michael W. Berry, © 2007 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T22

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Dr. Michael W. Berry; Browne, Murray; Signer, Ben. 2007. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T22
Up-to-date as of: Wed Oct 29 7:00:59 EDT 2025

Metadata
Title:		2001 Topic Annotated Enron Email Data Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Dr. Michael W. Berry, Murray Browne, and Ben Signer. 2001 Topic Annotated Enron Email Data Set LDC2007T22. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Dr. Michael W. Berry
		Browne, Murray
		Signer, Ben
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-06-20
Description:		Introduction The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936) emails from Enron Corporation (Enron) manually indexed into 32 topics. It is a subset of the original Enron Email Data Set of 1.5 million emails that was posted on the Federal Energy Regulatory Commission website as a matter of public record during the investigation of Enron. The original set suffered from document integrity problems; attempts were made to improve the quality of the data and to remove some sensitive and private information. Dr. William Cohen of Carnegie Mellon University took the lead in distributing the improved corpus, consisting of 517,431 Enron employee emails that covered the period 1999-2002. This corpus is a subset of the Carnegie Mellon data set and covers the period from January 2001 to December 2001. The email topics reflect the business activities and interests of Enron employees in that year: California energy problems and the subsequent state and Federal investigations, Enron's downfall (newsfeeds and interoffice communications), Enron's venture with the Dabhol India Power Company, Enrononline (Enron's trading infrastructure), competitors (Dynegy, El Paso Pipeline) and even fantasy football and college football. Eliminated from this data set are duplicates, emails that are too small and emails that are not really topics but are types (personnel memos and personal quips). The manual indexing was performed in the summer of 2006 by two people who worked closely together: a research associate familiar with the Enron saga and a junior in economics at the University of Tennessee. The original Enron Email Data Set is the first large email set made available to researchers, but until now there has been no ability to assess the performance of topic detection and tracking algorithms with the email set. Having an annotated subset such as this one should provide text mining researchers with a way to evaluate the accuracy of new algorithms for clustering and classification. This data set can also be used to provide communication context for researchers using the Enron Email Data Set in social network analysis. Previous annotations such as the one developed at UC Berkeley have been primarily based on email type rather than the specific topic(s) of discussion. This annotation can be used to qualify the discussion topics between individuals and groups comprising a social network of Enron employees. Updates As of Aug 13, 2007, an update corrects a small error in the subjection annotation file. Those members and licensees who received this publication prior to Aug 13, 2007 should re-download the corpus. All copies issued since this date have been corrected.
Extent:		Corpus size: 1677721 KB
Identifier:		LDC2007T22
		https://catalog.ldc.upenn.edu/LDC2007T22
		ISBN: 1-58563-441-7
		ISLRN: 171-422-435-824-5
		DOI: 10.35111/sk40-2c88
Language:		English
Language (ISO639):		eng
License:		2001 Topic Annotated Enron Email Data Set Agreement: https://catalog.ldc.upenn.edu/license/2001-topic-annotated-enron-email-data-set.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T22
Rights Holder:		Portions © 2006, 2007 Dr. Michael W. Berry, © 2007 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T22
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Dr. Michael W. Berry; Browne, Murray; Signer, Ben. 2007. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text