The Open Roget’s Project:
A freely available NLP-friendly implementation
of the 1911 Roget's Thesaurus

The Open Roget's Project provides a fully functional lexical resource for Natural Language Processing, based on Roget's Thesaurus. A Java implementation with the 1911 data now has a significantly updated lexicon. The process of updating Roget’s Thesaurus is documented in this paper:

Alistair Kennedy, Stan Szpakowicz (2014). Evaluation of Automatic Updates of Roget’s Thesaurus. Journal of Language Modelling 2(1), 1-49
(open access; download it at the JLM site)

To acquire Open Roget’s, visit Alistair Kennedy's resource page, or download directly the tarred&gzipped thesaurus.

Project Gutenberg offers the not quite NLP-friendly unedited 1911 Roget's Thesaurus.

Please direct questions and comments to
Alistair Kennedy or to Stan Szpakowicz.

Thanks to Mario Jarmasz, the author of the original system filled with limited-access data, and to Alyona Medelyan for retooling that system to work with the public-domain 1911 data.
Open Roget's stats