| ExtractReuters | Split the Reuters SGML documents into Simple Text files containing: Title, Date, Dateline, Body | code | html |
| ExtractWikipedia | Extract the downloaded Wikipedia dump into separate files for indexing. | code | html |
| ExtractWikipedia.Parser | code | html |