The project uses the German Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated. The Wikipedia maintainers provide, each month, an XML, BZ, BZ2 dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.
Wikipedia dumps are available from Wikipedia database download. The WikipediaDumpIndexer tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.
- Indexing wikipedia contents in a more meaningful manner.
xml parsing of wikipedia dumpsbz parsing of wikipedia dumpsbz2 parsing of wikipedia dumps
See CHANGES.txt
The license for the code is ALv2.