Skip to content

Simple and fast .NET tool for extracting and indexing plain text from Wikipedia dumps

Notifications You must be signed in to change notification settings

polytronicgr/WikipediaDumpIndexer

 
 

Repository files navigation

WikipediaDumpIndexer

alt tag

The project uses the German Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated. The Wikipedia maintainers provide, each month, an XML, BZ, BZ2 dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

Wikipedia dumps are available from Wikipedia database download. The WikipediaDumpIndexer tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

alt tag

TODOs

  • Indexing wikipedia contents in a more meaningful manner.
  • xml parsing of wikipedia dumps
  • bz parsing of wikipedia dumps
  • bz2 parsing of wikipedia dumps

Recent Changes

See CHANGES.txt

Committers

Licensing

The license for the code is ALv2.

About

Simple and fast .NET tool for extracting and indexing plain text from Wikipedia dumps

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 100.0%