Skip to content

jaensen/graph-crawler

Repository files navigation

A simple crawler which searchs for the appearence of nodes from a graph in a downloaded text.

  • crawl?url={url}
    downloads a web page and searches for words which appear in the page and in the graph. Found words are is sorted by classes and instances.

  • addNode?label={label}&type={type}
    adds a node to the graph (Type can be everything in principal but is pratically limited to 'Class' and 'Instance' at the moment)

  • addEdge?fromNode={fromNode}&toNode={toNode}&predicate={predicate}
    adds a new edge to the graph.

  • findNodes?labelStartsWith={labelStartsWith}
    finds nodes by matching the first characters of a node's label.

  • findEdges?labelStartsWith={labelStartsWith}
    finds edges by matching the first characters of a edges's label.

  • findEdgesFromNode?fromNode={fromNode}
    finds all edges which lead from the given node to antoher

  • findEdgesToNode?toNode={toNode}
    finds all edges which lead to the given node

  • loadNodes?nodes={nodes}
    loads all node data for all given node IDs

  • loadEdges?edges={edges}
    loads all edge data for all given edge IDs

  • loadClientCode/apps/{app}
    loads static files from the configured "client_code" directory

Prerequisites

  • Mono/.Net 4.5

Configuration
All configuration is done in the Liv.io.GraphCrawler.ControlService's App.config file.

Parameters are:

  • DataDirectory: A string which points to the directory where the graph-files and downloaded resources should be stored
  • EdgesFilename: The name of the file in which to store the edges
  • NodesFilename: The name of the file in which to store the nodes
  • ResourcesTableFilename: The name of the file in which to store the reource-entries (metdadata for downloaded sites)
  • ResourcesFolder: The name of the folder in which the downloaded resources should be stored (must lie within the DataDirectory)
  • ClientCodeFolder: A folder in which static files can be stored. They can be delivered directly by the application.

File format
The tool uses csv files for persistence. All columns are seperated by pipes '|'.

nodes.csv Id|Label|Type

Example row:
694|Canada|Instance

edges.csv Id|Source|Target|Type|Label|Weight

Example row:
196|85|99|Directed|lies-in|1

resources.csv Uri|Title|FilesystemLocation

Example row http://de.wikipedia.org/wiki/Markdow|Markdown|/var/crawler/resources/5b7c8499-6c32-478c-abd6-cf33153d0967

About

Small WCF-Service with webfrontent.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published