SimpleWebCrawler

This tool does 2 things:

make http request to get the response in html
parse the html and searches for the url links

These 2 steps are repeated for the found url links until it reaches its limits which are defined as:

url processing depth - how deep to crawl; e.g. depth == 2 means the initial link(s) is loaded/parsed + the links found on the initil link(s) are loaded/parsed
max processing url - limit for the number of http requests

The solution is split into 2 projects:

the engine that does the actual crawling
the console application that serves as an example of how to handle the engine API

Engine features The engine makes use of the TPL library (.net tasks parallel library) - for each job (load/parse) a new task is created. The engine exposes the following API:

events for:

entering another processing depth
completion of loading/parsing an url (along with the html, url and found url links)
error events (along with url and error message)

final result set of the parsed urls and found url links
ability to cancel the processing via a cancellation token

Areas of use

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
SimpleWebCrawler.Console		SimpleWebCrawler.Console
SimpleWebCrawler.Engine		SimpleWebCrawler.Engine
packages		packages
.gitignore		.gitignore
README.md		README.md
SimpleWebCrawler.sln		SimpleWebCrawler.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleWebCrawler.Console

SimpleWebCrawler.Console

SimpleWebCrawler.Engine

SimpleWebCrawler.Engine

packages

packages

.gitignore

.gitignore

README.md

README.md

SimpleWebCrawler.sln

SimpleWebCrawler.sln

Repository files navigation

SimpleWebCrawler

About

Releases

Packages

Languages

rprochazka/SimpleWebCrawler

Folders and files

Latest commit

History

Repository files navigation

SimpleWebCrawler

About

Resources

Stars

Watchers

Forks

Languages