WebCrawler

WebCrawler allows to extract all accessible URLs from a website. It's built using .NET Core and .NET Standard 1.4, so you can host it anywhere (Windows, Linux, Mac).

The crawler does not use regex to find links. Instead, Web pages are parsed using AngleSharp, a parser which is built upon the official W3C specification. This allows to parse pages as a browser and handle tricky tags such as base.

For HTML files, URLs are extracted from:

<a href="...">
<area href="...">
<audio src="...">
<iframe src="...">
<img src="...">
<img srcset="...">
<link href="...">
<object data="...">
<script src="...">
<source src="...">
<source srcset="...">
<track src="...">
<video src="...">
<video poster="...">
<... style="..."> (see CSS section)

For CSS files, URLs are extracted from:

rule: url(...)

How to deploy on Azure (free)

You can deploy the website on Azure for free:

Create a free Web App
Enable WebSockets in Application Settings (Introduction to WebSockets on Windows Azure Web Sites, Using Web Sockets with ASP.NET Core)
Deploy the website using WebDeploy or FTP

Blog posts

Some parts of the code are explained in blog posts:

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
WebCrawler.Console		WebCrawler.Console
WebCrawler.Site		WebCrawler.Site
WebCrawler		WebCrawler
img		img
.gitignore		.gitignore
README.md		README.md
WebCrawler.sln		WebCrawler.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebCrawler.Console

WebCrawler.Console

WebCrawler.Site

WebCrawler.Site

WebCrawler

WebCrawler

img

img

.gitignore

.gitignore

README.md

README.md

WebCrawler.sln

WebCrawler.sln

Repository files navigation

WebCrawler

How to deploy on Azure (free)

Blog posts

About

Releases

Packages

Languages

DEVBOX10/meziantou-WebCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

How to deploy on Azure (free)

Blog posts

About

Resources

Stars

Watchers

Forks

Languages