Skip to content

DEVBOX10/meziantou-WebCrawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

WebCrawler allows to extract all accessible URLs from a website. It's built using .NET Core and .NET Standard 1.4, so you can host it anywhere (Windows, Linux, Mac).

The crawler does not use regex to find links. Instead, Web pages are parsed using AngleSharp, a parser which is built upon the official W3C specification. This allows to parse pages as a browser and handle tricky tags such as base.

For HTML files, URLs are extracted from:

  • <a href="...">
  • <area href="...">
  • <audio src="...">
  • <iframe src="...">
  • <img src="...">
  • <img srcset="...">
  • <link href="...">
  • <object data="...">
  • <script src="...">
  • <source src="...">
  • <source srcset="...">
  • <track src="...">
  • <video src="...">
  • <video poster="...">
  • <... style="..."> (see CSS section)

For CSS files, URLs are extracted from:

  • rule: url(...)

Web Crawler

How to deploy on Azure (free)

You can deploy the website on Azure for free:

  1. Create a free Web App
  2. Enable WebSockets in Application Settings (Introduction to WebSockets on Windows Azure Web Sites, Using Web Sockets with ASP.NET Core)
  3. Deploy the website using WebDeploy or FTP

Blog posts

Some parts of the code are explained in blog posts:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 69.8%
  • TypeScript 25.4%
  • HTML 3.6%
  • CSS 1.2%