ASPDotNetCoreCrawl

Web based crawler for site broken links based on https://github.com/hmol/LinkCrawler Besides the crawl, this displays also the words with the highest frequency on the site

The technology is Microsoft ASP.NET Core 3 with SignalR for communication between browser and local webserver

This is a web based crawler that searches from broken links on a domain specified by the user from the browser. The result is written to both the client browser as well as to other outputs (e.g. csv)

Build and Run

Make sure you have Microsoft dotnet core 3 installed (currently in preview) on either Mac, Windows or Linux

Then run

git clone https://github.com/firecodergithub/ASPDotNetCoreCrawl

cd ASPDotNetCoreCrawl

dotnet run

Then simply connect your browser to http://localhost:5000 (it might redirect to https on port 5001 and complain about self signed certificate)

Configuration settings

These are located in the file crawlSettings.json

Key	Usage
`SuccessHttpStatusCodes`	HTTP status codes that are considered "successful". Example: "1xx,2xx,302,303"
`CheckImages`	If true, <img src=".." will be checked
`ValidUrlRegex`	Regex to match valid urls
`OnlyReportBrokenLinksToOutput`	If true, only broken links will be reported to output.
`Csv.FilePath`	File path for the CSV file
`Csv.Overwrite`	Whether to overwrite or append (if file exists)
`Csv.Delimiter`	Delimiter between columns in the CSV file (like ',' or ';')
`PrintSummary`	If true, a summary will be printed when all links have been checked.
`TimeMsBetweenRequests`	How long in miliseconds the app should wait until the next request to the crawled site
`TopWordsCount`	How many top words count should be displayed; default 20
`RemoveStopWords`	Whether to ignore stop words from the top words (default = true)
`languageStopWords`	List of site extensions along with the name of the stopwords file

Basic workflow

Once the browser connects to the local webserver, the root URL should be input (full URL, including http(s)) then click enter or press crawl

A live list of crawled links is displayed and the ones broken are appended to a table in the browser (and also saved on the csv output file on the webserver location)

The webserver used by dotnet is Kestrel. If you require setting it up on IIS or IIS Express look for the configuration steps on this on the web

The app is also deployable via docker to heroku through the dockerfile Sample free deployment is at: (might be offline due to free heroku site policy) https://crawlwords.herokuapp.com

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
Controllers		Controllers
Crawl		Crawl
Hubs		Hubs
Models		Models
Properties		Properties
Views		Views
wwwroot		wwwroot
.dockerignore		.dockerignore
.gitignore		.gitignore
ASPDotNetCore3Crawl.csproj		ASPDotNetCore3Crawl.csproj
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Program.cs		Program.cs
README.md		README.md
SampleRun.png		SampleRun.png
Startup.cs		Startup.cs
appsettings.Development.json		appsettings.Development.json
appsettings.json		appsettings.json
crawlSettings.json		crawlSettings.json
heroku.yml		heroku.yml
libman.json		libman.json
stopWordsRomanian.txt		stopWordsRomanian.txt
stopwordEnglish.txt		stopwordEnglish.txt

License

firecodergithub/ASPDotNetCoreCrawl

Folders and files

Latest commit

History

Repository files navigation

ASPDotNetCoreCrawl

Build and Run

Configuration settings

Basic workflow

About

Topics

Resources

License

Stars

Watchers

Forks

Languages