Skip to content

eduprog/SimpleSpider

 
 

Repository files navigation

BREAKING CHANGE: Altered NAMESPACE and the NuGet package

Simple Spider

.NET Core NuGet The MIT License

Old package: NuGet

A simple to implement and modular web spider written in C#. Multi Target with:

  • .Net 5.0
  • .Net Core 3.1
  • .NET Standard 2.1

Why I should use a "simple" spider instead of a full stack framework ?

The main focus of this project is to create a library that is simple to implement and operate.

It's lightweight, use less resources and fewer libraries as possible

Ideal scenarios:

  • Personal bots; want to know when something enter a sale or became available ?
  • Lots of small projects; It's easy to implement help the creation of small bots
  • A good number of small bots can be archived with a few lines of code
  • With new .Net5.0 Top-level statements, entire applications can be achieved with very little effort

Content

Some advantages

  • Very simple to use and operate, ideal for lots of small projects or personal ones
  • Easy html filter with HObject (a HtmlNode wrap with use similar to JObject)
  • Internal conversion from html to XElement, no need to external tools on use
  • Automatic Json parser to JObject
  • Automatic Json deserialize
  • Modular Parser engine (you can add your own parsers!)
    • JSON and XML already included
  • Modular Storage Engine to easly save what you collect (you can add your own!)
  • Modular Caching engine (you can add your own!)
    • Stand alone Cache engine included, no need to external softwares
  • Modular Downloader engine (you can add your own!)
    • WebClient with cookies or HttpClient download engine included

Easy import with NuGet

Installation

Install the SimpleSpider NuGet package: Install-Package RafaelEstevam.Simple.Spider

Getting started

  1. Start a new console project and add Nuget Reference
  2. PM> Install-Package RafaelEstevam.Simple.Spider
  3. Create a class for your spider (or leave in program)
  4. create a new instance of SimpleSpider
    1. Give it a name, cache and log will be saved with that name
    2. Give it a domain (your spider will not fleet from it)
  5. Add a event FetchCompleted to
  6. Optionally give a first page with AddPage. If omitted, it will use the home page of the domain
  7. Call Execute()
void run()
{
    var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"));
    // Set the completed event to implement your stuff
    spider.FetchCompleted += fetchCompleted_items;
    // execute
    spider.Execute();
}
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
    // walk around ...
    // TIP: inspect args to see stuff

    var hObj = args.GetHObject();
    string[] quotes = hObj["span > .text"];
}

TIP: Use the Simple.Tests project to see examples and poke around

Samples

Inside the Simple.Tests folders are all samples, these are some of them:

Use XPath to select content

Use XPath to select html elements and filter data.

void run()
{
    var spider = new SimpleSpider("BooksToScrape", new Uri("http://books.toscrape.com/"));
    // callback to gather items, new links are collected automatically
    spider.FetchCompleted += fetchCompleted_items;
    // Ignore (cancel) the pages containing "/reviews/" 
    spider.ShouldFetch += (s, a) => { a.Cancel = a.Link.Uri.ToString().Contains("/reviews/"); };
    
    // execute from first page
    spider.Execute();
}
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
    // ignore all pages except the catalogue
    if (!args.Link.ToString().Contains("/catalogue/")) return;
    // HObject also processes XPath
    var hObj = args.GetHObject();
    // collect book data
    var articleProd = hObj.XPathSelect("//article[@class=\"product_page\"]");
    if (articleProd.IsEmpty()) return; // not a book
    // Book info
    string sTitle = articleProd.XPathSelect("//h1");
    string sPrice = articleProd.XPathSelect("//p[@class=\"price_color\"]");
    string sStock = articleProd.XPathSelect("//p[@class=\"instock availability\"]").GetValue().Trim();
    string sDesc = articleProd.XPathSelect("p")?.GetValue(); // books can be description less
}

Below we have the same example but using HObject to select html elements

void run() ... /* Same run() method */
void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
    // ignore all pages except the catalogue
    if (!args.Link.ToString().Contains("/catalogue/")) return;

    var hObj = args.GetHObject();
    // collect book data
    var articleProd = hObj["article > .product_page"]; // XPath: "//article[@class=\"product_page\"]"
    if (articleProd.IsEmpty()) return; // not a book
    // Book info
    string sTitle = articleProd["h1"];                 // XPath: "//h1"
    string sPrice = articleProd["p > .price_color"];   // XPath: "//p[@class=\"price_color\"]"
    string sStock = articleProd["p > .instock"].GetValue().Trim();// XPath "//p[@class=\"instock\"]"
    string sDesc =  articleProd.Children("p");         // XPath "p"
}

see full source

Use our HObject implementation to select content

Use indexing style object representation of the html document similar to Newtonsoft's JObject.

 void run()
{
    // Get Quotes.ToScrape.com as HObject
    HObject hObj = FetchHelper.FetchResourceHObject(new Uri("http://quotes.toscrape.com/"));
    ...
    // Example 2
    // Get all Spans and filter by Class='text'
    HObject ex2 = hObj["span"].OfClass("text");
    // Supports css selector style, dot for Class
    HObject ex2B = hObj["span"][".text"];
    // Also supports css '>' selector style
    HObject ex2C = hObj["span > .text"];
    ...
    // Example 4
    // Get all Spans filters by some arbitrary attribute
    //  Original HTML: <span class="text" itemprop="text">
    HObject ex4 = hObj["span"].OfWhich("itemprop", "text");
    ...
    //Example 9
    // Exports Values as Strings with Method and implicitly
    string[] ex9A = hObj["span"].OfClass("text").GetValues();
    string[] ex9B = hObj["span"].OfClass("text");
    ...
    //Example 13
    // Gets Attribute's value
    string ex13 = hObj["footer"].GetClassValue();

    //Example 14
    // Chain query to specify item and then get Attribute Values
    // Gets Next Page Url
    string ex14A = hObj["nav"]["ul"]["li"]["a"].GetAttributeValue("href"); // Specify one attribute
    string ex14B = hObj["nav"]["ul"]["li"]["a"].GetHrefValue(); // directly
    // Multiple parameters can be parametrized as array
    string ex14C = hObj["nav", "ul", "li", "a"].GetHrefValue();
    // Multiple parameters can filtered with ' > '
    string ex14D = hObj["nav > ul > li > a"].GetHrefValue();
}

see full source

Easy storage with StorageEngines

Store you data with Attached Storage Engines, some included !

void run()
{
    var iP = new InitializationParams()
                // Defines a Storage Engine
                // All stored items will be in spider folder as JsonLines
                .SetStorage(new Storage.JsonLinesStorage()); 
    var spider = new SimpleSpider("BooksToScrape", new Uri("http://books.toscrape.com/"), iP);
    // callback to gather items
    spider.FetchCompleted += fetchCompleted_items;
    // execute
    spider.Execute();
}
static void fetchCompleted_items(object Sender, FetchCompleteEventArgs args)
{
    // ignore all pages except the catalogue
    if (!args.Link.ToString().Contains("/catalogue/")) return;

    var tag = new Tag(args.GetDocument());
    var books = tag.SelectTags<Article>("//article[@class=\"product_page\"]");

    foreach (var book in books)
    {
        // process prices
        var priceP = book.SelectTag<Paragraph>(".//p[@class=\"price_color\"]");
        var price = priceP.InnerText.Trim();
        // Store name and prices
        (Sender as SimpleSpider).Storage.AddItem(args.Link, new
        {
            name = book.SelectTag("//h1").InnerText,
            price
        });
    }
}

see full source

Easy initialization with chaining

Initialize your spider easily with chaining and a good variety of options.

void run()
{
    var init = new InitializationParams()
        .SetCacher(new ContentCacher()) // Easy cache engine change
        .SetDownloader(new WebClientDownloader()) // Easy download engine change
        .SetSpiderStartupDirectory(@"D:\spiders\") // Default directory
        // create a json parser for our QuotesObject class
        .AddParser(new JsonDeserializeParser<QuotesObject>(parsedResult_event))
        .SetConfig(c => c.Enable_Caching()  // Already enabled by default
                         .Disable_Cookies() // Already disabled by default
                         .Disable_AutoAnchorsLinks()
                         .Set_CachingNoLimit() // Already set by default
                         .Set_DownloadDelay(5000));

    var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"), init);

    // add first 
    spider.AddPage(buildPageUri(1), spider.BaseUri);
    // execute
    spider.Execute();
}

see full source

Easy single resource fetch

Easy API pooling for updates with single resource fetch.

void run()
{
    var uri = new Uri("http://quotes.toscrape.com/api/quotes?page=1");
    var quotes = FetchHelper.FetchResourceJson<QuotesObject>(uri);
    // show the quotes deserialized
    foreach (var quote in quotes.quotes)
    {
        Console.WriteLine($"Quote: {quote.text}");
        Console.WriteLine($"       - {quote.author.name}");
        Console.WriteLine();
    }
}

see full source

Use Json to deserialize Quotes

Json response? Get a event with your data already deserialized.

( yes, these few lines below are full functional examples! )

void run()
{
    var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"));
    // create a json parser for our QuotesObject class
    spider.Parsers.Add(new JsonDeserializeParser<QuotesObject>(parsedResult_event));
    // add first page /api/quotes?page={pageNo}
    spider.AddPage(buildPageUri(1), spider.BaseUri);
    // execute
    spider.Execute();
}
void parsedResult_event(object sender, ParserEventArgs<QuotesObject> args)
{
    // add next
    if (args.ParsedData.has_next)
    {
        int next = args.ParsedData.page + 1;
        (sender as SimpleSpider).AddPage(buildPageUri(next), args.FetchInfo.Link);
    }
    // process data (show on console)
    foreach (var q in args.ParsedData.quotes)
    {
        Console.WriteLine($"{q.author.name }: { q.text }");
    }
}

see full source

Complete spider application with SQLite storage in less than 50 lines

With Uses .Net5.0 Top-level statements, you can create an complete application to crawl a site, collect your data, storage in SQLite and display them into the console in less than 50 lines (including comments)

using System;
using RafaelEstevam.Simple.Spider;
using RafaelEstevam.Simple.Spider.Extensions;
using RafaelEstevam.Simple.Spider.Storage;

// Creates a new instance (can be chained in init)
var storage = new SQLiteStorage<Quote>();

// Initialize with a good set of configs
var init = InitializationParams.Default002().SetStorage(storage);

var spider = new SimpleSpider("QuotesToScrape", new Uri("http://quotes.toscrape.com/"), init);

Console.WriteLine($"The sqlite database is at {storage.DatabaseFilePath}");
Console.WriteLine($"The quotes are being stored in the table '{storage.TableNameOfT}'");

spider.FetchCompleted += spider_FetchCompleted;
spider.Execute();

// process each page
static void spider_FetchCompleted(object Sender, FetchCompleteEventArgs args)
{
    var hObj = args.GetHObject();
    // get all quotes, divs with class "quote"
    foreach (var q in hObj["div > .quote"])
    {
        var quote = new Quote()
        {
            Text = q["span > .text"].GetValue().HtmlDecode(),
            Author = q["small > .author"].GetValue().HtmlDecode(),
            Tags = string.Join(';', q["a > .tag"].GetValues())
        };
        // store them
        ((SimpleSpider)Sender).Storage.AddItem(args.Link, quote);
    }
}
class Quote
{
    public string Author { get; set; }
    public string Text { get; set; }
    public string Tags { get; set; }
}

Based on SQLite module example, which has less than 70 lines of code ;-)

Some Helpers

  • FetchHelper: Fast single resource fetch with lots of parsers
  • RequestHelper: Make requests (gets and posts) easily
  • XmlSerializerHelper: Generic class to serialize and deserialize stuff using Xml, easy way to save what you collected without any database
  • CSV Helper: Read csv files (even compressed) without external libraries
  • UriHelper: Manipulates parts of the Uri
  • XElement to Stuff: Extract tables from page in DataTable

Giants' shoulders

About

A simple and capable web spider

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 100.0%