Skip to content

FengYunWorkstation/DotnetSpider

 
 

Repository files navigation

DotnetSpider

Build Status NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

DESIGN

                                  +----------------------+  +----------------------+      
                                  | Download Center      |  | Statistics Center    |    
+----------------------+          +----------^-----------+  +----------^-----------+  
| Downloader Agent 1   +----+                |                         |                               
+----------------------+    |                |                         |                
                            |     +----------v-------Message Queue-----v-----------+    +------------- Scheduler-------------------+
+----------------------+    |     |  +-------+       +----------+       +-------+  |    |  +-------+    +-------+    +----------+  |
| Downloader Agent 2   +----+<---->  | Local |       | RabbitMq |       | Kafka |  |    |  | Local |    | Redis |    | Database |  |
+----------------------+    |     |  +-------+       +----------+       +-------+  |    |  +-------+    +-------+    +----------+  |
                            |     +-----------------------^------------------------+    +-------------------^----------------------+   
+----------------------+    |                             |                                                 |
| Downloader Agent 3   +----+                             |                                                 |
+----------------------+          +-------Spider----------v--------------------------+                      |
                                  |    +-----------------+  +--------------------+   |                      |
                                  |    | SpeedController |  | RequestSupply      |   |                      |
                                  |    +-----------------+  +--------------------+   <----------------------+             
                                  |    +----------------------------+  +----------+  |                      |
                                  |    | Configure Request delegate |  | DataFlow |  |                      |
                                  |    +----------------------------+  +----------+  |                      |       
                                  +--------------------------------------------------+          +-----------v--------------+
                                                                                                |  MySql, SqlServer, etc   |
                                                                                                +-----------+--------------+
                                                                                                            |
                                                                                                            |
                                                                                                +-----------v--------------+
                                                                                                |        ClickHouse        |
                                                                                                +--------------------------+     
                                                                                                                        

DEVELOP ENVIROMENT

OPTIONAL ENVIROMENT

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Projet DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntitySpider : Spider
{
    public static void Run()
    {
        var builder = new SpiderBuilder();
        builder.AddSerilog();
        builder.ConfigureAppConfiguration();
        builder.UseStandalone();
        builder.AddSpider<EntitySpider>();
        var provider = builder.Build();
        provider.Create<EntitySpider>().RunAsync();
    }

    protected override void Initialize()
    {
        NewGuidId();
        Scheduler = new QueueDistinctBfsScheduler();
        Speed = 1;
        Depth = 3;
        DownloaderSettings.Type = DownloaderType.HttpClient;
        AddDataFlow(new DataParser<BaiduSearchEntry>()).AddDataFlow(GetDefaultStorage());
        AddRequests(
            new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string> {{"网站", "博客园"}}),
            new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string> {{"网站", "博客园"}}));
    }

    [Schema("cnblogs", "cnblogs_entity_model")]
    [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
    [ValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
    class BaiduSearchEntry : EntityBase<BaiduSearchEntry>
    {
        protected override void Configure()
        {
            HasIndex(x => x.Title);
            HasIndex(x => new {x.WebSite, x.Guid}, true);
        }

        public int Id { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "类别", Type = SelectorType.Enviroment)]
        public string Category { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "网站", Type = SelectorType.Enviroment)]
        public string WebSite { get; set; }

        [StringLength(200)]
        [ValueSelector(Expression = "//title")]
        [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
        public string Title { get; set; }

        [StringLength(40)]
        [ValueSelector(Expression = "GUID", Type = SelectorType.Enviroment)]
        public string Guid { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
        public string News { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
        public string Url { get; set; }

        [ValueSelector(Expression = ".//div[@class='entry_summary']", ValueOption = ValueOption.InnerText)]
        public string PlainText { get; set; }

        [ValueSelector(Expression = "DATETIME", Type = SelectorType.Enviroment)]
        public DateTime CreationTime { get; set; }
    }

    public EntitySpider(IMessageQueue mq, IStatisticsService statisticsService, ISpiderOptions options, ILogger<Spider> logger, IServiceProvider services) : base(mq, statisticsService, options, logger, services)
    {
    }
}

Run via Startup

Command: -s [spider type name] -i [id] -a [arg1,arg2...] -d [true/false] -n [name] -c [configuration file]

1.  -s: Type name of spider for example: EntitySpider
2.  -i: Set spider id
3.  -a: Pass arguments to spider's Run method
4.  -n: Set spider name
5.  -c: Set config file path, for example you want to run with a customize config: -c app.my.config

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader=new WebDriverDownloader(Browser.Chrome);

See a complete sample

NOTE:

  1. Make sure the ChromeDriver.exe is in bin folder when use Chrome, install it to your project from NUGET: Chromium.ChromeDriver
  2. Make sure you already add a *.webdriver Firefox profile when use Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
  3. Make sure the PhantomJS.exe is in bin folder when use PhantomJS, install it to your project from NUGET: PhantomJS

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Buy me a coffee

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: zlzforever@163.com

About

DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 68.4%
  • HTML 31.5%
  • Other 0.1%