Skip to content

Nord001/htmlsharp

 
 

Repository files navigation

HtmlSharp

HtmlSharp is a C# library for parsing HTML that allows querying of the DOM with css selectors.

Usage

Let’s say you have the following html stored in a string named html.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>Sample HTML</title>
  </head>
  <body>
    <p class="important">Make sure you read this part.</p>
    <p>Some info about stuff.</p>
    <p align="left" id="copyright">Copyright data.</p>
  </body>
</html>

To be able to query this document, we’ll need to parse it:

HtmlParser parser = new HtmlParser();
HtmlDocument doc = parser.Parse(html);

From there, we can use css selectors to find the information we want from the document.

Finding the tag with the id of copyright:

Tag t = doc.Find("#copyright");

Automatic casting to access attributes statically:

P paragraph = doc.Find<P>("#copyright");
string align = paragraph.Align; // "left"
string class = paragraph.Class; // null

Finding multiple tags:

IEnumerable<Tag> paragraphs = doc.FindAll("p"); // Returns the 3 paragraph objects it finds

Supported selectors

Pattern Meaning
* any element
E an element of type E
E[foo] an E element with a “foo” attribute
E[foo=“bar”] an E element whose “foo” attribute value is exactly equal to “bar”
E[foo~=“bar”] an E element whose “foo” attribute value is a list of space-separated values, one of which is exactly equal to “bar”
E[foo^=“bar”] an E element whose “foo” attribute value begins exactly with the string “bar”
E[foo$=“bar”] an E element whose “foo” attribute value ends exactly with the string “bar”
E[foo*=“bar”] an E element whose “foo” attribute value contains the substring “bar”
E[hreflang|=“en”] an E element whose “hreflang” attribute has a hyphen-separated list of values beginning (from the left) with “en”
E:root an E element, root of the document
E:nth-child(n) an E element, the n-th child of its parent
E:nth-last-child(n) an E element, the n-th child of its parent, counting from the last one
E:nth-of-type(n) an E element, the n-th sibling of its type
E:nth-last-of-type(n) an E element, the n-th sibling of its type, counting from the last one
E:first-child an E element, first child of its parent
E:last-child an E element, last child of its parent
E:first-of-type an E element, first sibling of its type
E:last-of-type an E element, last sibling of its type
E:only-child an E element, only child of its parent
E:only-of-type an E element, only sibling of its type
E:empty an E element that has no children (including text nodes)
E:lang(fr) an element of type E in language “fr” (the document language specifies how language is determined)
E:disabled a user interface element E which is disabled
E:enabled a user interface element E which is enabled
E:checked a user interface element E which is checked (for instance a radio-button or checkbox)
E.warning an E element whose class is “warning” (the document language specifies how class is determined)
E#myid an E element with ID equal to “myid”
E:not(s) an E element that does not match simple selector s
E F an F element descendant of an E element
E > F an F element child of an E element
E + F an F element immediately preceded by an E element
E ~ F an F element preceded by an E element

About

A C# HTML parsing library supporting CSS selectors for querying.

Resources

Stars

Watchers

Forks

Packages

No packages published