调用HtmlAgilityPack实现Xpath解析HTML采集信息
What's Html Agility Pack (HAP)?
It is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (No need to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant of "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
它是一个敏捷的 HTML 解析器,它构建了一个读/写 DOM 并支持普通的 XPATH 或 XSLT(你无需了解 XPATH 或 XSLT 即可使用它,不用担心......)。
它是一个 .NET 代码库,允许您解析本地的 HTML 文件。解析器对“实际使用中”格式错误的 HTML 非常宽容。对象模型与 Xml 也非常相似,但它用于 HTML 文档(或文件流).
HtmlAgilityPack是一款开源免费强大的解析html信息的c#类库, 最新的版本1.11.42更新于2022.2.4日, 可见它还保持着活跃的更新.
通过使用xpath可以方便的进行各类数据采集.
HtmlAgilityPack官方网站: https://html-agility-pack.net/
开源地址: https://github.com/zzzprojects/html-agility-pack/
简单采集使用教程:
https://www.cnblogs.com/asxinyu/p/CSharp_HtmlAgilityPack_XPath_Weather_Data.html
Xpath功能介绍:
https://www.w3school.com.cn/xpath/xpath_syntax.asp
以下简单测试直接解析string html字符串功能:
import console; import dotNet; var dll = dotNet.load("\HtmlAgilityPack.dll"); var htmlDoc = dll.new("HtmlAgilityPack.HtmlDocument"); var html =/***** <!DOCTYPE html> <html> <body> <h1>This is <b>bold</b> heading</h1> <p>This is <u>underlined</u> paragraph</p> <h2>This is <i>italic</i> heading</h2> </body> </html> *****/ htmlDoc.LoadHtml(html); var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//h1"); console.log(htmlBody.OuterHtml); console.pause(true);
测试工程下载:
采集实例: 获取本论坛文章和链接
import console; import dotNet; var dll = dotNet.load("\HtmlAgilityPack.dll"); var HtmlWeb = dll.new("HtmlAgilityPack.HtmlWeb"); var htmlDoc = HtmlWeb.Load("https://www.chengxu.xyz"); //设置区域 var htmlNode = htmlDoc.DocumentNode.SelectSingleNode(`//div[@class="main-box home-box-list"]`); //选取每篇文章区域 var Nodes = htmlNode.SelectNodes(`//div[@class="item-content"]`); //获取标题和链接网址 for(i=1;Nodes.Count;1){ var node = Nodes[i].SelectSingleNode(`h2`); //标题 console.log( string.trim(node.InnerText) ); //链接 console.log( node.SelectSingleNode(`a`).Attributes.Item["href"].Value ); } console.pause(true);
其实还可以利用xpath更精简一下: 直接一次到位获取到每篇文章和链接
//直接获取每篇文章区域 var Nodes = htmlDoc.DocumentNode.SelectNodes(`//div[@class="main-box home-box-list"]/*/div[@class="item-content"]/h2`); //获取标题和链接网址 for(i=1;Nodes.Count;1){ //标题 console.log( string.trim(Nodes[i].InnerText) ); //链接 console.log( Nodes[i].SelectSingleNode(`a`).Attributes.Item["href"].Value ); }
效果一样的.
登录后方可回帖
读取和修改属性: