August 10, 2011

code for screen scraping with Html Agility Pack

Html Agility Pack  is a .NET code library that allows you to parse "out of the web" HTML files.
If you want to scrap some data from HTMl file over the we this is the easiest solution .
The only problem is you are dependent on third party , IF they change the structure , then you have to again work around with the changes.

Below is the sample code I am writing to for this
1.) To use Html Agility Pack download it from the below path

2.) Add the reference of HTMLAgilityPack.dll from the above downloaded folder in your bin folder of the ASP.NET solution.


3.) See the below code to read the data using this in you Class




using System;
using System.Data;
using System.Configuration;
using System.Collections;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.IO;
using System.Net;
using HtmlAgilityPack;
public partial class GetGoldPrice : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
           FindDIVFromHTMLOnWeb();


    }
    private void FindDIVFromHTMLOnWeb()
    {
        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load("http://www.testdomain.com");
        HtmlNode dataNode = doc.DocumentNode.SelectSingleNode("//div[@id='[id of the div to be fetched]']");
        string data = dataNode.InnerText;
        Response.Write(data);




    }
}

1 comment:

  1. Hello all,

    This document explains how to do HTML screen scraping. In effect it shows how to treat the Web as a resource by enabling you to retrieve and extract data from HTML Web pages. Thanks a lot....

    ReplyDelete