Friday, August 13, 2010

Scrap data of Asp.Net MVC website using PHP cURL

Hello,

Last week I was working with data scraping project. I have to extract data of Asp.Net MVC website using PHP Curl. I guess everyone know about PHP cURL. Its library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. More information you can get it from this link.

So basically following is the code snippet of how to use cURL with PHP for data scraping.

$ch = curl_init("http://www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$pageData = curl_exec($ch);
curl_close($ch);

You will get complete page HTML+Data in $pageData variable. Now you can extract data from this variable using various PHP functions.

When you use cURL it is very easy to extract data of static page. What if we have dynamic page like Asp.Net page? For example you have a combo box on Asp.Net page so when you select some other option in combo box your Asp.Net page gets refreshed. The exact word is 'POST BACK'. Some code gets executed in code behind file of Asp.Net page and again page is constructed.

So how we can use PHP cURL for this. cURL supports various options for this like

curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);

Here you can give all the post fields like option selected in combo box. It will be & separated string. For Example

postfield1=value1&postfield2=value2&postfield3=value3

So you have to identify all the fields in Asp.Net page that gets posted. But what if you don't know anything about Asp.Net page. How you can identify all the necessary fields that gets posted on post back of Asp.Net page?

Well, I have used a tool called Fiddler web debugger. When you install Fiddler it will track all of your web sessions and it will give you all the data like headers, textview , WebForms , XML etc.
After installing Fiddler just browse through your Asp.Net website. Select various fields from combo box and do analysis of Fiddler data. You will get all the information about posted fields. Using this you can make your post string for cURL request and you can have all the data of Asp.Net website.


2 comments:

  1. Have you ever tried logging into an ASP.NET website using php/curl and received an "Validation of viewstate MAC failed" error? It's an invalid viewstate error of some sort, but I'm capturing it and writing it using rawurlencode.

    I've been struggling with this for about a week now, with no luck. Please help, if you can.

    ReplyDelete
  2. that's because on asp.net page view state enabled. You have to do some more exercise here. You have to grab viewstate value and send it along with php/cURL

    ReplyDelete