Last week I was working with data scraping project. I have to extract data of Asp.Net MVC website using PHP Curl. I guess everyone know about PHP cURL. Its library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. More information you can get it from this link.
So basically following is the code snippet of how to use cURL with PHP for data scraping.
$ch = curl_init("http://www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$pageData = curl_exec($ch);
You will get complete page HTML+Data in $pageData variable. Now you can extract data from this variable using various PHP functions.
When you use cURL it is very easy to extract data of static page. What if we have dynamic page like Asp.Net page? For example you have a combo box on Asp.Net page so when you select some other option in combo box your Asp.Net page gets refreshed. The exact word is 'POST BACK'. Some code gets executed in code behind file of Asp.Net page and again page is constructed.
So how we can use PHP cURL for this. cURL supports various options for this like
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
Here you can give all the post fields like option selected in combo box. It will be & separated string. For Example
So you have to identify all the fields in Asp.Net page that gets posted. But what if you don't know anything about Asp.Net page. How you can identify all the necessary fields that gets posted on post back of Asp.Net page?
Well, I have used a tool called Fiddler web debugger. When you install Fiddler it will track all of your web sessions and it will give you all the data like headers, textview , WebForms , XML etc.
After installing Fiddler just browse through your Asp.Net website. Select various fields from combo box and do analysis of Fiddler data. You will get all the information about posted fields. Using this you can make your post string for cURL request and you can have all the data of Asp.Net website.