Saturday, August 28, 2010

Data Scraping Part-2

Here is second article on data scrapping. As I have mentioned in my previous blog that I was working on data scraping from Asp.Net MVC website. In that I have faced one more problem.

There was a field on the page which is visible to only logged in user. So first of all I have to create Asp.Net session by PHP cURL then I can request for the particular page. So how to do that. Following is the procedure for that.

First of all create a session through PHP cURL and then use that cURL request to get the restricted page. Following is the code for that.

$url = $this->login_url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT ,'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8');

curl_setopt($ch, CURLOPT_AUTOREFERER,1);

curl_setopt ($ch, CURLOPT_POSTFIELDS, 'userName='.urlencode($this->user).'&password='.urlencode($this->pass).'&rememberMe=true&rememberMe=false');

curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt ($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION , 1);

In above code two things are new CURLOPT_USERAGENT ,CURLOPT_POSTFIELDS.

If you carefully see CURLOPT_POSTFIELDS there you can see some different data then normal post fields. Generally every login page has remember me options.

Another different thing is CURLOPT_USERAGENT. This is the information of browser and system from where the request is being sent.

How you can properly build these two fields? The option I have used is Live HTTP headers extension in Firefox. You can find it from this website http://livehttpheaders.mozdev.org/

Install it in your Firefox and then simply browse that the website for which you want to get the information. This extension will capture all the data of your browsed page. You can use it in your cURL request. Use it in above code and then use following code to get restricted information on the page.

curl_setopt($ch, CURLOPT_URL, $page_url);
curl_setopt ($ch, CURLOPT_POST, false);
$store = curl_exec ($ch);
curl_close ($ch);

Hope this helps.



Friday, August 13, 2010

Scrap data of Asp.Net MVC website using PHP cURL

Hello,

Last week I was working with data scraping project. I have to extract data of Asp.Net MVC website using PHP Curl. I guess everyone know about PHP cURL. Its library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. More information you can get it from this link.

So basically following is the code snippet of how to use cURL with PHP for data scraping.

$ch = curl_init("http://www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$pageData = curl_exec($ch);
curl_close($ch);

You will get complete page HTML+Data in $pageData variable. Now you can extract data from this variable using various PHP functions.

When you use cURL it is very easy to extract data of static page. What if we have dynamic page like Asp.Net page? For example you have a combo box on Asp.Net page so when you select some other option in combo box your Asp.Net page gets refreshed. The exact word is 'POST BACK'. Some code gets executed in code behind file of Asp.Net page and again page is constructed.

So how we can use PHP cURL for this. cURL supports various options for this like

curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);

Here you can give all the post fields like option selected in combo box. It will be & separated string. For Example

postfield1=value1&postfield2=value2&postfield3=value3

So you have to identify all the fields in Asp.Net page that gets posted. But what if you don't know anything about Asp.Net page. How you can identify all the necessary fields that gets posted on post back of Asp.Net page?

Well, I have used a tool called Fiddler web debugger. When you install Fiddler it will track all of your web sessions and it will give you all the data like headers, textview , WebForms , XML etc.
After installing Fiddler just browse through your Asp.Net website. Select various fields from combo box and do analysis of Fiddler data. You will get all the information about posted fields. Using this you can make your post string for cURL request and you can have all the data of Asp.Net website.


Saturday, August 7, 2010

Detect product page in Magento

Recently, we have faced a problem in a Magento project. We were implementing search functionality for products. User can enter product name choose a category and sub category to search. Now the problem was. This search box should only be displayed on category page and sub category page. On product page, it should not be visible. First we have tried to hide search box on a client side through JavaScript. However, there was an annoying flicker effect. Search box displayed for some time, and then it gets to hide. So we have decided to hide it from server side code. In Magento this search box is generated from.

Following file /app/design/frontend/base/default/template/catalogsearch/form.mini.phtml

We can hide it from this file. Now the challenge was to detect the product page. We have used following code for that.

$isProductPage=false;
if(Mage::registry('current_category'))
{
$cat=Mage::registry('current_category');
$catLevel=$cat->getLevel();$catId=$cat->getId();  $_id= Mage::app()->getRequest()->getParam('id', false);  if($_id!=$catId)//spacial condition for product page  {   $isProductPage=true;  }}


In case of category and sub category page$catId and $_id gives you thesame id. For product page $catId gives you id of productcategory while $_id gives you product id.

So this was thetrick to detectthe product page.