Edmonds Commerce Logo
  • home
    • blog
  • ecommerce
    • product catalogue
    • order processing
    • customer services
    • stock control
    • human resources
    • management information
  • development
    • oscommerce
    • php
    • mysql
    • open source
    • performance tuning
  • design
  • marketing
  • contact us
    • pricing

Edmonds Commerce Blog

Freelance PHP Ecommerce and SEO Developer in the UK

Latest Posts

Magento Most Popular Extensions
Google Chrome for Linux Beta
Firefox Address Bar Lag + Solution
Custom Buttons for Firefox

Most Popular Posts

Magento Developer UK Freelance osCommerce UK Magento Training CRE Loaded UK

Building Spiders: Grab Data, Post Forms and Interact with Web Sites Automatically

February 14th, 2008
Read More curl, firefox, php, programming, spidering

One of the most useful and powerful things you can do with PHP is to create a programme which will simulate a web browser and can grab data, post data to forms and generally interact with other web sites - automatically.

For PHP to be able to work like this it must have the CURL library installed and active. It is the CURL library which actually handles all of the interaction and PHP is my scripting language of choice for interacting with CURL.

A simple CURL function is like this:

PLAIN TEXT
PHP:
  1. function curl($url){
  2.  
  3. $timeout = '300'; //how long before CURL gives up on this page
  4. $go = curl_init();
  5. curl_setopt ($go, CURLOPT_URL, $url);
  6. curl_setopt ($go, CURLOPT_RETURNTRANSFER, 1);
  7. curl_setopt ($go, CURLOPT_FOLLOWLOCATION, 1);
  8. curl_setopt ($go, CURLOPT_TIMEOUT, $timeout);
  9. $spage = curl_exec($go);
  10. curl_close($go);
  11. return $page;
  12.  
  13. }

This function when called and echoed will output the entire html of the $url specified.

For grabbing data from this page to be inserted into a database (for example when spidering a suppliers web site for product information to be inserted into your site) we then use regular expressions to find what we are looking for and then insert that into the database.

so for example if we wanted to grab the product title and we knew that this was wrapped in a h1 tag with the class "product title" we could use this regexp to grab this:

PLAIN TEXT
PHP:
  1. $page = curl($url);
  2.  
  3. $pattern = '%
  4. <h1 class="product_title">(.+?)</h1>
  5. %i';
  6.  
  7. preg_match($pattern,$page,$matches);
  8.  
  9. print_r($matches); //we can see the entire array of matches and choose which we want to insert into the database

We can also Post data to web sites using curl. This allows us to do all kinds of things including grabbing data that is displayed on the submission of post forms. Here is an example Curl Post Function:

PLAIN TEXT
PHP:
  1. function curl_post($url,$post_data){
  2.  
  3. $timeout = '300'; //how long before CURL gives up on this page
  4. $go = curl_init();
  5. curl_setopt ($go, CURLOPT_URL, $url);
  6. curl_setopt ($go, CURLOPT_RETURNTRANSFER, 1);
  7. curl_setopt ($go, CURLOPT_FOLLOWLOCATION, 1);
  8. curl_setopt ($go, CURLOPT_TIMEOUT, $timeout);
  9. //now for the post section
  10. curl_setopt($go, CURLOPT_POST, true);
  11.  
  12. curl_setopt($go, CURLOPT_POSTFIELDS, $post);
  13. $spage = curl_exec($go);
  14. curl_close($go);
  15. return $page;
  16. }

It can be tricky to figure out exactly what data should be in the post string. To help you out though is this incredibly handy addon for firefox: Live Http Headers.

This addon lets you see exactly what is going on between your browser and the web site you are visiting. This can quickly and easily give you the information you need to replicate the same behaviour with your CURL script.

Edmonds Commerce specialise in working with PHP and CURL. If you have any spidering, screen scraping or other application that requires PHP to actively interact with other web sites - get in touch today to see how we can help you benefit from this incredibly powerful technique.

Related Resources

http://www.phpfour.com/blog/2008/01/20/php-http-class/

http://www.phpclasses.org/browse/package/1988.html

http://www.phpit.net/article/using-curl-php/

http://skeymedia.com/intro-to-curl-with-php/

Possibly Relevant Posts:

  • Mysql Database Migration / Synchronisation Script
  • Check if MySQL Table Exists
  • Firefox Address Bar Lag + Solution
  • Custom Buttons for Firefox
  • PHP Into Compiled C++ – Hip Hop by Facebook

Feed | Respond | Trackback

3 Responses to “Building Spiders: Grab Data, Post Forms and Interact with Web Sites Automatically”

  1. Martin Says:
    October 5th, 2008 at 12:41 am

    Very interesting!

    Thanks for this post.

  2. Nathan Says:
    February 12th, 2009 at 4:44 pm

    Hi

    Thanks for this post, I think it is a really well written and informative introduction to curl! I will pass this link onto other people I find wanting curl advice.

    Nathan

  3. admin Says:
    February 12th, 2009 at 4:49 pm

    my pleasure :-)

Leave a Reply

  • RSS Feed
  • Categories

    • adwords
    • apache
    • barcode
    • business
    • creloaded
    • css
    • curl
    • customer services
    • debugging
    • drupal
    • eclipse
    • ecommerce
    • edmondscommerce
    • email
    • excel
    • firefox
    • flash
    • gd
    • git
    • graphs
    • hosting
    • icecat
    • internet news
    • javascript
    • jquery
    • link building
    • linux
    • mac
    • magento
    • management
    • misc
    • mod_rewrite
    • mysql
    • netbeans
    • open suse
    • oscommerce
    • php
    • plesk
    • portfolio
    • product catalogue
    • product feed
    • programming
    • regular expressions
    • ria
    • scraping
    • search engine optimisation
    • security
    • seo
    • spidering
    • symfony
    • twitter
    • ubuntu
    • Uncategorized
    • usability
    • vps
    • web design
    • web development
    • Windows
    • xampp
    • zend framework
    • zip
  • Archives

    • February 2010
    • January 2010
    • December 2009
    • November 2009
    • October 2009
    • September 2009
    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009
    • January 2009
    • December 2008
    • November 2008
    • October 2008
    • September 2008
    • August 2008
    • July 2008
    • June 2008
    • May 2008
    • April 2008
    • March 2008
    • February 2008
  • Tags

    bulk course cre loaded creloaded css custom developer development directories drupal error find firefox git google hosts file html jaunty javascript leeds links linux magento mysql netbeans oscommerce performance php plesk ppc problem replace search engine optimisation seo server symfony table training ubuntu uk virtualbox web web design xml zend framework
  • Random Posts

    • SSH / Command Line Mysql Dump and Compress
    • Google Checkout, 501 Error with Mod Security + Solution
    • Accessing Raw XML Posted to a Script
    • Online SSL Checker (Google Checkout)
    • Colorzilla for Ubuntu Alternative - Gcolor2
    • Magento How to Change the Favicon
    • Why VPS is Not for Everyone - Yet
    • Lightweight, Easy Install phpMyAdmin Alternative - phpMiniAdmin
    • osCommerce Contribution Released: Server Migration Synchronisation
    • Hacked Server - Cleanup Script
  • Recent Comments

    • admin on Magento Backup Error Filesystem.php on line 234 + Solution
    • admin on Magento Leeds
    • Matthew Dolley on Magento Leeds
    • kash on PHP Email Attachment Function
    • Hussein on PHP Save Images Using cURL
  • Category Specific RSS

    • adwords Feed for all posts filed under adwords
    • apache Feed for all posts filed under apache
    • barcode Feed for all posts filed under barcode
    • business Feed for all posts filed under business
    • creloaded Feed for all posts filed under creloaded
    • css Feed for all posts filed under css
    • curl Feed for all posts filed under curl
    • customer services Feed for all posts filed under customer services
    • debugging Feed for all posts filed under debugging
    • drupal Feed for all posts filed under drupal
    • eclipse Feed for all posts filed under eclipse
    • ecommerce Feed for all posts filed under ecommerce
    • edmondscommerce Feed for all posts filed under edmondscommerce
    • email Feed for all posts filed under email
    • excel Feed for all posts filed under excel
    • firefox Feed for all posts filed under firefox
    • flash Feed for all posts filed under flash
    • gd Feed for all posts filed under gd
    • git Feed for all posts filed under git
    • graphs Feed for all posts filed under graphs
    • hosting Feed for all posts filed under hosting
    • icecat Feed for all posts filed under icecat
    • internet news Feed for all posts filed under internet news
    • javascript Feed for all posts filed under javascript
    • jquery Feed for all posts filed under jquery
    • link building Feed for all posts filed under link building
    • linux Feed for all posts filed under linux
    • mac Feed for all posts filed under mac
    • magento Feed for all posts filed under magento
    • management Feed for all posts filed under management
    • misc Feed for all posts filed under misc
    • mod_rewrite Feed for all posts filed under mod_rewrite
    • mysql Feed for all posts filed under mysql
    • netbeans Feed for all posts filed under netbeans
    • open suse Feed for all posts filed under open suse
    • oscommerce Feed for all posts filed under oscommerce
    • php Feed for all posts filed under php
    • plesk Feed for all posts filed under plesk
    • portfolio Feed for all posts filed under portfolio
    • product catalogue Feed for all posts filed under product catalogue
    • product feed Feed for all posts filed under product feed
    • programming Feed for all posts filed under programming
    • regular expressions Feed for all posts filed under regular expressions
    • ria Feed for all posts filed under ria
    • scraping Feed for all posts filed under scraping
    • search engine optimisation Feed for all posts filed under search engine optimisation
    • security Feed for all posts filed under security
    • seo Feed for all posts filed under seo
    • spidering Feed for all posts filed under spidering
    • symfony Feed for all posts filed under symfony
    • twitter Feed for all posts filed under twitter
    • ubuntu Feed for all posts filed under ubuntu
    • Uncategorized Feed for all posts filed under Uncategorized
    • usability Feed for all posts filed under usability
    • vps Feed for all posts filed under vps
    • web design Feed for all posts filed under web design
    • web development Feed for all posts filed under web development
    • Windows Feed for all posts filed under Windows
    • xampp Feed for all posts filed under xampp
    • zend framework Feed for all posts filed under zend framework
    • zip Feed for all posts filed under zip

Edmonds Commerce related questions? Send us a message or call us on 0844 357 0201.

Freelance PHP Web Design UK Commercial Web Design