Edmonds Commerce Logo
  • home
    • blog
  • ecommerce
    • product catalogue
    • order processing
    • customer services
    • stock control
    • human resources
    • management information
  • development
    • oscommerce
    • php
    • mysql
    • open source
    • performance tuning
  • design
  • marketing
  • contact us
    • pricing

Edmonds Commerce Blog

Freelance PHP Ecommerce and SEO Developer in the UK

Latest Posts

Git Ignore All Files Except PHP etc + Solution
PHP 5.3 Is Released
Netbeans 6.7 is Out. Yay :-)
RGBA Cross Browser Support + Solution

Most Popular Posts

Magento Developer UK Freelance osCommerce UK Magento UK Developers on Linked.in CRE Loaded UK

Building Spiders: Grab Data, Post Forms and Interact with Web Sites Automatically

February 14th, 2008
Read More curl, firefox, php, programming, spidering

One of the most useful and powerful things you can do with PHP is to create a programme which will simulate a web browser and can grab data, post data to forms and generally interact with other web sites - automatically.

For PHP to be able to work like this it must have the CURL library installed and active. It is the CURL library which actually handles all of the interaction and PHP is my scripting language of choice for interacting with CURL.

A simple CURL function is like this:

PLAIN TEXT
PHP:
  1. function curl($url){
  2.  
  3. $timeout = '300'; //how long before CURL gives up on this page
  4. $go = curl_init();
  5. curl_setopt ($go, CURLOPT_URL, $url);
  6. curl_setopt ($go, CURLOPT_RETURNTRANSFER, 1);
  7. curl_setopt ($go, CURLOPT_FOLLOWLOCATION, 1);
  8. curl_setopt ($go, CURLOPT_TIMEOUT, $timeout);
  9. $spage = curl_exec($go);
  10. curl_close($go);
  11. return $page;
  12.  
  13. }

This function when called and echoed will output the entire html of the $url specified.

For grabbing data from this page to be inserted into a database (for example when spidering a suppliers web site for product information to be inserted into your site) we then use regular expressions to find what we are looking for and then insert that into the database.

so for example if we wanted to grab the product title and we knew that this was wrapped in a h1 tag with the class "product title" we could use this regexp to grab this:

PLAIN TEXT
PHP:
  1. $page = curl($url);
  2.  
  3. $pattern = '%
  4. <h1 class="product_title">(.+?)</h1>
  5. %i';
  6.  
  7. preg_match($pattern,$page,$matches);
  8.  
  9. print_r($matches); //we can see the entire array of matches and choose which we want to insert into the database

We can also Post data to web sites using curl. This allows us to do all kinds of things including grabbing data that is displayed on the submission of post forms. Here is an example Curl Post Function:

PLAIN TEXT
PHP:
  1. function curl_post($url,$post_data){
  2.  
  3. $timeout = '300'; //how long before CURL gives up on this page
  4. $go = curl_init();
  5. curl_setopt ($go, CURLOPT_URL, $url);
  6. curl_setopt ($go, CURLOPT_RETURNTRANSFER, 1);
  7. curl_setopt ($go, CURLOPT_FOLLOWLOCATION, 1);
  8. curl_setopt ($go, CURLOPT_TIMEOUT, $timeout);
  9. //now for the post section
  10. curl_setopt($go, CURLOPT_POST, true);
  11.  
  12. curl_setopt($go, CURLOPT_POSTFIELDS, $post);
  13. $spage = curl_exec($go);
  14. curl_close($go);
  15. return $page;
  16. }

It can be tricky to figure out exactly what data should be in the post string. To help you out though is this incredibly handy addon for firefox: Live Http Headers.

This addon lets you see exactly what is going on between your browser and the web site you are visiting. This can quickly and easily give you the information you need to replicate the same behaviour with your CURL script.

Edmonds Commerce specialise in working with PHP and CURL. If you have any spidering, screen scraping or other application that requires PHP to actively interact with other web sites - get in touch today to see how we can help you benefit from this incredibly powerful technique.

Related Resources

http://www.phpfour.com/blog/2008/01/20/php-http-class/

http://www.phpclasses.org/browse/package/1988.html

http://www.phpit.net/article/using-curl-php/

http://skeymedia.com/intro-to-curl-with-php/

Possibly Relevant Posts:

  • no matches

Feed | Respond | Trackback

3 Responses to “Building Spiders: Grab Data, Post Forms and Interact with Web Sites Automatically”

  1. Martin Says:
    October 5th, 2008 at 12:41 am

    Very interesting!

    Thanks for this post.

  2. Nathan Says:
    February 12th, 2009 at 4:44 pm

    Hi

    Thanks for this post, I think it is a really well written and informative introduction to curl! I will pass this link onto other people I find wanting curl advice.

    Nathan

  3. admin Says:
    February 12th, 2009 at 4:49 pm

    my pleasure :-)

Leave a Reply

  • RSS Feed
  • Categories

    • apache
    • barcode
    • business
    • creloaded
    • css
    • curl
    • customer services
    • debugging
    • eclipse
    • ecommerce
    • edmondscommerce
    • email
    • excel
    • firefox
    • flash
    • gd
    • git
    • graphs
    • hosting
    • icecat
    • internet news
    • javascript
    • link building
    • linux
    • mac
    • magento
    • management
    • misc
    • mod_rewrite
    • mysql
    • open suse
    • oscommerce
    • php
    • plesk
    • portfolio
    • product catalogue
    • product feed
    • programming
    • regular expressions
    • ria
    • scraping
    • search engine optimisation
    • security
    • seo
    • spidering
    • twitter
    • ubuntu
    • Uncategorized
    • usability
    • vps
    • web design
    • web development
    • Windows
    • xampp
    • zend framework
    • zip
  • Archives

    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009
    • January 2009
    • December 2008
    • November 2008
    • October 2008
    • September 2008
    • August 2008
    • July 2008
    • June 2008
    • May 2008
    • April 2008
    • March 2008
    • February 2008
  • Tags

    alpha browser case creloaded css curl development directories filezilla firefox flush google googlecheckout hosts file html internet explorer jaunty links linux magento migration mod_security mysql myths oscommerce php plesk reciprocal linking rgba search engine optimisation seo spidering spotify ssl synchronisation table transparency ubuntu uk virtualbox web web design xml zend form zend framework
  • Random Posts

    • Great Desktop Wallpaper Collection
    • Home from PHP UK 2009 in London
    • Basic Server Migration Using SSH + SCP
    • Git Ignore All Files Except PHP etc + Solution
    • Who Needs Photoshop? PHP GD Images and Your Online Store
    • osCommerce Output Queries Debug : Store Speed Optimisation
    • Favourite Ubuntu Music Player
    • Impressed with Google Docs
    • EAN13 Barcode Check Digit with PHP
    • Ultimate Design Test - Five Second Test
  • Most Popular Posts

    • MySQL Copy Table from One Database to Another (15)
    • PHP Email Attachment Function (10)
    • PHP Save Images Using cURL (6)
    • PHP, cURL, CURLOPT FOLLOWLOCATION and open basedir Or Safe Mode (6)
    • Magento Sites Go Down if Magentocommerce.com Goes Down (6)
  • Recent Comments

    • admin on Migrate Magento to Alternative Server
    • jon on Migrate Magento to Alternative Server
    • admin on Magento Sites Go Down if Magentocommerce.com Goes Down
    • Sam on Magento Sites Go Down if Magentocommerce.com Goes Down
    • Email Attachments with PHP | S-Carter.co.uk on PHP Email Attachment Function

Edmonds Commerce related questions? Send us a message or call us on 0844 357 0201.

Freelance PHP Web Design UK Commercial Web Design