home | contact us
» Archive by category "scraping"

category: scraping


If you need to do some HTML parsing, scraping etc then you generally have the choice of using the DOM approach to parse the HTML or using regex to pull bits out. I quite like both approaches, there are pros and cons to both so having both options available is the best to ensure you use the right tool for the job on a case by case basis.

Recently though I discovered Simple HTML Dom. This makes the DOM based approach almost ridiculously easy, especially if you are comfortable with jQuery like selectors for pulling out specific DOM nodes.

You can read all about it here:
http://simplehtmldom.sourceforge.net/manual.htm

First impressions are really good, its very easy and the lead in time from installation to using is really fast. I like that, never been much of a fan of having to RTFM for everything!


 

I just hit an intriguing situation where a page that was perfectly viewable in my browser was not visible via curl.

I scratched my head and messed around testing the page in variety of online proxy services and local web browsers. I even stared messing about with telnet and manually typing headers. My conclusion was that the simpler systems such as text based browsers were not able to see the page and were instead given a 404 message.

However better more modern browsers could see the page. Likewise the page was visible in the Google cache and aso Google Translate.

In the end I downloaded a neat little firefox add-on called Tamper Data. This allows you to tweak your request headers before they are submitted. 5 minutes later I realised that it was the Gzip compatibility which was causing the issue.

Curl (being the awesome tool that it is) can handle Gzip compression, but I wasn’t using it. I have now added the following line to my curl function and I am pulling pages fine.

 if(!empty($compression)){
    curl_setopt($go,CURLOPT_ENCODING , $compression);
 }

More…


 

Sometimes you want your script to pause for a short period of time before repeating a loop or proceeding to the next step. This may be to reduce server load or even to simulate the natural pauses that a person would make whilst browsing a site. This kind of thing is especially true if you are building a system to scrape content such as product information from a suppliers web site and you do not want to hammer their server to death and get your IP banned.

 function sleep_flush($chunks=10){

 	$c=0;

 	while($c < $chunks){

 		$rand = rand(2000000, 6000000);

 		echo '<br> . . . sleeping for ' . round(($rand / 1000000),2) . ' seconds . . . zzzzzzzzzzzzzz<br>';

 		flush();

 		usleep($rand);

 		$c++;

 	}

}

This function will do just that for you. Also it has a built in flush which will help in preventing the script from being regarded as timed out if you are running it from the web browser.


 

One of the most arduous task in maintaining and ecommerce web site is the constant task of keeping all of your products up to date. Suppliers like to release new price lists and product ranges on at least a yearly basis. Just when you have managed to get your web site completely up to date, the rug is pulled out from under you and you have to start all over again…

Its a never ending task and if not handled carefully can soak up ridiculous amounts of staff time. Staff time means expense as any business financial director will tell you on wages day.

There are some great gains to be made in automating the task of keeping a product catalogue up to date. The popular ecommerce platform osCommerce has a contribution called easypopulate which is an essential addition. CREloaded and other flavours of enhanced osCommerce tend to ship with easypopulate installed by default.

Its a great script, but for a very large site with a lot of demands the script does need a little extra work to make it the fully fledged backup, restore, update and insert products tool that a serious ecommerce business needs.

Most platforms either come with or have contributions or modules you can install to enable bulk importing and updating of products which definitely makes things faster. However if you are finding that you or your staff are creating these spreadsheets by hand in your spreadsheet programme then there are still amazing efficiences to be achieved.

In fact, once you get really serious, you start hitting the row limits of common spread sheet programmes which means you literally can not open the entire product feed for editing. Once you are at this kind of size (or a lot less) you will find that your computer is very sluggish and editing becomes a slow and tedious task.

A much better way is to use custom written scripts for processing your information into the correct format for your bulk updating system. This is something that Edmonds Commerce specialise in and we have many happy customers who are saving countless hours by having software for the hard work.

The ultimate solution is to recieve product information in a feed from your suppliers. This is generally a very fast and reliable way of grabbing all of the product information for a particular supplier. However not all suppliers supply a feed. Others do supply a feed, but it is missing crucial information.

In this kind of scenario, it is usually possible to actually develop a script which will grab all of the required information and images directly from your suppliers web site. The familiar copying and pasting scenario but done by a script rather than a member of staff. Thousands of products can be grabbed in an hour and all outputed into a file correctly formatted for immediate upload into your bulk updating system such as Easypopulate for oscommerce.

Again this is a real speciality of Edmonds Commerce, though you may find that the average web design company does not have this kind of service to offer as it is a bit unusual and involves a steep learning curve.

For more information on this subject please do get in touch


 

One of the big tasks that any ecommerce retail business must undertake is the continual updating and inserting of products into the catalogue. Done one by one, this task can take a ridiculous amount of time. In some instances there is no better option, but in the vast majority of cases there is!

Product Feed

The ideal scenario is that your supplier makes available an up to date product feed which is regularly updated and contains all of the information you need to insert those products into your catalogue. The challenge with this is that it is highly unlikely that you will literally be able to upload this data as is. The reason being that each ecommerce system has its own quirks and separate ways of doing things. Before you can upload this data into your feed, it is highly likely that it will need to be altered and prepared for insertion.

You could do updating by hand – but that brings us back to our first point. Doing things by hand can take a ridiculously large amount of time. Instead – we recommend that you have a script which does all this preparation for you.

In fact this task is something that Edmonds Commerce specialise in. Not least because it is something that we have done plenty of and so we have a good understanding of how to do the job. Furthermore – we understand how to do the job well.

Spidering and Scraping Products from Supplier Web Site

If your supplier does not provide a feed, or if the feed they supply does not have all of the information that you want, you might think you are stuck. You are not!

It is perfectly possible to build a system which will visit every product on your suppliers web site and grab all of the information and pictures and then save them into a format that you can insert into your catalogue system. It is even possible to extend the scraping system so that it goes all the way and inserts the products into your site for you.

Again this is something that Edmonds Commerce specialise in.

Conclusion

If you find that you or your staff are spending large amounts of time manually copying and pasting information from supplier web sites – you need to ask yourself if that is really cost effective. Whilst developing a script to process a feed or scrape a supplier web site might involve a significant initial outlay – the humongous saving in staff time ensures that you will quickly recoup this cost and then will be straight into a profitable scenario. Furthermore your catalogue will be absolutely up to date with the latest pictures, information and prices meaning that you have the best chance to sell those products.

If you want to discuss how Edmonds Commerce could help you achieve these great goals of cost reduction and totally up to date catalogue – please do get in touch.


 
rss icon