Category Archives: Booko

Posts about booko, some technical, some explaining new features

So awesome.

There are times I really enjoy using Ruby on Rails.  Recently, Fishpond started 403’ing http requests for cover images if the referrer isn’t fishpond.com.au.  Sites do this so that other sites don’t steal their bandwidth.  Really, Booko should be downloading the images and serving them itself (It’s on the todo list BTW).  Since Booko had been using Fishpond image URLs to display covers, you may have noticed a bunch of missing cover images – some of them are caused by Fishpond’s new (completely reasonable) policy.

So I’ve updated the code so I don’t link to Fishpond images, but now I need to go through every product Booko’s ever seen  and update those with a Fishpond image URL.   This is laughably easy with ruby on rails. Just fire up the console and run this:

Product.find_each do |p|
  if p.image_url =~ /fishpond/
    puts "updating details for #{p.gtin}"
    p.image_url=nil
    p.get_detail
    p.save
  end
end

The Rails console gives you access to all the data and models of your application – and this code, just pasted in, will find links to all Fishpond images, find a replacement image, or set it to nil. Point of interest – Booko has 396,456 products in its database.  Iterating with Product.all.each would load every product into memory before hitting the each – that would probably never return. On the other hand Product.find_each loads records in batches of 1000 by default.  Pretty cool.

* Thanks to http://ryandaigle.com/ to posting about this feature.

Fun with git post-commit

While developing new features or bug fixes Booko, I usually work in branches. This makes keeping things separate easy, and means I can easily keep the current production version clean and easy to find.  But when changing branches I often have to restart the rails server and the price grabber to pickup any changes.  For example, if I’m adding a new shop in a branch, when I switch branches I want the price grabber to restart.

Turns out git makes this super easy. You just create a shell script: .git/hooks/post-checkout

That script gets called after checkout. So, mine is pretty simple:

!/bin/sh
./bin/fetch_price.rb 0 restart;
thin restart

There’s probably a better way to get Thin to reload itself, but this works nicely.

You can checkout all the hooks here: http://www.kernel.org/pub/software/scm/git/docs/v1.5.5.4/hooks.html

On being Google’d

google_crawling

I love Google – they send me stacks of traffic and make sites like Booko reach a far greater audience than I could effect on my own. Recently, however, Google’s taken a bigger interest in Booko than usual.  These kinds of numbers are no problem in general – the webserver and database are easily capable of handling the load.

The problem for Booko, when Google comes calling, is that they request pages for specific books such as:

http://www.booko.com.au/books/isbn/9780140232929

When this request comes in, Booko will check to see how old the prices are – if they’re more than 24 hours old, Booko will attempt to update the prices. Booko used to load the prices into the browser via AJAX – so, as far as I can tell, Google wasn’t even seeing the prices.  Further, Booko has a queuing system in place for requests to look up prices, so when Google requests pages, this adds a book to the queue of books to be looked up. Google views books faster than Booko can grab the prices, so we end up with 100’s of books scheduled for lookup, frustrating normal Booko users who see the problem as a page full of spinning wheels – wondering why Booko isn’t giving them prices. Meanwhile, the price grabbers are hammering through hundreds of requests from Google, in turn, hammering all the sites Booko indexes.  So, what to do?

Well, the first thing I did was drop Google’s traffic. I hate the idea of doing this – but really Booko needs to be available to people to use and being indexed by Google won’t help if you can’t actually use it. So to the iptables command we go:

iptables -I INPUT -s 66.249.71.108 -j DROP
iptables -I INPUT -s 66.249.71.130 -j DROP

These commands will drop all Google traffic.

The next step was to go to sign up for Google Webmaster Tools and reduce the page crawl rate.

Google Webmaster Tools

Once I’d dialled back Google’s crawl rate, I dropped the iptables rules:

iptables -F

To make Booko more Google friendly, the first code change was to have book pages rendered immediately with the available pricing (provided it’s complete) and have updates to that pricing delivered via AJAX. Google now gets to see the entire page and should (hopefully) provide better indexing.

The second change was to create a second queue for price updates – the bulk queue. The price grabbers will first check for regular price update requests – meaning people will get their prices first. Requests by bulk users, such as Google, Yahoo & Bing, will be added to the bulk queue and looked up when there are no normal requests.  In addition, I can restrict the number of price grabbers which will service the bulk queue.

This work has now opened up a new idea I’ve been thinking about – pre-emptively grab the prices of the previous day or week’s most popular titles. The idea would be to add these popular titles to the bulk queue during the quiet time between 03:00 and 06:00.   That would mean that when people viewed the title later that day, they’d be fresh.

I’ve just pushed these changes into the Booko site and with some luck, Google & Co will be happier, Booko users will be happier and I should be able to build new features with this ground work laid. Nice for a Sunday evening’s work.

A gaggle of updates

Just pushed out some more updates to Booko:

  • Moved Booko’s gem dependencies into environment.rb. Should have done this ages ago. (Read about it here)
  • Started moving from Hpricot to Nokogiri. Hpricot was awesome, but it’s been choking on some websites and crashing the price grabbers.
  • Started moving from timeout.rb to SystemTimer because I would like my timeouts to, you know, timeout. Sheesh.
  • Updated the memcache-client to use SystemTimer for the same reason.
  • Added more testing

Moving processing to the Database

It’s been great fun writing Booko in Ruby on Rails for lots of reasons, and the ORM module – ActiveRecord, is a big part of what makes it enjoyable. I know that ORMs exist in plenty of other languages, but RoR was my first exposure to it and it makes writing database backed applications much less tedious. But, as with all abstractions though, looking under the covers can help you solve problems and improve performance.

Booko has a “Most Viewed” section, which finds the products which have been viewed the most over the last 7 days. It does this by having a model called “View” which, as you might guess, records product views. The Product class asks the View class for the top 10 viewed products:

class Product < ActiveRecord::Base 
  ... 
  def self.get_popular(period = 7.days.ago, count = 10 )
    View.popular(period).collect {|v| v.product }.reverse.slice(0...count) 
  end 
  ..
end

The View makes use of named scopes:

class View < ActiveRecord::Base 
  .... 
  named_scope :popular, lambda { |time_ago| { :group => 'product_id',
:conditions => ['created_on > ?', time_ago],
:include => :product,
:order => 'count(*)' } }
  ...
end
class Product < ActiveRecord::Base
  ... 
  def self.get_popular(period = 7.days.ago, count = 10 )
    View.popular(period).collect {|v| v.product }.reverse.slice(0...count) 
  end 
  ... 
end

The View makes use of named scopes:

class View < ActiveRecord::Base 
  .... 
  named_scope :popular, lambda { |time_ago| { :group => 'product_id',
:conditions => ['created_on > ?', time_ago],
:include => :product,
:order => 'count(*)' } }
  ...
end

This all worked ok – but once the number of products being viewed started to grow into the 1000’s, this started taking longer and longer to generate the data for the view. At last count, it was running into the 50 second mark – way, way too long. The result of the calculation is cached for 30 minutes but that means that every 30 minutes, some poor user had to wait ~ 50seconds for the “Most Viewed” section to render. Time for a rethink.

There’s one obvious problem with the above method – all products viewed in the last 7 days are returned from the named_scope and instantiated and then count (by default, 10 ) number of Products are sliced off the result and are then displayed. Time to update the named scope so that it returns only the required number of products, and as a bonus, return them in the right order removing the need for the reverse method call. 

class Product < ActiveRecord::Base
  ... 
  def self.get_popular(period = 7.days.ago, count = 10 ) 
    View.popular(period, count).collect {|v| v.product } 
  end 
  ... 
end 

class View < ActiveRecord::Base
  .... 
  named_scope :popular, lambda { |time_ago, freq| { :group => 'product_id',
:conditions => ['created_on > ?', time_ago],
:include => :product,
:order => 'count(*) desc',
:limit => freq } }
  ...
end

Updating the named_scope to return the required number of Products (in the right order),  reduced the time from 56 seconds, to 6. In fact, subsequent calls returned in ~2 second mark no doubt due to some caching at the database side. Below is the graph showing network traffic to the database host. You can see periodic spikes, every 30 minutes, as the query ran and the database is hit for the 1000’s of Products to be instantiated. After the update, just after 15:00, the traffic becomes much steadier.

database traffic
database traffic

Moving that logic from the Ruby side to the database side resulted in a pretty substantial performance improvement.

To compress?

If Booko looks up 1000 book prices in 1 day, it will be making 1000 queries to 33 online stores. How much quota could be saved by using HTTP compression? Picking a page such as the one below from Book Depository, I did a quick test. First, I fired up the Booko console and set the URL, grab the page and see how big it is::

>>url="http://www.bookdepository.co.uk/browse/book/isbn/9780141014593"
>>open(url).length
=> 38122

That’s the length in characters, which is ~ 37 KB. Let’s turn on compression and see if it makes much difference:

>> open(url, "Accept-encoding" => "gzip;q=1.0,deflate;q=0.6,identity;q=0.3").length
=> 7472

That’s around 7 KB, which is about 20% of the non-compressed version.

So, 1000 books from 33 shops is 33,000 requests per day. If they were all 37KB (of course they aren’t but let’s play along) we get around 1,200 MB of data or 1.2 GB. If they’re were all compressed down to 7KB, that would come to around 235 MB. Using compression means there’s a trade off – higher CPU utilisation. However, the price grabbers spend most of their time waiting for data to be transmitted – any reduction in this time should yield faster results with significantly lower bandwidth usage.

No prizes for guessing the next feature I’m working on adding to Booko 😉

Update: Thanks to Anish for fixing my overly pessimistic calculation that 1,200 MB == 1.2 TB.

New hardwarez for Booko

After a false start on Thursday morning, Booko is now distributed across multiple servers: the Web server, the Database server and the Price server (well, price grabber). Having these components separated out will make expanding Booko much easier than when it was deployed to a single host.

Once the web server load grows higher than a single host can handle, the next step will be to load balance the web servers – with nginx or Varnish or HAProxy or maybe even Apache.

I took the opportunity of some downtime to upgrade to the latest versions of Ruby Enterprise Edition and Passenger (mod_rails) from the awesome guys at Phusion.

All this means that Booko should now be a touch snappier. Enjoy!