Category Archives: Ruby

Technical posts related to the programming language Booko is written in

On Daemons

August 7, 2010Booko, Ruby, SysadminDan Milne

As you might imagine (depending on just how nerdy and imaginative you are), Booko is a poster child for the concept of long running background tasks. Grabbing prices from 40 online stores isn’t a fast process and you certainly would not want your front end webservers making your users wait as long as the slowest of the 40 stores before responding to a user request.

Over the years, I’ve tried various approaches to running user level daemons. My first attempt was ok – I rolled my own and slowly improved it. It could handle HUP signals, write PID files, die gracefully and it knew if it hadn’t died properly and attempted to kill zombie versions of itself. It had stop / start / restart commands. But it wasn’t all sweetness and light. What happens when it dies? This is probably the trickiest part of running daemons (Well, having to fork twice and make sure you have detached from the terminal is probably tricker, but still).

So, how do you make sure your daemon is running? Cron immediately springs to mind. So, part two of writing your own daemons is writing something to keep them going. You may have found yourself in this position and felt a little tickle in the back of your mind when you setup a cron job to solve this problem. My cron job looked at the daemon’s log file’s modified time and if it was more than 5 minutes old, looked for the PID file and sent that process a KILL signal.

It’s an easy, stable solution to the problem at hand – albeit with a 5 minute lag to detect crashed daemons. It’s ok because I run multiple daemons which can take the load if one dies. But, what happens if you only have a single daemon? Increase the frequency of checking? Cron’s smallest resolution is 1 minute – that’s not really ok (depending on what your daemon does, it may be fine). But now you have to make sure that your daemon’s writing to the log at least every minute. Ugh.

This solution is starting to smell. So, what does everyone else do? Well, I checked out God – but it just doesn’t feel like an elegant solution to this problem. It may solve the problem nicely, but there must be a better way? Hard core nerds would probably move on to daemontools but it’s too much work for me.

That tickle you may have had in the back of your mind earlier was your subconscious telling you the problem is already solved and you already use it for your webserver, mail server, DNS server, ssh server and more. Your operating system can provide this exact service for you. Since I’m using Ubuntu that service is provided by Upstart.

Running your service with Upstart has two very nice consequences. Firstly – you can remove all the code used to manage daemonising. You can now write your code to hang around in the foreground. Leaving your code in the foreground while you’re in development mode is good anyway – you can watch it more closely. If you really want to daemonise in our dev environment, bang up a tiny ruby script with the Ruby Daemon gem which calls your actual script and manages PIDs, signals and a stop/start interface for you.

Setting up a service to run with Upstart requires just a config file – here’s one I prepared earlier:

description "Price Fetcher Upstart script"
author "Dan Milne"

start on startup
stop on shutdown

console output

respawn
instance $FID

script
env RAILS_ENV=production
export RAILS_ENV

exec sudo -u booko RAILS_ENV=production /opt/ruby-enterprise/bin/ruby /var/www/booko.com.au/booko/bin/fetcher.rb $FID
end script

That file gets named “fetcher.conf” and goes in the /etc/init/ directory. This has some nice features; the first of which is that once it’s started, it will keep running. If it dies, it’ll respawn (you can see the option right there in the script). The fact that it died goes in /var/log/daemons – but what’s even awesomer, you can run multiple instances of the same script, by passing in FID=0 or FID=1 etc when you’re starting it. Finally, it gets the standard init features. You can start it with ‘service fetcher start FID=0’ for example.

The only missing feature that I can see, is that because I need to pass in FID=0 to the script, it doesn’t start at bootup. There appears to be no way of stating “Startup 2 of these at boot time”.

In summary, if you use your OS init services, you get to write simpler code, get respawning at an OS level and you get all the normal daemon control features.

Booko’s moved, features added.

August 1, 2010Booko, Ruby, SysadminDan Milne

I’ve been working on a beta version of Booko for, like, 7 months now. I finally upgraded it while moving from Slicehost to Linode. I’ve made a large number of changes to the way Booko performs long running tasks and how those tasks communicate. But I’ll leave the nerdy stuff for later.

The biggest change from a user point of view is some integration into Freebase.com. You’ll notice extra information appear in book listings now. For example, the Booko page for The Girl with the Dragon Tattoo now tells you:

that the book is part of the Millennium Trilogy
the other books in the series (The Girl who played with Fire & The Girl who kicked the Hornets’ Nest)
the other editions of The Girl with the Dragon Tattoo – hardcover and paperback.

The data at Freebase is a long way from complete, but it’s constantly growing. This should be a very useful feature. I’ll be doing more to integrate Freebase into the search results – for example, so a book shows up only once, listing the different editions in that single search result item.

List management also got an overhaul. That old list manager page was pretty bad. There’s still work to do, but it should be far more usable now. Log in and check it out!

There are still bugs to fix (soooo many missing images for cover art) and features to add (smarter list price calculation, used books).

Google / Yahoo user?

December 30, 2009Booko, RubyDan Milne

Logging into Booko just got easier. If you have a Google or Yahoo account, just hit the appropriate button and you’re in, registration included.

Turns out, this was super easy to add to Booko since it already does vanilla OpenID logins. Basically, Booko just fills in the OpenID URL with “https://www.google.com/accounts/o8/id” or “http://yahoo.com/” and OpenID Directed Identity does the rest. You could also just type those URLs in yourself and it’ll work just the same. Sweet!

So awesome.

October 25, 2009Booko, RubyDan Milne

There are times I really enjoy using Ruby on Rails. Recently, Fishpond started 403’ing http requests for cover images if the referrer isn’t fishpond.com.au. Sites do this so that other sites don’t steal their bandwidth. Really, Booko should be downloading the images and serving them itself (It’s on the todo list BTW). Since Booko had been using Fishpond image URLs to display covers, you may have noticed a bunch of missing cover images – some of them are caused by Fishpond’s new (completely reasonable) policy.

So I’ve updated the code so I don’t link to Fishpond images, but now I need to go through every product Booko’s ever seen and update those with a Fishpond image URL. This is laughably easy with ruby on rails. Just fire up the console and run this:

Product.find_each do |p|
  if p.image_url =~ /fishpond/
    puts "updating details for #{p.gtin}"
    p.image_url=nil
    p.get_detail
    p.save
  end
end

The Rails console gives you access to all the data and models of your application – and this code, just pasted in, will find links to all Fishpond images, find a replacement image, or set it to nil. Point of interest – Booko has 396,456 products in its database. Iterating with Product.all.each would load every product into memory before hitting the each – that would probably never return. On the other hand Product.find_each loads records in batches of 1000 by default. Pretty cool.

* Thanks to http://ryandaigle.com/ to posting about this feature.

Fun with git post-commit

October 24, 2009Booko, RubyDan Milne

While developing new features or bug fixes Booko, I usually work in branches. This makes keeping things separate easy, and means I can easily keep the current production version clean and easy to find. But when changing branches I often have to restart the rails server and the price grabber to pickup any changes. For example, if I’m adding a new shop in a branch, when I switch branches I want the price grabber to restart.

Turns out git makes this super easy. You just create a shell script: .git/hooks/post-checkout

That script gets called after checkout. So, mine is pretty simple:

!/bin/sh
./bin/fetch_price.rb 0 restart;
thin restart

There’s probably a better way to get Thin to reload itself, but this works nicely.

You can checkout all the hooks here: http://www.kernel.org/pub/software/scm/git/docs/v1.5.5.4/hooks.html

Now with all new REE + Phusion.

October 11, 2009Booko, Ruby, SysadminDan Milne

The excellent people at Phusion have released a 1.8.7 based version of their fantastic Ruby Enterprise Edition. I’ve just updated to it and Booko sure feels snappier. I’ve also upgraded to the latest mod_rails (aka, phusion-passenger) so we’re all up-to-date on my medium-ticket sysadmin work.

On being Google’d

August 23, 2009Booko, Ruby, SysadminDan Milne

I love Google – they send me stacks of traffic and make sites like Booko reach a far greater audience than I could effect on my own. Recently, however, Google’s taken a bigger interest in Booko than usual. These kinds of numbers are no problem in general – the webserver and database are easily capable of handling the load.

The problem for Booko, when Google comes calling, is that they request pages for specific books such as:

http://www.booko.com.au/books/isbn/9780140232929

When this request comes in, Booko will check to see how old the prices are – if they’re more than 24 hours old, Booko will attempt to update the prices. Booko used to load the prices into the browser via AJAX – so, as far as I can tell, Google wasn’t even seeing the prices. Further, Booko has a queuing system in place for requests to look up prices, so when Google requests pages, this adds a book to the queue of books to be looked up. Google views books faster than Booko can grab the prices, so we end up with 100’s of books scheduled for lookup, frustrating normal Booko users who see the problem as a page full of spinning wheels – wondering why Booko isn’t giving them prices. Meanwhile, the price grabbers are hammering through hundreds of requests from Google, in turn, hammering all the sites Booko indexes. So, what to do?

Well, the first thing I did was drop Google’s traffic. I hate the idea of doing this – but really Booko needs to be available to people to use and being indexed by Google won’t help if you can’t actually use it. So to the iptables command we go:

iptables -I INPUT -s 66.249.71.108 -j DROP
iptables -I INPUT -s 66.249.71.130 -j DROP

These commands will drop all Google traffic.

The next step was to go to sign up for Google Webmaster Tools and reduce the page crawl rate.

Once I’d dialled back Google’s crawl rate, I dropped the iptables rules:

iptables -F

To make Booko more Google friendly, the first code change was to have book pages rendered immediately with the available pricing (provided it’s complete) and have updates to that pricing delivered via AJAX. Google now gets to see the entire page and should (hopefully) provide better indexing.

The second change was to create a second queue for price updates – the bulk queue. The price grabbers will first check for regular price update requests – meaning people will get their prices first. Requests by bulk users, such as Google, Yahoo & Bing, will be added to the bulk queue and looked up when there are no normal requests. In addition, I can restrict the number of price grabbers which will service the bulk queue.

This work has now opened up a new idea I’ve been thinking about – pre-emptively grab the prices of the previous day or week’s most popular titles. The idea would be to add these popular titles to the bulk queue during the quiet time between 03:00 and 06:00. That would mean that when people viewed the title later that day, they’d be fresh.

I’ve just pushed these changes into the Booko site and with some luck, Google & Co will be happier, Booko users will be happier and I should be able to build new features with this ground work laid. Nice for a Sunday evening’s work.

Moving processing to the Database

May 30, 2009Booko, RubyDan Milne

It’s been great fun writing Booko in Ruby on Rails for lots of reasons, and the ORM module – ActiveRecord, is a big part of what makes it enjoyable. I know that ORMs exist in plenty of other languages, but RoR was my first exposure to it and it makes writing database backed applications much less tedious. But, as with all abstractions though, looking under the covers can help you solve problems and improve performance.

Booko has a “Most Viewed” section, which finds the products which have been viewed the most over the last 7 days. It does this by having a model called “View” which, as you might guess, records product views. The Product class asks the View class for the top 10 viewed products:

class Product < ActiveRecord::Base 
  ... 
  def self.get_popular(period = 7.days.ago, count = 10 )
    View.popular(period).collect {|v| v.product }.reverse.slice(0...count) 
  end 
  ..
end

The View makes use of named scopes:

class View < ActiveRecord::Base 
  .... 
  named_scope :popular, lambda { |time_ago| { :group => 'product_id',
:conditions => ['created_on > ?', time_ago],
:include => :product,
:order => 'count(*)' } }
  ...
end

class Product < ActiveRecord::Base
  ... 
  def self.get_popular(period = 7.days.ago, count = 10 )
    View.popular(period).collect {|v| v.product }.reverse.slice(0...count) 
  end 
  ... 
end

The View makes use of named scopes:

class View < ActiveRecord::Base 
  .... 
  named_scope :popular, lambda { |time_ago| { :group => 'product_id',
:conditions => ['created_on > ?', time_ago],
:include => :product,
:order => 'count(*)' } }
  ...
end

This all worked ok – but once the number of products being viewed started to grow into the 1000’s, this started taking longer and longer to generate the data for the view. At last count, it was running into the 50 second mark – way, way too long. The result of the calculation is cached for 30 minutes but that means that every 30 minutes, some poor user had to wait ~ 50seconds for the “Most Viewed” section to render. Time for a rethink.

There’s one obvious problem with the above method – all products viewed in the last 7 days are returned from the named_scope and instantiated and then count (by default, 10 ) number of Products are sliced off the result and are then displayed. Time to update the named scope so that it returns only the required number of products, and as a bonus, return them in the right order removing the need for the reverse method call.

class Product < ActiveRecord::Base
  ... 
  def self.get_popular(period = 7.days.ago, count = 10 ) 
    View.popular(period, count).collect {|v| v.product } 
  end 
  ... 
end 

class View < ActiveRecord::Base
  .... 
  named_scope :popular, lambda { |time_ago, freq| { :group => 'product_id',
:conditions => ['created_on > ?', time_ago],
:include => :product,
:order => 'count(*) desc',
:limit => freq } }
  ...
end

Updating the named_scope to return the required number of Products (in the right order), reduced the time from 56 seconds, to 6. In fact, subsequent calls returned in ~2 second mark no doubt due to some caching at the database side. Below is the graph showing network traffic to the database host. You can see periodic spikes, every 30 minutes, as the query ran and the database is hit for the 1000’s of Products to be instantiated. After the update, just after 15:00, the traffic becomes much steadier.

Moving that logic from the Ruby side to the database side resulted in a pretty substantial performance improvement.

To compress?

May 16, 2009Booko, RubyDan Milne

If Booko looks up 1000 book prices in 1 day, it will be making 1000 queries to 33 online stores. How much quota could be saved by using HTTP compression? Picking a page such as the one below from Book Depository, I did a quick test. First, I fired up the Booko console and set the URL, grab the page and see how big it is::

>>url="http://www.bookdepository.co.uk/browse/book/isbn/9780141014593"
>>open(url).length
=> 38122

That’s the length in characters, which is ~ 37 KB. Let’s turn on compression and see if it makes much difference:

>> open(url, "Accept-encoding" => "gzip;q=1.0,deflate;q=0.6,identity;q=0.3").length
=> 7472

That’s around 7 KB, which is about 20% of the non-compressed version.

So, 1000 books from 33 shops is 33,000 requests per day. If they were all 37KB (of course they aren’t but let’s play along) we get around 1,200 MB of data or 1.2 GB. If they’re were all compressed down to 7KB, that would come to around 235 MB. Using compression means there’s a trade off – higher CPU utilisation. However, the price grabbers spend most of their time waiting for data to be transmitted – any reduction in this time should yield faster results with significantly lower bandwidth usage.

No prizes for guessing the next feature I’m working on adding to Booko 😉

Update: Thanks to Anish for fixing my overly pessimistic calculation that 1,200 MB == 1.2 TB.

New hardwarez for Booko

May 16, 2009Booko, RubyDan Milne

After a false start on Thursday morning, Booko is now distributed across multiple servers: the Web server, the Database server and the Price server (well, price grabber). Having these components separated out will make expanding Booko much easier than when it was deployed to a single host.

Once the web server load grows higher than a single host can handle, the next step will be to load balance the web servers – with nginx or Varnish or HAProxy or maybe even Apache.

I took the opportunity of some downtime to upgrade to the latest versions of Ruby Enterprise Edition and Passenger (mod_rails) from the awesome guys at Phusion.

All this means that Booko should now be a touch snappier. Enjoy!

Blogo

Booko's Blog