Category Archives: Ruby

On Daemons

As you might imagine (depending on just how nerdy and imaginative you are), Booko is a poster child for the concept of long running background tasks. Grabbing prices from 40 online stores isn’t a fast process and you certainly would not want your front end webservers making your users wait as long as the slowest of the 40 stores before responding to a user request.

Over the years, I’ve tried various approaches to running user level daemons. My first attempt was ok – I rolled my own and slowly improved it. It could handle HUP signals, write PID files, die gracefully and it knew if it hadn’t died properly and attempted to kill zombie versions of itself. It had stop / start / restart commands. But it wasn’t all sweetness and light.  What happens when it dies? This is probably the trickiest part of running daemons (Well, having to fork twice and make sure you have detached from the terminal is probably tricker, but still).

So, how do you make sure your daemon is running? Cron immediately springs to mind. So, part two of writing your own daemons is writing something to keep them going.  You may have found yourself in this position and felt a little tickle in the back of your mind when you setup a cron job to solve this problem. My cron job looked at the daemon’s log file’s modified time and if it was more than 5 minutes old, looked for the PID file and sent that process a KILL signal.

It’s an easy, stable solution to the problem at hand – albeit with a 5 minute lag to detect crashed daemons. It’s ok because I run multiple daemons which can take the load if one dies.  But, what happens if you only have a single daemon? Increase the frequency of checking?  Cron’s smallest resolution is 1 minute – that’s not really ok (depending on what your daemon does, it may be fine).  But now you have to make sure that your daemon’s writing to the log at least every minute.  Ugh.

This solution is starting to smell. So, what does everyone else do? Well, I checked out God – but it just doesn’t feel like an elegant solution to this problem. It may solve the problem nicely, but there must be a better way? Hard core nerds would probably move on to daemontools but it’s too much work for me.

That tickle you may have had in the back of your mind earlier was your subconscious telling you the problem is already solved and you already use it for your webserver, mail server, DNS server, ssh server and more. Your operating system can provide this exact service for you. Since I’m using Ubuntu that service is provided by Upstart.

Running your service with Upstart has two very nice consequences. Firstly – you can remove all the code used to manage daemonising. You can now write your code to hang around in the foreground. Leaving your code in the foreground while you’re in development mode is good anyway – you can watch it more closely. If you really want to daemonise in our dev environment, bang up a tiny ruby script with the Ruby Daemon gem which calls your actual script and manages PIDs, signals and a stop/start interface for you.

Setting up a service to run with Upstart requires just a config file – here’s one I prepared earlier:

[sourcecode lang=”ruby”]
description "Price Fetcher Upstart script"
author "Dan Milne"

start on startup
stop on shutdown

console output

instance $FID

env RAILS_ENV=production
export RAILS_ENV

exec sudo -u booko RAILS_ENV=production /opt/ruby-enterprise/bin/ruby /var/www/ $FID
end script

That file gets named “fetcher.conf” and goes in the /etc/init/ directory. This has some nice features; the first of which is that once it’s started, it will keep running. If it dies, it’ll respawn (you can see the option right there in the script).  The fact that it died goes in /var/log/daemons – but what’s even awesomer, you can run multiple instances of the same script, by passing in FID=0 or FID=1 etc when you’re starting it. Finally, it gets the standard init features. You can start it with ‘service fetcher start FID=0’ for example.

The only missing feature that I can see, is that because I need to pass in FID=0 to the script, it doesn’t start at bootup. There appears to be no way of stating “Startup 2 of these at boot time”.

In summary, if you use your OS init services, you get to write simpler code, get respawning at an OS level and you get all the normal daemon control features.

Booko’s moved, features added.

I’ve been working on a beta version of Booko for, like, 7 months now.  I finally upgraded it while moving from Slicehost to Linode.  I’ve made a large number of changes to the way Booko performs long running tasks and how those tasks communicate. But I’ll leave the nerdy stuff for later.

The biggest change from a user point of view is some integration into You’ll notice extra information appear in book listings now. For example, the Booko page for The Girl with the Dragon Tattoo now tells you:

  • that the book is part of the Millennium Trilogy
  • the other books in the series (The Girl who played with Fire & The Girl who kicked the Hornets’ Nest)
  • the other editions of The Girl with the Dragon Tattoo – hardcover and paperback.

The data at Freebase is a long way from complete, but it’s constantly growing. This should be a very useful feature. I’ll be doing more to integrate Freebase into the search results – for example, so a book shows up only once, listing the different editions in that single search result item.

List management also got an overhaul. That old list manager page was pretty bad. There’s still work to do, but it should be far more usable now. Log in and check it out!

There are still bugs to fix (soooo many missing images for cover art) and features to add (smarter list price calculation, used books).

Google / Yahoo user?

Logging into Booko just got easier. If you have a Google or Yahoo account, just hit the appropriate button and you’re in, registration included.

Turns out, this was super easy to add to Booko since it already does vanilla OpenID logins. Basically, Booko just fills in the OpenID URL with “” or “” and OpenID Directed Identity does the rest. You could also just type those URLs in yourself and it’ll work just the same.  Sweet!

So awesome.

There are times I really enjoy using Ruby on Rails.  Recently, Fishpond started 403’ing http requests for cover images if the referrer isn’t  Sites do this so that other sites don’t steal their bandwidth.  Really, Booko should be downloading the images and serving them itself (It’s on the todo list BTW).  Since Booko had been using Fishpond image URLs to display covers, you may have noticed a bunch of missing cover images – some of them are caused by Fishpond’s new (completely reasonable) policy.

So I’ve updated the code so I don’t link to Fishpond images, but now I need to go through every product Booko’s ever seen  and update those with a Fishpond image URL.   This is laughably easy with ruby on rails. Just fire up the console and run this:

[sourcecode lang=”ruby”]
Product.find_each do |p|
if p.image_url =~ /fishpond/
puts "updating details for #{p.gtin}"

The Rails console gives you access to all the data and models of your application – and this code, just pasted in, will find links to all Fishpond images, find a replacement image, or set it to nil. Point of interest – Booko has 396,456 products in its database.  Iterating with Product.all.each would load every product into memory before hitting the each – that would probably never return. On the other hand Product.find_each loads records in batches of 1000 by default.  Pretty cool.

* Thanks to to posting about this feature.

Fun with git post-commit

While developing new features or bug fixes Booko, I usually work in branches. This makes keeping things separate easy, and means I can easily keep the current production version clean and easy to find.  But when changing branches I often have to restart the rails server and the price grabber to pickup any changes.  For example, if I’m adding a new shop in a branch, when I switch branches I want the price grabber to restart.

Turns out git makes this super easy. You just create a shell script: .git/hooks/post-checkout

That script gets called after checkout. So, mine is pretty simple:

[sourcecode language=”bash”]

./bin/fetch_price.rb 0 restart;
thin restart

There’s probably a better way to get Thin to reload itself, but this works nicely.

You can checkout all the hooks here:

On Users and Passwords

Update: thanks to the commenters for pointing out some flaws in the logic of the previous version of this page. I’ve updated the page to incorporate their feedback.

I’ve been thinking about adding wish lists to Booko. Wishlists require an implementation of Users – after all, what’s the point of having Wishlists if you can’t change them or publish them?  Booko’s built on Ruby on Rails, so I had a look around for plugins, but, truth be told, I’m too much of a Ruby n00b to trust other people’s plugins.  I’m sure they’re easy to install, but how do you keep them up to date? Finding any kind of bug will mean reading and understanding the code and seriously, that’s as much work as implementing Users on my own.  Plus, I have concerns about their implementation which I’ll discuss more later.

So, I figure having users requires two bits of data:

  • Email address
  • Password

Having the email address as the login name makes sense to me – it’s unique and if someone forgets their password I can email them a password reset link.  No need to remember a separate username.  One day I’ll add OpenID because I’m a freetard and like the concept.

Now, passwords are valuable bits of information and they need to be protected. This may sound obvious, but they need to be protected for a couple of reasons:

  • only the actual user (or owner of the email address) can log in and manipulate Booko Wishlists
  • many users use the same email address and password on other sites (PayPal & Amazon for example).

Maybe you don’t do this, but if you have a separate password for every site requiring login, you’re a better person than me. The consequence of someone getting hold of your email and password can lead to some … difficulties on sites you’ve used the same email address and password.

So, we need to protect passwords from prying eyes.  There are two main ways for your password to be discovered:

  • Database compromise
  • Web sniffing proxies.

So, what to do? What we do is turn to hashing functions. Hashing functions (like MD5 & SHA*) can take information like a password and send it on a one-way trip.  In effect, it is impossible to work backwards from the hash to the password, however, some clever people have calculated the hash of hundreds or millions of passwords and stored the hash – then they can simply lookup the hash stored in Booko with their pre-calculated hashes and find the password. This is known as a dictionary attack.

To thwart this attack, we introduce a “salt”. The salt is a random string of characters which is combined with the password before it is hashed.  This means that the dictionary attack is now useless – they would need an entire dictionary which included your salt combined with all those guessed passwords. The correct way of combining a password with a salt is to use a HMAC function.  In Ruby you can do this like this:

[sourcecode language=”ruby”]
require ‘hmac-sha1’

def self.do_hashing(password, salt)
passwd_hmac =
passwd_hmac << (password)

Ideally, each User has their own salt. This means an attacker would need to generate an entire dictionary attack per user.

In any case, this is where the other implementation of Users seem to stop. When you type your password into a form on a web page, it gets sent to the server – the server hashes your password (along with the salt) and checks if that hash matches the stored hash. If they match, you’re in.

But there is still the matter of your password being sent over the internets possibly via proxy servers which can listen in on the traffic. There’s two ways to stop the password being sent in the clear. Either hash the password first, or use SSL. If you’re using SSL to send the passwords, you could stop here. Booko currenly doesn’t have SSL certificates so we need to stop passwords travelling over the internet in the clear. How do we do that? Easy, we hash the password before sending it over the internet. We’ll create a single salt for this hash and call it the transport salt.

On the server side, we’ll also hash the password with the transport salt, prior to hashing it with the per-user salt.

This means we send the browser the transport salt, and the browser calculates the hash. This hash is sent to the server. The server can validate the password as shown below:

[sourcecode language=”ruby”]
require ‘hmac-sha1’

def self.do_hashing(password, salt)
passwd_hmac =
passwd_hmac << (password)

final_hash_to_check = do_hashing(password_hash_from_browser, user_salt)

if final_hash_to_check == stored_password_hash
# User provided correct password.
session[:user_id] =


In the code above, when the response comes back from the client, the server calculates the final hash by using the user_salt. This final hash is compared with the stored_password_hash – if they’re the same the client provided the correct password.

So, where does this leave us?

  • At no point is the password sent in the clear, nor stored in the clear.
  • The client only sees the transport salt, not the per-user salt and can’t precalculate a dictionary attack
  • Each user has a separate salt, making it far more difficult for an attacker to perform a dictionary attack
  • Compromising the database and retrieving the hashes doesn’t allow you log in with that hash
  • Compromising the database and retrieving the hashes doesn’t allow you to log onto any other sites
  • Sniffing the hashed transport hash will allow an attacker to access your account

So, we’ve achieved some of our goals. The password is never sent in the clear. However, if an attacker snoops traffic between the client and server and get’s a copy of the hashed password, that password can be used to log on to the service.

If that’s unacceptable, then moving to SSL is the next step. You can use SSL to protect the hash as it moves between the client and server. However you can also use SSL to protect the plain password being discovered too. Is there any point in doing both? Hashing the password prior to SSL is slightly more secure. If you hash the password before it leaves the client, there’s no danger of the password appearing in the log files of your web sever or application server. If you have a dedicated host decrypting the SSL prior to passing it to your web server, the password could be sniffed between those servers.

After all that, I’ve decided that Booko will use both hashing the password before transmission and, eventually, SSL.

Notes on SHA1 and extra security:
SHA1 hashes are very secure and can be calculated very fast. The faster the hash, the faster you can create a dictionary attack.  Ideally for this scenario, we want a slow hashing function, or, more correctly, we want the method that generates our hashes to be slow. This can easily be achieved by simply hashing the password multiple times.  Here’s some timings of hashing:

[sourcecode light=”true” language=”ruby”]
>> require ‘hmac-sha1’
>> require ‘benchmark’
>> salt = (0..256).map { ((0..9).to_a + (‘a’..’z’).to_a + (‘A’..’Z’).to_a).rand }.join
>> hash = nil
>> Benchmark.realtime { hash =<< "MyPassword" }
=> 6.60419464111328e-05
>> Benchmark.realtime { 100.times { hash = << hash.hexdigest } }
=> 0.00437498092651367
>> Benchmark.realtime { 1000.times { hash = << hash.hexdigest } }
=> 0.0426349639892578
>> Benchmark.realtime { 10000.times { hash = << hash.hexdigest } }
=> 0.463771820068359
>> Benchmark.realtime { 100000.times { hash = << hash.hexdigest } }
=> 4.64294099807739

So, 10,000 hashes took around 1/2 second on my 2.8Ghz Core 2 Duo MacBook Pro. You can see that performing 10,000 hashes will seriously slow any attempt at creating a dictionary attack on your passwords. I haven’t timed how long it takes in Javascript, but you’d want to keep it under a second to make user log in not too painful.


You can find javascript libraries for doing HMAC on PAJ’s Homepage or jsSSH on sourceforge.

On being Google’d

google_crawlingI love Google – they send me stacks of traffic and make sites like Booko reach a far greater audience than I could effect on my own. Recently, however, Google’s taken a bigger interest in Booko than usual.  These kinds of numbers are no problem in general – the webserver and database are easily capable of handling the load.

The problem for Booko, when Google comes calling, is that they request pages for specific books such as:

When this request comes in, Booko will check to see how old the prices are – if they’re more than 24 hours old, Booko will attempt to update the prices. Booko used to load the prices into the browser via AJAX – so, as far as I can tell, Google wasn’t even seeing the prices.  Further, Booko has a queuing system in place for requests to look up prices, so when Google requests pages, this adds a book to the queue of books to be looked up. Google views books faster than Booko can grab the prices, so we end up with 100’s of books scheduled for lookup, frustrating normal Booko users who see the problem as a page full of spinning wheels – wondering why Booko isn’t giving them prices. Meanwhile, the price grabbers are hammering through hundreds of requests from Google, in turn, hammering all the sites Booko indexes.  So, what to do?

Well, the first thing I did was drop Google’s traffic. I hate the idea of doing this – but really Booko needs to be available to people to use and being indexed by Google won’t help if you can’t actually use it. So to the iptables command we go:

[sourcecode language=’shell’]
iptables -I INPUT -s -j DROP
iptables -I INPUT -s -j DROP

These commands will drop all Google traffic.

The next step was to go to sign up for Google Webmaster Tools and reduce the page crawl rate.

Google Webmaster Tools

Once I’d dialled back Google’s crawl rate, I dropped the iptables rules:

[sourcecode language=’shell’]
iptables -F

To make Booko more Google friendly, the first code change was to have book pages rendered immediately with the available pricing (provided it’s complete) and have updates to that pricing delivered via AJAX. Google now gets to see the entire page and should (hopefully) provide better indexing.

The second change was to create a second queue for price updates – the bulk queue. The price grabbers will first check for regular price update requests – meaning people will get their prices first. Requests by bulk users, such as Google, Yahoo & Bing, will be added to the bulk queue and looked up when there are no normal requests.  In addition, I can restrict the number of price grabbers which will service the bulk queue.

This work has now opened up a new idea I’ve been thinking about – pre-emptively grab the prices of the previous day or week’s most popular titles. The idea would be to add these popular titles to the bulk queue during the quiet time between 03:00 and 06:00.   That would mean that when people viewed the title later that day, they’d be fresh.

I’ve just pushed these changes into the Booko site and with some luck, Google & Co will be happier, Booko users will be happier and I should be able to build new features with this ground work laid. Nice for a Sunday evening’s work.

Moving processing to the Database

It’s been great fun writing Booko in Ruby on Rails for lots of reasons, and the ORM module – ActiveRecord, is a big part of what makes it enjoyable. I know that ORMs exist in plenty of other languages, but RoR was my first exposure to it and it makes writing database backed applications much less tedious. But, as with all abstractions though, looking under the covers can help you solve problems and improve performance.

Booko has a “Most Viewed” section, which finds the products which have been viewed the most over the last 7 days. It does this by having a model called “View” which, as you might guess, records product views. The Product class asks the View class for the top 10 viewed products:

[sourcecode language=’ruby’]
class Product < ActiveRecord::Base ... def self.get_popular(period = 7.days.ago, count = 10 ) View.popular(period).collect {|v| v.product }.reverse.slice(0...count) end ... end [/sourcecode] The View makes use of named scopes: [sourcecode language='ruby'] class View < ActiveRecord::Base .... named_scope :popular, lambda { |time_ago| { :group => ‘product_id’,
:conditions => [‘created_on > ?’, time_ago],
:include => :product,
:order => ‘count(*)’ } }


This all worked ok – but once the number of products being viewed started to grow into the 1000’s, this started taking longer and longer to generate the data for the view. At last count, it was running into the 50 second mark – way, way too long. The result of the calculation is cached for 30 minutes but that means that every 30 minutes, some poor user had to wait ~ 50seconds for the “Most Viewed” section to render. Time for a rethink.

There’s one obvious problem with the above method – all products viewed in the last 7 days are returned from the named_scope and instantiated and then count (by default, 10 ) number of Products are sliced off the result and are then displayed. Time to update the named scope so that it returns only the required number of products, and as a bonus, return them in the right order removing the need for the reverse method call. 

[sourcecode language=’ruby’]
class Product < ActiveRecord::Base ... def self.get_popular(period = 7.days.ago, count = 10 ) View.popular(period, count).collect {|v| v.product } end ... end [/sourcecode] [sourcecode language='ruby'] class View < ActiveRecord::Base .... named_scope :popular, lambda { |time_ago, freq| { :group => ‘product_id’,
:conditions => [‘created_on > ?’, time_ago],
:include => :product,
:order => ‘count(*) desc’,
:limit => freq } }


Updating the named_scope to return the required number of Products (in the right order),  reduced the time from 56 seconds, to 6. In fact, subsequent calls returned in ~2 second mark no doubt due to some caching at the database side. Below is the graph showing network traffic to the database host. You can see periodic spikes, every 30 minutes, as the query ran and the database is hit for the 1000’s of Products to be instantiated. After the update, just after 15:00, the traffic becomes much steadier.

database traffic
database traffic

Moving that logic from the Ruby side to the database side resulted in a pretty substantial performance improvement.

To compress?

If Booko looks up 1000 book prices in 1 day, it will be making 1000 queries to 33 online stores. How much quota could be saved by using HTTP compression? Picking a page such as the one below from Book Depository, I did a quick test. First, I fired up the Booko console and set the URL:
[sourcecode language=’ruby’]
>> url=”″
=> “”

Next, grab the page and see how big it is:
[sourcecode language=’ruby’]
>> open(url).length
=> 38122

That’s the length in characters, which is ~ 37 KB. Let’s turn on compression and see if it makes much difference:

[sourcecode language=’ruby’]
>> open(url, “Accept-encoding” => “gzip;q=1.0,deflate;q=0.6,identity;q=0.3”).length
=> 7472

That’s around 7 KB, which is about 20% of the non-compressed version.

So, 1000 books from 33 shops is 33,000 requests per day. If they were all 37KB (of course they aren’t but let’s play along) we get around 1,200 MB of data or 1.2 GB. If they’re were all compressed down to 7KB, that would come to around 235 MB. Using compression means there’s a trade off – higher CPU utilisation. However, the price grabbers spend most of their time waiting for data to be transmitted – any reduction in this time should yield faster results with significantly lower bandwidth usage.

No prizes for guessing the next feature I’m working on adding to Booko 😉

Update: Thanks to Anish for fixing my overly pessimistic calculation that 1,200 MB == 1.2 TB.