Category Archives: Sysadmin

Technical posts about the inner workings of Booko

On Daemons

As you might imagine (depending on just how nerdy and imaginative you are), Booko is a poster child for the concept of long running background tasks. Grabbing prices from 40 online stores isn’t a fast process and you certainly would not want your front end webservers making your users wait as long as the slowest of the 40 stores before responding to a user request.

Over the years, I’ve tried various approaches to running user level daemons. My first attempt was ok – I rolled my own and slowly improved it. It could handle HUP signals, write PID files, die gracefully and it knew if it hadn’t died properly and attempted to kill zombie versions of itself. It had stop / start / restart commands. But it wasn’t all sweetness and light.  What happens when it dies? This is probably the trickiest part of running daemons (Well, having to fork twice and make sure you have detached from the terminal is probably tricker, but still).

So, how do you make sure your daemon is running? Cron immediately springs to mind. So, part two of writing your own daemons is writing something to keep them going.  You may have found yourself in this position and felt a little tickle in the back of your mind when you setup a cron job to solve this problem. My cron job looked at the daemon’s log file’s modified time and if it was more than 5 minutes old, looked for the PID file and sent that process a KILL signal.

It’s an easy, stable solution to the problem at hand – albeit with a 5 minute lag to detect crashed daemons. It’s ok because I run multiple daemons which can take the load if one dies.  But, what happens if you only have a single daemon? Increase the frequency of checking?  Cron’s smallest resolution is 1 minute – that’s not really ok (depending on what your daemon does, it may be fine).  But now you have to make sure that your daemon’s writing to the log at least every minute.  Ugh.

This solution is starting to smell. So, what does everyone else do? Well, I checked out God – but it just doesn’t feel like an elegant solution to this problem. It may solve the problem nicely, but there must be a better way? Hard core nerds would probably move on to daemontools but it’s too much work for me.

That tickle you may have had in the back of your mind earlier was your subconscious telling you the problem is already solved and you already use it for your webserver, mail server, DNS server, ssh server and more. Your operating system can provide this exact service for you. Since I’m using Ubuntu that service is provided by Upstart.

Running your service with Upstart has two very nice consequences. Firstly – you can remove all the code used to manage daemonising. You can now write your code to hang around in the foreground. Leaving your code in the foreground while you’re in development mode is good anyway – you can watch it more closely. If you really want to daemonise in our dev environment, bang up a tiny ruby script with the Ruby Daemon gem which calls your actual script and manages PIDs, signals and a stop/start interface for you.

Setting up a service to run with Upstart requires just a config file – here’s one I prepared earlier:

description "Price Fetcher Upstart script"
author "Dan Milne"

start on startup
stop on shutdown

console output

respawn
instance $FID

script
env RAILS_ENV=production
export RAILS_ENV

exec sudo -u booko RAILS_ENV=production /opt/ruby-enterprise/bin/ruby /var/www/booko.com.au/booko/bin/fetcher.rb $FID
end script

That file gets named “fetcher.conf” and goes in the /etc/init/ directory. This has some nice features; the first of which is that once it’s started, it will keep running. If it dies, it’ll respawn (you can see the option right there in the script).  The fact that it died goes in /var/log/daemons – but what’s even awesomer, you can run multiple instances of the same script, by passing in FID=0 or FID=1 etc when you’re starting it. Finally, it gets the standard init features. You can start it with ‘service fetcher start FID=0’ for example.

The only missing feature that I can see, is that because I need to pass in FID=0 to the script, it doesn’t start at bootup. There appears to be no way of stating “Startup 2 of these at boot time”.

In summary, if you use your OS init services, you get to write simpler code, get respawning at an OS level and you get all the normal daemon control features.

Booko’s moved, features added.

I’ve been working on a beta version of Booko for, like, 7 months now.  I finally upgraded it while moving from Slicehost to Linode.  I’ve made a large number of changes to the way Booko performs long running tasks and how those tasks communicate. But I’ll leave the nerdy stuff for later.

The biggest change from a user point of view is some integration into Freebase.com. You’ll notice extra information appear in book listings now. For example, the Booko page for The Girl with the Dragon Tattoo now tells you:

  • that the book is part of the Millennium Trilogy
  • the other books in the series (The Girl who played with Fire & The Girl who kicked the Hornets’ Nest)
  • the other editions of The Girl with the Dragon Tattoo – hardcover and paperback.

The data at Freebase is a long way from complete, but it’s constantly growing. This should be a very useful feature. I’ll be doing more to integrate Freebase into the search results – for example, so a book shows up only once, listing the different editions in that single search result item.

List management also got an overhaul. That old list manager page was pretty bad. There’s still work to do, but it should be far more usable now. Log in and check it out!

There are still bugs to fix (soooo many missing images for cover art) and features to add (smarter list price calculation, used books).

On being Google’d

google_crawling

I love Google – they send me stacks of traffic and make sites like Booko reach a far greater audience than I could effect on my own. Recently, however, Google’s taken a bigger interest in Booko than usual.  These kinds of numbers are no problem in general – the webserver and database are easily capable of handling the load.

The problem for Booko, when Google comes calling, is that they request pages for specific books such as:

http://www.booko.com.au/books/isbn/9780140232929

When this request comes in, Booko will check to see how old the prices are – if they’re more than 24 hours old, Booko will attempt to update the prices. Booko used to load the prices into the browser via AJAX – so, as far as I can tell, Google wasn’t even seeing the prices.  Further, Booko has a queuing system in place for requests to look up prices, so when Google requests pages, this adds a book to the queue of books to be looked up. Google views books faster than Booko can grab the prices, so we end up with 100’s of books scheduled for lookup, frustrating normal Booko users who see the problem as a page full of spinning wheels – wondering why Booko isn’t giving them prices. Meanwhile, the price grabbers are hammering through hundreds of requests from Google, in turn, hammering all the sites Booko indexes.  So, what to do?

Well, the first thing I did was drop Google’s traffic. I hate the idea of doing this – but really Booko needs to be available to people to use and being indexed by Google won’t help if you can’t actually use it. So to the iptables command we go:

iptables -I INPUT -s 66.249.71.108 -j DROP
iptables -I INPUT -s 66.249.71.130 -j DROP

These commands will drop all Google traffic.

The next step was to go to sign up for Google Webmaster Tools and reduce the page crawl rate.

Google Webmaster Tools

Once I’d dialled back Google’s crawl rate, I dropped the iptables rules:

iptables -F

To make Booko more Google friendly, the first code change was to have book pages rendered immediately with the available pricing (provided it’s complete) and have updates to that pricing delivered via AJAX. Google now gets to see the entire page and should (hopefully) provide better indexing.

The second change was to create a second queue for price updates – the bulk queue. The price grabbers will first check for regular price update requests – meaning people will get their prices first. Requests by bulk users, such as Google, Yahoo & Bing, will be added to the bulk queue and looked up when there are no normal requests.  In addition, I can restrict the number of price grabbers which will service the bulk queue.

This work has now opened up a new idea I’ve been thinking about – pre-emptively grab the prices of the previous day or week’s most popular titles. The idea would be to add these popular titles to the bulk queue during the quiet time between 03:00 and 06:00.   That would mean that when people viewed the title later that day, they’d be fresh.

I’ve just pushed these changes into the Booko site and with some luck, Google & Co will be happier, Booko users will be happier and I should be able to build new features with this ground work laid. Nice for a Sunday evening’s work.