How Much Do We Want A Decent Version of Git for Windows? T-H-I-S M-U-C-H.

We want it bad. Reeeeeal bad.

Bonanzle is still running on Subversion (via Tortoise), because comparing its UI to the UI for the de facto git standard (msysgit) is like comparing Rails to ASP on Visual Basic. Yes, the difference is that big. Msysgit is ugly as Pauly Shore, most of its windows can’t be resized, it crashes regularly, and trying to decipher the intent of its UI is like reading dead sea scrolls.

Yes, yes, I know: if you don’t like a piece of open source software, you should shut up and fix it. Unfortunately, I’m sort of single-handedly maintaining this web site that has grown to about 150k uniques this month, where two months ago we had about 10k uniques. I can not end world hunger and rescue my cat stuck in a tree.

But if there is any intelligent life out there that has spare programming cycles and the desire to make a huge difference in the world, this is my personal plea that they will consider giving some love to the woebegone msysgit… or maybe just start their own Windows git client, I can’t imagine a real hacker would take more than a week or two to match the featureset of the existing msysgit.

I’d really like to move Savage Beast to github, and I’d really like to collaborate on the other projects that are happening on there, but it just doesn’t make sense to go from a slick, error free, decipherable UI like Tortoise to the meager helpings of msysgit.

I’d happily donate to a project like better Windows git.

Preemptive note to smart alecks: No, I’m not moving to Mac (or Linux) now. There are plenty of reasons why, I’ll tell you all about it some other time. Incidentally, what is the preeminent GUI for git on Mac these days? From what I understand, many of the real hackers are perfectly content using git from the command line…? Shudder to think of reading the diffs and histories.

Rails Thinking Sphinx Plugin: Full Text Searching that’s Cooler than a Polar Bear’s Toenails

Continuing my series of reviews of the plugins and products that have made Bonanzle great, today I’ll talk about Thinking Sphinx: How we’ve used it, what it’s done, and why it’s a dandy.

What It Is Bro, What It Is

What it is is a full text search Rails plugin that uses the Sphinx search engine to allow you to search big tables for data that would take a long-assed time (and a lot of custom application code) to find if you used MySql full text searching.

What Are Your Other Options

In the space of legitimate Rails full text plugins, the commonly mentioned choices are Sphinx (via Thinking Sphinx or Ultra Sphinx), Xapian (via acts_as_xapian), solr (via acts_as_solr) and (shudder) Ferret (via acts_as_ferret).

Jim Mulholland does a great job of covering the various choices at a glance, so if you’d like a good overview, starts with his blog about the choices.

To his commentary, I would add that Solr looks complicated to get running, appears to have been abandoned by its creator, and hasn’t been updated in quite awhile. It should also be mentioned that if you were to choose solr, every time you wished to talk about it online you’d have the burdensome task of backspacing the “a” out of the name that your fingers were intent on typing in.

Xapian seems alright, but the documentation on it seemed lacking and not a little arcane. Despite Jim’s post on how to use it, it seemed like the Xapian Rails community was pretty sparse. My impression was that if it didn’t “just work,” it would be I alone who would have to figure out why. Also, from what I can tell in Jim’s post, it sounds like one has to stop Xapian from serving search requests to run the index? Update: the FUD patrol informs me that you can concurrently index and serve, oh what joy!

Ferret sucks. We tried it in our early days. It caused mysterious indexing exceptions left and right, whenever we changed our models or migrated. The day we expunged it from our system was the day I started programming our site and stopped worrying about what broke Ferret today.

Ultra Sphinx looks OK, but as you can read here, it’s ease of use leaves something to be desired compared to the star of our blog post, who has now entered the building. Ladies and gentlemen, may I present to you, hailing from Australia and weighing in at many thousand lines of code:

Thinking Sphinx!

There’s a lot to like about Thinking Sphinx: it has easy to read docs with examples, it has an extremely active Google Group behind it, and it supports useful features like location-based searches and delta indexing (e.g., search updates in real time).

But if there is one reason that I would recommend Thinking Sphinx above your other choices, it’s that you probably don’t care a hell of a lot about full text searching. Because I didn’t. I care about writing my website. This is where Thinking Sphinx really shines. With the tutorials and Railscasts that exist for Thinking Sphinx, you can write an index for your model and actually be serving results from Thinking Sphinx within a couple hours time. That doesn’t mean its an oversimplified app though. Its feature list is long (most of the features we don’t yet use), but smarts defaults are assumed, and its super easy to get rolling with a basic setup, allowing you to hone the parameters of the search as your situation dictates.

Also extremely important in choosing a full text search system is reliability. Programming a full text engine (and its interface into your application) is rocket science, as far as I’m concerned. I don’t want to spend my time interpreting esoteric error messages from my full text search engine. It must work. Consistently. All the time. Without me knowing anything about it. Thinking Sphinx has done just that for us. In more than a month since we started using it, it’s been a solid, reliable champ.

A final, if somewhat lesser, consideration in my recommendation of TS is who you’ll be dealing with if something goes wrong. Being open source, my usual expectation is that if me and Google can’t solve a particular problem, it will be a long wait for a response from a random, ever-so-slightly-more-experienced-than-me user of the system in question who will hopefully, eventually answer my question in a forum. Thinking Sphinxes creator Pat Allen blows away this expectation by tirelessly answering almost all questions on Thinking Sphinx in its Google Group. From what I can tell, he does this practically every night. This is a man possessed. I don’t claim to know or understand what’s in the punch he’s drinking (probably not beginner’s enthusiasm, since TS has been around for quite some time now), but whatever’s driving him, I would recommend you take advantage of his expertise soon — before he becomes jaded and sour like the rest of us.

What About the Performance and Results?

Performance: great. Our usual TS query is returned in a fraction of a second from a table of more than 200,000 rows indexed on numerous attributes. Indexing the table currently takes about 1-2 minutes and doesn’t lock the database. Nevertheless, we recently moved our indexing to a remote server, since it did bog down the system somewhat to have it constantly running. I plan to describe in the next couple days how we got the remote indexing working, but suffice to say, it wasn’t very hard (especially with Pat’s guidance on the Google Group).

Results: fine. I don’t know what the pertinent metrics are here, but you can use weighting for your results and search on any number of criteria. Our users are happy enough with the search results they’re getting with TS out of the box, and when we do go to get more customized with our search weighting, I have little doubt that TS will be up to the task, and it’ll probably be easy to setup.

Final Analysis

If you want to do full text searching on a Rails model, do yourself a favor and join the bandwagon enjoying Thinking Sphinx. It’s one of the best written and supported plugin/systems I’ve stumbled across so far in the creation of Bonanzle.

I’m Bill Harding, and I approved of this message.

MySql Use “Distinct” and “Order by” with Multiple Columns AKA Apply “Order by” before “Group”

I’ve had a devil of a time trying to get Google to tell me how to write a Mysql query that allows us to somehow perform a MySql query that 1) filters rows on a distinct column 2) returns other columns in the query besides the distinct column and 3) allows us to order by a column. In our case, we (and you, if you’re running Savage Beast!) have a list of most recent forum posts on the site — currently, if you list all recent posts, the search will just find all posts and order by date of creation, but this makes for some dumb-looking output since you often end up with a list where 10 of the 20 posts are all from the same forum topic. All the user really wants to know is what topics have a new post in them, and to get a brief glimpse as to what that new post might be.

Thus, we want to create a query that returns the new posts, ordered by date of creation, that have a distinct topic_id.

Here’s the SQL that can make it happen:

Post.find_by_sql(“select posts.* from posts LEFT JOIN posts t2 on posts.topic_id=t2.topic_id AND posts.created_at < t2.created_at WHERE t2.topic_id IS NULL ORDER BY posts.created_at DESC”)

Hope that Google sees fit to lead other people here instead of struggling to get GROUP BY to order results beforehand (GROUP BY ‘posts.topic_id’ works, but it returns the first post in each distinct topic, rather than the last post as we desire), or get SELECT DISTINCT to return more than one column, as many forum posters unhelpfully suggested in all the results I was getting.

Update 11/26/08 – A Word of Caution

I finally got around to setting up some profiling for our site yesterday and was surprised to discover that the above query was taking longer per execution than almost anything else on our entire site. The SQL Explain was not too helpful to explain why, but it showed three joins, the join on the topics table involving every row of the table (which is presently almost 10,000).

Takeaway: for this query to work, it seems to consider every distinct topic in the table, rather than being smart and stopping when it hits the per-page paginated limit. Since I already determined that “group” and “distinct” were non-starters for being able to pick the newest post in each topic, I ended up revising the way the logic was done to an easier to manage and far-more-DB-efficient way:

We now track in each topic the newest post_id within that topic. While this adds a bit of overhead to keeping the topic updated when new posts are made to it, it allows us to do a far simpler query where we just select the more recent topics, joining to the most recent post in each topic, and then ordered by the age of those posts.

If you have the ability to create an analogous situation to solve your problem, your database will thank you for it. The above query starts getting extremely slow with more than a few thousand rows. Yet, I defy you to find an alternative to it that works at all using “group” or “distinct.”

Rails Hosting: Review of Slicehost vs. EC2

It’s a goal of mine to write a series of review for all the major plugins and services that have gone into the creation of Bonanzle. Previously, I reviewed Fiveruns and gave it a “thumbs down,” which gave me the blues, since I’d like Fiveruns to be the killer app for monitoring Rails performance. Unfortunately, though I two or three different Fiveruns sales people have noticed Bonanzle and told me I should use Fiveruns, none have gotten back to me with a promise that they could make it easier to use after I point them to my review. But I digress. Today we discuss Slicehost.

Synopsis

Slicehost has been very good to Bonanzle. After a short and bad experience with another Rails hosting provider that gave limited shell access, we started using Slicehost almost a year ago, first with a 256 MB slice to host our main Bonanzle server. A Slicehost “slice” is their name for a server partition, very much like an EC2 server instance (will get to a comparison of the two of them shortly). When you sign up for a slice, you can pick from a number of sizes: 256 MB, 512 MB, 1 GB, 2 GB, or 4 GB. When you setup your slice, you can choose from a variety of OS’ to have pre-installed on your slice (including most all the flavors of Ubuntu). You have full shell access with any slice you setup, so essentially you have the full range of possibility in configuring your server instances that you’d have if you had the server in your basement. If you choose to add more slices in the future, you can copy the disk image from your existing slices as a starting point for your new slice (as long as you have backups turned on the slice whose disk image you want to copy). This has been very convenient for us, as it saves us the trouble of having to repeatedly install basic stuff like Mysql and Apache on each new slice we add.

How We’ve Used It

From that initial 256 MB slice we started with a year ago, we now have grown to seven slices ranging in size from 512 MB to 4 GB. As mentioned above, it’s very convenient to get a new slice up to speed using the disk image of an old slice. It’s also very fast — when we’ve put in our request to get a new slice, it has taken from 30 minutes to a couple hours max to get the new slice created.

Uptime

None of our slices have gone down in a year of use. That’s nice.

Performance

The bigger the size of your slice, the more CPU you get to use in times of contention. According to the support personnel I’ve spoken with, the servers are hosted on quad core Opteron 64-bit 2 Ghz machines, and a 4 GB slice would get up to 25% of the CPU cycles in times of contention (which there rarely are). Scale down from the 25% for each level down in slice size (e.g., 2 GB slice would get 1/8th of the cycles).

In terms of practical speed, we’re currently serving about 50,000 pages/day, mostly non-cached, on a site that has a lot of interactive features and image processing. We’re doing this on one 4GB slice that currently runs 8 mongrels AND the Mysql server itself. Most page load times are less than a second, creating images takes longer. Good enough for me for now.

Compared to EC2

The closest comparable EC2 offering to a 4GB Slicehost slice is the

7.5 GB of memory, $288.00/month – 850 GB of instance storage, 0 GB BW included in price, 4 EC2 Compute Units

Compare to Slicehost:

4 GB of memory, $280.00/month (with automatic 10% discount it’s $250/mo) – 160GB HD, 1600GB BW included in price, equivalent of 2 EC2 (e.g., one 2 ghz processor) units during resource contention, more otherwise

EC2 jumps out to the early lead, as it offers about twice as much computing power and RAM for $30 more. But Slicehost catches up quickly when you consider bandwidth and storage:

EC2 bandwidth = $0.10-$0.17 per GB transferred. Slicehost = up to 1600 GB transfer free.

That is, if you were to use all of your Slice’s bandwidth, you’d save yourself something in the neighborhood of $250/month vs. Amazon. For storage, Amazon offers more storage space by default, but they make no guarantees about that your instance storage won’t evaporate at any time, which is why they also offer Elastic Block Storage (EBS), which is intended to be your “real” disk when operating in an EC2 instance. Use of EBS costs $0.10/GB and $0.10 per million IO writes, which Amazon estimates to add up to about $26/month more for a “medium sized web site.”

When you add up the total costs, assuming you were going to use your storage and bandwidth, Slicehost offers about half the memory and half the computing power, but it does so at less than half the price of EC2. And a 4GB Slicehost slice is no small computing organism. As mentioned above, it’s serving 50k daily pages of dynamic content and getting by well enough (except when it comes to image creation, which can take 5-10 seconds to process including thumbnails).

Where does EC2 win?

Still, there are a number of advantages to EC2. The first is that 4GB (the size I’ve been discussing) is the largest instance size currently offered at Slicehost, whereas Amazon has a couple different instances with significant more computing power/memory available. This alone is reason that we will probably need to switch to EC2 in the not-distant future, since at times of peak traffic we’re pushing the maximum performance of our slice currently. Update: The Slicehost support team informs me that they also have 8GB and 15.5GB slices available by request. Both of the unlisted, larger-sized slices also having corresponding 2x or 4x increases in HD space and processing power (and of course, cost).

Another annoying limitation of Slicehost is that all traffic is throttled at 10Mbps. While it’s not a “low” amount (Wikipedia says that 8-12 Mbps is equivalent to “medium to high-definition digital channel with DVD quality data” aka about 1 meg of transfer per second) per se, it is not conducive to high traffic, image-heavy sites, and it is annoying that throttling is set at the same level regardless of what slice size you use. Update: The Slicehost support team informs me that this limit can be adjusted as necessary by request. I requested that they double our bandwidth allowance and they had it done within an hour.

Where does Slicehost win?

Firstly, there are the cost wins described above if you are hosting a site that uses lots of bandwidth.

Secondly, I get the sense (from documents I’d previous read but can no longer locate) that it is far less likely that one’s instance storage will evaporate with Slicehost. I know that it’s never happened to us in the year we’ve been hosted, whereas I thought I recalled reading that EC2 made no guarantee that their instance storage would be available at a given time. Would love to get more details on this if anyone else can cite where I might have read this?

Another great feature of Slicehost that’s easy to underestimate is the availability of their help. They have a Slicehost chat room that is staffed by a handful of Slicehost employees during all normal business hours (Update: and non-normal hours too… I was talking to them at 3 AM last night about the progress of our resize to an 8GB slice. There were two Slicehost employees manning the chat window at that hour (!)). I’ve ended up visiting this chat room on numerous occasions when I want instant answers to my questions, and I’ve found the people in the chat room to be very knowledgeable and patient. Getting good support at Amazon is very expensive ($100-$400 per month, or more, for a service Slicehost provides free of charge).

Also, I’ve found that our slice almost always has more than the “guaranteed” CPU cycles available: our slice regularly uses more than “100%” (=1 of the 4 quadcore processors… which is what’s guaranteed with a 4GB slice), according to top.

Final Summary

I hope to continue adding to this article as I gain experience with the two services. As mentioned above, we have stuck exclusively to Slicehost so far, but if our site gets into the millions of uniques we might end up making the move to EC2. Update: I did some research on EC2 recently, and was pretty surprised at how esoteric their documentation is (see the section on creating your own AMI if you need to lull yourself to sleep), so I’d just as soon stay at Slicehost where there are less proprietary concepts involved. For people making the decision today about where to host, I’d pick Slicehost if you’re looking for high configurability, less learning about proprietary concepts, more human support, and lower, more predictable costs. I’d pick EC2 if you already know how to use it or are planning to run a complex scalable architecture that you want to be able to swap in more servers on a whim. I’d imagine EC2 is pretty easy to get up and running with some of the pre-configured AMIs (haven’t researched, but I’m sure they have one for Rails). But then again, Slicehost is pretty damn easy to get Rails rolling too, since you can follow any of the kajillion tutorials about setting up Rails on an Ubuntu machine. (Or you can use modrails, which from what I’ve heard is pretty much plug-and-use.)

Stay tuned for updates, and if you have comparable experience with either, please post it below!

Rails Mysql Indexes: Step 1 in Pitiful to Prime Performance

Like any breathing Rails developer, I love blogging about performance. I do it all the time. I’ve done it here, here, and quite famously, here.

But one thing I haven’t done is blog about Rails performance from a perspective of experience. But tripling in traffic for a few months in a row has a way of changing that.

So now I’m a real Rails performance guy. Ask me anything about Rails performance, and I’ll tell you to get back to me in a couple months, because this aint’ exactly yellowpages.com that I’m running here. BUT, these are the lessons and facts from our first few months of operation:

  • One combined Rails server+Mysql slice at Slicehost is handling about 3000 daily visits and 30,000 daily pageviews (on a highly real time, interactive site) with relative ease. Almost all pageviews less than 2 seconds, most less than 1.
  • Memcached saves our ass repeatedly
  • Full text searching (we’re using Thinking Sphinx) saves our ass repeatedly
  • BackgroundRb will ruin your life, cron-scheduled rake tasks will save it
  • Database ain’t nothing but a chicken wing with indexing

Now there are five salient observations to take from a growing site, but you notice that it was the last one that I chose to single out in the title of this blog? Why? Because, if I called this entry “Rails Performance Blog,” your eyes would glaze over and you’d wouldn’t be able to read through the hazy glare.

Why else? Because the day I spent indexing our tables was the only time in the history of Bonanzle that I will ever bring forth a sitewide 2x-3x performance increase within about 4 hours time. God damn that was a fantastic day. I spent the second half of it writing airy musings to my girlfriend and anyone who would listen about how much fun web sites are to program. Then I drank beer and went rafting. Those who haven’t indexed their DB lately: don’t you hate me and want to be like me more than you ever have before?

Well, I can’t help you with the former, but the latter, that we can work on.

  1. Download Query Analyzer
  2. Delete your development.log file. Start your site in development mode. Go to your slowest page. Open your development.log file in an editor that can automatically update as the file changes.
  3. Look through the queries your Rails site is making. Any query where the “type” column reads “ALL” is a query on which you are searching every row of your database to satisfy the query. Hundreds of rows? OK, whatever. Thousands of rows? Ouch. Tens of thousands of rows (or more)? Your request might never be heard from again.
  4. Create indexes to make those “ALL”s go away. Adding an index in Rails is the simplest thing ever. In a migration: add_index :table_name, :column_name and you’re done. remove_index :table_name, :column_name and you’re undone.
  5. Observe that, at least for MySql, queries where you are filtering for more than one attribute in your where clause (e.g., select * from items where status = “active” and hidden = false) are still slow if you create indexes for “status” and “hidden.” Why? I think it’s because the DB ORs them together to find its results. But I don’t know the exact reason, nor do I care. What I do know is that an add_index :items, [:status, :hidden] creates a compound attribute that will get you back to log(n) time in making queries with compound where clauses.

Now, if you are like me or the 50 people in the Rails wiki and forums that learn about this crazy wonderful thing called “indexes,” your first question is “Indexing sounds pretty bangin. Why not just index the hell out of everything?”

Answer: Them indexes aren’t immaculately conceived, son. Every index you create has to be generated and maintained. So the more indexes you create, the more overhead there is to inserting or deleting records from your table. Of course, most queries on most sites are read queries, so you will make up the extra insert/delete time by 10x or more, but if you were to go buck wild and index the farm, you probably wouldn’t be much better off on balance than if you indexed nothing at all. You see why downloading Query Analyzer was the first step?

The general rule that is given on indexes is that most any foreign key should be indexed, and any criteria upon which you regularly search or sort should be indexed. That’s worked well for us. For tables with less than 500 rows, I usually get lazy and don’t do any indexing, and that seems fine. But assuredly, if you’re working on a table with 1,000 or more rows and you’re querying for columns that aren’t indexed, you are 15 minutes away from a beer-enabling, management-impressing performance optimization that would make Ferris Bueller proud.

Change ACL of Amazon S3 Files in Recursive Batch

We’re in the process of moving our images to be served off of S3, and wanted to share a quick recommend I came across this evening when trying to change our presently-private S3 image files to be public. The answer is Bucket Explorer. All things being equal, you certainly won’t mistake it for a high budget piece of UI mastery, but it is surprisingly capable of doing many things that have been troublesome for me with the Firefox S3 plugin (which is a major pain to even get working with Firefox 2 (which is a major pain to upgrade to Firefox 3… I upgraded for a month, spent 5 hours trying to figure out why some pages seemed to randomly freeze indefinitely before giving up and downgrading (my best guess was it seemed to be Flash-related))), or the AWS-S3 gem, or the other free S3 browsing web service I found somewhere or another.

In addition to providing a capable, FTP-like interface for one’s S3 buckets, it can also get stats on directories, do the aforementioned recursive batch permission setting, delete buckets (S3 gem won’t let me, even with the :force => true option), a bunch of other features, and probably most importantly (for me) — it works! Tra-lee!

It’s $50 to buy, but once it finishes changing batch permissions for about 20,000 of our images files (as its currently in the process of) I would seriously consider paying it. For the time being, I’m on a fully functional 30 day trial.

Bonanzle: “The Best eBay Alternative They’ve Seen”

An incredible accolade for a site that’s still technically in beta, Ecommerce Guide just named Bonanzle “The Best eBay Alternative They’ve Seen” in four years of reviewing eBay alternatives. Pessimistic side of me says that an article this effusive is an open invitation for every Tom, Dick and Harry to quibble and point out the faults of Bonanzle (of which there are admittedly still several… we’re haven’t even officially launched yet, people), or question how Bonanzle can be called an “eBay alternative” when it doesn’t even do auctions.

That said, it’s hard to imagine this project going much better than it has so far. While I’m fully aware that the hundreds of PHP eBay lookalikes are going to slowly start nibbling at what are now Bonanzle-only features, it’s comforting to know that they’re going to have to program those features in PHP (or maybe Java).

If you haven’t already, pay a visit to Bonanzle and cast your vote that Rails is an unfair advantage.

Bloggity – An Idea for a Rails Blog Plugin

Update: This plugin now exists! It’s under active development, and has some pretty cool features. Read more about it through the link.

There comes a time in most every Rails applications’ life where its owner wants to up the “Web 2.0” factor and create a blog presence. Luckily, this is Rails, a framework ideally suited for creating a blog. Unluckily, there doesn’t seem to be a de facto Rails blog plugin floating around the community yet.

Now I know what you’re thinking… Mephisto! No thanks. The same factors that drove me away from Beast drive me away from Mephisto as well. Who wants to manage multiple subdomains, multiple app repositories, two sets of Mongrels, cross-subdomain authentication issues, and figuring out how to share resources (like views and images) between your Mephisto app and your standard app?

Upon discovering that no good blog plugin existed, my first instinct was to see if I could bring to life a sort of “Savage Mephisto.” Unfortunately, whereas I had a little bit of time to squeeze in Savage Beast development during Bonanzle’s final run to launch, I have virtually no time whatsoever to work on a project like this, now that Bonanzle is live and doubling in traffic every few weeks. Nevertheless, I did at least give a half-hearted attempt to make Mephisto into a plugin before realizing that as vast and deep as Beast is, Mephisto is probably twice as complicated, and thus, half as plugin-able.

So I turned my attention to creating a blog plugin from scratch. You can get a sense of how far I’ve gotten on the other Bonanzle blog. Like SB, I wrote it using the Engines plugin, so it could theoretically be extracted to a repository for download with relatively little effort. However, I’m not sure what the demand is like for a plugin like this would be? I mean, you could write your own fairly functional blog in a day or less, and you could conquer the cross-domain Mephisto issues, or you could try out one of the not-well-rated blog plugin attempts on Agile Web Development.

Anyway, I’ve decided to use the responses to this post to gauge how much enthusiasm there is for an easy-to-setup Rails blog plugin. If there’s lots, I’ll try to make the time frame to launch short. If there’s little, it probably remains a Bonanzle-only plugin.

Rails script/server command line options

Usage: server [options]
-p, –port=port Runs Rails on the specified port. Default: 3000
-b, –binding=ip Binds Rails to the specified ip. Default: 0.0.0.0
-d, –daemon Make server run as a Daemon.
-u, –debugger Enable ruby-debugging for the server.
-e, –environment=name Specifies the environment to run this server under (test/development/production). Default: development
-s, –config-script=path Uses the specified mongrel config script.

-h, –help Show this help message.