MySql Use “Distinct” and “Order by” with Multiple Columns AKA Apply “Order by” before “Group”

I’ve had a devil of a time trying to get Google to tell me how to write a Mysql query that allows us to somehow perform a MySql query that 1) filters rows on a distinct column 2) returns other columns in the query besides the distinct column and 3) allows us to order by a column. In our case, we (and you, if you’re running Savage Beast!) have a list of most recent forum posts on the site — currently, if you list all recent posts, the search will just find all posts and order by date of creation, but this makes for some dumb-looking output since you often end up with a list where 10 of the 20 posts are all from the same forum topic. All the user really wants to know is what topics have a new post in them, and to get a brief glimpse as to what that new post might be.

Thus, we want to create a query that returns the new posts, ordered by date of creation, that have a distinct topic_id.

Here’s the SQL that can make it happen:

Post.find_by_sql(“select posts.* from posts LEFT JOIN posts t2 on posts.topic_id=t2.topic_id AND posts.created_at < t2.created_at WHERE t2.topic_id IS NULL ORDER BY posts.created_at DESC”)

Hope that Google sees fit to lead other people here instead of struggling to get GROUP BY to order results beforehand (GROUP BY ‘posts.topic_id’ works, but it returns the first post in each distinct topic, rather than the last post as we desire), or get SELECT DISTINCT to return more than one column, as many forum posters unhelpfully suggested in all the results I was getting.

Update 11/26/08 – A Word of Caution

I finally got around to setting up some profiling for our site yesterday and was surprised to discover that the above query was taking longer per execution than almost anything else on our entire site. The SQL Explain was not too helpful to explain why, but it showed three joins, the join on the topics table involving every row of the table (which is presently almost 10,000).

Takeaway: for this query to work, it seems to consider every distinct topic in the table, rather than being smart and stopping when it hits the per-page paginated limit. Since I already determined that “group” and “distinct” were non-starters for being able to pick the newest post in each topic, I ended up revising the way the logic was done to an easier to manage and far-more-DB-efficient way:

We now track in each topic the newest post_id within that topic. While this adds a bit of overhead to keeping the topic updated when new posts are made to it, it allows us to do a far simpler query where we just select the more recent topics, joining to the most recent post in each topic, and then ordered by the age of those posts.

If you have the ability to create an analogous situation to solve your problem, your database will thank you for it. The above query starts getting extremely slow with more than a few thousand rows. Yet, I defy you to find an alternative to it that works at all using “group” or “distinct.”

7 Replies to “MySql Use “Distinct” and “Order by” with Multiple Columns AKA Apply “Order by” before “Group””

  1. Hi Bob,

    Not quite sure what aspect of the query you find confusing, but I’ll break it down as best I understand it:

    Basically, there are two tables, a “posts” and a “topics” table. We are trying to get entries from the posts table where each post has a unique topic_id. Thus, the query selects all data fields from within the post table (select posts.* from posts), joins the post to other posts in the same the topic (LEFT JOIN posts t2 on posts.topic_id=t2.topic_id), picks the most recent post within the topics table (posts.created_at < t2.created_at). I'm still not clear myself what the "t2.topic_id IS NULL" clause is does, but without it, I believe you get duplicate entries. The ORDER BY posts.created_at DESC means that the final SQL result orders your list of unique posts starting with the most recently created. Hope this helps to clarify things some. If we're lucky a SQL expert will stop by eventually who knows the purpose of the t2.topic_id IS NULL fragment.

  2. You are damn right! This is the only place I found a correct solution to what I was looking for after I spent 3+ hours trying other things I found online (mostly the GROUP BY bs you were talking about).

    The query above worked like a charm but the overhead, as you mentioned, was so much for my table that I had to resort to another implementation. Basically I ending up building a query that grabs a much larger data set, ‘100’ (when I want data from a distinct 10 users) then I process the data in the php and grab as many records of unique users as possible. This ended up being the best ‘quick solution’ I could come up with right now but it does not guarantee you get your 10 unique users. The only way after that would be to continue grabbing more records…

  3. I love you!

    This example works brilliantly. I wanted to which user had which phone extension at a certain time, which was impossible using GROUP BY, DISTINCT or other SQL magic. Your example worked wonderfully!

  4. Well, I found a solution. I have many columns. I want unique date, the one with last date_ini. Also I filter by some fileds and date range. Definitelly group by does not work.

    select A.fcstout_tag, A.var, A.date, A.value, A.date_ini from (select date,max(date_ini) as max_date_ini from data_fcstout_c0 where date between ‘2013-05-19 22:00:00’ and ‘2013-05-20 03:00:00′ && var=’pow’ && fcstout_tag=’NetN30′ group by date) as B, data_fcstout_c0 as A where A.date=B.date && A.date_ini=B.max_date_ini && A.var=’pow’ && A.fcstout_tag=’NetN30′ order by A.date;

Leave a Reply to Matt Cancel reply

Your email address will not be published. Required fields are marked *