There is a table messages that contains data as shown below:

Id   Name   Other_Columns
-------------------------
1    A       A_data_1
2    A       A_data_2
3    A       A_data_3
4    B       B_data_1
5    B       B_data_2
6    C       C_data_1

If I run a query select * from messages group by name, I will get the result as:

1    A       A_data_1
4    B       B_data_1
6    C       C_data_1

What query will return the following result?

3    A       A_data_3
5    B       B_data_2
6    C       C_data_1

That is, the last record in each group should be returned.

At present, this is the query that I use:

select * from (select * from messages ORDER BY id DESC) AS x GROUP BY name

But this looks highly inefficient. Any other ways to achieve the same result?

1 upvote
  flag
see accepted answer in //allinonescript.com/questions/1379565/… for a more efficient solution – eyaler
upvote
  flag
4 upvote
  flag
Why can't you just add DESC, i.e. select * from messages group by name DESC – Kim Prince
upvote
  flag
upvote
  flag
@KimPrince worked for me man !!!, one word just did the trick !!! – Accountant م
1 upvote
  flag
@KimPrince It seems like the answer you are suggesting doesn't do what is expected! I just tried your method and it took FIRST row for each group and ordered DESC. It does NOT take the last row of each group – Ayrat
upvote
  flag
For more efficiency, see mysql.rjweb.org/doc.php/groupwise_max – Rick James

18 Answers 11

Use your subquery to return the correct grouping, because you're halfway there.

Try this:

select
    a.*
from
    messages a
    inner join 
        (select name, max(id) as maxid from messages group by name) as b on
        a.id = b.maxid

If it's not id you want the max of:

select
    a.*
from
    messages a
    inner join 
        (select name, max(other_col) as other_col 
         from messages group by name) as b on
        a.name = b.name
        and a.other_col = b.other_col

This way, you avoid correlated subqueries and/or ordering in your subqueries, which tend to be very slow/inefficient.

9 upvote
  flag
Epic solution is epic. – Notinlist
45 upvote
  flag
Eric solution is Eric? :) – JYelton
1 upvote
  flag
Note a caveat for the solution with other_col: if that column is not unique you may get multiple records back with the same name, if they tie for max(other_col). I found this post that describes a solution for my needs, where I need exactly one record per name. – Eric Simonton
upvote
  flag
In some situations you can only use this solution but ont the accepted one. – tom10271

Here are two suggestions. First, if mysql supports ROW_NUMBER(), it's very simple:

WITH Ranked AS (
  SELECT Id, Name, OtherColumns,
    ROW_NUMBER() OVER (
      PARTITION BY Name
      ORDER BY Id DESC
    ) AS rk
  FROM messages
)
  SELECT Id, Name, OtherColumns
  FROM messages
  WHERE rk = 1;

I'm assuming by "last" you mean last in Id order. If not, change the ORDER BY clause of the ROW_NUMBER() window accordingly. If ROW_NUMBER() isn't available, this is another solution:

Second, if it doesn't, this is often a good way to proceed:

SELECT
  Id, Name, OtherColumns
FROM messages
WHERE NOT EXISTS (
  SELECT * FROM messages as M2
  WHERE M2.Name = messages.Name
  AND M2.Id > messages.Id
)

In other words, select messages where there is no later-Id message with the same Name.

8 upvote
  flag
MySQL doesn't support ROW_NUMBER() or CTE's. – Bill Karwin
up vote 647 down vote accepted

MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

Below is the original answer I wrote for this question in 2009:


I write the solution this way:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I'll write a query to find the most recent post for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;
7 upvote
  flag
Really? What happens if you have a ton of entries? For example, if you're working w/ an in-house version control, say, and you have a ton of versions per file, that join result would be massive. Have you ever benchmarked the subquery method with this one? I'm pretty curious to know which would win, but not curious enough to not ask you first. – Eric
upvote
  flag
Thanks Bill. That works perfectly. Can you provide more information regarding the performance of this query against the join provided by Eric? – Vijay Dev
1 upvote
  flag
Did some testing. On a small table (~300k records, ~190k groups, so not massive groups or anything), the queries tied (8 seconds each). – Eric
upvote
  flag
I should note that's with a composite key and no indexing. It was a throw-away staging table :) – Eric
upvote
  flag
Wow, great info. I ran my test against SQL Server 2008, so it's intriguing to see how MySQL differs with these queries. Again showing you that explain is your friend! – Eric
upvote
  flag
Aha! I was wondering how you got such different results from mine. In many cases of using GROUP BY, MySQL creates a temporary table on disk, leading to expensive I/O. Best to avoid GROUP BY if you can in MySQL. And yes, always analyze queries with EXPLAIN when performance is important. – Bill Karwin
upvote
  flag
@Bill: SQL Server hates or, MySQL hates group by. One of these days we'll get an RDBMS that likes all of SQL. Thought looking at the explain, it looks like if you put the where clause inside the subquery, also, it would return a much smaller rowset. – Eric
upvote
  flag
@Eric: regarding putting a WHERE restriction inside the subquery, yes, but then you don't need the GROUP BY either. – Bill Karwin
upvote
  flag
@Bill: Ah, I always forget that MySQL will let you drop GROUP BY. Of course, dropping it would be the most efficient way to run that query for a specific user. SQL Server is less forgiving in with its GROUP BY. If it's in the select, it has to be in the GROUP BY. Of course, it can be accomplished with the over clause, which is just magical, really. – Eric
upvote
  flag
As @newt indicates, this query was slow for me (10+ minutes on SQL Server 2008) with large datasets. I need to select the last data per group from a 3.5 million row table. – JYelton
upvote
  flag
@JYelton, with SQL Server 2008 you should use a CTE with windowing functions. – Bill Karwin
1 upvote
  flag
@BillKarwin: See meta.stackexchange.com/questions/123017, especially the comments below Adam Rackis' answer. Let me know if you want to reclaim your answer on the new question. – Robert Harvey
upvote
  flag
@RobertHarvey, thanks, I will follow up on the Meta post you linked to. – Bill Karwin
upvote
  flag
SELECT m1.* FROM messages m1 LEFT JOIN messages m2 ON (m1.name = m2.name AND m1.id < m2.id) WHERE m2.id IS NULL and m1.anotherID; fails if you have only one record at anotherID – webenformasyon
upvote
  flag
@webenformasyon, the way you've written that condition, the query would fail if m1.anotherID is zero. You have no comparison term, you have only treated anotherID as if it is a boolean. – Bill Karwin
upvote
  flag
I just got a downvote. Downvoter, can you please explain why you object to this answer? Perhaps I can improve it. – Bill Karwin
upvote
  flag
Just wanted to mention that this solution works on Derby databases also. – Hybris95
upvote
  flag
@BillKarwin this does not work with non-unique id's sice < comparison - is it possible to use it somehow with <= so it works when you have duplicate id's ? – Tim
2 upvote
  flag
@Tim, no, <= will not help if you have a non-unique column. You must use a unique column as a tiebreaker. – Bill Karwin
1 upvote
  flag
The performance degrades exponentially as the number of rows increases or when groups become larger. For example a group consisting of 5 dates will yield 4+3+2+1+1 = 11 rows via left join out of which one row is filtered in the end. Performance of joining with grouped results is almost linear. Your tests look flawed. – Salman A
upvote
  flag
@SalmanA nevertheless, I did run these tests and got the results I show. If you want to do your own test, and post your own answer showing the results, be my guest. – Bill Karwin
1 upvote
  flag
thank god. you exist in the world sir – Accountant م
upvote
  flag
@BillKarwin what if I've non-unique column, what can be done to get rid of duplicates? – ahmed
upvote
  flag
@ahmed, see "solution 2" in newtover's answer. – Bill Karwin
upvote
  flag
@BillKarwin my workaround is that I added an extra or condition in the join and where clauses in your proposed query: SELECT m1.* FROM messages m1 LEFT JOIN messages m2 ON (m1.name = m2.name AND (m1.id < m2.id OR m1.non-unique-column < m2.non-unique-column)) WHERE m2.id IS NULL AND m2.non-unique-column IS NULL; it did the trick, although it's not really that optimized, but I'm using limit for pagination and it's fast enough for my case. Thank you – ahmed
upvote
  flag
@BillKarwin where did you get the copy of StackOverflow database? – Wakan Tanka
1 upvote
  flag
upvote
  flag
@BillKarwin thanks for the query. I am quite new to SQL/Joins and wondering how to modify the same query to do something similar like 1) Get the first record instead of the last, and 2) Only get records for a certain date ( My Table has a date field ) . Thanks once again – M.M
upvote
  flag
@MenonM, You should be able to do that yourself given what I have shown above. – Bill Karwin
1 upvote
  flag
This is a godsend. Your query does the job and is lightning fast. I needed to grab the latest login time of each user and it worked. Thanks a lot! – The Sexiest Man in Jamaica
upvote
  flag
If the explain takes almost as long as the query itself, does that mean if this was a prepared query it would run much quicker, because it seems most of the time is spent deciding how to retrieve the data, than it is retrieving the data. – Cruncher
upvote
  flag
@Cruncher, Not necessarily. It takes a long time to do the EXPLAIN because the optimizer actually executes the subquery for the derived table and creates a temp table for it, before it can estimate the optimization plan. – Bill Karwin

Is there any way we could use this method to delete duplicates in a table? The result set is basically a collection of unique records, so if we could delete all records not in the result set, we would effectively have no duplicates? I tried this but mySQL gave a 1093 error.

DELETE FROM messages WHERE id NOT IN
 (SELECT m1.id  
 FROM messages m1 LEFT JOIN messages m2  
 ON (m1.name = m2.name AND m1.id < m2.id)  
 WHERE m2.id IS NULL)

Is there a way to maybe save the output to a temp variable then delete from NOT IN (temp variable)? @Bill thanks for a very useful solution.

EDIT: Think i found the solution:

DROP TABLE IF EXISTS UniqueIDs; 
CREATE Temporary table UniqueIDs (id Int(11)); 

INSERT INTO UniqueIDs 
    (SELECT T1.ID FROM Table T1 LEFT JOIN Table T2 ON 
    (T1.Field1 = T2.Field1 AND T1.Field2 = T2.Field2 #Comparison Fields  
    AND T1.ID < T2.ID) 
    WHERE T2.ID IS NULL); 

DELETE FROM Table WHERE id NOT IN (SELECT ID FROM UniqueIDs);

Try this:

SELECT jos_categories.title AS name,
       joined .catid,
       joined .title,
       joined .introtext
FROM   jos_categories
       INNER JOIN (SELECT *
                   FROM   (SELECT `title`,
                                  catid,
                                  `created`,
                                  introtext
                           FROM   `jos_content`
                           WHERE  `sectionid` = 6
                           ORDER  BY `id` DESC) AS yes
                   GROUP  BY `yes`.`catid` DESC
                   ORDER  BY `yes`.`created` DESC) AS joined
         ON( joined.catid = jos_categories.id )  

The below query will work fine as per your question.

SELECT M1.* 
FROM MESSAGES M1,
(
 SELECT SUBSTR(Others_data,1,2),MAX(Others_data) AS Max_Others_data
 FROM MESSAGES
 GROUP BY 1
) M2
WHERE M1.Others_data = M2.Max_Others_data
ORDER BY Others_data;

UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check.

Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2 of only IS NULL comparisons.

I made my tests on a InnoDB table of 18684446 rows with 1182 groups. The table contains testresults for functional tests and has the (test_id, request_id) as the primary key. Thus, test_id is a group and I was searching for the last request_id for each test_id.

Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index in EXPLAIN).

I have a couple of other solutions that are based on the same ideas:

  • if the underlying index is BTREE index (which is usually the case), the largest (group_id, item_value) pair is the last value within each group_id, that is the first for each group_id if we walk through the index in descending order;
  • if we read the values which are covered by an index, the values are read in the order of the index;
  • each index implicitly contains primary key columns appended to that (that is the primary key is in the coverage index). In solutions below I operate directly on the primary key, in you case, you will just need to add primary key columns in the result.
  • in many cases it is much cheaper to collect the required row ids in the required order in a subquery and join the result of the subquery on the id. Since for each row in the subquery result MySQL will need a single fetch based on primary key, the subquery will be put first in the join and the rows will be output in the order of the ids in the subquery (if we omit explicit ORDER BY for the join)

3 ways MySQL uses indexes is a great article to understand some details.

Solution 1

This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows:

SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC;

If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns:

SELECT test_id, request_id
FROM (
    SELECT test_id, MAX(request_id), request_id
    FROM testresults
    GROUP BY test_id DESC) as ids
ORDER BY test_id;

This one takes about 1,2 secs on my data.

Solution 2

Here is another solution that takes about 19 seconds for my table:

SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC

It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group.

The disadvantage of the query is that its result cannot be cached by the query cache.

upvote
  flag
upvote
  flag
Please link to a dump of your tables so that people can test it on their platforms. – Pacerier
2 upvote
  flag
Solution 1 can't work, you can't select request_id without having that in group by clause, – giò
1 upvote
  flag
@giò, this is answer is 5 years old. Until MySQL 5.7.5 ONLY_FULL_GROUP_BY was disabled by default and this solution worked out of the box dev.mysql.com/doc/relnotes/mysql/5.7/en/…. Now I'm not sure if the solution still works when you disable the mode, because the implementation of the GROUP BY has been changed. – newtover
upvote
  flag
First solution worked for me. Thanks. – Alok Patel
upvote
  flag
If you wanted ASC in the first solution, would it work if you turn MAX to MIN? – Jin Izzraeel
upvote
  flag
@JinIzzraeel, you have MIN by default at the top of each group (it is the order of the covering index): SELECT test_id, request_id FROM testresults GROUP BY test_id; would return the minimum request_id for each test_id. – newtover

I arrived at a different solution, which is to get the IDs for the last post within each group, then select from the messages table using the result from the first query as the argument for a WHERE x IN construct:

SELECT id, name, other_columns
FROM messages
WHERE id IN (
    SELECT MAX(id)
    FROM messages
    GROUP BY name
);

I don't know how this performs compared to some of the other solutions, but it worked spectacularly for my table with 3+ million rows. (4 second execution with 1200+ results)

This should work both on MySQL and SQL Server.

upvote
  flag
Just make sure you have an index on (name, id). – Samuel Åslund
1 upvote
  flag
Much better that self joins – anwerj

I've not yet tested with large DB but I think this could be faster than joining tables:

SELECT *, Max(Id) FROM messages GROUP BY Name
4 upvote
  flag
This returns arbitrary data. In other words there returned columns might not be from the record with MAX(Id). – harm
upvote
  flag
Useful to select the max Id from a set of record with WHERE condition : "SELECT Max(Id) FROM Prod WHERE Pn='" + Pn + "'" It returns the max Id from a set of records with same Pn.In c# use reader.GetString(0) to get the result – Nicola

Solution by sub query fiddle Link

select * from messages where id in
(select max(id) from messages group by Name)

Solution By join condition fiddle link

select m1.* from messages m1 
left outer join messages m2 
on ( m1.id<m2.id and m1.name=m2.name )
where m2.id is null

Reason for this post is to give fiddle link only. Same SQL is already provided in other answers.

Here is another way to get the last related record using GROUP_CONCAT with order by and SUBSTRING_INDEX to pick one of the record from the list

SELECT 
  `Id`,
  `Name`,
  SUBSTRING_INDEX(
    GROUP_CONCAT(
      `Other_Columns` 
      ORDER BY `Id` DESC 
      SEPARATOR '||'
    ),
    '||',
    1
  ) Other_Columns 
FROM
  messages 
GROUP BY `Name` 

Above query will group the all the Other_Columns that are in same Name group and using ORDER BY id DESC will join all the Other_Columns in a specific group in descending order with the provided separator in my case i have used || ,using SUBSTRING_INDEX over this list will pick the first one

Fiddle Demo

SELECT 
  column1,
  column2 
FROM
  table_name 
WHERE id IN 
  (SELECT 
    MAX(id) 
  FROM
    table_name 
  GROUP BY column1) 
ORDER BY column1 ;
upvote
  flag
Could you elaborate a bit on your answer? Why is your query preferrable to Vijays original query? – janfoeh

Hi @Vijay Dev if your table messages contains Id which is auto increment primary key then to fetch the latest record basis on the primary key your query should read as below:

SELECT m1.* FROM messages m1 INNER JOIN (SELECT max(Id) as lastmsgId FROM messages GROUP BY Name) m2 ON m1.Id=m2.lastmsgId

You can take view from here as well.

http://sqlfiddle.com/#!9/ef42b/9

FIRST SOLUTION

SELECT d1.ID,Name,City FROM Demo_User d1
INNER JOIN
(SELECT MAX(ID) AS ID FROM Demo_User GROUP By NAME) AS P ON (d1.ID=P.ID);

SECOND SOLUTION

SELECT * FROM (SELECT * FROM Demo_User ORDER BY ID DESC) AS T GROUP BY NAME ;
upvote
  flag
Second Solution is best answer here (for me) – user2029890
upvote
  flag
your welcome :) – Shrikant Gupta
upvote
  flag
Second Solution doesn't work for my case – dikirill

If you want the last row for each Name, then you can give a row number to each row group by the Name and order by Id in descending order.

QUERY

SELECT t1.Id, 
       t1.Name, 
       t1.Other_Columns
FROM 
(
     SELECT Id, 
            Name, 
            Other_Columns,
    (
        CASE Name WHEN @curA 
        THEN @curRow := @curRow + 1 
        ELSE @curRow := 1 AND @curA := Name END 
    ) + 1 AS rn 
    FROM messages t, 
    (SELECT @curRow := 0, @curA := '') r 
    ORDER BY Name,Id DESC 
)t1
WHERE t1.rn = 1
ORDER BY t1.Id;

SQL Fiddle

select * from messages group by name desc
upvote
  flag
this works fine! see here also //allinonescript.com/questions/1313120/… – user2241289

How about this:

SELECT DISTINCT ON (name) *
FROM messages
ORDER BY name, id DESC;

I had similar issue (on postgresql tough) and on a 1M records table. This solution takes 1.7s vs 44s produced by the one with LEFT JOIN. In my case I had to filter the corrispondant of your name field against NULL values, resulting in even better performances by 0.2 secs

Here is my solution:

SELECT 
  DISTINCT NAME,
  MAX(MESSAGES) OVER(PARTITION BY NAME) MESSAGES 
FROM MESSAGE;

Not the answer you're looking for? Browse other questions tagged or ask your own question.