As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY.

Specifically, if I've got a purchases table that looks like this:

SELECT * FROM purchases;
id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

I'd like to query for the id of the largest purchase (total) made by each customer. Something like this:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;
FIRST(id) | customer | FIRST(total)
----------+----------+-------------
        1 | Joe      | 5
        2 | Sally    | 3

9 Answers 11

up vote 711 down vote accepted

On Oracle 9.2+ (not 8i+ as originally stated), SQL Server 2005+, PostgreSQL 8.4+, DB2, Firebird 3.0+, Teradata, Sybase, Vertica:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rk
      FROM PURCHASES p)
SELECT s.*
  FROM summary s
 WHERE s.rk = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total
upvote
  flag
Teradata only allows WITH clause after version 14, I believe. I am sure it does not work with version 12. – Ryoku
1 upvote
  flag
Informix 12.x also supports window functions (the CTE needs to be converted to a derived table though). And Firebird 3.0 will also support Window functions – a_horse_with_no_name
7 upvote
  flag
ROW_NUMBER() OVER(PARTITION BY [...]) along with some other optimizations helped me get a query down from 30 seconds to a few milliseconds. Thanks! (PostgreSQL 9.2) – Sam
3 upvote
  flag
If there are multiple purchases with equally the highest total for one customer, the 1st query returns an arbitrary winner (depending on implementations details; the id can change for every execution!). Typically (not always) you would want one row per customer, defined by additional criteria like "the one with the smallest id". To fix, append id to ORDER BY list of row_number(). Then you get the same result as with the 2nd query, which is very inefficient for this case. Also, you'd need another subquery for every additional column. – Erwin Brandstetter
upvote
  flag
Your query worked perfectly in Spark SQL. – zzztimbo
upvote
  flag
Google's BigQuery also supports the first query's ROW_NUMBER() command. Worked like a charm for us – Praxiteles

In PostgreSQL this is typically simpler and faster (more performance optimization below):

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

...
ORDER  BY customer, total DESC NULLS LAST, id;

Major points

  • DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).

  • List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:

    Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.

    Bold emphasis mine.

  • DISTINCT ON can be combined with ORDER BY. Leading expressions have to match leading DISTINCT ON expressions in the same order. You can add additional expressions to ORDER BY to pick a particular row from each group of peers. I added id as last item to break ties:

    "Pick the row with the smallest id from each group sharing the highest total."

    If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. Details:

  • The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way. (Not needed in the simple case above):

    • You don't have to include any of the expressions in DISTINCT ON or ORDER BY.

    • You can include any other expression in the SELECT list. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.

  • I tested with versions 8.3 – 10. But the feature has been there at least since version 7.1, so basically always.

Index

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized for real world applications. But use it if read performance is crucial. If you have DESC NULLS LAST in the query, use the same in the index so Postgres knows sort order matches.

Effectiveness / Performance optimization

You have to weigh cost and benefit before you create a tailored index for every query. The potential of above index largely depends on data distribution.

The index is used because it delivers pre-sorted data, and in Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.

Benchmark

I had a simple benchmark here for Postgres 9.1, which was outdated by 2016. So I ran a new one with a better, reproducible setup for Postgres 9.4 and 9.5 and added the detailed results in another answer.

10 upvote
  flag
This is a great answer for most database sizes, but I want to point out that as you approach ~million rows DISTINCT ON becomes extremely slow. The implementation always sorts the entire table and scans through it for duplicates, ignoring all indices (even if you have created the required multi-column index). See explainextended.com/2009/05/03/postgresql-optimizing-distinc‌​t for a possible solution. – Meekohi
4 upvote
  flag
@Meekohi: I added a chapter discussing effectiveness of the index and alternatives. – Erwin Brandstetter
4 upvote
  flag
Using ordinals to "make the code shorter" is a terrible idea. How about leaving the column names in to make it readable? – KOTJMF
3 upvote
  flag
@KOTJMF: I suggest you go with your personal preference then. I demonstrate both options to educate. The syntax shorthand can be useful for long expressions in the SELECT list. – Erwin Brandstetter
upvote
  flag
@ErwinBrandstetter any scripts of your benchmark? I'm interested in performance for 9.5. – jangorecki
1 upvote
  flag
@jangorecki: The original benchmark is from 2011, I don't have the setup any more. But it was about time to run tests with pg 9.4 and pg 9.5 anyway. See details in the added answer.. You might add a comment with result from your installation below? – Erwin Brandstetter
upvote
  flag
Nice! I wasn't aware of DISTINCT ON until now. Looks like a very useful tool. – isapir

The solution is not very efficient as pointed by Erwin, because of presence of SubQs

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;
upvote
  flag
Thanks, yes agree with you, the join between subq and outer query actually takes longer. "In" won't be an issue here as the subq will result only one row. BTW, what syntax error are you pointing to?? – user2407394
upvote
  flag
ohh.. used to "Teradata"..edited now..however breaking ties is not required here as it need to find highest total for each customer.. – user2407394
upvote
  flag
You are aware that you get multiple rows for a single customer in case of a tie? Whether that is desirable depends on exact requirements. Normally, it isn't. For the question at hand, the title is pretty clear. – Erwin Brandstetter
upvote
  flag
This is not clear from the question, if same customer have purchase = Max for 2 different ids, I think we should display both. – user2407394

This is common problem, which has already well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).

Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.

18 upvote
  flag
How is the MySQL manual in any way "official" for Postgres / SQLite (not to mention SQL) questions? Also, to be clear, the DISTINCT ON version is much shorter, simpler and generally performs better in Postgres than alternatives with a self LEFT JOIN or semi-anti-join with NOT EXISTS. It is also "well tested". – Erwin Brandstetter
2 upvote
  flag
Additionally to what Erwin wrote, I'd say that using a window function (which is common SQL functionality nowadays) is almost always faster than using a join with a derived table – a_horse_with_no_name
4 upvote
  flag
Great references. I didn't know this was called the greatest-n-per-group problem. Thank you. – David Mann
upvote
  flag
The question does not as for the greatest n per group but the first n. – reinierpost
upvote
  flag
@a_horse_with_no_name any references to the fact window functions are faster, please? – Artem Novikov
1 upvote
  flag
In a two order-fields case I tried, "left join solution by Bill Karwin" give poor performance. See my comment below //allinonescript.com/a/8749095/684229 – Johnny Wong

Very fast solution

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

and really very fast if table is indexed by id:

create index purchases_id on purchases (id);
upvote
  flag
The USING clause is very much standard. It's just that some minor database systems don't have it. – Holger Jakobs
upvote
  flag
That's true. The post was be edited. – Alejandro Salamanca Mazuelo
upvote
  flag
This doesn't find customers' purchase with largest total – Johnny Wong

In Postgres you can use array_agg like this:

SELECT  customer,
        (array_agg(id ORDER BY total DESC))[1],
        max(total)
FROM purchases
GROUP BY customer

This will give you the id of each customer's largest purchase.

Some things to note:

  • array_agg is an aggregate function, so it works with GROUP BY.
  • array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
  • Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
  • You could use array_agg in a similar way for your third output column, but max(total) is simpler.
  • Unlike DISTINCT ON, using array_agg lets you keep your GROUP BY, in case you want that for other reasons.

I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

Then your example should work almost as is:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

CAVEAT: It ignore's NULL rows


Edit 1 - Use the postgres extension instead

Now I use this way: http://pgxn.org/dist/first_last_agg/

To install on ubuntu 14.04:

apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

It's a postgres extension that gives you first and last functions; apparently faster than the above way.


Edit 2 - Ordering and filtering

If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:

http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES

So the equivalent example, with ordering would be something like:

SELECT first(id order by id), customer, first(total order by id)
  FROM purchases
 GROUP BY customer
 ORDER BY first(total);

Of course you can order and filter as you deem fit within the aggregate; it's very powerful syntax.

Benchmark

Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in purchases and 10k distinct customer_id (avg. 20 rows per customer).

For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).

Setup

Main table

CREATE TABLE purchases (
  id          serial
, customer_id int  -- REFERENCES customer
, total       int  -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);

I use a serial (PK constraint added below) and an integer customer_id since that's a more typical setup. Also added some_column to make up for typically more columns.

Dummy data, PK, index - a typical table also has some dead tuples:

INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9; -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer table - for superior query

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

In my second test for 9.5 I used the same setup, but with random() * 100000 to generate customer_id to get only few rows per customer_id.

Object sizes for table purchases

Generated with this query.

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

Queries

1. row_number() in CTE, (see other answer)

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2. row_number() in subquery (my optimization)

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3. DISTINCT ON (see other answer)

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5. customer table with LATERAL (see here)

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6. array_agg() with ORDER BY (see other answer)

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

Results

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  
5.  37.679 ms  -- winner
6. 189.495 ms

B. The same with Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  
5.  33.944 ms  -- winner
6. 211.540 ms  

C. Same as B., but with ~ 2.3 rows per customer_id

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

Original (outdated) benchmark from 2011

I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (A) to the above DISTINCT ON solution (B):

  1. Select the whole table, results in 5958 rows in this case.

    A: 567.218 ms
    B: 386.673 ms
    
  2. Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.

    A: 249.136 ms
    B:  55.111 ms
    
  3. Select a single customer with WHERE customer = x.

    A:   0.143 ms
    B:   0.072 ms
    

Same test repeated with the index described in the other answer

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms
3 upvote
  flag
Thanks for a great benchmark. I was wondering if querying events data where you have a timestamp instead of total would benefit from new BRIN index. This can potentially give speedup for temporal queries. – jangorecki
1 upvote
  flag
@jangorecki: Any huge table with physically sorted data can profit from a BRIN index. – Erwin Brandstetter
upvote
  flag
@ErwinBrandstetter In the 2. row_number() and 5. customer table with LATERAL examples, what does ensure the id will be the smallest? – Artem Novikov
upvote
  flag
@ArtemNovikov: Nothing. The objective is to retrieve, per customer_id the row with the highest total. It's a misleading coincidence in the test data of the question that the id in the selected rows happens to also be the smallest per customer_id. – Erwin Brandstetter
upvote
  flag
@ErwinBrandstetter then why do you need it in the index? – Artem Novikov
1 upvote
  flag
@ArtemNovikov: To allow index-only scans. – Erwin Brandstetter

The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.

Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.

Supported by any database:

select * from purchase
join (
    select min(id) as id from purchase
    join (
        select customer, max(total) as total from purchase
        group by customer
    ) t1 using (customer, total)
    group by customer
) t2 using (id)
order by customer

This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.

Remark:

  1. t1, t2 are subquery alias which could be removed depending on database.

  2. Caveat: the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g. on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.

Not the answer you're looking for? Browse other questions tagged or ask your own question.