# The SQL DISTINCT
clause
In this chapter we'll learn about the DISTINCT
clause, which we can use to SELECT
unique results from a result set.
As an example for this chapter, we will try to get the most voted options across all polls. That means that for each poll, we want to get the most voted option.
Since we're interested in the most voted option, we'll begin by finding the vote count on each option. We can reuse part of the query we wrote when calculating the vote percentages.
SELECT
options.id,
options.option_text,
options.poll_id,
COUNT(votes.option_id) as vote_count
FROM options
LEFT JOIN votes on options.id = votes.option_id
GROUP BY options.id;
Note that here we're not limiting to one particular poll, so we're getting back all the options in the database: their id
, option_text
, poll_id
, and count of votes.
Now we want to only get the option with the highest votes in each poll. We want to make sure that each options.poll_id
only appears once.
Let's say we have three options in one poll:
id | option_text | poll_id |
---|---|---|
1 | Flask | 1 |
2 | Django | 1 |
3 | It depends | 1 |
If we ask PostgreSQL to only give us one poll_id
for this, naturally we will lose two of the rows.
So which row will we keep?
The answer is: the first one. Therefore, we need to order the results so that we will keep the first row that we're interested in.
To order results, we'll ORDER BY vote_count DESC
:
SELECT
options.id,
options.option_text,
options.poll_id,
COUNT(votes.option_id) as vote_count
FROM options
LEFT JOIN votes on options.id = votes.option_id
GROUP BY options.id
ORDER BY vote_count DESC;
And to get distinct poll_id
, we'll use the DISTINCT ON
clause. To do this, it must move to the top:
SELECT
DISTINCT ON (options.poll_id) poll_id,
options.id,
options.option_text,
COUNT(votes.option_id) as vote_count
FROM options
LEFT JOIN votes on options.id = votes.option_id
GROUP BY options.id
ORDER BY poll_id, vote_count DESC;
# The DISTINCT ON
clause
DISTINCT ON
does exactly what the example suggests: gets the results back as a whole, and then keeps the first row for each value in the column we've marked as distinct.
Note that for consistency in the order of rows returned, the column on which we apply DISTINCT ON
must be used as the first colum in the ORDER BY
clause.
The poll_id
after the DISTINCT ON ()
clause is a column alias: just a name we use to refer to it in the ORDER BY
clause later. It is optional.
# The DISTINCT
clause
Together with DISTINCT ON
, there is also an (arguably more popular) DISTINCT
clause on its own.
The DISTINCT
clause keeps one row for each group of duplicates, and it can be used on one or more columns of a table. So for example, if we had this table:
id | option_text | poll_id |
---|---|---|
1 | Flask | 1 |
2 | Django | 1 |
3 | It depends | 1 |
4 | Bob | 2 |
5 | Rolf | 2 |
6 | Flask | 3 |
7 | Django | 3 |
8 | Circuits | 3 |
Where we have 3 polls and 8 options. Note that Flask
and Django
appear twice.
We could use this option to get the unique values from column option_text
:
SELECT
DISTINCT option_text
FROM
options;
That would give us this:
option_text |
---|
Flask |
Django |
It depends |
Bob |
Rolf |
Circuits |
Notice that Flask
and Django
only appear once.
If we apply the DISTINCT
clause on two columns, then it would check duplication on both column values:
SELECT
DISTINCT option_text, poll_id
FROM
options;
Now we would get all the options, because there are no rows with the same option_text
and poll_id
as any other rows:
option_text | poll_id |
---|---|
Flask | 1 |
Django | 1 |
It depends | 1 |
Bob | 2 |
Rolf | 2 |
Flask | 3 |
Django | 3 |
Circuits | 3 |
# Wrapping Up
Alright, now that we know about the DISTINCT
and DISTINCT ON
clauses, we can use that to get the most voted option for each poll!