# GROUP BY
and calculating vote percentages
Using SQL we can group together rows, so that all rows with a given value for a column become a single row of result data.
Then, on that result, we can use aggregate functions to retrieve group results.
Here's what that means...
If you GROUP BY
without using an aggregate function, normally the result will be more or less meaningless. Let's say you've got this table, votes
:
username TEXT | vote INTEGER | poll INTEGER |
---|---|---|
jose | 1 | 1 |
jose | 2 | 3 |
anne | 1 | 1 |
rolf | 2 | 1 |
jose | 1 | 2 |
charlie | 3 | 1 |
By looking at that data, you can see that jose
has voted on polls 1
, 2
, and 3
.
Let's say we wanted to count the number of polls each user has voted on...
You might try to do something like this:
SELECT username, COUNT(*) FROM votes;
But you can't do this. The attempt is almost correct, but we haven't told PostgreSQL that we want to count in groups--i.e. each user separately.
If PostgreSQL did let us do this, the result would be something like this:
username | COUNT(*) |
---|---|
jose | 6 |
jose | 6 |
anne | 6 |
rolf | 6 |
jose | 6 |
charlie | 6 |
What this shows is we've got the username and the count of all rows. Not what we would ever want, and that's why it's not allowed.
That's where GROUP BY
comes into play. When we use GROUP BY
with a particular column, only that column and aggregate functions on that column or other columns may be used.
What that means is that if we ask PostgreSQL to group together all the data based on username
, we'll only have access to username
in that query, and the other columns through aggregate functions.
That is because upon grouping, PostgreSQL doesn't know what to do with the data in the other columns.
I'll explain with an example. Let's say we do this:
SELECT username, COUNT(username) FROM votes GROUP BY username;
This is correct, and you get back the username
groups:
username | COUNT(username) |
---|---|
rolf | 1 |
jose | 3 |
anne | 1 |
charlie | 1 |
But what has happened to the data in the vote
and poll
columns?
Well, should it be added together? Should the average be calculated? Should it be discarded?
PostgreSQL doesn't know what to do with the data in the other columns, because we've grouped by the username
column.
You can use aggregate functions[1] on the other columns to interact with the data as a group. For example, you could do this (although it is not terribly useful):
SELECT username, COUNT(username), AVG(vote) FROM votes GROUP BY username;
And you'd get back:
username | COUNT(username) | AVG(vote) |
---|---|---|
rolf | 1 | 2.000 |
jose | 3 | 1.333 |
anne | 1 | 1.000 |
charlie | 1 | 3.000 |
This is the key part of GROUP BY
: you no longer have the table as you did without grouping. You have access to:
- The column you grouped on; and
- The aggregation of other columns (via aggregate functions).
# Calculating the vote percentage in our app
Alright, with this newfound knowledge, let's start calculating the vote percentages in our app.
The first thing we need to be able to do to calculate this is to get the data from the database.
We'll need to get, for each option in a poll, the votes that were cast.
Let's say that we have one poll with three options, and six votes cast in total. Like this:
polls
id | title | owner_username |
---|---|---|
1 | Flask vs. Django? | jose |
options
id | option_text | poll_id |
---|---|---|
1 | Flask | 1 |
2 | Django | 1 |
3 | It depends | 1 |
votes
username | option_id |
---|---|
jose | 1 |
rolf | 2 |
anne | 3 |
charlie | 1 |
ammar | 1 |
bob | 2 |
So there we have it! Our poll, "Flask vs. Django?", has received 6 votes:
- 3 for "Flask" (that's 50%)
- 2 for "Django" (that's 33%)
- 1 for "It depends" (that's 17%)
Let's start off by getting a set of results that contains all the votes, and for each vote we'll also get the option that they refer to:
SELECT
options.*,
votes.*
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1;
This will give us a table that looks like this:
username | option_id | id | option_text | poll_id |
---|---|---|---|---|
jose | 1 | 1 | Flask | 1 |
rolf | 2 | 2 | Django | 1 |
anne | 3 | 3 | It depends | 1 |
charlie | 1 | 1 | Flask | 1 |
ammar | 1 | 1 | Flask | 1 |
bob | 2 | 2 | Django | 1 |
Now the next step is for us to group by options.id
. That'll give us a row per option, and the aggregate data for each option:
SELECT
options.id,
options.option_text,
COUNT(votes.option_id)
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1
GROUP BY options.id;
Here we're now getting the option's id
and option_text
, and then the COUNT(votes.option_id)
as we can only access that data now via an aggregate function.
This gives us something closer to what we want:
options.id | options.option_text | COUNT(votes.option_id) |
---|---|---|
1 | Flask | 3 |
2 | Django | 2 |
3 | It depends | 1 |
Finally, we can use what we have already learned to calculate the average:
SELECT
options.id,
options.option_text,
COUNT(votes.option_id),
COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id))
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1
GROUP BY options.id;
options.id | options.option_text | COUNT(votes.option_id) | COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id)) |
---|---|---|---|
1 | Flask | 3 | 50.000 |
2 | Django | 2 | 33.333 |
3 | It depends | 1 | 16.667 |
Notice that the last two columns especially have extremely wide names. We can fix that!
# Using AS
to alias names
In PostgreSQL we can alias column and table names so that we can refer to them more easily (and so they show up in results in a more readable way). To do so, just use the AS
keyword:
SELECT
options.id,
options.option_text,
COUNT(votes.option_id) AS vote_count,
COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id)) AS vote_percentage
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1
GROUP BY options.id;
Now it looks much better:
options.id | options.option_text | vote_count | vote_percentage |
---|---|---|---|
1 | Flask | 3 | 50.000 |
2 | Django | 2 | 33.333 |
3 | It depends | 1 | 16.667 |
# Changes to our code
In order to make use of this in our code, we'll first create the query variable:
SELECT_LATEST_POLL_WITH_OPTIONS = """SELECT * FROM polls
JOIN options ON polls.id = options.poll_id
WHERE polls.id = (
SELECT id FROM polls ORDER BY id DESC LIMIT 1
);"""
+SELECT_POLL_VOTE_DETAILS = """
+SELECT
+ options.id,
+ options.option_text,
+ COUNT(votes.option_id) as vote_count,
+ COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id)) as vote_percentage
+FROM options
+LEFT JOIN votes on options.id = votes.option_id
+WHERE options.poll_id = %s
+GROUP BY options.id;"""
SELECT_RANDOM_VOTE = "SELECT * FROM votes WHERE option_id = %s ORDER BY RANDOM() LIMIT 1;"
Then, we'll use this in our get_poll_and_vote_results()
function:
def get_poll_and_vote_results(connection, poll_id):
with connection:
with connection.cursor() as cursor:
- pass
+ cursor.execute(SELECT_POLL_VOTE_DETAILS, (poll_id,))
+ return cursor.fetchall()
# Wrapping Up
Now that we've added this code, we have completed our Python project that uses the advanced concepts in PostgreSQL.
However, in the next few chapters we'll cover some more PostgreSQL topics that can be really useful as well, such as the DISTINCT
and HAVING
clauses, as well as VIEW
.
We won't be including them in the Python project, but you can try to do so if you want! Remember, practice is the key to mastery in programming. Challenge yourself by adding more features to this project as we go along!
In later sections though, we will continue creating Python projects that use increasingly advanced PostgreSQL topics.