# GROUP BY and calculating vote percentages

Using SQL we can group together rows, so that all rows with a given value for a column become a single row of result data.

Then, on that result, we can use aggregate functions to retrieve group results.

Here's what that means...

If you GROUP BY without using an aggregate function, normally the result will be more or less meaningless. Let's say you've got this table, votes:

username TEXT vote INTEGER poll INTEGER
jose 1 1
jose 2 3
anne 1 1
rolf 2 1
jose 1 2
charlie 3 1

By looking at that data, you can see that jose has voted on polls 1, 2, and 3.

Let's say we wanted to count the number of polls each user has voted on...

You might try to do something like this:

SELECT username, COUNT(*) FROM votes;

But you can't do this. The attempt is almost correct, but we haven't told PostgreSQL that we want to count in groups--i.e. each user separately.

If PostgreSQL did let us do this, the result would be something like this:

username COUNT(*)
jose 6
jose 6
anne 6
rolf 6
jose 6
charlie 6

What this shows is we've got the username and the count of all rows. Not what we would ever want, and that's why it's not allowed.

That's where GROUP BY comes into play. When we use GROUP BY with a particular column, only that column and aggregate functions on that column or other columns may be used.

What that means is that if we ask PostgreSQL to group together all the data based on username, we'll only have access to username in that query, and the other columns through aggregate functions.

That is because upon grouping, PostgreSQL doesn't know what to do with the data in the other columns.

I'll explain with an example. Let's say we do this:

SELECT username, COUNT(username) FROM votes GROUP BY username;

This is correct, and you get back the username groups:

username COUNT(username)
rolf 1
jose 3
anne 1
charlie 1

But what has happened to the data in the vote and poll columns?

Well, should it be added together? Should the average be calculated? Should it be discarded?

PostgreSQL doesn't know what to do with the data in the other columns, because we've grouped by the username column.

You can use aggregate functions[1] on the other columns to interact with the data as a group. For example, you could do this (although it is not terribly useful):

SELECT username, COUNT(username), AVG(vote) FROM votes GROUP BY username;

And you'd get back:

username COUNT(username) AVG(vote)
rolf 1 2.000
jose 3 1.333
anne 1 1.000
charlie 1 3.000

This is the key part of GROUP BY: you no longer have the table as you did without grouping. You have access to:

  • The column you grouped on; and
  • The aggregation of other columns (via aggregate functions).

# Calculating the vote percentage in our app

Alright, with this newfound knowledge, let's start calculating the vote percentages in our app.

The first thing we need to be able to do to calculate this is to get the data from the database.

We'll need to get, for each option in a poll, the votes that were cast.

Let's say that we have one poll with three options, and six votes cast in total. Like this:

polls

id title owner_username
1 Flask vs. Django? jose

options

id option_text poll_id
1 Flask 1
2 Django 1
3 It depends 1

votes

username option_id
jose 1
rolf 2
anne 3
charlie 1
ammar 1
bob 2

So there we have it! Our poll, "Flask vs. Django?", has received 6 votes:

  • 3 for "Flask" (that's 50%)
  • 2 for "Django" (that's 33%)
  • 1 for "It depends" (that's 17%)

Let's start off by getting a set of results that contains all the votes, and for each vote we'll also get the option that they refer to:

SELECT
    options.*,
    votes.*
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1;

This will give us a table that looks like this:

username option_id id option_text poll_id
jose 1 1 Flask 1
rolf 2 2 Django 1
anne 3 3 It depends 1
charlie 1 1 Flask 1
ammar 1 1 Flask 1
bob 2 2 Django 1

Now the next step is for us to group by options.id. That'll give us a row per option, and the aggregate data for each option:

SELECT
    options.id,
    options.option_text,
    COUNT(votes.option_id)
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1
GROUP BY options.id;

Here we're now getting the option's id and option_text, and then the COUNT(votes.option_id) as we can only access that data now via an aggregate function.

This gives us something closer to what we want:

options.id options.option_text COUNT(votes.option_id)
1 Flask 3
2 Django 2
3 It depends 1

Finally, we can use what we have already learned to calculate the average:

SELECT
    options.id,
    options.option_text,
    COUNT(votes.option_id),
    COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id))
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1
GROUP BY options.id;
options.id options.option_text COUNT(votes.option_id) COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id))
1 Flask 3 50.000
2 Django 2 33.333
3 It depends 1 16.667

Notice that the last two columns especially have extremely wide names. We can fix that!

# Using AS to alias names

In PostgreSQL we can alias column and table names so that we can refer to them more easily (and so they show up in results in a more readable way). To do so, just use the AS keyword:

SELECT
    options.id,
    options.option_text,
    COUNT(votes.option_id) AS vote_count,
    COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id)) AS vote_percentage
FROM options
LEFT JOIN votes on options.id = votes.option_id
WHERE options.poll_id = 1
GROUP BY options.id;

Now it looks much better:

options.id options.option_text vote_count vote_percentage
1 Flask 3 50.000
2 Django 2 33.333
3 It depends 1 16.667

# Changes to our code

In order to make use of this in our code, we'll first create the query variable:

 SELECT_LATEST_POLL_WITH_OPTIONS = """SELECT * FROM polls
 JOIN options ON polls.id = options.poll_id
 WHERE polls.id = (
     SELECT id FROM polls ORDER BY id DESC LIMIT 1
 );"""
+SELECT_POLL_VOTE_DETAILS = """
+SELECT
+    options.id,
+    options.option_text,
+    COUNT(votes.option_id) as vote_count,
+    COUNT(votes.option_id) * 100.0 / sum(count(votes.option_id)) as vote_percentage
+FROM options
+LEFT JOIN votes on options.id = votes.option_id
+WHERE options.poll_id = %s
+GROUP BY options.id;"""
 SELECT_RANDOM_VOTE = "SELECT * FROM votes WHERE option_id = %s ORDER BY RANDOM() LIMIT 1;"

Then, we'll use this in our get_poll_and_vote_results() function:

 def get_poll_and_vote_results(connection, poll_id):
     with connection:
         with connection.cursor() as cursor:
-            pass
+            cursor.execute(SELECT_POLL_VOTE_DETAILS, (poll_id,))
+            return cursor.fetchall()

# Wrapping Up

Now that we've added this code, we have completed our Python project that uses the advanced concepts in PostgreSQL.

However, in the next few chapters we'll cover some more PostgreSQL topics that can be really useful as well, such as the DISTINCT and HAVING clauses, as well as VIEW.

We won't be including them in the Python project, but you can try to do so if you want! Remember, practice is the key to mastery in programming. Challenge yourself by adding more features to this project as we go along!

In later sections though, we will continue creating Python projects that use increasingly advanced PostgreSQL topics.


  1. Aggregate functions in PostgreSQL (Official Documentation) (opens new window) ↩ī¸Ž