Postgresql for analytics

brief overview


In this article we will discover main analytics features of postgresql. If you are familiar with classic sql, you can find some postgres specific features witch will boost your analytic experience.

Topics

Dataset

Schema:

City table:

Country table:

Countrylanguage table:

[db-world.zip]


# CTE (Common table expression)

CTE (with expression) allow you to write your queries in chain style rather that classic sql statements where your write "deep queries". Step by step you can extend your query using previous results. Lets look in example above:

SELECT countrycode, langcount
FROM (
         SELECT countrycode, COUNT(*) as langcount
         FROM countrylanguage
         GROUP BY countrycode
     ) country_lang_count
WHERE langcount > 1
ORDER BY langcount DESC;
WITH country_lang_count AS (SELECT countrycode, COUNT(*) as langcount
                            FROM countrylanguage
                            GROUP BY countrycode)
SELECT countrycode, langcount
FROM country_lang_count
WHERE langcount > 1
ORDER BY langcount DESC;

As you can see CTE simplify understanding of your queries. But real power of WITH expression is RECURSIVE ability. Lets implement query that evaluate sum of first N numbers (sum[1..N]) with RECURSIVE WITH:

WITH RECURSIVE t(n) AS (
    VALUES (1)
    UNION ALL
    SELECT n + 1
    FROM t
    WHERE n < 100
)
SELECT SUM(n)
FROM t;

You can find details in documentation.

# Filtering

If we want to filter data in sql query than we typically user WHERE statement with combination of conditions like query above, but postgres allows to use filter inside select for groupby.

SELECT countrycode,
       avg(population) FILTER ( WHERE population > 10000 ) AS avg_population
FROM city
GROUP BY countrycode;
SELECT countrycode,
       avg(population) AS avg_population
FROM (
         SELECT countrycode, population
         FROM city
         WHERE population > 10000
     ) city_population
GROUP BY countrycode;

Note that in first query results can contains null in case when countrycode has no cities with population > 10000.

# Joining

-- left join
SELECT city.name, c.continent
FROM city
         LEFT JOIN country c ON city.countrycode = c.code;

-- cross join
SELECT city.name as from_city, c.continent as to_continent
FROM city
         CROSS JOIN country c;

-- cross join with self
SELECT city.name as from_city, c.name as to_city
FROM city
         CROSS JOIN city c
WHERE city.id != c.id;

# Windows

This part of postgresql is useful when you have to analyse time series data. So we will extend our schema with population dataset [csv] (extracted from worldbank). New table population contains three columns: code, year, population.

Percentiles

When we want to measure percentile [?] we shall use ordered sets. In example above we percentile_disc and within group for measure median.

SELECT percentile_disc(0.5) within group ( order by population ) as median_population
FROM population;

Rank

Rank is probability aggregate that answer to question: what position in rating will be placed if value will be equal this. For example: if gnp [Gross National Product] will be equal 500 than what position will be in each region.

SELECT
    region,
    rank(500) WITHIN GROUP ( ORDER BY gnp DESC NULLS LAST)
FROM country
GROUP BY ROLLUP (region);

Windows

We can use ROWS BETWEEN for prev and next values in aggregate function.


SELECT code,
       year,
       avg(population.population) over
        (PARTITION BY code ORDER BY year ROWS BETWEEN 0 PRECEDING AND 3 FOLLOWING)
FROM population
WHERE code = 'ABW';

# Cubes

We can use OLAP Cubes with postgres cube extension. The main idea of cubes is cache aggregates. It is look like groupby with multiple arguments. Lets look to example above:

SELECT
    continent, region, sum(gnp)
FROM country
GROUP BY continent, CUBE (region)
ORDER BY continent, region;

   continent   |          region           |    sum
---------------+---------------------------+------------
 Africa        | Central Africa            |   32938.00
 Africa        | Eastern Africa            |   69925.00
 Africa        | Northern Africa           |  243870.00
 Africa        | Southern Africa           |  126931.00
 Africa        | Western Africa            |  106711.00
 Africa        |                           |  580375.00
 Antarctica    | Antarctica                |       0.00
 Antarctica    |                           |       0.00
 Asia          | Eastern Asia              | 5524885.00
 Asia          | Middle East               |  677260.00
 Asia          | Southeast Asia            |  642643.00
 Asia          | Southern and Central Asia |  810604.00
 Asia          |                           | 7655392.00
 Europe        | Baltic Countries          |   22418.00
 Europe        | British Islands           | 1454251.00
 Europe        | Eastern Europe            |  659980.00
 Europe        | Nordic Countries          |  676655.00
 Europe        | Southern Europe           | 2012289.00
 Europe        | Western Europe            | 4673272.00
 Europe        |                           | 9498865.00
 North America | Caribbean                 |  103586.20
 North America | Central America           |  473151.00
 North America | North America             | 9111890.00
 North America |                           | 9688627.20
 Oceania       | Australia and New Zealand |  405851.00
 Oceania       | Melanesia                 |   10530.00
 Oceania       | Micronesia                |    1848.70
 Oceania       | Micronesia/Caribbean      |       0.00
 Oceania       | Polynesia                 |    1545.00
 Oceania       |                           |  419774.70
 South America | South America             | 1511874.00
 South America |                           | 1511874.00
(32 rows)

Result contains two types of group by: (continent, *), (continent, region). So we can see gnp in whole continent and in region.

If we want to save this cube we can use MATERIALIZED VIEW. In opposite to VIEW this view is not fetched every time, so we need to refresh this manually.

CREATE MATERIALIZED VIEW continent_region_gnp AS
SELECT continent,
       region,
       sum(gnp)
FROM country
GROUP BY continent, CUBE (region)
ORDER BY continent, region;

-- refresh
REFRESH MATERIALIZED VIEW continent_region_gnp;

-- drop materialized view
DROP MATERIALIZED VIEW continent_region_gnp;

Near CUBE you can find ROLLUP lets explain:

ROLLUP(e1, e2, e3, ...)

(e1, e2, e3, ...)
...
(e1, e2),
(e1),
(e1)
CUBE(e1, e2, e3)

(e1, e2, e3)
(e1, e2, *)
(e1, , e3)
(e1, *, *)
(*, e2, e3)
(*, e2, e3)
(*, e2, *)
(*, *, e3)
(*, *, e3)

Author @mrkandreev

Machine Learning Engineer