Tutorial do pg_clickhouse - ClickHouse Documentation

Visão geral

Este tutorial segue o [tutorial do ClickHouse], mas executa todas as consultas via pg_clickhouse.

Inicie o ClickHouse

Primeiro, crie um banco de dados no ClickHouse, caso ainda não tenha um. Uma forma rápida de começar é usar a imagem Docker:

docker run -d --network host --name clickhouse -p 8123:8123 -p9000:9000 --ulimit nofile=262144:262144 clickhouse
docker exec -it clickhouse clickhouse-client

Criar uma tabela

Vamos usar o [tutorial do ClickHouse] como base para criar um banco de dados simples com o conjunto de dados de táxis da cidade de Nova York:

CREATE DATABASE taxi;
CREATE TABLE taxi.trips
(
    trip_id UInt32,
    vendor_id Enum8(
        '1'      =  1, '2'      =  2, '3'      =  3, '4'      =  4,
        'CMT'    =  5, 'VTS'    =  6, 'DDS'    =  7, 'B02512' = 10,
        'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14,
        ''       = 15
    ),
    pickup_date Date,
    pickup_datetime DateTime,
    dropoff_date Date,
    dropoff_datetime DateTime,
    store_and_fwd_flag UInt8,
    rate_code_id UInt8,
    pickup_longitude Float64,
    pickup_latitude Float64,
    dropoff_longitude Float64,
    dropoff_latitude Float64,
    passenger_count UInt8,
    trip_distance Float64,
    fare_amount Decimal(10, 2),
    extra Decimal(10, 2),
    mta_tax Decimal(10, 2),
    tip_amount Decimal(10, 2),
    tolls_amount Decimal(10, 2),
    ehail_fee Decimal(10, 2),
    improvement_surcharge Decimal(10, 2),
    total_amount Decimal(10, 2),
    payment_type Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    trip_type UInt8,
    pickup FixedString(25),
    dropoff FixedString(25),
    cab_type Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    pickup_nyct2010_gid Int8,
    pickup_ctlabel Float32,
    pickup_borocode Int8,
    pickup_ct2010 String,
    pickup_boroct2010 String,
    pickup_cdeligibil String,
    pickup_ntacode FixedString(4),
    pickup_ntaname String,
    pickup_puma UInt16,
    dropoff_nyct2010_gid UInt8,
    dropoff_ctlabel Float32,
    dropoff_borocode UInt8,
    dropoff_ct2010 String,
    dropoff_boroct2010 String,
    dropoff_cdeligibil String,
    dropoff_ntacode FixedString(4),
    dropoff_ntaname String,
    dropoff_puma UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;

Adicione o conjunto de dados

Depois, importe os dados:

INSERT INTO taxi.trips
SELECT * FROM s3(
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
    'TabSeparatedWithNames', "
    trip_id UInt32,
    vendor_id Enum8(
        '1'      =  1, '2'      =  2, '3'      =  3, '4'      =  4,
        'CMT'    =  5, 'VTS'    =  6, 'DDS'    =  7, 'B02512' = 10,
        'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14,
        ''       = 15
    ),
    pickup_date Date,
    pickup_datetime DateTime,
    dropoff_date Date,
    dropoff_datetime DateTime,
    store_and_fwd_flag UInt8,
    rate_code_id UInt8,
    pickup_longitude Float64,
    pickup_latitude Float64,
    dropoff_longitude Float64,
    dropoff_latitude Float64,
    passenger_count UInt8,
    trip_distance Float64,
    fare_amount Decimal(10, 2),
    extra Decimal(10, 2),
    mta_tax Decimal(10, 2),
    tip_amount Decimal(10, 2),
    tolls_amount Decimal(10, 2),
    ehail_fee Decimal(10, 2),
    improvement_surcharge Decimal(10, 2),
    total_amount Decimal(10, 2),
    payment_type Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    trip_type UInt8,
    pickup FixedString(25),
    dropoff FixedString(25),
    cab_type Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    pickup_nyct2010_gid Int8,
    pickup_ctlabel Float32,
    pickup_borocode Int8,
    pickup_ct2010 String,
    pickup_boroct2010 String,
    pickup_cdeligibil String,
    pickup_ntacode FixedString(4),
    pickup_ntaname String,
    pickup_puma UInt16,
    dropoff_nyct2010_gid UInt8,
    dropoff_ctlabel Float32,
    dropoff_borocode UInt8,
    dropoff_ct2010 String,
    dropoff_boroct2010 String,
    dropoff_cdeligibil String,
    dropoff_ntacode FixedString(4),
    dropoff_ntaname String,
    dropoff_puma UInt16
") SETTINGS input_format_try_infer_datetimes = 0

Certifique-se de que é possível consultá-lo e, em seguida, saia do cliente:

SELECT count() FROM taxi.trips;
quit

Instale o pg_clickhouse

Faça a compilação e instale o pg_clickhouse a partir do PGXN ou do GitHub. Ou inicie um container Docker usando a [imagem do pg_clickhouse], que simplesmente adiciona o pg_clickhouse à [imagem do Postgres] no Docker:

docker run -d --network host --name pg_clickhouse -e POSTGRES_PASSWORD=my_pass \
       -d ghcr.io/clickhouse/pg_clickhouse:18

Conectar pg_clickhouse

Agora conecte-se ao Postgres:

docker exec -it pg_clickhouse psql -U postgres

E crie pg_clickhouse:

CREATE EXTENSION pg_clickhouse;

Crie um servidor externo usando o nome do host, a porta e o banco de dados da sua instância do ClickHouse.

CREATE SERVER taxi_srv FOREIGN DATA WRAPPER clickhouse_fdw
       OPTIONS(driver 'binary', host 'localhost', dbname 'taxi');

Aqui, optamos por usar o driver binário, que usa o protocolo binário do ClickHouse. Você também pode usar o driver “http”, que usa a interface HTTP. Em seguida, mapeie um usuário do PostgreSQL para um usuário do ClickHouse. A forma mais simples de fazer isso é mapear o usuário atual do PostgreSQL para um usuário remoto no servidor externo:

CREATE USER MAPPING FOR CURRENT_USER SERVER taxi_srv
       OPTIONS (user 'default');

Você também pode especificar a opção password. Agora, adicione a tabela de táxi; basta importar todas as tabelas do banco de dados remoto ClickHouse para um schema do Postgres:

CREATE SCHEMA taxi;
IMPORT FOREIGN SCHEMA taxi FROM SERVER taxi_srv INTO taxi;

E agora a tabela já deve estar importada: No psql, use \det+ para vê-la:

taxi=# \det+ taxi.*
                                       List of foreign tables
 Schema | Table |  Server  |                        FDW options                        | Description
--------+-------+----------+-----------------------------------------------------------+-------------
 taxi   | trips | taxi_srv | (database 'taxi', table_name 'trips', engine 'MergeTree') | [null]
(1 row)

Sucesso! Use \d para mostrar todas as colunas:

taxi=# \d taxi.trips
                                   Foreign table "taxi.trips"
        Column         |           Type           | Collation | Nullable | Default | FDW options
-----------------------+--------------------------+-----------+----------+---------+-------------
 trip_id               | bigint                   |           | not null |         |
 vendor_id             | text                     |           | not null |         |
 pickup_date           | date                     |           | not null |         |
 pickup_datetime       | timestamp with time zone |           | not null |         |
 dropoff_date          | date                     |           | not null |         |
 dropoff_datetime      | timestamp with time zone |           | not null |         |
 store_and_fwd_flag    | smallint                 |           | not null |         |
 rate_code_id          | smallint                 |           | not null |         |
 pickup_longitude      | double precision         |           | not null |         |
 pickup_latitude       | double precision         |           | not null |         |
 dropoff_longitude     | double precision         |           | not null |         |
 dropoff_latitude      | double precision         |           | not null |         |
 passenger_count       | smallint                 |           | not null |         |
 trip_distance         | double precision         |           | not null |         |
 fare_amount           | numeric(10,2)            |           | not null |         |
 extra                 | numeric(10,2)            |           | not null |         |
 mta_tax               | numeric(10,2)            |           | not null |         |
 tip_amount            | numeric(10,2)            |           | not null |         |
 tolls_amount          | numeric(10,2)            |           | not null |         |
 ehail_fee             | numeric(10,2)            |           | not null |         |
 improvement_surcharge | numeric(10,2)            |           | not null |         |
 total_amount          | numeric(10,2)            |           | not null |         |
 payment_type          | text                     |           | not null |         |
 trip_type             | smallint                 |           | not null |         |
 pickup                | character varying(25)    |           | not null |         |
 dropoff               | character varying(25)    |           | not null |         |
 cab_type              | text                     |           | not null |         |
 pickup_nyct2010_gid   | smallint                 |           | not null |         |
 pickup_ctlabel        | real                     |           | not null |         |
 pickup_borocode       | smallint                 |           | not null |         |
 pickup_ct2010         | text                     |           | not null |         |
 pickup_boroct2010     | text                     |           | not null |         |
 pickup_cdeligibil     | text                     |           | not null |         |
 pickup_ntacode        | character varying(4)     |           | not null |         |
 pickup_ntaname        | text                     |           | not null |         |
 pickup_puma           | integer                  |           | not null |         |
 dropoff_nyct2010_gid  | smallint                 |           | not null |         |
 dropoff_ctlabel       | real                     |           | not null |         |
 dropoff_borocode      | smallint                 |           | not null |         |
 dropoff_ct2010        | text                     |           | not null |         |
 dropoff_boroct2010    | text                     |           | not null |         |
 dropoff_cdeligibil    | text                     |           | not null |         |
 dropoff_ntacode       | character varying(4)     |           | not null |         |
 dropoff_ntaname       | text                     |           | not null |         |
 dropoff_puma          | integer                  |           | not null |         |
Server: taxi_srv
FDW options: (database 'taxi', table_name 'trips', engine 'MergeTree')

Agora, consulte a tabela:

 SELECT count(*) FROM taxi.trips;
   count
 ---------
  1999657
 (1 row)

Observe como a consulta foi executada rapidamente. O pg_clickhouse envia toda a consulta, incluindo a agregação COUNT(), para que ela seja executada no ClickHouse e retorne apenas uma única linha ao Postgres. Use EXPLAIN para ver isso:

 EXPLAIN select count(*) from taxi.trips;
                    QUERY PLAN
 -------------------------------------------------
  Foreign Scan  (cost=1.00..-0.90 rows=1 width=8)
    Relations: Aggregate on (trips)
 (2 rows)

Observe que “Foreign Scan” aparece na raiz do plano, o que significa que a consulta inteira foi executada no ClickHouse.

Analise os dados

Execute algumas consultas para analisar os dados. Explore os exemplos a seguir ou experimente sua própria consulta SQL.

Calcule a média do valor da gorjeta:

taxi=# \timing
Timing is on.
taxi=# SELECT round(avg(tip_amount), 2) FROM taxi.trips;
 round
-------
  1.68
(1 row)

Time: 9.438 ms

Calcule o custo médio com base no número de passageiros:

taxi=# SELECT
        passenger_count,
        avg(total_amount)::NUMERIC(10, 2) AS average_total_amount
    FROM taxi.trips
    GROUP BY passenger_count;
 passenger_count | average_total_amount
-----------------+----------------------
               0 |                22.68
               1 |                15.96
               2 |                17.14
               3 |                16.75
               4 |                17.32
               5 |                16.34
               6 |                16.03
               7 |                59.79
               8 |                36.40
               9 |                 9.79
(10 rows)

Time: 27.266 ms

Calcule o número diário de embarques por bairro:

taxi=# SELECT
    pickup_date,
    pickup_ntaname,
    SUM(1) AS number_of_trips
FROM taxi.trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC LIMIT 10;
 pickup_date |         pickup_ntaname         | number_of_trips
-------------+--------------------------------+-----------------
 2015-07-01  | Williamsburg                   |               1
 2015-07-01  | park-cemetery-etc-Queens       |               6
 2015-07-01  | Maspeth                        |               1
 2015-07-01  | Stuyvesant Town-Cooper Village |              44
 2015-07-01  | Rego Park                      |               1
 2015-07-01  | Greenpoint                     |               7
 2015-07-01  | Highbridge                     |               1
 2015-07-01  | Briarwood-Jamaica Hills        |               3
 2015-07-01  | Airport                        |             550
 2015-07-01  | East Harlem North              |              32
(10 rows)

Time: 30.978 ms

Calcule a duração de cada viagem em minutos e, em seguida, agrupe os resultados por pela duração da viagem:

taxi=# SELECT
    avg(tip_amount) AS avg_tip,
    avg(fare_amount) AS avg_fare,
    avg(passenger_count) AS avg_passenger,
    count(*) AS count,
    round((date_part('epoch', dropoff_datetime) - date_part('epoch', pickup_datetime)) / 60) as trip_minutes
FROM taxi.trips
WHERE round((date_part('epoch', dropoff_datetime) - date_part('epoch', pickup_datetime)) / 60) > 0
GROUP BY trip_minutes
ORDER BY trip_minutes DESC
LIMIT 5;
      avg_tip      |     avg_fare     |  avg_passenger   | count | trip_minutes
-------------------+------------------+------------------+-------+--------------
              1.96 |                8 |                1 |     1 |        27512
                 0 |               12 |                2 |     1 |        27500
 0.562727272727273 | 17.4545454545455 | 2.45454545454545 |    11 |         1440
 0.716564885496183 | 14.2786259541985 | 1.94656488549618 |   131 |         1439
  1.00945205479452 | 12.8787671232877 | 1.98630136986301 |   146 |         1438
(5 rows)

Time: 45.477 ms

Mostre o número de embarques em cada bairro, separado por hora do dia:

taxi=# SELECT
    pickup_ntaname,
    date_part('hour', pickup_datetime) as pickup_hour,
    SUM(1) AS pickups
FROM taxi.trips
WHERE pickup_ntaname != ''
GROUP BY pickup_ntaname, pickup_hour
ORDER BY pickup_ntaname, date_part('hour', pickup_datetime)
LIMIT 5;
 pickup_ntaname | pickup_hour | pickups
----------------+-------------+---------
 Airport        |           0 |    3509
 Airport        |           1 |    1184
 Airport        |           2 |     401
 Airport        |           3 |     152
 Airport        |           4 |     213
(5 rows)

Time: 36.895 ms

Defina o fuso horário de exibição para Nova York e recupere corridas com destino aos aeroportos LaGuardia ou JFK:

taxi=# SET timezone = 'America/New_York';
SET
taxi=# SELECT
    pickup_datetime,
    dropoff_datetime,
    total_amount,
    pickup_nyct2010_gid,
    dropoff_nyct2010_gid,
    CASE
        WHEN dropoff_nyct2010_gid = 138 THEN 'LGA'
        WHEN dropoff_nyct2010_gid = 132 THEN 'JFK'
    END AS airport_code,
    EXTRACT(YEAR FROM pickup_datetime) AS year,
    EXTRACT(DAY FROM pickup_datetime) AS day,
    EXTRACT(HOUR FROM pickup_datetime) AS hour
FROM taxi.trips
WHERE dropoff_nyct2010_gid IN (132, 138)
ORDER BY pickup_datetime
LIMIT 5;
    pickup_datetime     |    dropoff_datetime    | total_amount | pickup_nyct2010_gid | dropoff_nyct2010_gid | airport_code | year | day | hour
------------------------+------------------------+--------------+---------------------+----------------------+--------------+------+-----+------
 2015-06-30 20:04:14-04 | 2015-06-30 20:15:29-04 |        13.30 |                 -34 |                  132 | JFK          | 2015 |  30 |   20
 2015-06-30 20:09:42-04 | 2015-06-30 20:12:55-04 |         6.80 |                  50 |                  138 | LGA          | 2015 |  30 |   20
 2015-06-30 20:23:04-04 | 2015-06-30 20:24:39-04 |         4.80 |                -125 |                  132 | JFK          | 2015 |  30 |   20
 2015-06-30 20:27:51-04 | 2015-06-30 20:39:02-04 |        14.72 |                -101 |                  138 | LGA          | 2015 |  30 |   20
 2015-06-30 20:32:03-04 | 2015-06-30 20:55:39-04 |        39.34 |                  48 |                  138 | LGA          | 2015 |  30 |   20
(5 rows)

Time: 17.450 ms

Crie um dicionário

Crie um dicionário associado a uma tabela no seu serviço ClickHouse. A tabela e o dicionário se baseiam em um arquivo CSV que contém uma linha para cada bairro da cidade de Nova York. Os bairros são mapeados para os nomes dos cinco boroughs da cidade de Nova York (Bronx, Brooklyn, Manhattan, Queens e Staten Island), bem como para o aeroporto de Newark (EWR). Aqui está um trecho do arquivo CSV que você está usando em formato de tabela. A coluna LocationID no arquivo é mapeada para as colunas pickup_nyct2010_gid e dropoff_nyct2010_gid na sua tabela de viagens:

LocationID	Borough	Zone	service_zone
1	EWR	Newark Airport	EWR
2	Queens	Jamaica Bay	Boro Zone
3	Bronx	Allerton/Pelham Gardens	Boro Zone
4	Manhattan	Alphabet City	Yellow Zone
5	Staten Island	Arden Heights	Boro Zone

Ainda no Postgres, use a função clickhouse_raw_query para criar um [dicionário] do ClickHouse chamado taxi_zone_dictionary e preencher o dicionário com base no arquivo CSV no S3:

SELECT clickhouse_raw_query($$
    CREATE DICTIONARY taxi.taxi_zone_dictionary (
        LocationID Int64 DEFAULT 0,
        Borough String,
        zone String,
        service_zone String
    )
    PRIMARY KEY LocationID
    SOURCE(HTTP(URL 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv' FORMAT 'CSVWithNames'))
    LIFETIME(MIN 0 MAX 0)
    LAYOUT(HASHED_ARRAY())
$$, 'host=localhost dbname=taxi');

Definir LIFETIME como 0 desativa as atualizações automáticas para evitar tráfego desnecessário no nosso bucket do S3. Em outros casos, talvez você o configure de outra forma. Para mais detalhes, consulte Atualização de dados de dicionário usando LIFETIME.

Agora importe-o:

    IMPORT FOREIGN SCHEMA taxi LIMIT TO (taxi_zone_dictionary)
    FROM SERVER taxi_srv INTO taxi;

Confirme se podemos consultá-la:

    taxi=# SELECT * FROM taxi.taxi_zone_dictionary limit 3;
     LocationID |  Borough  |                     Zone                      | service_zone
    ------------+-----------+-----------------------------------------------+--------------
             77 | Brooklyn  | East New York/Pennsylvania Avenue             | Boro Zone
            106 | Brooklyn  | Gowanus                                       | Boro Zone
            103 | Manhattan | Governor's Island/Ellis Island/Liberty Island | Yellow Zone
    (3 rows)

Excelente. Agora use a função dictGet para recuperar o nome de um distrito em uma consulta. A consulta a seguir soma o número de corridas de táxi por distrito que terminam no aeroporto LaGuardia ou no JFK:

    taxi=# SELECT
            count(1) AS total,
            COALESCE(NULLIF(dictGet(
                'taxi.taxi_zone_dictionary', 'Borough',
                toUInt64(pickup_nyct2010_gid)
            ), ''), 'Unknown') AS borough_name
        FROM taxi.trips
        WHERE dropoff_nyct2010_gid = 132 OR dropoff_nyct2010_gid = 138
        GROUP BY borough_name
        ORDER BY total DESC;
     total | borough_name
    -------+---------------
     23683 | Unknown
      7053 | Manhattan
      6828 | Brooklyn
      4458 | Queens
      2670 | Bronx
       554 | Staten Island
        53 | EWR
    (7 rows)

    Time: 66.245 ms

Esta consulta soma o número de corridas de táxi por borough que terminam em um dos aeroportos, LaGuardia ou JFK. Observe que há várias corridas em que o bairro de origem é desconhecido.

Faça um JOIN

Escreva algumas consultas que façam um JOIN entre taxi_zone_dictionary e sua tabela trips.

Comece com um JOIN simples que funciona de forma semelhante à consulta de aeroporto anterior:

taxi=# SELECT
    count(1) AS total,
    "Borough"
FROM taxi.trips
JOIN taxi.taxi_zone_dictionary
  ON trips.pickup_nyct2010_gid = toUInt64(taxi.taxi_zone_dictionary."LocationID")
WHERE pickup_nyct2010_gid > 0
  AND dropoff_nyct2010_gid IN (132, 138)
GROUP BY "Borough"
ORDER BY total DESC;
 total | borough_name
-------+---------------
  7053 | Manhattan
  6828 | Brooklyn
  4458 | Queens
  2670 | Bronx
   554 | Staten Island
    53 | EWR
(6 rows)

Time: 48.449 ms

Observe que a saída da consulta JOIN acima é a mesma da consulta dictGet acima (exceto porque os valores Unknown não estão incluídos). Nos bastidores, o ClickHouse está, na verdade, chamando a função dictGet para o dicionário taxi_zone_dictionary, mas a sintaxe JOIN é mais familiar para desenvolvedores SQL.

    taxi=# explain SELECT
            count(1) AS total,
            "Borough"
        FROM taxi.trips
        JOIN taxi.taxi_zone_dictionary
          ON trips.pickup_nyct2010_gid = toUInt64(taxi.taxi_zone_dictionary."LocationID")
        WHERE pickup_nyct2010_gid > 0
          AND dropoff_nyct2010_gid IN (132, 138)
        GROUP BY "Borough"
        ORDER BY total DESC;
                                  QUERY PLAN
    -----------------------------------------------------------------------
     Foreign Scan  (cost=1.00..5.10 rows=1000 width=40)
       Relations: Aggregate on ((trips) INNER JOIN (taxi_zone_dictionary))
    (2 rows)
    Time: 2.012 ms

Esta consulta retorna as linhas das 1000 viagens com o maior valor de gorjeta e, em seguida, faz um inner join entre cada linha e o dicionário:

taxi=# SELECT *
FROM taxi.trips
JOIN taxi.taxi_zone_dictionary
    ON trips.dropoff_nyct2010_gid = taxi.taxi_zone_dictionary."LocationID"
WHERE tip_amount > 0
ORDER BY tip_amount DESC
LIMIT 1000;

Em geral, evitamos usar SELECT * no PostgreSQL e no ClickHouse. Você deve recuperar apenas as colunas de que realmente precisa.

​Visão geral

​Inicie o ClickHouse

​Criar uma tabela

​Adicione o conjunto de dados

​Instale o pg_clickhouse

​Conectar pg_clickhouse

​Analise os dados

​Crie um dicionário

​Faça um JOIN

Visão geral

Inicie o ClickHouse

Criar uma tabela

Adicione o conjunto de dados

Instale o pg_clickhouse

Conectar pg_clickhouse

Analise os dados

Crie um dicionário

Faça um JOIN