Sharding functions

The sharding function inside PgDog transforms column values in SQL queries to specific shard numbers, which are in turn used for routing queries to one or more databases in the configuration.

How it works

The PgDog sharding function is based on PostgreSQL declarative partitions. This choice is intentional: it allows data to be sharded both inside PgDog and inside PostgreSQL, with the use of the same partition functions.

PgDog supports all three PostgreSQL partition functions and uses them for sharding data between nodes:

Function	Description
Hash	`PARTITION BY HASH` function, using a special hashing function implemented by both PgDog and Postgres.
List	`PARTITION BY LIST` function, used for splitting rows by an explicitly defined mapping of values to shard numbers.
Range	`PARTITION BY RANGE` function, similar to list sharding, except the mapping is defined with a bounded range.

The sharding functions are configurable in pgdog.toml on a per-table and/or per-column basis.

Multiple sharding functions

Since sharding is configured for each table or column name, this allows storing tables with different sharding functions in the same database.

While this works for some cross-shard queries, joins between tables using a different sharding function are not possible for direct-to-shard queries.

Hash

The hash function evenly distributes data between all shards. It ingests bytes and returns a single 64-bit unsigned integer which we then modulo by the number of shards in the configuration.

hash(user_id) mod shards

The hash function is used by default when configuring sharded tables in pgdog.toml:

[[sharded_tables]]
database = "prod"
column = "user_id"
data_type = "bigint"

All queries referencing the user_id column will be automatically sent to the matching shard(s) and data in those tables will be split between all data nodes evenly. See below for a list of supported data types. Each can be specified as follows:

IntegersTextUUID

Integer types

Different integer types are treated the same by the query router. If you're using BIGINT, INTEGER or SMALLINT as your sharding key, you can specify bigint in the configuration:

data_type = "bigint"

Text types

VARCHAR, VARCHAR(n), and TEXT use the same encoding and are treated the same by the query router. For either one, you can specify varchar in the configuration:

data_type = "varchar"

UUID types

Only UUIDv4 is currently supported for sharding in the query router.

data_type = "uuid"

List

The list sharding function distributes data between shards according to a value <-> shard mapping. It's useful for low-cardinality sharding keys, like country codes or region names, or when you want to control how your data is distributed between the data nodes. The most common use case for this is multitenant systems.

To enable this sharding function on a table or column, you need to specify additional value <-> shard mappings in pgdog.toml, for example:

[[sharded_mappings]]
database = "prod"
column = "user_id"
kind = "list"
values = [1, 2, 3]
shard = 0

This example will route all queries with user_id equal to one, two, or three to shard zero. Unlike hash sharding, a value <-> shard mapping is required for all values of the sharding key. If a value is used that doesn't have a mapping, the query will be sent to all shards.

Required configuration

The [[sharded_tables]] configuration entry is still required for list and range sharding. It specifies the data type of the column, which tells PgDog how to parse its value at runtime.

Range

Sharding by range is similar to list sharding, except instead of specifying the values explicitly, you can specify a bounding range. All values that are included in the range will be sent to the specified shard, for example:

[[sharded_mappings]]
database = "prod"
column = "user_id"
kind = "range"
start = 1
end = 100
shard = 0

This will route queries that refer to the user_id column, with values between 1 and 100 (exclusively), to shard zero. For open-ended ranges, you can specify either the start or the end value. The start value is included in the range, while the end value is excluded.

Required configuration

The [[sharded_tables]] configuration entry is still required for list and range sharding. It specifies the data type of the column, which tells PgDog how to parse its value at runtime.

Supported data types

PostgreSQL has dozens of data types. PgDog supports a subset of those for sharding purposes and they are listed below.

Work in progress

This list will continue to get longer as the development of PgDog continues. Check back soon or create an issue to request support for a data type you need.

Data type	Hash	List	Range
`BIGINT` / `INTEGER` / `SMALLINT`
`VARCHAR` / `TEXT`			No
`UUID`			No

Schema-based sharding

In addition to splitting the tables themselves, PgDog can shard Postgres databases by placing different schemas on different shards. This is useful for multi-tenant applications that have stricter separation between their users' data.

When enabled, PgDog will route queries that fully qualify tables based on their respective schema names.

Schema-to-shard mapping

Schemas are mapped to their shards in pgdog.toml, for example:

[[sharded_schemas]]
database = "prod"
name = "customer_a"
shard = 0

[[sharded_schemas]]
database = "prod"
name = "customer_b"
shard = 1

Queries that include the schema name in the tables they are referring to can be routed to the right shard. For example:

SELECT * FROM customer_a.users WHERE email = $1;

Since the users table is fully qualified as customer_a.users, the query will be routed to shard zero.

DDL

Unlike other sharding functions, schema-based sharding will also route DDL (e.g., CREATE TABLE, CREATE INDEX, etc.) queries to their respective shard, as long as the entity name is fully qualified.

For example:

CREATE TABLE customer_b.users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR NOT NULL
);

CREATE UNIQUE INDEX ON customer_b.users USING btree(email);

Both of these DDL statements will be sent to shard one, because they explicitly refer to tables in schema customer_b, which is mapped to shard one.

Default routing

If a schema isn't mapped to a shard number, PgDog will fallback to using other configured sharding functions. If none are set, the query will be sent to all shards.

To avoid this behavior and send all other queries to a particular shard, you can add a default schema mapping:

[[sharded_schemas]]
database = "prod"
shard = 0

This will send all queries that don't specify a schema or use a schema without a mapping to shard zero.

Rewrite behavior

PgDog can transparently move writes between shards when the rewrite feature is enabled.

Sharding key updates

Sharding key updates handle the situation when a query is changing the value of the sharding key, which could require the row to be moved to a different shard.

For example:

UPDATE users SET id = $2 WHERE id = $1;

If configured, PgDog can rewrite this query into three statements, executed inside a cross-shard transaction:

SELECT * FROM users WHERE id = $1; /* query 1 */
DELETE FROM users WHERE id = $1; /* query 2 */
INSERT INTO users VALUES ($1, $2); /* row fetched in query #1 */

Limitations

The row returned by query 1, constructed from the WHERE clause of the original UPDATE statement, has to match exactly one row. If that's not the case, the operation will be aborted and PgDog will raise an error.

Multi-tuple inserts

Multi-tuple INSERT statements may write rows that belong on separate shards. To handle this situation, if configured, PgDog can rewrite the statement into multiple, single-tuple statements, and send them to their respective shards in a cross-shard transaction.

For example:

INSERT INTO users (id, email) VALUES ($1, $2), ($3, $4);

This statement will be rewritten into the following two queries:

INSERT INTO users (id, email) VALUES ($1, $2);
INSERT INTO users (id, email) VALUES ($3, $4);

Limitations

Since PgDog starts a cross-shard transaction to make this operation atomic, the original INSERT statement must not be sent inside an explicit transaction by the client. If that's the case, PgDog will abort the operation and return an error.

Configuration

Both features require the enabled flag to be set to true, for example:

[rewrite]
enabled = true
split_inserts = "rewrite"
shard_key = "rewrite"

If a safe rewrite plan cannot be determined, PgDog will abort the transaction and return an error. To guarantee cross-shard atomicity of the operation, consider enabling two-phase commit.

Sharding functions

How it works

Hash

List

Range

Supported data types

Schema-based sharding

Schema-to-shard mapping

DDL

Default routing

Rewrite behavior

Sharding key updates

Limitations

Multi-tuple inserts

Limitations

Configuration

Read more