Sharding functions
The sharding function inside PgDog transforms column values in SQL queries to specific shard numbers. They are in turn used for routing queries to one or more databases in the configuration.
How it works
PgDog sharding function is based on PostgreSQL declarative partitions. This choice is intentional: it allows data to be sharded both inside PgDog and inside PostgreSQL, with the use of the same partition functions.
PgDog supports all three PostgreSQL partition functions and uses them for sharding data between nodes:
| Function | Description |
|---|---|
| Hash | PARTITION BY HASH function, using a special hashing function implemented by both PgDog and Postgres. |
| List | PARTITION BY LIST function, used for splitting rows by an explicitly defined mapping of values to shard numbers. |
| Range | PARTITION BY RANGE function, similar to list sharding, except the mapping is defined with a bounded range. |
The sharding functions are configurable in pgdog.toml on a per-table and/or per-column basis.
Multiple sharding functions
Since sharding is configured for each table or column name, this allows storing tables with different sharding functions in the same database.
While this works for some cross-shard queries, joins between tables using a different sharding function are not possible for direct-to-shard queries.
Hash
The hash function evenly distributes data between all shards. It ingests bytes and returns a single 64-bit unsigned integer which we then modulo by the number of shards in the configuration.
hash(user_id) mod shards
The hash function is used by default when configuring sharded tables in pgdog.toml:
All queries referencing the user_id column will be automatically sent to the matching shard(s) and data in those tables will be split between all data nodes evenly. See below for a list of supported data types. Each can be specified as follows:
Integer types
Different integer types are treated the same by the query router. If you're using BIGINT, INTEGER or SMALLINT as your sharding key, you can specify bigint in the configuration:
Text types
VARCHAR, VARCHAR(n), and TEXT use the same encoding and are treated the same by the query router. For either one, you can specify varchar in the configuration:
List
The list sharding function distributes data between shards according to a value <-> shard mapping. It's useful for low-cardinality sharding keys, like country codes or region names, or when you want to control how your data is distributed between the data nodes. The most common use case for this is multitenant systems.
To enable this sharding function on a table or column, you need to specify additional value <-> shard mappings in pgdog.toml, for example:
[[sharded_mappings]]
database = "prod"
column = "user_id"
kind = "list"
values = [1, 2, 3]
shard = 0
This example will route all queries with user_id equals to one, two or three to shard zero. Unlike hash sharding, a value <-> shard mapping is required for all values of the sharding key. If a value is used that doesn't have a mapping, the query will be sent to all shards.
Required configuration
The [[sharded_tables]] configuration entry is still required for list and range sharding. It specifies the data type of the column, which tells PgDog how to parse its value at runtime.
Range
Sharding by range function is similar to list sharding function, except instead of specifying the values explicitly, you can specify a bounding range. All values which are included in the range will be sent to the specified shard, for example:
[[sharded_mappings]]
database = "prod"
column = "user_id"
kind = "range"
start = 1
end = 100
shard = 0
This will route queries that refer to the user_id column, with values between 1 and 100 (exclusively), to shard zero. For open-ended ranges, you can specify either the start or the end value. The start value is included in the range, while the end value is excluded.
Required configuration
The [[sharded_tables]] configuration entry is still required for list and range sharding. It specifies the data type of the column, which tells PgDog how to parse its value at runtime.
Supported data types
PostgreSQL has dozens of data types. PgDog supports a subset of those for sharding purposes and they are listed below.
Work in progress
This list will continue to get longer as the development of PgDog continues. Check back soon or create an issue to request support for a data type you need.
| Data type | Hash | List | Range |
|---|---|---|---|
BIGINT / INTEGER / SMALLINT |
|||
VARCHAR / TEXT |
No | ||
UUID |
No |
Schema-based sharding
In addition to splitting the tables themselves, PgDog can shard Postgres databases by placing different schemas on different shards. This is useful for multi-tenant applications that have stricter separation between their users' data.
When enabled, PgDog will route queries that fully qualify tables based on their respective schema names.
Schema-to-shard mapping
Schemas are mapped to their shards in pgdog.toml, for example:
[[sharded_schemas]]
database = "prod"
name = "customer_a"
shard = 0
[[sharded_schemas]]
database = "prod"
name = "customer_b"
shard = 1
Queries that include the schema name in the tables they are referring to can be routed to the right shard. For example:
Since the users table is fully qualified as customer_a.users, the query will be routed to shard zero.
DDL
Unlike other sharding functions, schema-based sharding will also route DDL (e.g., CREATE TABLE, CREATE INDEX, etc.) queries to their respective shard, as long as the entity name is fully qualified.
For example:
CREATE TABLE customer_b.users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR NOT NULL
);
CREATE UNIQUE INDEX ON customer_b.users USING btree(email);
Both of these DDL statements will be sent to shard one, because they explicitly refer to tables in schema customer_b, which is mapped to shard one.
Default routing
If a schema isn't mapped to a shard number, PgDog will fallback to using other configured sharding functions. If none are set, the query will be sent to all shards.
To avoid this behavior and send all other queries to a particular shard, you can add a default schema mapping:
This will send all queries that don't specify a schema or use a schema without a mapping to shard zero.