Skip to content

COPY

COPY is a special PostgreSQL command that ingests a file directly into a database table. This allows ingesting data faster than by using individual INSERT queries. PgDog supports parsing this command, sharding the file automatically, and splitting the data between shards, invisible to the application.

Cross-shard queries

How it works

PgDog supports data sent via COPY formatted using any one of the 3 possible formats:

  • CSV (comma-separated values)
  • Text (PostgreSQL version of CSV, with \t as the delimiter)
  • Binary (not frequently used)

Expected syntax

COPY commands sent through PgDog should specify table columns explicitly. This allows it to parse the data stream correctly, knowing which column is the sharding key.

Take the following example:

CREATE TABLE "users" (
  id BIGINT PRIMARY KEY,
  email VARCHAR NOT NULL UNIQUE
)

To ingest this data into the table, the following query should be used:

COPY "users" ("id", "email") FROM STDIN CSV HEADER;

This query specifies the column order, the file format, and that the file contains a header which should be ignored. If you're using psql, replace the COPY with the special \copy command.

Note

While it's technically feasible to use the CSV header to determine the column order, it's possible to supply files without headers, and that would require PgDog to fetch the schema definition from the database, making this a more complex, multi-step process.

Performance

By adding N nodes to a sharded cluster, the performance of COPY increases N times. Data sent through COPY is ingested into shards in parallel. This makes the performance of COPY as fast as data nodes can write data and the network can send/receive messages. The cost of parsing and sharding CSV data in PgDog is negligibly small.