Skip to content

PlinkingDuck

A DuckDB extension for reading PLINK 2 genomics file formats and running common genetic analyses directly in SQL.

PlinkingDuck brings PLINK genotype, variant, and sample data into DuckDB, letting you query genomics datasets with standard SQL instead of format-specific command-line tools. Read files, filter variants, compute allele frequencies, test Hardy-Weinberg equilibrium, measure missingness, calculate linkage disequilibrium, run polygenic scoring, and perform genome-wide association testing — all from a SQL prompt.

Built on the shoulders of DuckHTS

DuckHTS pioneered the idea of querying genomics file formats directly in DuckDB using htslib. It supports VCF, BCF, BAM, CRAM, FASTA, FASTQ, GTF, GFF, and tabix-indexed files — the sequencing side of the house. PlinkingDuck picks up where DuckHTS leaves off, covering the PLINK genotype file formats used in population genetics and GWAS. If you work with both sequencing and genotype data, you'll want both extensions.

Quick Example

-- Load the extension
LOAD plinking_duck;

-- Read variant metadata
SELECT CHROM, POS, ID, REF, ALT
FROM read_pvar('cohort.pvar')
WHERE CHROM = '22';

-- Read genotypes with sample info
SELECT chrom, pos, iid, genotype
FROM read_pfile('cohort', orient := 'genotype')
WHERE genotype = 2;

-- Compute allele frequencies
SELECT ID, ALT_FREQ
FROM plink_freq('cohort.pgen')
WHERE ALT_FREQ > 0.01;

-- Hardy-Weinberg equilibrium test
SELECT ID, P_HWE
FROM plink_hardy('cohort.pgen')
WHERE P_HWE < 1e-6;

Functions

PlinkingDuck provides 5 file readers and 6 analysis functions:

File Readers

Function Description
read_pvar(path) Read .pvar or .bim variant metadata
read_psam(path) Read .psam or .fam sample metadata
read_pgen(path) Read .pgen binary genotype files
read_pfile(prefix) Read a complete PLINK fileset with orient modes
read_plink_vcf(path) Fast biallelic genotype extraction from VCF

Analysis Functions

Function Description
plink_freq(path) Per-variant allele frequencies
plink_hardy(path) Hardy-Weinberg equilibrium exact test
plink_missing(path) Per-variant or per-sample missingness
plink_ld(path) Pairwise linkage disequilibrium
plink_score(path) Polygenic risk scoring
plink_glm(prefix) Per-variant GWAS regression (linear, logistic, Firth)

Features

  • Native SQL access to PLINK 2 binary genotype data
  • Projection pushdown -- skip genotype decoding when only metadata is queried
  • Parallel scan -- multi-threaded variant processing for large files
  • Sample subsetting -- filter to specific samples by IID or index
  • Region filtering -- restrict to genomic regions (chr:start-end)
  • Filter pushdown -- pre-filter variants by allele frequency, count, or genotype value
  • GWAS regression -- per-variant association testing with covariates and Firth correction
  • Legacy format support -- read PLINK 1 .bim and .fam files
  • VCF import -- fast biallelic genotype extraction from VCF files via plink-ng
  • VFS integration -- works with DuckDB's filesystem abstraction