PlinkingDuck¶
A DuckDB extension for reading PLINK 2 genomics file formats and running common genetic analyses directly in SQL.
PlinkingDuck brings PLINK genotype, variant, and sample data into DuckDB, letting you query genomics datasets with standard SQL instead of format-specific command-line tools. Read files, filter variants, compute allele frequencies, test Hardy-Weinberg equilibrium, measure missingness, calculate linkage disequilibrium, run polygenic scoring, and perform genome-wide association testing — all from a SQL prompt.
Built on the shoulders of DuckHTS
DuckHTS pioneered the idea of querying genomics file formats directly in DuckDB using htslib. It supports VCF, BCF, BAM, CRAM, FASTA, FASTQ, GTF, GFF, and tabix-indexed files — the sequencing side of the house. PlinkingDuck picks up where DuckHTS leaves off, covering the PLINK genotype file formats used in population genetics and GWAS. If you work with both sequencing and genotype data, you'll want both extensions.
Quick Example¶
-- Load the extension
LOAD plinking_duck;
-- Read variant metadata
SELECT CHROM, POS, ID, REF, ALT
FROM read_pvar('cohort.pvar')
WHERE CHROM = '22';
-- Read genotypes with sample info
SELECT chrom, pos, iid, genotype
FROM read_pfile('cohort', orient := 'genotype')
WHERE genotype = 2;
-- Compute allele frequencies
SELECT ID, ALT_FREQ
FROM plink_freq('cohort.pgen')
WHERE ALT_FREQ > 0.01;
-- Hardy-Weinberg equilibrium test
SELECT ID, P_HWE
FROM plink_hardy('cohort.pgen')
WHERE P_HWE < 1e-6;
Functions¶
PlinkingDuck provides 5 file readers and 6 analysis functions:
File Readers¶
| Function | Description |
|---|---|
read_pvar(path) |
Read .pvar or .bim variant metadata |
read_psam(path) |
Read .psam or .fam sample metadata |
read_pgen(path) |
Read .pgen binary genotype files |
read_pfile(prefix) |
Read a complete PLINK fileset with orient modes |
read_plink_vcf(path) |
Fast biallelic genotype extraction from VCF |
Analysis Functions¶
| Function | Description |
|---|---|
plink_freq(path) |
Per-variant allele frequencies |
plink_hardy(path) |
Hardy-Weinberg equilibrium exact test |
plink_missing(path) |
Per-variant or per-sample missingness |
plink_ld(path) |
Pairwise linkage disequilibrium |
plink_score(path) |
Polygenic risk scoring |
plink_glm(prefix) |
Per-variant GWAS regression (linear, logistic, Firth) |
Features¶
- Native SQL access to PLINK 2 binary genotype data
- Projection pushdown -- skip genotype decoding when only metadata is queried
- Parallel scan -- multi-threaded variant processing for large files
- Sample subsetting -- filter to specific samples by IID or index
- Region filtering -- restrict to genomic regions (
chr:start-end) - Filter pushdown -- pre-filter variants by allele frequency, count, or genotype value
- GWAS regression -- per-variant association testing with covariates and Firth correction
- Legacy format support -- read PLINK 1
.bimand.famfiles - VCF import -- fast biallelic genotype extraction from VCF files via plink-ng
- VFS integration -- works with DuckDB's filesystem abstraction