Tutorial: Genomic Analysis with SQL¶

This tutorial walks through a complete genomic analysis workflow using PlinkingDuck — from loading data through quality control to downstream analysis. By the end, you'll know how to explore genomic datasets, assess data quality, measure variant correlation, and compute polygenic risk scores, all in SQL.

No prior genomics knowledge is required. We'll explain concepts as they arise.

Prerequisites

Before starting, make sure PlinkingDuck is installed and loaded. See the Getting Started guide for setup instructions.

1. Meet the Dataset¶

Genomic data in PLINK 2 format is stored as a fileset — a group of files sharing a common prefix:

.pvar — variant metadata (which genetic markers were measured)
.psam — sample metadata (who was measured)
.pgen — genotype data (the actual measurements, in compact binary)

Our tutorial dataset has 4 genetic variants measured across 4 individuals. Let's start by examining what variants and samples are present.

Variant Metadata¶

Each variant (a position in the genome where people differ) has a chromosome, position, identifier, and the two alleles observed:

SELECT * FROM read_pvar('test/data/pgen_example.pvar');

CHROM	POS	ID	REF	ALT
1	10000	rs1	A	G
1	20000	rs2	C	T
1	30000	rs3	G	A
2	15000	rs4	T	C

Three variants sit on chromosome 1 and one on chromosome 2. REF is the reference allele (the common form) and ALT is the alternate allele.

What are rsIDs?

Variant identifiers like rs1 and rs2 are shorthand here. In real data, these are RefSNP IDs (e.g., rs7903146) that uniquely identify known variants across studies.

Since this is standard SQL, you can summarize the data however you like:

SELECT CHROM, COUNT(*) AS n_variants
FROM read_pvar('test/data/pgen_example.pvar')
GROUP BY CHROM
ORDER BY CHROM;

CHROM	n_variants
1	3
2	1

Sample Metadata¶

Now let's look at who was genotyped. We'll use a companion .psam file that includes family IDs and sex:

SELECT * FROM read_psam('test/data/pfile_example.psam');

FID	IID	SEX
FAM001	SAMPLE1	1
FAM001	SAMPLE2	2
FAM002	SAMPLE3	NULL
FAM002	SAMPLE4	1

IID is the individual identifier. FID groups individuals into families. SEX uses PLINK conventions: 1 = male, 2 = female, NULL = unknown.

Note

Different .psam files can have different columns. The minimal format has just IID. See read_psam for format details.

2. Reading Genotypes¶

Now for the core data — the actual genotype calls. A genotype is an individual's genetic state at a variant: how many copies of the alternate allele they carry.

List Mode (One Row per Variant)¶

read_pgen returns one row per variant, with all genotypes packed into a list:

SELECT * FROM read_pgen('test/data/pgen_example.pgen');

CHROM	POS	ID	REF	ALT	genotypes
1	10000	rs1	A	G	[0, 1, 2, NULL]
1	20000	rs2	C	T	[1, 1, 0, 2]
1	30000	rs3	G	A	[2, NULL, 1, 0]
2	15000	rs4	T	C	[0, 0, 1, 2]

Each number in the genotype list is an alternate allele count: 0 = homozygous reference (no copies of ALT), 1 = heterozygous (one copy), 2 = homozygous alternate (two copies), NULL = missing. See Genotype Encoding for the full reference.

For variant rs1, SAMPLE1 has genotype 0 (AA), SAMPLE2 has 1 (AG), SAMPLE3 has 2 (GG), and SAMPLE4 is missing.

Genotype Orient Mode (One Row per Observation)¶

For analysis with SQL joins and aggregations, genotype orient mode is often more convenient. It produces one row per variant-sample pair:

SELECT CHROM, POS, ID, IID, genotype
FROM read_pfile('test/data/pfile_example', orient := 'genotype')
ORDER BY ID, IID;

CHROM	POS	ID	IID	genotype
1	10000	rs1	SAMPLE1	0
1	10000	rs1	SAMPLE2	1
1	10000	rs1	SAMPLE3	2
1	10000	rs1	SAMPLE4	NULL
1	20000	rs2	SAMPLE1	1
1	20000	rs2	SAMPLE2	1
1	20000	rs2	SAMPLE3	0
1	20000	rs2	SAMPLE4	2
1	30000	rs3	SAMPLE1	2
1	30000	rs3	SAMPLE2	NULL
1	30000	rs3	SAMPLE3	1
1	30000	rs3	SAMPLE4	0
2	15000	rs4	SAMPLE1	0
2	15000	rs4	SAMPLE2	0
2	15000	rs4	SAMPLE3	1
2	15000	rs4	SAMPLE4	2

read_pfile vs read_pgen

read_pfile takes a prefix (e.g., 'test/data/pfile_example') and auto-discovers the .pgen, .pvar, and .psam files. read_pgen takes the .pgen path directly. See read_pfile and read_pgen for details.

With genotype orient data, standard SQL works naturally. For example, count the total alternate alleles carried by each individual:

SELECT IID, SUM(genotype) AS total_alt_alleles
FROM read_pfile('test/data/pfile_example', orient := 'genotype')
WHERE genotype IS NOT NULL
GROUP BY IID
ORDER BY IID;

IID	total_alt_alleles
SAMPLE1	3
SAMPLE2	2
SAMPLE3	4
SAMPLE4	4

Or find all heterozygous calls (genotype = 1):

SELECT ID, IID
FROM read_pfile('test/data/pfile_example', orient := 'genotype')
WHERE genotype = 1
ORDER BY ID, IID;

ID	IID
rs1	SAMPLE2
rs2	SAMPLE1
rs2	SAMPLE2
rs3	SAMPLE3
rs4	SAMPLE3

3. Quality Control¶

Before running any analysis, you need to check data quality. PlinkingDuck provides dedicated functions for the three pillars of genotype QC: missingness, allele frequency, and Hardy-Weinberg equilibrium. These use optimized paths in pgenlib that are faster than computing from raw genotypes.

Missingness¶

Missingness is the fraction of genotype calls that failed. High missingness for a variant suggests a technical problem with that assay; high missingness for a sample suggests a problem with that individual's DNA.

Per-Variant Missingness¶

SELECT ID, MISSING_CT, OBS_CT, F_MISS
FROM plink_missing('test/data/pgen_example.pgen');

ID	MISSING_CT	OBS_CT	F_MISS
rs1	1	3	0.25
rs2	0	4	0.0
rs3	1	3	0.25
rs4	0	4	0.0

Variants rs1 and rs3 each have one missing call (25% missingness). In a real study, you would typically remove variants exceeding a threshold (e.g., 5%):

SELECT ID, F_MISS
FROM plink_missing('test/data/pgen_example.pgen')
WHERE F_MISS > 0.05;

Per-Sample Missingness¶

Switch to sample mode to check individual-level data quality:

SELECT IID, MISSING_CT, OBS_CT, F_MISS
FROM plink_missing('test/data/pgen_example.pgen',
    mode := 'sample', psam := 'test/data/pfile_example.psam');

IID	MISSING_CT	OBS_CT	F_MISS
SAMPLE1	0	4	0.0
SAMPLE2	1	3	0.25
SAMPLE3	0	4	0.0
SAMPLE4	1	3	0.25

SAMPLE2 and SAMPLE4 each have one missing genotype.

Allele Frequency¶

Allele frequency is how common each allele is in the population. The alternate allele frequency (ALT_FREQ) ranges from 0 to 1. Variants with very low frequency (rare variants) may need special handling.

SELECT ID, ALT_FREQ, OBS_CT
FROM plink_freq('test/data/pgen_example.pgen');

ID	ALT_FREQ	OBS_CT
rs1	0.5	6
rs2	0.5	8
rs3	0.5	6
rs4	0.375	8

OBS_CT is allele count, not sample count

OBS_CT is the number of observed alleles (2 per non-missing sample, since humans are diploid). With 3 non-missing samples, OBS_CT = 6.

Use counts := true to see the underlying genotype counts:

SELECT ID, HOM_REF_CT, HET_CT, HOM_ALT_CT, MISSING_CT
FROM plink_freq('test/data/pgen_example.pgen', counts := true);

ID	HOM_REF_CT	HET_CT	HOM_ALT_CT	MISSING_CT
rs1	1	1	1	1
rs2	1	2	1	0
rs3	1	1	1	1
rs4	2	1	1	0

Hardy-Weinberg Equilibrium¶

Hardy-Weinberg equilibrium (HWE) predicts the expected genotype frequencies given the allele frequencies. Departures from HWE can indicate genotyping errors, population structure, or selection. The HWE exact test produces a p-value — small values suggest the variant deviates from equilibrium.

SELECT ID, HOM_REF_CT, HET_CT, HOM_ALT_CT,
       ROUND(P_HWE, 4) AS P_HWE
FROM plink_hardy('test/data/pgen_example.pgen');

ID	HOM_REF_CT	HET_CT	HOM_ALT_CT	P_HWE
rs1	1	1	1	1.0
rs2	1	2	1	1.0
rs3	1	1	1	1.0
rs4	2	1	1	0.4286

All variants are in equilibrium (p-values well above any reasonable threshold). In practice, variants with P_HWE < 1e-6 are often excluded.

Putting It Together: A QC Summary¶

The power of SQL shines when you combine multiple analyses. This CTE joins frequency, missingness, and HWE results into a single QC summary:

WITH freq AS (
    SELECT ID, ALT_FREQ
    FROM plink_freq('test/data/pgen_example.pgen')
),
miss AS (
    SELECT ID, F_MISS
    FROM plink_missing('test/data/pgen_example.pgen')
),
hardy AS (
    SELECT ID, P_HWE
    FROM plink_hardy('test/data/pgen_example.pgen')
)
SELECT f.ID, f.ALT_FREQ, m.F_MISS, h.P_HWE
FROM freq f
JOIN miss m ON f.ID = m.ID
JOIN hardy h ON f.ID = h.ID
ORDER BY f.ID;

ID	ALT_FREQ	F_MISS	P_HWE
rs1	0.5	0.25	1.0
rs2	0.5	0.0	1.0
rs3	0.5	0.25	1.0
rs4	0.375	0.0	0.4285714285714286

From here you could filter to only variants passing QC:

-- Variants passing all QC thresholds
-- (not run here since our tiny dataset passes everything)
WHERE f.ALT_FREQ > 0.01          -- MAF > 1%
  AND m.F_MISS < 0.05            -- missingness < 5%
  AND h.P_HWE > 1e-6             -- HWE p-value > 1e-6

Note

For a deeper dive into QC workflows, see the Quality Control Guide.

4. Linkage Disequilibrium¶

Linkage disequilibrium (LD) measures the statistical correlation between genotypes at different variants. Variants in high LD tend to be inherited together. LD is fundamental to understanding genome structure and interpreting association results.

Pairwise LD¶

To measure LD between two specific variants, use pairwise mode:

SELECT ID_A, ID_B, ROUND(R2, 4) AS R2, D_PRIME, OBS_CT
FROM plink_ld('test/data/pgen_example.pgen',
    variant1 := 'rs1', variant2 := 'rs2');

ID_A	ID_B	R2	D_PRIME	OBS_CT
rs1	rs2	0.75	0.5	3

r² is the squared correlation between genotypes (0 = independent, 1 = perfectly correlated). D' is a normalized measure of allelic association. OBS_CT is the number of samples with non-missing genotypes at both variants.

Windowed LD¶

To scan all variant pairs within a genomic window, omit the variant parameters. Set r2_threshold to control which pairs are reported:

SELECT ID_A, ID_B, ROUND(R2, 4) AS R2,
       ROUND(D_PRIME, 4) AS D_PRIME, OBS_CT
FROM plink_ld('test/data/pgen_example.pgen',
    r2_threshold := 0.0);

ID_A	ID_B	R2	D_PRIME	OBS_CT
rs1	rs2	0.75	0.5	3
rs1	rs3	1.0	1.0	2
rs2	rs3	0.25	0.3333	3

By default, only same-chromosome pairs within 1000 kb are tested. Variants rs1 and rs3 show perfect LD (r² = 1.0), though with only 2 shared observations.

Cross-Chromosome LD¶

To include pairs across chromosomes (useful for checking for population structure artifacts), set inter_chr := true:

SELECT ID_A, ID_B, ROUND(R2, 4) AS R2, OBS_CT
FROM plink_ld('test/data/pgen_example.pgen',
    r2_threshold := 0.0, inter_chr := true);

ID_A	ID_B	R2	OBS_CT
rs1	rs2	0.75	3
rs1	rs3	1.0	2
rs1	rs4	0.75	3
rs2	rs3	0.25	3
rs2	rs4	0.1818	4
rs3	rs4	1.0	3

Warning

High LD between variants on different chromosomes is unexpected in real data and usually signals population stratification or genotyping artifacts.

5. Polygenic Risk Scoring¶

A polygenic risk score (PRS) summarizes an individual's genetic predisposition by weighting each variant's genotype by its effect size from a prior study. PlinkingDuck's plink_score computes this efficiently, handling missing data with mean imputation by default.

Positional Weights¶

The simplest approach provides one weight per variant, in order:

SELECT IID, SCORE_SUM, SCORE_AVG
FROM plink_score('test/data/pgen_example.pgen',
    psam := 'test/data/pfile_example.psam',
    weights := [0.5, -0.3, 1.2, 0.8]);

IID	SCORE_SUM	SCORE_AVG
SAMPLE1	2.1	0.2625
SAMPLE2	1.4	0.175
SAMPLE3	3.0	0.375
SAMPLE4	1.5	0.1875

Each score is sum(weight × dosage) across all variants. SCORE_AVG normalizes by the number of observed alleles.

ID-Keyed Weights¶

For real applications, you typically have a scoring file that maps variant IDs and alleles to weights. ID-keyed mode handles this — variants not found in the data are silently skipped:

SELECT IID, ROUND(SCORE_SUM, 2) AS SCORE_SUM,
       ROUND(SCORE_AVG, 4) AS SCORE_AVG
FROM plink_score('test/data/pgen_example.pgen',
    psam := 'test/data/pfile_example.psam',
    weights := [
        {'id': 'rs1', 'allele': 'G', 'weight': 0.5},
        {'id': 'rs2', 'allele': 'T', 'weight': -0.3},
        {'id': 'rs4', 'allele': 'C', 'weight': 0.8}
    ]);

IID	SCORE_SUM	SCORE_AVG
SAMPLE1	-0.3	-0.05
SAMPLE2	0.2	0.0333
SAMPLE3	1.8	0.3
SAMPLE4	1.5	0.25

The allele field specifies which allele the weight applies to. If it matches ALT, the dosage is used directly; if it matches REF, the dosage is flipped.

Missing data handling

By default, missing genotypes are imputed with the variant's mean dosage. Use no_mean_imputation := true to skip missing variants instead (each sample's ALLELE_CT will reflect only non-missing variants). See plink_score for details.

6. Genome-Wide Association Testing¶

A genome-wide association study (GWAS) tests every variant for association with a trait. plink_glm runs per-variant regression using plink2's proven regression engine — the same math that powers plink2 --glm.

Linear Regression (Continuous Trait)¶

For a continuous trait like height or BMI, plink_glm fits a linear regression at each variant:

SELECT ID, ROUND(BETA, 4) AS BETA, ROUND(P, 6) AS P
FROM plink_glm('test/data/pgen_example',
    phenotype := [1.5, 2.3, 3.7, 0.8])
ORDER BY P;

ID	BETA	P
rs2	-1.45	0.04880
rs1	1.1	0.099425
rs4	-0.3364	0.741259
rs3	0.35	0.851413

Each row reports the effect size (BETA — how much the trait changes per copy of the ALT allele) and p-value (how unlikely this association is under the null hypothesis of no effect).

Logistic Regression (Case/Control)¶

For binary traits (disease yes/no), plink_glm auto-detects the 0/1 phenotype and switches to logistic regression, reporting odds ratios:

SELECT ID, ROUND(BETA, 4), ROUND("OR", 4), ROUND(P, 4), FIRTH_YN
FROM plink_glm('test/data/large_example',
    phenotype := [0, 1, 0, 1, 1, 0, 1, 0])
WHERE ID = 'var1';

ID	BETA	OR	P	FIRTH_YN
var1	0.0	1.0	1.0	N

The OR (odds ratio) is exp(BETA). An OR of 1.0 means no association. When logistic regression fails to converge (common with rare variants), plink2's Firth correction kicks in automatically — the FIRTH_YN column tells you which method was used.

Adding Covariates¶

Real GWAS always adjusts for confounders like age, sex, and ancestry principal components:

SELECT ID, ROUND(BETA, 4) AS BETA, ROUND(P, 6) AS P
FROM plink_glm('test/data/large_example',
    phenotype := [1.2, 3.4, 2.1, 5.6, 4.3, 0.9, 3.8, 2.7],
    covariates := {
        'age': [25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0]
    })
WHERE ID = 'var1';

ID	BETA	P
var1	-1.1205	0.369083

Phenotypes from External Files¶

For real datasets, phenotypes and covariates live in external files. Use DuckDB's SET VARIABLE to load them, then pass with getvariable().

The phenotype and covariate lists must be in the same order as samples in the .pgen file. The safest approach is to join against the .psam file:

-- Join phenotype file against psam to match pgen sample order
SET VARIABLE pheno = (
    SELECT list(p.pheno_quant ORDER BY s.rowid)
    FROM read_psam('data/cohort.psam') s
    JOIN 'phenotypes.parquet' p ON s.IID = p.IID
);

SET VARIABLE covars = (
    SELECT {
        'age': list(p.age ORDER BY s.rowid),
        'bmi': list(p.bmi ORDER BY s.rowid),
        'pc1': list(p.pc1 ORDER BY s.rowid)
    }
    FROM read_psam('data/cohort.psam') s
    JOIN 'covariates.parquet' p ON s.IID = p.IID
);

-- Run GWAS: top hits
SELECT ID, CHROM, POS, ROUND(BETA, 4) AS BETA, P
FROM plink_glm('data/cohort',
    phenotype := getvariable('pheno'),
    covariates := getvariable('covars'))
WHERE P < 5e-8
ORDER BY P;

The ORDER BY s.rowid ensures values are emitted in .psam row order, which matches the .pgen sample order.

Pre-sorted shortcut

If your parquet file is already sorted to match pgen sample order (e.g., exported from the same pipeline), you can skip the join: SELECT list(pheno_quant) FROM 'phenotypes.parquet'.

See plink_glm for the full parameter and output reference.

7. Working at Scale¶

The examples above used a tiny 4-variant dataset. Real genomic data has millions of variants and thousands of samples. PlinkingDuck is designed for this — parallel scanning, projection pushdown, and region filtering all help keep queries fast.

Let's demonstrate with a larger dataset: 3,000 variants across 8 samples on 3 chromosomes.

SELECT
    (SELECT COUNT(*) FROM read_pvar('test/data/large_example.pvar')) AS n_variants,
    (SELECT COUNT(*) FROM read_psam('test/data/large_example.psam')) AS n_samples;

n_variants	n_samples
3000	8

SELECT CHROM, COUNT(*) AS n_variants
FROM read_pvar('test/data/large_example.pvar')
GROUP BY CHROM
ORDER BY CHROM;

CHROM	n_variants
1	1000
2	1000
3	1000

Region Filtering¶

When you only need variants in a specific genomic region, the region parameter restricts processing before any computation begins. This is much more efficient than filtering with WHERE after the fact:

-- Efficient: only reads variants in the region
SELECT COUNT(*) AS n_variants
FROM plink_freq('test/data/large_example.pgen',
    region := '1:1-50000');

n_variants
500

Region filtering works with all analysis functions (plink_freq, plink_hardy, plink_missing, plink_ld, plink_score) and read_pfile.

-- Combine region filtering with aggregation
SELECT COUNT(*) AS n_variants, ROUND(AVG(ALT_FREQ), 4) AS mean_freq
FROM plink_freq('test/data/large_example.pgen',
    region := '1:1-50000');

n_variants	mean_freq
500	0.5

Performance tips

Use region instead of WHERE CHROM = ... — it avoids reading genotype data outside the region
Select only the columns you need — projection pushdown skips unused computation
Use dedicated analysis functions (plink_freq, plink_hardy) instead of computing from raw genotypes — they use optimized pgenlib paths

See the Performance Guide for the full set of optimization strategies.

Next Steps¶

You've now seen the complete PlinkingDuck workflow: loading data, exploring genotypes, running QC, analyzing LD, computing PRS scores, running GWAS, and working efficiently at scale. Here are some directions to explore next:

Function Reference — full parameter and output documentation for every function
Quality Control Guide — detailed QC workflows with recommended thresholds
Performance Guide — optimization strategies for large datasets
Common Parameters — sample subsetting, region filtering, and companion file options
Genotype Encoding — how genotype values map to allele counts