Skip to content

Tutorial: Genomic Analysis with SQL

This tutorial walks through a complete genomic analysis workflow using PlinkingDuck — from loading data through quality control to downstream analysis. By the end, you'll know how to explore genomic datasets, assess data quality, measure variant correlation, and compute polygenic risk scores, all in SQL.

No prior genomics knowledge is required. We'll explain concepts as they arise.

Prerequisites

Before starting, make sure PlinkingDuck is installed and loaded. See the Getting Started guide for setup instructions.


1. Meet the Dataset

Genomic data in PLINK 2 format is stored as a fileset — a group of files sharing a common prefix:

  • .pvar — variant metadata (which genetic markers were measured)
  • .psam — sample metadata (who was measured)
  • .pgen — genotype data (the actual measurements, in compact binary)

Our tutorial dataset has 4 genetic variants measured across 4 individuals. Let's start by examining what variants and samples are present.

Variant Metadata

Each variant (a position in the genome where people differ) has a chromosome, position, identifier, and the two alleles observed:

SELECT * FROM read_pvar('test/data/pgen_example.pvar');
CHROM POS ID REF ALT
1 10000 rs1 A G
1 20000 rs2 C T
1 30000 rs3 G A
2 15000 rs4 T C

Three variants sit on chromosome 1 and one on chromosome 2. REF is the reference allele (the common form) and ALT is the alternate allele.

What are rsIDs?

Variant identifiers like rs1 and rs2 are shorthand here. In real data, these are RefSNP IDs (e.g., rs7903146) that uniquely identify known variants across studies.

Since this is standard SQL, you can summarize the data however you like:

SELECT CHROM, COUNT(*) AS n_variants
FROM read_pvar('test/data/pgen_example.pvar')
GROUP BY CHROM
ORDER BY CHROM;
CHROM n_variants
1 3
2 1

Sample Metadata

Now let's look at who was genotyped. We'll use a companion .psam file that includes family IDs and sex:

SELECT * FROM read_psam('test/data/pfile_example.psam');
FID IID SEX
FAM001 SAMPLE1 1
FAM001 SAMPLE2 2
FAM002 SAMPLE3 NULL
FAM002 SAMPLE4 1

IID is the individual identifier. FID groups individuals into families. SEX uses PLINK conventions: 1 = male, 2 = female, NULL = unknown.

Note

Different .psam files can have different columns. The minimal format has just IID. See read_psam for format details.


2. Reading Genotypes

Now for the core data — the actual genotype calls. A genotype is an individual's genetic state at a variant: how many copies of the alternate allele they carry.

List Mode (One Row per Variant)

read_pgen returns one row per variant, with all genotypes packed into a list:

SELECT * FROM read_pgen('test/data/pgen_example.pgen');
CHROM POS ID REF ALT genotypes
1 10000 rs1 A G [0, 1, 2, NULL]
1 20000 rs2 C T [1, 1, 0, 2]
1 30000 rs3 G A [2, NULL, 1, 0]
2 15000 rs4 T C [0, 0, 1, 2]

Each number in the genotype list is an alternate allele count: 0 = homozygous reference (no copies of ALT), 1 = heterozygous (one copy), 2 = homozygous alternate (two copies), NULL = missing. See Genotype Encoding for the full reference.

For variant rs1, SAMPLE1 has genotype 0 (AA), SAMPLE2 has 1 (AG), SAMPLE3 has 2 (GG), and SAMPLE4 is missing.

Genotype Orient Mode (One Row per Observation)

For analysis with SQL joins and aggregations, genotype orient mode is often more convenient. It produces one row per variant-sample pair:

SELECT CHROM, POS, ID, IID, genotype
FROM read_pfile('test/data/pfile_example', orient := 'genotype')
ORDER BY ID, IID;
CHROM POS ID IID genotype
1 10000 rs1 SAMPLE1 0
1 10000 rs1 SAMPLE2 1
1 10000 rs1 SAMPLE3 2
1 10000 rs1 SAMPLE4 NULL
1 20000 rs2 SAMPLE1 1
1 20000 rs2 SAMPLE2 1
1 20000 rs2 SAMPLE3 0
1 20000 rs2 SAMPLE4 2
1 30000 rs3 SAMPLE1 2
1 30000 rs3 SAMPLE2 NULL
1 30000 rs3 SAMPLE3 1
1 30000 rs3 SAMPLE4 0
2 15000 rs4 SAMPLE1 0
2 15000 rs4 SAMPLE2 0
2 15000 rs4 SAMPLE3 1
2 15000 rs4 SAMPLE4 2

read_pfile vs read_pgen

read_pfile takes a prefix (e.g., 'test/data/pfile_example') and auto-discovers the .pgen, .pvar, and .psam files. read_pgen takes the .pgen path directly. See read_pfile and read_pgen for details.

With genotype orient data, standard SQL works naturally. For example, count the total alternate alleles carried by each individual:

SELECT IID, SUM(genotype) AS total_alt_alleles
FROM read_pfile('test/data/pfile_example', orient := 'genotype')
WHERE genotype IS NOT NULL
GROUP BY IID
ORDER BY IID;
IID total_alt_alleles
SAMPLE1 3
SAMPLE2 2
SAMPLE3 4
SAMPLE4 4

Or find all heterozygous calls (genotype = 1):

SELECT ID, IID
FROM read_pfile('test/data/pfile_example', orient := 'genotype')
WHERE genotype = 1
ORDER BY ID, IID;
ID IID
rs1 SAMPLE2
rs2 SAMPLE1
rs2 SAMPLE2
rs3 SAMPLE3
rs4 SAMPLE3

3. Quality Control

Before running any analysis, you need to check data quality. PlinkingDuck provides dedicated functions for the three pillars of genotype QC: missingness, allele frequency, and Hardy-Weinberg equilibrium. These use optimized paths in pgenlib that are faster than computing from raw genotypes.

Missingness

Missingness is the fraction of genotype calls that failed. High missingness for a variant suggests a technical problem with that assay; high missingness for a sample suggests a problem with that individual's DNA.

Per-Variant Missingness

SELECT ID, MISSING_CT, OBS_CT, F_MISS
FROM plink_missing('test/data/pgen_example.pgen');
ID MISSING_CT OBS_CT F_MISS
rs1 1 3 0.25
rs2 0 4 0.0
rs3 1 3 0.25
rs4 0 4 0.0

Variants rs1 and rs3 each have one missing call (25% missingness). In a real study, you would typically remove variants exceeding a threshold (e.g., 5%):

SELECT ID, F_MISS
FROM plink_missing('test/data/pgen_example.pgen')
WHERE F_MISS > 0.05;

Per-Sample Missingness

Switch to sample mode to check individual-level data quality:

SELECT IID, MISSING_CT, OBS_CT, F_MISS
FROM plink_missing('test/data/pgen_example.pgen',
    mode := 'sample', psam := 'test/data/pfile_example.psam');
IID MISSING_CT OBS_CT F_MISS
SAMPLE1 0 4 0.0
SAMPLE2 1 3 0.25
SAMPLE3 0 4 0.0
SAMPLE4 1 3 0.25

SAMPLE2 and SAMPLE4 each have one missing genotype.

Allele Frequency

Allele frequency is how common each allele is in the population. The alternate allele frequency (ALT_FREQ) ranges from 0 to 1. Variants with very low frequency (rare variants) may need special handling.

SELECT ID, ALT_FREQ, OBS_CT
FROM plink_freq('test/data/pgen_example.pgen');
ID ALT_FREQ OBS_CT
rs1 0.5 6
rs2 0.5 8
rs3 0.5 6
rs4 0.375 8

OBS_CT is allele count, not sample count

OBS_CT is the number of observed alleles (2 per non-missing sample, since humans are diploid). With 3 non-missing samples, OBS_CT = 6.

Use counts := true to see the underlying genotype counts:

SELECT ID, HOM_REF_CT, HET_CT, HOM_ALT_CT, MISSING_CT
FROM plink_freq('test/data/pgen_example.pgen', counts := true);
ID HOM_REF_CT HET_CT HOM_ALT_CT MISSING_CT
rs1 1 1 1 1
rs2 1 2 1 0
rs3 1 1 1 1
rs4 2 1 1 0

Hardy-Weinberg Equilibrium

Hardy-Weinberg equilibrium (HWE) predicts the expected genotype frequencies given the allele frequencies. Departures from HWE can indicate genotyping errors, population structure, or selection. The HWE exact test produces a p-value — small values suggest the variant deviates from equilibrium.

SELECT ID, HOM_REF_CT, HET_CT, HOM_ALT_CT,
       ROUND(P_HWE, 4) AS P_HWE
FROM plink_hardy('test/data/pgen_example.pgen');
ID HOM_REF_CT HET_CT HOM_ALT_CT P_HWE
rs1 1 1 1 1.0
rs2 1 2 1 1.0
rs3 1 1 1 1.0
rs4 2 1 1 0.4286

All variants are in equilibrium (p-values well above any reasonable threshold). In practice, variants with P_HWE < 1e-6 are often excluded.

Putting It Together: A QC Summary

The power of SQL shines when you combine multiple analyses. This CTE joins frequency, missingness, and HWE results into a single QC summary:

WITH freq AS (
    SELECT ID, ALT_FREQ
    FROM plink_freq('test/data/pgen_example.pgen')
),
miss AS (
    SELECT ID, F_MISS
    FROM plink_missing('test/data/pgen_example.pgen')
),
hardy AS (
    SELECT ID, P_HWE
    FROM plink_hardy('test/data/pgen_example.pgen')
)
SELECT f.ID, f.ALT_FREQ, m.F_MISS, h.P_HWE
FROM freq f
JOIN miss m ON f.ID = m.ID
JOIN hardy h ON f.ID = h.ID
ORDER BY f.ID;
ID ALT_FREQ F_MISS P_HWE
rs1 0.5 0.25 1.0
rs2 0.5 0.0 1.0
rs3 0.5 0.25 1.0
rs4 0.375 0.0 0.4285714285714286

From here you could filter to only variants passing QC:

-- Variants passing all QC thresholds
-- (not run here since our tiny dataset passes everything)
WHERE f.ALT_FREQ > 0.01          -- MAF > 1%
  AND m.F_MISS < 0.05            -- missingness < 5%
  AND h.P_HWE > 1e-6             -- HWE p-value > 1e-6

Note

For a deeper dive into QC workflows, see the Quality Control Guide.


4. Linkage Disequilibrium

Linkage disequilibrium (LD) measures the statistical correlation between genotypes at different variants. Variants in high LD tend to be inherited together. LD is fundamental to understanding genome structure and interpreting association results.

Pairwise LD

To measure LD between two specific variants, use pairwise mode:

SELECT ID_A, ID_B, ROUND(R2, 4) AS R2, D_PRIME, OBS_CT
FROM plink_ld('test/data/pgen_example.pgen',
    variant1 := 'rs1', variant2 := 'rs2');
ID_A ID_B R2 D_PRIME OBS_CT
rs1 rs2 0.75 0.5 3

is the squared correlation between genotypes (0 = independent, 1 = perfectly correlated). D' is a normalized measure of allelic association. OBS_CT is the number of samples with non-missing genotypes at both variants.

Windowed LD

To scan all variant pairs within a genomic window, omit the variant parameters. Set r2_threshold to control which pairs are reported:

SELECT ID_A, ID_B, ROUND(R2, 4) AS R2,
       ROUND(D_PRIME, 4) AS D_PRIME, OBS_CT
FROM plink_ld('test/data/pgen_example.pgen',
    r2_threshold := 0.0);
ID_A ID_B R2 D_PRIME OBS_CT
rs1 rs2 0.75 0.5 3
rs1 rs3 1.0 1.0 2
rs2 rs3 0.25 0.3333 3

By default, only same-chromosome pairs within 1000 kb are tested. Variants rs1 and rs3 show perfect LD (r² = 1.0), though with only 2 shared observations.

Cross-Chromosome LD

To include pairs across chromosomes (useful for checking for population structure artifacts), set inter_chr := true:

SELECT ID_A, ID_B, ROUND(R2, 4) AS R2, OBS_CT
FROM plink_ld('test/data/pgen_example.pgen',
    r2_threshold := 0.0, inter_chr := true);
ID_A ID_B R2 OBS_CT
rs1 rs2 0.75 3
rs1 rs3 1.0 2
rs1 rs4 0.75 3
rs2 rs3 0.25 3
rs2 rs4 0.1818 4
rs3 rs4 1.0 3

Warning

High LD between variants on different chromosomes is unexpected in real data and usually signals population stratification or genotyping artifacts.


5. Polygenic Risk Scoring

A polygenic risk score (PRS) summarizes an individual's genetic predisposition by weighting each variant's genotype by its effect size from a prior study. PlinkingDuck's plink_score computes this efficiently, handling missing data with mean imputation by default.

Positional Weights

The simplest approach provides one weight per variant, in order:

SELECT IID, SCORE_SUM, SCORE_AVG
FROM plink_score('test/data/pgen_example.pgen',
    psam := 'test/data/pfile_example.psam',
    weights := [0.5, -0.3, 1.2, 0.8]);
IID SCORE_SUM SCORE_AVG
SAMPLE1 2.1 0.2625
SAMPLE2 1.4 0.175
SAMPLE3 3.0 0.375
SAMPLE4 1.5 0.1875

Each score is sum(weight × dosage) across all variants. SCORE_AVG normalizes by the number of observed alleles.

ID-Keyed Weights

For real applications, you typically have a scoring file that maps variant IDs and alleles to weights. ID-keyed mode handles this — variants not found in the data are silently skipped:

SELECT IID, ROUND(SCORE_SUM, 2) AS SCORE_SUM,
       ROUND(SCORE_AVG, 4) AS SCORE_AVG
FROM plink_score('test/data/pgen_example.pgen',
    psam := 'test/data/pfile_example.psam',
    weights := [
        {'id': 'rs1', 'allele': 'G', 'weight': 0.5},
        {'id': 'rs2', 'allele': 'T', 'weight': -0.3},
        {'id': 'rs4', 'allele': 'C', 'weight': 0.8}
    ]);
IID SCORE_SUM SCORE_AVG
SAMPLE1 -0.3 -0.05
SAMPLE2 0.2 0.0333
SAMPLE3 1.8 0.3
SAMPLE4 1.5 0.25

The allele field specifies which allele the weight applies to. If it matches ALT, the dosage is used directly; if it matches REF, the dosage is flipped.

Missing data handling

By default, missing genotypes are imputed with the variant's mean dosage. Use no_mean_imputation := true to skip missing variants instead (each sample's ALLELE_CT will reflect only non-missing variants). See plink_score for details.


6. Genome-Wide Association Testing

A genome-wide association study (GWAS) tests every variant for association with a trait. plink_glm runs per-variant regression using plink2's proven regression engine — the same math that powers plink2 --glm.

Linear Regression (Continuous Trait)

For a continuous trait like height or BMI, plink_glm fits a linear regression at each variant:

SELECT ID, ROUND(BETA, 4) AS BETA, ROUND(P, 6) AS P
FROM plink_glm('test/data/pgen_example',
    phenotype := [1.5, 2.3, 3.7, 0.8])
ORDER BY P;
ID BETA P
rs2 -1.45 0.04880
rs1 1.1 0.099425
rs4 -0.3364 0.741259
rs3 0.35 0.851413

Each row reports the effect size (BETA — how much the trait changes per copy of the ALT allele) and p-value (how unlikely this association is under the null hypothesis of no effect).

Logistic Regression (Case/Control)

For binary traits (disease yes/no), plink_glm auto-detects the 0/1 phenotype and switches to logistic regression, reporting odds ratios:

SELECT ID, ROUND(BETA, 4), ROUND("OR", 4), ROUND(P, 4), FIRTH_YN
FROM plink_glm('test/data/large_example',
    phenotype := [0, 1, 0, 1, 1, 0, 1, 0])
WHERE ID = 'var1';
ID BETA OR P FIRTH_YN
var1 0.0 1.0 1.0 N

The OR (odds ratio) is exp(BETA). An OR of 1.0 means no association. When logistic regression fails to converge (common with rare variants), plink2's Firth correction kicks in automatically — the FIRTH_YN column tells you which method was used.

Adding Covariates

Real GWAS always adjusts for confounders like age, sex, and ancestry principal components:

SELECT ID, ROUND(BETA, 4) AS BETA, ROUND(P, 6) AS P
FROM plink_glm('test/data/large_example',
    phenotype := [1.2, 3.4, 2.1, 5.6, 4.3, 0.9, 3.8, 2.7],
    covariates := {
        'age': [25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0]
    })
WHERE ID = 'var1';
ID BETA P
var1 -1.1205 0.369083

Phenotypes from External Files

For real datasets, phenotypes and covariates live in external files. Use DuckDB's SET VARIABLE to load them, then pass with getvariable().

The phenotype and covariate lists must be in the same order as samples in the .pgen file. The safest approach is to join against the .psam file:

-- Join phenotype file against psam to match pgen sample order
SET VARIABLE pheno = (
    SELECT list(p.pheno_quant ORDER BY s.rowid)
    FROM read_psam('data/cohort.psam') s
    JOIN 'phenotypes.parquet' p ON s.IID = p.IID
);

SET VARIABLE covars = (
    SELECT {
        'age': list(p.age ORDER BY s.rowid),
        'bmi': list(p.bmi ORDER BY s.rowid),
        'pc1': list(p.pc1 ORDER BY s.rowid)
    }
    FROM read_psam('data/cohort.psam') s
    JOIN 'covariates.parquet' p ON s.IID = p.IID
);

-- Run GWAS: top hits
SELECT ID, CHROM, POS, ROUND(BETA, 4) AS BETA, P
FROM plink_glm('data/cohort',
    phenotype := getvariable('pheno'),
    covariates := getvariable('covars'))
WHERE P < 5e-8
ORDER BY P;

The ORDER BY s.rowid ensures values are emitted in .psam row order, which matches the .pgen sample order.

Pre-sorted shortcut

If your parquet file is already sorted to match pgen sample order (e.g., exported from the same pipeline), you can skip the join: SELECT list(pheno_quant) FROM 'phenotypes.parquet'.

See plink_glm for the full parameter and output reference.


7. Working at Scale

The examples above used a tiny 4-variant dataset. Real genomic data has millions of variants and thousands of samples. PlinkingDuck is designed for this — parallel scanning, projection pushdown, and region filtering all help keep queries fast.

Let's demonstrate with a larger dataset: 3,000 variants across 8 samples on 3 chromosomes.

SELECT
    (SELECT COUNT(*) FROM read_pvar('test/data/large_example.pvar')) AS n_variants,
    (SELECT COUNT(*) FROM read_psam('test/data/large_example.psam')) AS n_samples;
n_variants n_samples
3000 8
SELECT CHROM, COUNT(*) AS n_variants
FROM read_pvar('test/data/large_example.pvar')
GROUP BY CHROM
ORDER BY CHROM;
CHROM n_variants
1 1000
2 1000
3 1000

Region Filtering

When you only need variants in a specific genomic region, the region parameter restricts processing before any computation begins. This is much more efficient than filtering with WHERE after the fact:

-- Efficient: only reads variants in the region
SELECT COUNT(*) AS n_variants
FROM plink_freq('test/data/large_example.pgen',
    region := '1:1-50000');
n_variants
500

Region filtering works with all analysis functions (plink_freq, plink_hardy, plink_missing, plink_ld, plink_score) and read_pfile.

-- Combine region filtering with aggregation
SELECT COUNT(*) AS n_variants, ROUND(AVG(ALT_FREQ), 4) AS mean_freq
FROM plink_freq('test/data/large_example.pgen',
    region := '1:1-50000');
n_variants mean_freq
500 0.5

Performance tips

  • Use region instead of WHERE CHROM = ... — it avoids reading genotype data outside the region
  • Select only the columns you need — projection pushdown skips unused computation
  • Use dedicated analysis functions (plink_freq, plink_hardy) instead of computing from raw genotypes — they use optimized pgenlib paths

See the Performance Guide for the full set of optimization strategies.


Next Steps

You've now seen the complete PlinkingDuck workflow: loading data, exploring genotypes, running QC, analyzing LD, computing PRS scores, running GWAS, and working efficiently at scale. Here are some directions to explore next: