Performance¶
PlinkingDuck is designed for efficient access to large PLINK datasets. This guide explains the performance features and how to take advantage of them.
Projection Pushdown¶
All PlinkingDuck functions support projection pushdown: if a column is not referenced in your query, the work to compute it is skipped.
This is most impactful for genotype data. When you only need variant metadata, genotype decoding is skipped entirely:
-- Fast: no genotype decoding
SELECT CHROM, POS, ID FROM read_pgen('large_dataset.pgen');
-- Slower: requires decoding genotypes for every variant
SELECT * FROM read_pgen('large_dataset.pgen');
For analysis functions, projection pushdown also works:
-- Fast: skips PgrGetCounts, only reads variant metadata
SELECT CHROM, POS, ID FROM plink_freq('large_dataset.pgen');
-- Normal: computes frequencies
SELECT ID, ALT_FREQ FROM plink_freq('large_dataset.pgen');
Parallel Scanning¶
Functions that process .pgen files use multi-threaded parallel scanning:
| Function | Parallel | Notes |
|---|---|---|
read_plink_vcf |
No | Sequential text VCF parsing |
read_pvar |
No | Sequential text file reading |
read_psam |
No | Sequential text file reading; .psam files are small |
read_pgen |
Yes | Per-thread pgenlib readers |
read_pfile (default) |
Yes | Same as read_pgen |
read_pfile (genotype orient) |
No | Row expansion is single-threaded |
read_pfile (sample orient) |
No | Pre-reads all genotypes at bind time |
plink_freq |
Yes | Per-thread readers, atomic batch claiming |
plink_hardy |
Yes | Same pattern as plink_freq |
plink_missing (variant) |
Yes | Parallel missingness extraction |
plink_missing (sample) |
No | Parallel per-sample accumulation |
plink_ld (windowed) |
Yes | Per-thread anchor claiming |
plink_ld (pairwise) |
No | Single pair computation |
plink_score |
Yes | Parallel per-sample accumulation |
plink_glm |
Yes | Per-thread regression |
Each parallel function uses atomic batch claiming: threads claim batches of variants from a shared counter, ensuring even work distribution without lock contention.
The maximum thread count scales with the number of variants (typically min(variants/500 + 1, 16)).
Region Filtering¶
For large datasets, region filtering restricts processing to a genomic region before any computation begins:
-- Only processes variants on chr22 between positions 1-50M
SELECT * FROM plink_freq('biobank.pgen',
region := '22:1-50000000');
Region filtering works by scanning the pre-loaded variant metadata (sorted by chromosome and position) to find the matching variant index range. The .pgen reader then starts at the first matching variant and stops after the last.
This avoids reading genotype data for variants outside the region.
Filter Pushdown¶
The af_range, ac_range, and genotype_range parameters let you filter variants or genotype values before full decompression:
-- Skip variants outside MAF range (uses PgrGetCounts, no decompression)
SELECT * FROM read_pfile('biobank',
af_range := {min: 0.01, max: 0.5});
-- Only keep heterozygous calls
SELECT * FROM read_pgen('biobank.pgen',
genotype_range := {min: 1, max: 1});
af_range and ac_range use pgenlib's PgrGetCounts to count genotypes without decompressing, then skip variants that fail the filter. genotype_range also uses PgrGetCounts as a pre-check — if a variant has no genotypes in range, it is skipped entirely without decompression.
Sample Subsetting¶
When using the samples parameter, only the specified samples are processed:
SELECT * FROM plink_freq('biobank.pgen',
samples := [0, 100, 200]);
Sample subsetting is handled at the pgenlib level using PgrGetCounts / PgrGet with a sample bitmask. For functions using PgrGetCounts (plink_freq, plink_hardy), this is very efficient since the counting path directly respects the bitmask without extracting individual genotypes.
Tips for Large Datasets¶
Use region filtering for targeted analyses¶
Instead of scanning all variants and filtering with WHERE:
-- Less efficient: scans all variants, filters in SQL
SELECT * FROM plink_freq('biobank.pgen')
WHERE CHROM = '22' AND POS BETWEEN 1 AND 50000000;
Use the region parameter:
-- More efficient: only reads variants in the region
SELECT * FROM plink_freq('biobank.pgen',
region := '22:1-50000000');
Select only the columns you need¶
Projection pushdown means unused columns are free to ignore:
-- Faster: only decodes what's needed
SELECT ID, ALT_FREQ FROM plink_freq('biobank.pgen');
-- Slower: computes everything including counts
SELECT * FROM plink_freq('biobank.pgen', counts := true);
Use analysis functions instead of raw genotypes¶
When you need summary statistics, use the dedicated functions rather than computing from raw genotypes. They use optimized pgenlib paths:
-- Efficient: uses PgrGetCounts (no decompression)
SELECT ID, ALT_FREQ FROM plink_freq('data.pgen');
-- Less efficient: decompresses genotypes, then aggregates in SQL
SELECT ID, AVG(genotype) / 2.0 AS alt_freq
FROM read_pfile('data', orient := 'genotype')
GROUP BY ID;
Subset samples when possible¶
If your analysis only involves a subset of samples, specify them upfront:
-- Efficient: pgenlib only processes 3 samples per variant
SELECT * FROM plink_freq('data.pgen',
samples := ['CASE1', 'CASE2', 'CASE3']);