Skip to content

read_plink_vcf

Read genotypes from a VCF file using plink-ng's optimized GT parsing. Designed as a fast path for extracting biallelic genotypes — skips multiallelic variants and outputs in the same format as read_pfile().

For full VCF/BCF parsing (multiallelic, INFO fields, annotations), use DuckHTS read_bcf() instead.

Synopsis

read_plink_vcf(path VARCHAR [, genotypes := ..., phased := ...,
               region := ..., min_gq := ..., min_dp := ..., max_dp := ...,
               halfcall := ...]) -> TABLE

Parameters

Name Type Default Description
path VARCHAR (required) Path to .vcf or .vcf.gz file
genotypes VARCHAR 'auto' Output format: 'array', 'list', or 'columns'
phased BOOLEAN false Output phased haplotype pairs (ARRAY(TINYINT, 2))
region VARCHAR All Filter to genomic region (chr or chr:start-end)
min_gq INTEGER -1 (disabled) Minimum genotype quality (GQ); samples below → NULL
min_dp INTEGER -1 (disabled) Minimum read depth (DP); samples below → NULL
max_dp INTEGER -1 (disabled) Maximum read depth (DP); samples above → NULL
halfcall VARCHAR 'missing' Half-call handling: 'missing', 'reference', 'haploid', or 'error'

Genotype Output Modes

  • 'auto''array' if sample count ≤ 100,000, else 'list'
  • 'array'ARRAY(TINYINT, N) where N = sample count (fixed-size, fastest)
  • 'list'LIST(TINYINT) (variable-size, no sample count limit)
  • 'columns' — one TINYINT column per sample, named by sample ID

Quality Filtering

Quality filters apply per-sample: if a sample's GQ or DP falls outside the specified range, that sample's genotype is set to NULL (missing) for that variant. The variant itself is still output.

Phased Mode

When phased := true, genotype elements become ARRAY(TINYINT, 2) representing [allele1, allele2]: - [0, 0] = hom ref - [0, 1] = ref|alt (standard het) - [1, 0] = alt|ref (phase-flipped het) - [1, 1] = hom alt - NULL = missing

Output Columns

Column Type Description
CHROM VARCHAR Chromosome
POS INTEGER Position (1-based)
ID VARCHAR Variant identifier (NULL if .)
REF VARCHAR Reference allele
ALT VARCHAR Alternate allele
genotypes ARRAY(TINYINT, N) Per-sample genotype values

In 'columns' mode, the genotypes column is replaced by one column per sample.

Genotype Encoding

Value Meaning
0 Homozygous reference
1 Heterozygous
2 Homozygous alternate
NULL Missing genotype

Behavior

  • Multiallelic variants are skipped — a warning reports the count at scan completion
  • GT must be the first FORMAT subfield — throws an error otherwise
  • Compressed VCF (.vcf.gz, bgzf) is transparently decompressed via DuckDB VFS
  • Projection pushdown — genotype parsing is skipped entirely when only metadata columns are referenced

Examples

-- Basic variant count
SELECT COUNT(*) FROM read_plink_vcf('cohort.vcf.gz');

-- Genotype distribution
SELECT
  genotypes[1] as gt,
  COUNT(*) as n
FROM read_plink_vcf('sample.vcf')
GROUP BY gt;

-- High-quality variants on chr1
SELECT *
FROM read_plink_vcf('wgs.vcf.gz', region := 'chr1', min_gq := 30, min_dp := 10)
LIMIT 100;

-- Phased haplotypes
SELECT id, genotypes
FROM read_plink_vcf('phased.vcf', phased := true);

-- Per-sample columns for easy pivoting
SELECT SAMPLE1, SAMPLE2
FROM read_plink_vcf('trio.vcf', genotypes := 'columns');

-- Cross-validate VCF against pfile
SELECT COUNT(*) FROM (
  SELECT v.id
  FROM read_plink_vcf('data.vcf') v
  JOIN read_pfile('data') p ON v.id = p.id
  WHERE v.genotypes != p.genotypes
);

Limitations

  • Only biallelic variants (multiallelic are skipped with a warning)
  • No INFO/QUAL/FILTER field access (use DuckHTS for that)
  • Single-threaded sequential scan (no parallel chunking)
  • genotypes := 'struct', 'counts', and 'stats' modes are not supported
  • BCF (binary VCF) is not supported — only text VCF and gzipped VCF