Genotype Encoding¶

All genotype data in PlinkingDuck uses TINYINT values with a simple additive encoding.

Encoding Table¶

Value	Meaning	VCF Equivalent
`0`	Homozygous reference	`0/0`
`1`	Heterozygous	`0/1`
`2`	Homozygous alternate	`1/1`
`NULL`	Missing genotype	`./.`

The value represents the count of alternate alleles: 0 copies, 1 copy, or 2 copies.

Contexts¶

Genotypes appear in two contexts depending on the function and mode:

List Context¶

In read_pgen, read_pfile default mode, and read_plink_vcf, genotypes are returned as an ARRAY(TINYINT, N) column named genotypes, where N is the sample count. Each array element corresponds to one sample, in the same order as the .psam file (or VCF header for read_plink_vcf).

-- Each row has a list of genotypes, one per sample
SELECT ID, genotypes FROM read_pgen('data.pgen');
-- rs1  [0, 1, 2, NULL]
-- rs2  [0, 0, 1, 1]

Scalar Context¶

In read_pfile genotype orient mode (orient := 'genotype'), each genotype is a scalar TINYINT column named genotype. There is one row per variant-sample combination.

-- Each row is one variant-sample pair
SELECT ID, IID, genotype
FROM read_pfile('data', orient := 'genotype');
-- rs1  SAMPLE1  0
-- rs1  SAMPLE2  1
-- rs1  SAMPLE3  2
-- rs1  SAMPLE4  NULL

Columns Context¶

With genotypes := 'columns', each sample (in variant orient) or each variant (in sample orient) gets its own scalar TINYINT column. Column names come from sample IIDs or variant IDs.

-- One column per sample
SELECT ID, SAMPLE1, SAMPLE2, SAMPLE3
FROM read_pfile('data', genotypes := 'columns');
-- rs1  0  1  2
-- rs2  1  1  0

Phased Haplotype Output¶

With phased := true (supported by read_pfile and read_plink_vcf), genotype elements change from scalar TINYINT to ARRAY(TINYINT, 2) representing [allele1, allele2] haplotype pairs:

Value	Meaning	VCF Equivalent
`[0, 0]`	Homozygous reference	`0\\|0`
`[0, 1]`	Ref/Alt (or unphased het)	`0\\|1`
`[1, 0]`	Alt/Ref	`1\\|0`
`[1, 1]`	Homozygous alternate	`1\\|1`
`NULL`	Missing	`.\\|.`

SELECT ID, genotypes
FROM read_pfile('data', phased := true);
-- rs1  [[0, 0], [0, 1], [1, 1], NULL]

Phased output works across all orient modes and genotypes modes (array, list). Incompatible with dosages := true.

Working with Genotypes¶

Counting alternate alleles per sample¶

SELECT IID, SUM(genotype) AS total_alt_alleles
FROM read_pfile('data', orient := 'genotype')
WHERE genotype IS NOT NULL
GROUP BY IID;

Filtering for heterozygous calls¶

SELECT ID, IID
FROM read_pfile('data', orient := 'genotype')
WHERE genotype = 1;

Computing allele frequency from genotypes¶

SELECT ID,
    AVG(genotype) / 2.0 AS alt_freq
FROM read_pfile('data', orient := 'genotype')
WHERE genotype IS NOT NULL
GROUP BY ID;

Note

For efficient allele frequency computation, prefer plink_freq() which uses pgenlib's fast counting path without full genotype decompression.

Internal Representation¶

PLINK 2 .pgen files store genotypes as 2-bit packed values (4 genotypes per byte). The raw encoding is:

2-bit value	Meaning
`00`	Homozygous reference
`01`	Heterozygous
`10`	Homozygous alternate
`11`	Missing

PlinkingDuck converts these to the user-facing 0/1/2/NULL encoding during reads. Missing values (11 = -9 in pgenlib's byte representation) are mapped to SQL NULL.