Skip to content

Genotype Encoding

All genotype data in PlinkingDuck uses TINYINT values with a simple additive encoding.

Encoding Table

Value Meaning VCF Equivalent
0 Homozygous reference 0/0
1 Heterozygous 0/1
2 Homozygous alternate 1/1
NULL Missing genotype ./.

The value represents the count of alternate alleles: 0 copies, 1 copy, or 2 copies.

Contexts

Genotypes appear in two contexts depending on the function and mode:

List Context

In read_pgen, read_pfile default mode, and read_plink_vcf, genotypes are returned as an ARRAY(TINYINT, N) column named genotypes, where N is the sample count. Each array element corresponds to one sample, in the same order as the .psam file (or VCF header for read_plink_vcf).

-- Each row has a list of genotypes, one per sample
SELECT ID, genotypes FROM read_pgen('data.pgen');
-- rs1  [0, 1, 2, NULL]
-- rs2  [0, 0, 1, 1]

Scalar Context

In read_pfile genotype orient mode (orient := 'genotype'), each genotype is a scalar TINYINT column named genotype. There is one row per variant-sample combination.

-- Each row is one variant-sample pair
SELECT ID, IID, genotype
FROM read_pfile('data', orient := 'genotype');
-- rs1  SAMPLE1  0
-- rs1  SAMPLE2  1
-- rs1  SAMPLE3  2
-- rs1  SAMPLE4  NULL

Columns Context

With genotypes := 'columns', each sample (in variant orient) or each variant (in sample orient) gets its own scalar TINYINT column. Column names come from sample IIDs or variant IDs.

-- One column per sample
SELECT ID, SAMPLE1, SAMPLE2, SAMPLE3
FROM read_pfile('data', genotypes := 'columns');
-- rs1  0  1  2
-- rs2  1  1  0

Phased Haplotype Output

With phased := true (supported by read_pfile and read_plink_vcf), genotype elements change from scalar TINYINT to ARRAY(TINYINT, 2) representing [allele1, allele2] haplotype pairs:

Value Meaning VCF Equivalent
[0, 0] Homozygous reference 0\|0
[0, 1] Ref/Alt (or unphased het) 0\|1
[1, 0] Alt/Ref 1\|0
[1, 1] Homozygous alternate 1\|1
NULL Missing .\|.
SELECT ID, genotypes
FROM read_pfile('data', phased := true);
-- rs1  [[0, 0], [0, 1], [1, 1], NULL]

Phased output works across all orient modes and genotypes modes (array, list). Incompatible with dosages := true.

Working with Genotypes

Counting alternate alleles per sample

SELECT IID, SUM(genotype) AS total_alt_alleles
FROM read_pfile('data', orient := 'genotype')
WHERE genotype IS NOT NULL
GROUP BY IID;

Filtering for heterozygous calls

SELECT ID, IID
FROM read_pfile('data', orient := 'genotype')
WHERE genotype = 1;

Computing allele frequency from genotypes

SELECT ID,
    AVG(genotype) / 2.0 AS alt_freq
FROM read_pfile('data', orient := 'genotype')
WHERE genotype IS NOT NULL
GROUP BY ID;

Note

For efficient allele frequency computation, prefer plink_freq() which uses pgenlib's fast counting path without full genotype decompression.

Internal Representation

PLINK 2 .pgen files store genotypes as 2-bit packed values (4 genotypes per byte). The raw encoding is:

2-bit value Meaning
00 Homozygous reference
01 Heterozygous
10 Homozygous alternate
11 Missing

PlinkingDuck converts these to the user-facing 0/1/2/NULL encoding during reads. Missing values (11 = -9 in pgenlib's byte representation) are mapped to SQL NULL.