read_pfile¶
Read a complete PLINK 2 fileset (.pgen + .pvar + .psam) from a common prefix. Supports multiple orient modes, sample subsetting, variant filtering, and region filtering.
Synopsis¶
read_pfile(prefix VARCHAR [, pgen := ..., pvar := ..., psam := ...,
orient := ..., genotypes := ..., phased := ..., dosages := ...,
samples := ..., variants := ..., region := ...,
af_range := ..., ac_range := ..., genotype_range := ...]) -> TABLE
Parameters¶
| Name | Type | Default | Description |
|---|---|---|---|
prefix |
VARCHAR |
(required) | Common file prefix (e.g., 'data/cohort') |
pgen |
VARCHAR |
prefix.pgen |
Explicit .pgen path |
pvar |
VARCHAR |
prefix.pvar |
Explicit .pvar or .bim path |
psam |
VARCHAR |
prefix.psam |
Explicit .psam or .fam path |
orient |
VARCHAR |
'variant' |
Output orientation: 'variant', 'genotype', or 'sample' |
genotypes |
VARCHAR |
'array' |
Genotype output format: 'array', 'list', or 'columns' |
phased |
BOOLEAN |
false |
Output phased haplotype pairs instead of dosage counts |
dosages |
BOOLEAN |
false |
Output dosage values instead of hard calls |
samples |
LIST(VARCHAR) or LIST(INTEGER) |
All | Filter to specific samples |
variants |
LIST(VARCHAR) or LIST(INTEGER) |
All | Filter to specific variants |
region |
VARCHAR |
All | Filter to genomic region (chr:start-end) |
af_range |
STRUCT(min DOUBLE, max DOUBLE) |
All | Filter variants by allele frequency range |
ac_range |
STRUCT(min INTEGER, max INTEGER) |
All | Filter variants by allele count range |
genotype_range |
STRUCT(min TINYINT, max TINYINT) |
All | Filter individual genotype values |
See Common Parameters for details on samples, region, and filter parameters.
Variant Filtering¶
The variants parameter accepts:
LIST(VARCHAR)-- variant IDs (matched against ID in.pvar)LIST(INTEGER)-- 0-based variant indices
When both region and variants are specified, the result is the intersection (variants must match both filters).
Output Columns¶
Variant Orient Mode (Default)¶
| Column | Type | Description |
|---|---|---|
CHROM |
VARCHAR |
Chromosome |
POS |
INTEGER |
Base-pair position |
ID |
VARCHAR |
Variant identifier |
REF |
VARCHAR |
Reference allele |
ALT |
VARCHAR |
Alternate allele |
genotypes |
ARRAY(TINYINT, N) |
Genotype calls, one per sample |
Genotype Orient Mode (orient := 'genotype')¶
| Column | Type | Description |
|---|---|---|
CHROM |
VARCHAR |
Chromosome |
POS |
INTEGER |
Base-pair position |
ID |
VARCHAR |
Variant identifier |
REF |
VARCHAR |
Reference allele |
ALT |
VARCHAR |
Alternate allele |
FID |
VARCHAR |
Family ID (from .psam) |
IID |
VARCHAR |
Individual ID (from .psam) |
SEX |
INTEGER |
Sex (from .psam) |
| additional | varies | Any extra .psam columns |
genotype |
TINYINT |
Single genotype value |
In genotype orient mode, each row represents one variant-sample combination. The sample metadata columns come from the .psam file and vary depending on what columns are present.
Sample Orient Mode (orient := 'sample')¶
| Column | Type | Description |
|---|---|---|
FID |
VARCHAR |
Family ID (from .psam) |
IID |
VARCHAR |
Individual ID (from .psam) |
SEX |
INTEGER |
Sex (from .psam) |
| additional | varies | Any extra .psam columns |
genotypes |
ARRAY(TINYINT, M) |
Genotype calls, one per variant |
In sample orient mode, each row represents one sample. The genotypes array contains one value per variant (M = effective variant count after filtering). All genotypes are pre-read at bind time.
Description¶
read_pfile is the most feature-rich reader function. It combines genotype, variant, and sample data with flexible filtering.
File Discovery¶
By default, read_pfile constructs file paths by appending extensions to the prefix: prefix.pgen, prefix.pvar, prefix.psam. Each can be overridden individually.
Genotype Output Formats¶
The genotypes parameter controls how genotype data is structured:
'array'(default): fixed-sizeARRAY(TINYINT, N)— fastest, requires compile-time size'list': variable-lengthLIST(TINYINT)— use for large cohorts (>100K samples)'columns': one scalarTINYINTcolumn per sample (variant orient) or per variant (sample orient); column names from IIDs or variant IDs. Incompatible withorient := 'genotype'.
Phased Output¶
With phased := true, genotype elements change from TINYINT to ARRAY(TINYINT, 2) representing [allele1, allele2] haplotype pairs. [0, 0] = hom ref, [0, 1] = REF|ALT, [1, 0] = ALT|REF, [1, 1] = hom alt, NULL = missing. Incompatible with dosages := true. See Genotype Encoding.
Filter Pushdown¶
af_range and ac_range filter variants by allele frequency or count using fast genotype counting (no decompression). genotype_range filters individual genotype values, setting non-matching values to NULL. See Common Parameters.
Projection Pushdown¶
Genotype decoding is skipped when genotype columns (genotypes or genotype) are not referenced in the query.
Parallel Scan¶
Default mode uses multi-threaded parallel scanning. Genotype orient mode is single-threaded because it expands each variant into multiple rows (one per sample). Sample orient mode is single-threaded (all genotypes are pre-read at bind time).
Examples¶
-- Read all variants in default mode
SELECT * FROM read_pfile('data/example');
-- Genotype orient mode: one row per variant-sample pair
SELECT chrom, pos, iid, genotype
FROM read_pfile('data/example', orient := 'genotype');
-- Sample orient: one row per sample
SELECT IID, genotypes
FROM read_pfile('data/example', orient := 'sample');
-- Sample subset by name
SELECT * FROM read_pfile('data/example',
orient := 'genotype',
samples := ['SAMPLE1', 'SAMPLE3']);
-- Region filter
SELECT * FROM read_pfile('data/example',
region := '22:1-50000000');
-- Variant filter by ID
SELECT * FROM read_pfile('data/example',
orient := 'genotype',
variants := ['rs1', 'rs2']);
-- Combined filters
SELECT iid, genotype
FROM read_pfile('data/example',
orient := 'genotype',
region := '1:10000-50000',
samples := ['SAMPLE1']);
-- Phased haplotype output
SELECT ID, genotypes
FROM read_pfile('data/example', phased := true);
-- Filter by allele frequency (common variants only)
SELECT * FROM read_pfile('data/example',
af_range := {min: 0.01, max: 0.5});
-- Columns pivot mode
SELECT * FROM read_pfile('data/example',
genotypes := 'columns');
See Also¶
- read_pgen -- lower-level genotype reader
- read_pvar -- variant metadata only
- read_psam -- sample metadata only
- Genotype Encoding -- encoding and phased output reference