Lecture 14 – Software for Population Genomic Analysis (PLINK)

📝Lecture 14 — PLINK & Population Genomic Analysis
0 / 30
Q1 Easy
What is PLINK primarily designed for?
ADe novo genome assembly from long reads
BWhole genome association analysis and related large-scale genomic analyses
CRNA-seq differential expression analysis
DMultiple sequence alignment and phylogenetic tree construction
Explanation
PLINK is a free, open-source whole genome association analysis toolset designed to perform a range of basic, large-scale analyses in a computationally efficient manner. Its tasks include data management, quality control, population stratification detection, association testing, and more.
Q2 Medium
In a PLINK PED file, what are the first six mandatory columns (in order)?
AChromosome, SNP ID, Genetic distance, Position, Allele 1, Allele 2
BSample ID, Family ID, Sex, Phenotype, Paternal ID, Maternal ID
CFamily ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype
DFamily ID, Individual ID, Sex, Phenotype, Paternal ID, Maternal ID
Explanation
The PED file has six mandatory columns in strict order: (1) Family ID, (2) Individual ID, (3) Paternal ID, (4) Maternal ID, (5) Sex (1=male, 2=female, 0=unknown), (6) Phenotype. The order matters — swapping Sex and Phenotype or rearranging IDs would break the format. Note that the PED file has NO header line.
Q3 Medium
How many columns does a PED file have if 500 biallelic SNP markers are genotyped for a diploid organism (assuming all six mandatory fields are present)?
A506
B500
C6 + 500 = 506
D6 + 2 × 500 = 1006
Explanation
The formula is: 6 + 2 × number of markers. The first 6 columns are the mandatory fields. Each marker requires 2 columns (one per allele in a diploid organism). So 6 + 2 × 500 = 1006 columns. This is a commonly tested calculation — don't forget the factor of 2!
Q4 Easy
Which four columns does a PLINK MAP file contain?
AChromosome, SNP identifier, Genetic distance (morgans), Base-pair position
BChromosome, SNP identifier, Minor allele frequency, P-value
CFamily ID, Individual ID, SNP identifier, Genotype
DChromosome, SNP identifier, Base-pair position, Allele
Explanation
Each line in a MAP file describes one marker with exactly 4 columns: (1) Chromosome, (2) rs# or SNP identifier, (3) Genetic distance in morgans, (4) Base-pair position. Note that genetic distance is often set to 0 when unknown.
Q5 Tricky
In the PED file, sex is encoded as numeric values. Which coding does PLINK use?
A0 = male, 1 = female, 2 = unknown
B1 = male, 2 = female, 0 = unknown
CM = male, F = female, U = unknown
D1 = female, 2 = male, 0 = unknown
Explanation
PLINK uses: 1 = male, 2 = female, 0 = unknown. Option D is a common trap — it reverses male and female. This small detail is easily confused and exactly the kind of thing a professor might test.
Q6 Medium
Which PLINK flag is used to load text-format PED/MAP files?
A--bfile
B--ped
C--file
D--input
Explanation
--file is used to load text-format PED + MAP files (e.g., --file Altamurana looks for Altamurana.ped and Altamurana.map). --bfile is used for binary files (FAM, BIM, BED). This is an important and frequently tested distinction.
Q7 Medium
What three files constitute the PLINK binary file format?
A.fam (individual info), .bim (marker info), .bed (genotypes)
B.ped (individual info), .map (marker info), .log (results)
C.fam (genotypes), .bim (individual info), .bed (marker info)
D.fam (individual info), .bim (genotypes), .bed (marker info)
Explanation
The binary format has three files: .fam stores individual/phenotype info (analogous to first 6 columns of PED), .bim stores marker position info (analogous to MAP), and .bed stores genotypes in compressed binary. Options C and D are traps that shuffle which file stores what.
Q8 Tricky
A PED file contains a line: 1 3 2 1 1 1 A A T C. What can we conclude about individual 3?
AIndividual 3 is female, from family 1, with parents unknown
BIndividual 3 is female, father is individual 2, mother is individual 1
CIndividual 3 is male, father is individual 1, mother is individual 2
DIndividual 3 is male, father is individual 2, mother is individual 1, and is homozygous AA at locus 1
Explanation
Reading the columns: Family ID=1, Individual ID=3, Paternal ID=2, Maternal ID=1, Sex=1 (male), Phenotype=1. Then locus 1 = A A (homozygous), locus 2 = T C (heterozygous). The column order is Father then Mother (not the other way around), and sex=1 means male. Option B reverses the sex coding and parent assignment.
Q9 Easy
Does the PLINK PED file contain a header row?
AYes, the first row always lists column names
BNo, the PED file does not use any header
COnly when the file is in binary format
DYes, but the header is optional
Explanation
The PED file does NOT use any header. Every row directly represents an individual. This is explicitly stated in the lecture and is a detail students might forget when working with other bioinformatics formats that do use headers (like VCF).
Q10 Medium
Which PLINK flag converts text PED/MAP files to binary BED/BIM/FAM format?
A--make-bed
B--recode
C--convert-binary
D--bfile
Explanation
--make-bed converts the input to binary format (.bed/.bim/.fam). --recode does the opposite — it outputs text PED/MAP format. --bfile is an input flag (to load binary files), not a conversion flag. Example: ./plink --file Altamurana --make-bed --out Altamurana_binary.
Q11 Medium
What does the PLINK flag --mind 0.1 do during quality control?
AExcludes SNPs with more than 10% missing genotypes
BIncludes only individuals with at least 10% heterozygosity
CExcludes samples (individuals) with more than 10% missing genotypes
DSets the minor allele frequency threshold to 0.1
Explanation
--mind filters individuals (samples), not SNPs. It removes samples with a missing genotype rate above the specified threshold (here 10%). The equivalent filter for SNPs is --geno. This is a classic exam trap: confusing --mind (individuals) with --geno (SNPs).
Q12 Tricky
Which of the following PLINK QC commands is correctly described?
A--geno 0.1 excludes individuals with more than 10% missing data
B--geno 0.1 includes only SNPs with a genotyping rate of at least 90%
C--maf 0.05 removes SNPs with a minor allele frequency above 0.05
D--hwe 0.01 removes SNPs that are in Hardy-Weinberg equilibrium
Explanation
--geno 0.1 includes only SNPs with ≤10% missing data (i.e., ≥90% genotyping rate). Option A confuses --geno (SNPs) with --mind (individuals). --maf 0.05 includes SNPs with MAF ≥ 0.05 (not removes them). --hwe 0.01 includes SNPs with HWE p-value ≥ 0.01 (removes those significantly deviating from HWE).
Q13 Medium
What is the default minor allele frequency (MAF) threshold in PLINK if --maf is used without a specified value?
A0.05
B0.10
C0.001
D0.01
Explanation
The default MAF threshold in PLINK is 0.01. Many students assume it's 0.05 because that's the most commonly used value in practice, but the actual default is 0.01. The lecture explicitly states: "include SNPs with MAF >= 0.05. The default value is 0.01."
Q14 Easy
Which PLINK command generates allele frequency statistics?
A--freq
B--hardy
C--maf
D--assoc
Explanation
--freq generates a .frq file with allele frequencies (CHR, SNP, A1, A2, MAF, NCHROBS). --hardy tests for Hardy-Weinberg equilibrium. --maf is a QC filter, not a statistics command. --assoc performs association testing.
Q15 Medium
What is the main purpose of Multidimensional Scaling (MDS) in PLINK?
ATo identify runs of homozygosity across the genome
BTo compute the minor allele frequency of each SNP
CTo represent high-dimensional genetic data in a low-dimensional space to detect population stratification
DTo perform genome-wide association tests for quantitative traits
Explanation
MDS is a dimensionality reduction technique. In PLINK, it compresses the information contained in thousands of SNPs into a 2D (or few-D) space so you can visualize population structure. Clusters in the MDS plot often correspond to distinct breeds or populations. It operates on a genome-wide IBS (identity by state) pairwise distance matrix.
Q16 Hard
To perform MDS analysis in PLINK, which two-step process is required?
AFirst run --freq, then run --mds-plot
BFirst compute --genome (IBS pairwise distances), then run --cluster --mds-plot
CFirst run --hardy, then run --cluster --mds-plot
DFirst run --assoc, then run --mds-plot
Explanation
MDS requires two steps: (1) compute genome-wide IBS pairwise distances with --genome, which produces a .genome file; (2) load that file with --read-genome and run --cluster --mds-plot N where N is the number of dimensions. The lecture shows: step 1: --genome --out Al_Ap-Ba_genome, step 2: --read-genome ... --cluster --mds-plot 2.
Q17 Easy
What are Runs of Homozygosity (ROH)?
AShort regions with high heterozygosity across SNPs
BRegions of copy number variation detected by aCGH
CInversions in the chromosome that prevent recombination
DLong stretches of chromosome regions that are homozygous at each polymorphic position
Explanation
ROH are contiguous stretches of the genome where an individual is homozygous at every (or nearly every) SNP position. They arise from autozygosity — inheriting two copies of the same ancestral haplotype. ROH are indicators of inbreeding level.
Q18 Hard
In the PLINK ROH analysis shown in the lecture, which parameters were used for the sliding window?
AWindow of 1000 kbp, 0 heterozygous SNPs allowed, max 5 missing, final ROH ≥15 SNPs, density 1 SNP per 100 kb
BWindow of 500 kbp, 1 heterozygous SNP allowed, max 3 missing, final ROH ≥20 SNPs, density 1 SNP per 50 kb
CWindow of 1000 kbp, 1 heterozygous SNP allowed, max 5 missing, final ROH ≥15 SNPs, density 1 SNP per 100 kb
DWindow of 2000 kbp, 0 heterozygous SNPs allowed, max 10 missing, final ROH ≥30 SNPs, density 1 SNP per 200 kb
Explanation
The lecture example uses: --homozyg-kb 1000 (1000 kbp = 1 Mbp window), --homozyg-window-het 0 (no heterozygous SNPs), --homozyg-window-missing 5 (max 5 missing), --homozyg-snp 15 (minimum 15 SNPs), --homozyg-density 100 (1 SNP per 100 kb). Option C is the main trap — it allows 1 heterozygous SNP, but the lecture explicitly sets this to 0.
Q19 Medium
What does the PLINK .hom.indiv output file contain?
AA list of every SNP that falls within any ROH
BOne row per identified homozygous region with start/end positions
CA per-individual summary including number of ROH segments (NSEG) and total ROH length (KB)
DA summary of ROH frequency per chromosome
Explanation
The .hom.indiv file provides a per-individual summary with columns FID, IID, PHE, NSEG (number of segments), KB (total ROH length), and KBAVG (average ROH size). The .hom file (not .hom.indiv) contains one row per individual ROH region. These two are often confused.
Q20 Medium
How is the genomic inbreeding coefficient FROH calculated?
AFROH = Number of ROH segments / Total number of SNPs
BFROH = Total length of all ROHs / Length of the autosomal genome
CFROH = Average ROH length / Longest chromosome length
DFROH = Number of homozygous SNPs / Total number of SNPs
Explanation
FROH = LROH / Laut, where LROH is the total length of all ROHs and Laut is the total length of the autosomal genome. Option D describes overall homozygosity but not the ROH-based inbreeding coefficient specifically. The distinction is important: FROH uses physical length of ROH segments, not just SNP counts.
Q21 Medium
What do short ROH versus long ROH indicate about an individual's ancestry?
AShort ROH suggest remote common ancestors; long ROH suggest recent inbreeding
BShort ROH suggest recent inbreeding; long ROH suggest remote ancestors
CBoth short and long ROH indicate recent inbreeding equally
DROH length is unrelated to the timing of inbreeding events
Explanation
Short ROH originate from remote common ancestors because recombination over many generations breaks up long stretches of DNA. Long ROH indicate autozygosity from more recent ancestors because fewer recombination events have had time to disrupt them. This is a key concept from Ceballos et al., 2018.
Q22 Medium
In PLINK GWAS for a quantitative trait, which statistical model is used?
AChi-squared contingency table test
BLogistic regression
CFisher's exact test
DGeneralized linear model (GLM) / linear regression
Explanation
For quantitative traits, GWAS uses generalized linear models (GLM). For dichotomous case/control traits, contingency table methods or logistic regression are used. The lecture explicitly distinguishes these two approaches. The --assoc flag with quantitative phenotypes uses linear regression.
Q23 Hard
In a linear mixed model for GWAS (Y = Xb + Zu + e), what do the terms represent?
AXb = random genetic effects, Zu = fixed environmental effects, e = phenotype
BXb = genotype encoding, Zu = linkage disequilibrium, e = Hardy-Weinberg deviation
CXb = fixed effects (known constants), Zu = random effects (from subsampling), e = residual error
DXb = random effects, Zu = fixed effects, e = environmental variance
Explanation
In Y = Xb + Zu + e: Xb represents fixed effects (known constants that remain the same over repeated sampling, e.g., sex, age, SNP genotype), Zu represents random effects (random variables arising from subsampling, e.g., population structure), and e is the residual error. Options A and D swap fixed and random effects.
Q24 Medium
Why is covariate adjustment important in GWAS?
AIt increases the number of SNPs tested, improving genome coverage
BIt reduces spurious associations due to sampling artifacts, biases, or population substructure
CIt converts a quantitative trait into a case/control phenotype
DIt eliminates the need for quality control filtering
Explanation
Covariate adjustment reduces spurious associations caused by confounders like sex, age, study site, or population substructure. However, it comes at a cost: each additional covariate uses degrees of freedom, potentially reducing statistical power. Population substructure is noted as one of the most important covariates to consider.
Q25 Easy
Which PLINK command performs a basic quantitative trait association test (GWAS)?
A--assoc
B--genome
C--homozyg
D--freq
Explanation
--assoc performs association testing. For quantitative traits it produces a .qassoc file. The lecture example: ./plink --file Cattle --assoc --out GWAS_stature_cattle_no_covariates. --genome computes IBS distances, --homozyg detects ROH, and --freq calculates allele frequencies.
Q26 Tricky
An advantage of SNP panels listed in the lecture is that they provide "the most comprehensive view of the genome." However, what is a key limitation NOT mentioned?
ALow per-sample cost
BScalable workflow for large populations
CSNP panels are limited to pre-selected markers and cannot discover novel variants
DThey detect both SNPs and other variations across the genome
Explanation
The lecture lists many advantages of SNP panels (low cost, scalable, comprehensive, good data quality). However, a fundamental limitation is ascertainment bias: SNP panels only genotype pre-selected markers. They cannot discover new/novel variants the way whole-genome sequencing can. Options A, B, and D are stated advantages, not limitations.
Q27 Medium
In the PLINK .hom output file, which columns describe the boundaries of an identified ROH?
ACHR, NSNP, DENSITY, PHOM
BSNP1, SNP2, POS1, POS2
CFID, IID, KB, KBAVG
DA1, A2, MAF, NCHROBS
Explanation
In the .hom file, SNP1 and SNP2 are the SNPs at the start and end of the ROH, while POS1 and POS2 are the physical positions (bp) of those boundary SNPs. Option D describes columns from a .frq file (allele frequency output), and option C mixes .hom.indiv columns.
Q28 Tricky
You run wc -l Altamurana.ped and get 24. You also run wc -l Altamurana.map and get 54241. What do these numbers tell you?
A24 SNPs and 54241 individuals
B24 families and 54241 chromosomes
C24 individuals and 54241 alleles
D24 individuals (animals) and 54241 DNA markers (SNPs)
Explanation
In a PED file, each line = one individual, so 24 lines = 24 animals. In a MAP file, each line = one marker, so 54241 lines = 54241 SNPs. This is shown directly in the lecture as a quick way to check dataset dimensions using wc -l.
Q29 — Open Calculation
An individual has the following ROH data from PLINK: total ROH length (LROH) = 400,646 kb. The autosomal genome length of the species is 2,500,000 kb. Calculate the genomic inbreeding coefficient FROH for this individual.
✓ Model Answer

The formula for the genomic inbreeding coefficient is:

FROH = LROH / Laut
FROH = 400,646 kb / 2,500,000 kb
FROH = 0.1603 (or approximately 16.0%)

This means about 16% of this individual's autosomal genome is covered by runs of homozygosity, indicating a moderate level of genomic inbreeding. The population mean FROH would be calculated as the average FROH across all individuals.

Q30 — Open Short Answer
Explain the four main PLINK quality control filters (--mind, --geno, --maf, --hwe). For each, state what it filters (individuals or SNPs) and what criterion is applied.
✓ Model Answer

--mind [threshold]: Filters individuals. Excludes samples with a proportion of missing genotypes exceeding the threshold. Example: --mind 0.1 removes individuals with >10% missing data.

--geno [threshold]: Filters SNPs. Excludes markers with a proportion of missing genotypes exceeding the threshold. Example: --geno 0.1 removes SNPs with >10% missing data (i.e., keeps SNPs with ≥90% call rate).

--maf [threshold]: Filters SNPs. Excludes markers with a minor allele frequency below the threshold. Example: --maf 0.05 removes SNPs with MAF < 0.05 (removes very rare variants or monomorphic markers). Default is 0.01.

--hwe [threshold]: Filters SNPs. Excludes markers whose Hardy-Weinberg equilibrium test p-value falls below the threshold. Example: --hwe 0.01 removes SNPs with HWE p < 0.01 (those significantly deviating from HWE, which may indicate genotyping errors).

Q31 — Open Short Answer
Describe the relationship between ROH length and inbreeding history. How can ROH length class distributions (e.g., 1–2 Mb, 2–4 Mb, 4–8 Mb, 8–16 Mb, >16 Mb) help reconstruct the demographic history of a breed?
✓ Model Answer

Short ROH (e.g., 1–4 Mb): Originate from remote common ancestors. Over many generations, recombination breaks long ancestral haplotypes into smaller fragments. A breed with predominantly short ROH likely experienced background relatedness long ago but has maintained a relatively large effective population size recently.

Long ROH (e.g., >8 Mb or >16 Mb): Indicate recent inbreeding, because few meiotic recombination events have occurred since the common ancestor. A high frequency of long ROH suggests recent bottlenecks, small population sizes, or close mating.

Demographic reconstruction: By plotting the frequency distribution of ROH across length classes, researchers can infer the timing and severity of inbreeding events. A breed with many long ROH has experienced recent, intense inbreeding. A breed with mainly short ROH has ancient background inbreeding but recent outcrossing. Additionally, plotting total ROH coverage (SROH) vs. number of ROH segments per individual helps distinguish populations: many short segments = ancient inbreeding; fewer but longer segments = recent inbreeding.

Q32 — Open Tricky
You want to construct a PED file for 3 individuals genotyped at 5 loci. Individual A is a female (family 1, no parents known, phenotype = 160, genotypes: AA, TC, GG, AT, CC). Individual B is a male (family 1, no parents known, phenotype = 185, genotypes: AG, TT, GA, AA, CT). Individual C is male (family 1, father = B, mother = A, phenotype = 175, genotypes: AG, TC, GA, AT, CC). Write the complete PED file content.
✓ Model Answer

Remember: columns are FamilyID, IndividualID, PaternalID, MaternalID, Sex (1=M, 2=F), Phenotype, then 2 columns per locus. No header row!

1 A 0 0 2 160 A A T C G G A T C C
1 B 0 0 1 185 A G T T G A A A C T
1 C B A 1 175 A G T C G A A T C C

Key details: (1) No header; (2) Unknown parents = 0; (3) Sex: A is female → 2, B and C are male → 1; (4) Each genotype takes 2 columns (one per allele); (5) Individual C has father=B and mother=A (paternal before maternal). Total columns = 6 + 2×5 = 16.