API Reference

API documentation for the kpop module.

Individual

Each element of a population is an instance of kpop.Individual. An kpop.Individual behave similarly as a list of genetypes or as an 2D array of genotypes.

class kpop.Individual(data, id=None, population=None, allele_names=None, dtype=None, meta=None, admixture_q=None, num_alleles=None)[source]

Represents a single individual genotype.

A genotype data must be an integer array of shape (num_loci, ploidy).

Parameters:
  • data – Can be either a string of values or a list of raw genotype values represented as integers.
  • population – Population to which individual belongs to.
num_loci

Number of loci in the raw genotype data

ploidy

Genotype’s ploidy

data

A numpy array of integers with genotype data. Allele types are represented sequentially by 1, 2, 3, etc. Missing data is represented by zero. By default, data is stored in uint8 form. This supports up to 255 different allele types plus zero.

allele_names

A list of mappings between allele integer values to a character representation. If not given, it inherits from parent population.

breed(other, id=None, **kwargs)

Breeds with other individual.

Creates a new genotype in which features are selected from both parents.

copy(data=None, *, meta=<object object>, **kwargs)

Creates a copy of individual.

count(value) → integer -- return number of occurrences of value
classmethod from_freqs(freqs, ploidy=2, **kwargs)[source]

Returns a random individual from the given frequency distribution.

Parameters:
  • freqs – A frequency distribution. Can be a sequence of Prob() elements or an square array of frequencies.
  • ploidy – Individuals ploidy.
  • **kwargs – Additional keyword arguments passed to the constructor.
Returns:

A new Individual instance.

haplotypes()

Return a sequence of ploidy arrays with each haplotype.

This operation is a simple transpose of genotype data.

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

render(id_align=None, max_loci=None)

Renders individual genotype.

render_csv(sep=', ')

Render individual in CSV.

render_ped(family_id='FAM001', individual_id=0, paternal_id=0, maternal_id=0, sex=0, phenotype=0, memo=None)

Render individual as a line in a plink’s .ped file.

Parameters:
  • family_id – A string or number representing the individual’s family.
  • individual_id – A number representing the individual’s id.
  • maternal_id (paternal_id,) – A number representing the individuals father/mother’s id.
  • sex – The sex (1=male, 2=female, other=unknown).
  • phenotype – A number representing the optional phenotype.

Population objects

The main type in the kpop package is kpop.Population. A population is basically a list of individuals. It has a similar interface as a Python’s list or a Numpy array.

class kpop.Population(data=(), id=None, individual_ids=None, **kwargs)[source]

A Population is a collection of individuals.

as_array(which='raw')

Convert to a numpy data array using the requested conversion method. This is a basic pre-processing step in many dimensionality reduction algorithms.

Genotypes are categorical data and usually it doesn’t make sense to treat the integer encoding used in kpop as ordinal data (there is no ordering implied when treating say, allele 1 vs allele 2 vs allele 3).

Conversion methods:
  • raw:
    An 3 dimensional array of (size, num_loci, ploidy) for raw genotype data. Each component represents the value of a single allele.
  • flat:
    Like raw, but flatten the last dimension into a (size, num_loci * ploidy) array. This creates a new feature per loci for each degree of ploidy in the data.
  • rflat:
    Flatten data, but first shuffle the positions of alleles at each loci. This is recommended if data does not carry reliable haplotype information.
  • raw-norm, flat-norm, rflat-norm:
    Normalized versions of “raw”, “flat”, and “rflat” methods. All components are rescaled with zero mean and unity variance.
  • count:
    Force conversion to biallelic data and counts the number of occurrences of the first allele. Most methdds will require normalization, so you probably should consider an specific method such as count-unity, count-snp, etc
  • count-norm:
    Normalized version of count scaled to zero mean and unity variance.
  • count-snp:
    Normalizes each feature using the standard deviation expected under the assumption of Hardy-Weinberg equilibrium. This procedure is described at Patterson et. al., “Population Structure and Eigenanalysis” and is recommended for SNPs subject to genetic drift.
  • count-center:
    Instead of normalizing, simply center data by subtracting half the ploidy to place it into a symmetric range. This normalization puts data into a cube with a predictable origin and range. For diploid data, the components will be either -1, 0, or 1.
Returns:An ndarray with transformed data.
count(value) → integer -- return number of occurrences of value
drop_individuals(indexes, **kwargs)

Creates new population removing the individuals in the given indexes.

drop_loci(indexes, **kwargs)

Create a new population with all loci in the given indexes removed.

drop_missing_data(axis=0, thresh=0.0, **kwargs)

Drop all individuals or loci that have a proportion of missing data higher than the given threshold.

Parameters:
  • axis (0 or 1) – If axis=0 or ‘individuals’ (default), it will scan individuals with a minimum amount of missing data values. If axis=1 or ‘loci’, it will drop all loci with the minimum ammount of missing data.
  • thresh (float, between 0 and 1) – The maximum proportion of missing data tolerated.
Returns:

A new population.

drop_non_biallelic(**kwargs)

Creates a new population removing all non-biallelic loci.

find_missing_data(axis=0, thresh=0.0)

Return the indexes for all all individuals or loci that have a proportion of missing data higher than the given threshold.

Parameters:
  • axis (0 or 1) – If axis=0 or ‘individuals’ (default), it will scan individuals with a minimum amount of missing data values. If axis=1 or ‘loci’, it will drop all loci with the minimum ammount of missing data.
  • thresh (float, between 0 and 1) – The maximum proportion of missing data tolerated.
Returns:

An array of indexes.

find_non_biallelic()

Finds all non-biallelic loci in population.

force_biallelic(**kwargs)

Return a new population with forced biallelic data.

If a locus has more than 2 alleles, the most common allele is picked as allele 1 and the alternate allele 2 comprises all the other alleles.

freqs

Return a list of Prob instances representing the frequencies in each locus.

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

random(size=0, num_loci=0, alleles=2, ploidy=2, id=None, seed=None)

Creates a new random population.

Parameters:
  • size – Number of individuals. If a list of numbers is given, creates a Multipopulation object with sub-populations of the assigned sizes.
  • num_loci – Number of loci in the genotype.
  • alleles – Number of alleles for all loci.
  • ploidy – Ploidy of genotype.
  • min_prob – Minimum value for a frequency probability.
Returns:

A new population object.

shuffle_loci(**kwargs)

Return a copy with shuffled contents of each locus.

size

Return the number of items in a container.

sort_by_allele_freq(**kwargs)

Return a new population in which the index attributed to each allele in each locus is sorted by the frequency in the population. After that, allele 1 will be the most common, allele 2 is the second most common and so on.

Population vs Multipopulation

Kpop uses two classes to represent populations that have basically the same interface. A MultiPopulation is basically a population structured with many sub-populations.

class kpop.MultiPopulation(populations=(), freqs=None, **kwargs)[source]

A population formed by several sub-populations.

add_population(population)[source]

Adds a new sub-population.

Parameters:population – A Population instance.
as_array(which='raw')

Convert to a numpy data array using the requested conversion method. This is a basic pre-processing step in many dimensionality reduction algorithms.

Genotypes are categorical data and usually it doesn’t make sense to treat the integer encoding used in kpop as ordinal data (there is no ordering implied when treating say, allele 1 vs allele 2 vs allele 3).

Conversion methods:
  • raw:
    An 3 dimensional array of (size, num_loci, ploidy) for raw genotype data. Each component represents the value of a single allele.
  • flat:
    Like raw, but flatten the last dimension into a (size, num_loci * ploidy) array. This creates a new feature per loci for each degree of ploidy in the data.
  • rflat:
    Flatten data, but first shuffle the positions of alleles at each loci. This is recommended if data does not carry reliable haplotype information.
  • raw-norm, flat-norm, rflat-norm:
    Normalized versions of “raw”, “flat”, and “rflat” methods. All components are rescaled with zero mean and unity variance.
  • count:
    Force conversion to biallelic data and counts the number of occurrences of the first allele. Most methdds will require normalization, so you probably should consider an specific method such as count-unity, count-snp, etc
  • count-norm:
    Normalized version of count scaled to zero mean and unity variance.
  • count-snp:
    Normalizes each feature using the standard deviation expected under the assumption of Hardy-Weinberg equilibrium. This procedure is described at Patterson et. al., “Population Structure and Eigenanalysis” and is recommended for SNPs subject to genetic drift.
  • count-center:
    Instead of normalizing, simply center data by subtracting half the ploidy to place it into a symmetric range. This normalization puts data into a cube with a predictable origin and range. For diploid data, the components will be either -1, 0, or 1.
Returns:An ndarray with transformed data.
count(value) → integer -- return number of occurrences of value
drop_individuals(indexes, **kwargs)

Creates new population removing the individuals in the given indexes.

drop_loci(indexes, **kwargs)

Create a new population with all loci in the given indexes removed.

drop_missing_data(axis=0, thresh=0.0, **kwargs)

Drop all individuals or loci that have a proportion of missing data higher than the given threshold.

Parameters:
  • axis (0 or 1) – If axis=0 or ‘individuals’ (default), it will scan individuals with a minimum amount of missing data values. If axis=1 or ‘loci’, it will drop all loci with the minimum ammount of missing data.
  • thresh (float, between 0 and 1) – The maximum proportion of missing data tolerated.
Returns:

A new population.

drop_non_biallelic(**kwargs)

Creates a new population removing all non-biallelic loci.

find_missing_data(axis=0, thresh=0.0)

Return the indexes for all all individuals or loci that have a proportion of missing data higher than the given threshold.

Parameters:
  • axis (0 or 1) – If axis=0 or ‘individuals’ (default), it will scan individuals with a minimum amount of missing data values. If axis=1 or ‘loci’, it will drop all loci with the minimum ammount of missing data.
  • thresh (float, between 0 and 1) – The maximum proportion of missing data tolerated.
Returns:

An array of indexes.

find_non_biallelic()

Finds all non-biallelic loci in population.

force_biallelic(**kwargs)

Return a new population with forced biallelic data.

If a locus has more than 2 alleles, the most common allele is picked as allele 1 and the alternate allele 2 comprises all the other alleles.

freqs

Return a list of Prob instances representing the frequencies in each locus.

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

random(size=0, num_loci=0, alleles=2, ploidy=2, id=None, seed=None)

Creates a new random population.

Parameters:
  • size – Number of individuals. If a list of numbers is given, creates a Multipopulation object with sub-populations of the assigned sizes.
  • num_loci – Number of loci in the genotype.
  • alleles – Number of alleles for all loci.
  • ploidy – Ploidy of genotype.
  • min_prob – Minimum value for a frequency probability.
Returns:

A new population object.

shuffle_loci(**kwargs)

Return a copy with shuffled contents of each locus.

size

Return the number of items in a container.

slice_indexes(indexes)[source]

Map indexes to a list of indexes for each sub-population.

sort_by_allele_freq(**kwargs)

Return a new population in which the index attributed to each allele in each locus is sorted by the frequency in the population. After that, allele 1 will be the most common, allele 2 is the second most common and so on.

The .plot attribute

Each kpop.Population or kpop.MultiPopulation instance have a .plot attribute that defines a namespace with many different plotting utilities.

Other utility types

Representing probabilities

class kpop.prob.Prob(data, normalize=True, support=None)[source]

A dictionary-like object that behaves as a mapping between categories to their respective probabilities.

encode(coding=None)[source]

Encode probability distribution as a vector.

Parameters:coding – a sequence of ordered categories.

Example

>>> prob = Prob({'a': 0.75, 'b': 0.25})
>>> prob.encode(['b', 'a'])
[0.25, 0.75]
entropy()[source]

Return the Shannon entropy for the probability distribution.

kl_divergence(q: collections.abc.Mapping)[source]

Return the Kullback-Leibler divergence with probability distribution.

This is given by the formula:

$KL = sum_i p_i ln

rac {p_i} {q_i},$

in which p_i comes from the probability object and q_i comes from the argument.
max()[source]

Return the value of maximum probability.

classmethod mixture(coeffs, probs)[source]

Create a mixture probability from the given coeffs and list of Probs objects.

Parameters:
  • coeffs – Mixture coefficients. These coefficients do not have to be normalized.
  • probs – List of Prob objects.
Returns:

A Prob object representing the mixture.

mode()[source]

Return the element with the maximum probability.

If more than one element shares the maximum probability, return an arbitrary value within this set.

mode_set()[source]

Return a set of elements that share the maximum probability.

random()[source]

Returns a random element.

random_sequence(size)[source]

Returns a sequence of random elements.

set_support(support)[source]

Defines the support set of distribution.

If elements exist in support, they are forced to exist in distribution, possibly with zero probability. If element exists in the distribution but is not present in support, raises a ValueError.

sharp(mode_set=True)[source]

Return a sharp version of the probability distribution.

All elements receive probability zero, except the mode which receives probability one.

update_support(support)[source]

Force all elements in support to be explicitly present in distribution (possibly with null probability).

Parameters:support – a list of elements in the support set for probability distribution.

Utility modules

Plotting

kpop.plots contains a few useful plotting functions based on matplotlib.

Loading objects

Functions from the kpop.loaders module are responsible for loading Population objects from files.