Tutorial¶
Getting started¶
We are assuming you already installed Kpop. The easiest way to start is to simply type the following command on the terminal:
$ kpop shell
This will start a Python session with all basic Kpop functionality available. It also also tries to load all population files in the current directory, so they become easily available.
The kpop shell is just a convenience method of starting a IPython shell with a few useful imports:
This is useful for interactive and exploratory work. However, for more serious jobs you probably should make those imports manually and avoid the start import in the first line.
Basic Kpop concepts¶
You notice that most of Kpop interactions go through two main object types:
kpop.Individual
and kpop.Population
. Let us start with the first
of these two, kpop.Individual
, which represent single individuals by
their corresponding genotypes.
Individual¶
An kpop.Individual
instance behaves basically as a list of genotype
values. Kpop represents genotypes by numbers, where zero is used to encode missing
data and numbers above one represent each allele. We can start a new
individual by constructing it from a list of pairs of numbers:
>>> ind = Individual([[1, 1], [1, 2], [2, 2], [1, 2]])
This is a genotype with 4 loci of biallelic data. You might expect it behave just as a list of genotypes for each locus. It accepts Python indexing, slicing and iteration:
>>> ind[0]
array([1, 1], dtype=uint8)
>>> [(1 in locus) for locus in ind]
[True, True, False, True]
kpop.Individual
objects can also be inspected in several ways.
>>> ind.num_loci, ind.ploidy, ind.is_biallelic
(4, 2, True)
You should use the autocomplete feature of Kpop’s shell to discover more
attributes. Just type ind.
and hit the <tab> key to see a list of
completions. Some of those options are methods (you will notice it by the
open-close parens at the end of their names). In order to get help on the
methods behavior and signature, just use the ?
helper as bellow
>>> ind.breed?
Signature: ind.breed(other, id=None, **kwargs)
doctest:
Breeds with other individual.
<NEWLINE>
Creates a new genotype in which features are selected from both
parents.
File: ~/git/bio/kpop/src/kpop/individual.py
Type: method
You will notice that if you print an Individual in the terminal it will shown with the following notation
>>> ind
Individual('ind: 11 12 22 12')
This is actually a different way to construct kpop.Individual
instances.
The first part in the string before the column is a label used to identify the
given individual and everything on the right hand side is its genotype.
Let us create a second individual to interact with the first.
>>> ind2 = Individual('ind2: 22 11 12 12')
>>> ind2.breed(ind)
Individual('ind2_: 21 12 12 12')
Of course, handling a handful of individuals is not very useful. Let us create a list of individuals by drawing samples from an specific probability. First, define a list of probabilities for each allele in each loci
>>> freqs = [[0.1, 0.9], [0.5, 0.5], [0.9, 0.1], [0.5, 0.5]]
Now we can create a random individual using the from_freqs
method of the
Individual class
>>> random_ind = Individual.from_freqs(freqs)
... and now we create a bunch:
>>> list_of_individuals = []
>>> for _ in range(10):
... new_ind = Individual.from_freqs(freqs)
... list_of_individuals.append(new_ind)
Population¶
Now that we have a bunch of individuals, we can make a population. Of course
we could use the list of individuals directly, but Kpop provides the much more
convenient kpop.Population
type to represent a group of individuals.
>>> popA = Population(list_of_individuals, id='A')
>>> popA
ind1: 22 21 12 22
ind2: 22 11 11 21
ind3: 22 11 11 21
ind4: 22 11 11 21
ind5: 22 11 11 12
ind6: 22 22 11 21
ind7: 22 11 11 21
ind8: 22 21 11 22
ind9: 22 12 11 12
ind10: 22 22 11 21
We created the Population object from a list of individuals and gave it an optional label. The label is used to identify the population in several different contexts such as clustering, plotting, etc.
Just like kpop.Individual
instances, kpop.Population
objects
have many associated methods and attributes. You can explore it by typing
popA.
and hitting the <tab> key (you will notice it is way more complex than
Individual instances).
In population genetics we are usually interested in comparing different populations rather than different individuals in the same population. We can easily create a new random population using the Population.make_random function:
>>> popB = Population.random(10, num_loci=4, id='B')
This will create a new population with 10 individuals and 4 loci. Now, let us compose this population with the previous one by creating a new generation that breeds individuals from the first population with the second
>>> popC = popA.simulation.breed(popB, size=15, id='C')
We can combine all sub-populations into a single population containing all individuals by simply adding the population objects together
>>> pop_all = popA + popB + popC
This creates a kpop.MultiPopulation
object which behaves essentially as
a Population, but keeps track of sub-structuring.
Visualization¶
Kpop implements a few visualization methods through the Population.plot
attribute. The population.plot.?
namespace has methods for dimensionality
reduction (such as PCA),
Statistics¶
Admixture¶
Admixture analysis is the task of estimating the admixture coefficients of each individual in a population. This is the main concern of programs such as Structure and ADMIXTURE.
Projections¶
All dimensionality reduction methods from the above section are implemented in
the population.projection
namespace. Those methods provide the raw data for
dimensionality reduction and may be useful in contexts other than data
visualization.
# TODO.
Clusterization¶
Clusterization is the task of spliting data into separate groups without providing a training set on correct classifications. This is often refered as “unsupervised learning”. Notice here that “unsupervised” does not mean “completely independent of human intervention” since almost all clustering algorithms requires some sort of tuning.
Kpop provides a few methods for performing clustering of individuals. They are
all implemented under the population.cluster
namespace.
# TODO.
Classification¶
Differently from clustering, a supervised classification task learns from a dataset in which all items are classified with a corresponding label. A classification task is useful when it can generalize this mapping to data points outside of the training set.
In Population genetics this often maps to the sittuation in which we have a group of individuals with known parental populations and we want to classify additional specimens into one of those populations. Notice it is different from admixture analysis that tries to infer the fractions of DNA belonging to each parental population. Here the classification is sharp: the individual is said to belong to a single parental population.
All classification methods live under the population.classification
namespace.
# TODO