Classification

Kpop population objects have builtin tools for building classifiers from population objects. As with any classification task, you must provide a labeled training data set and the classifier algorithm will be trained to replicate those labels and generalize to new sample points. A very basic classification task can start with a population object and a list of labels:

>>> from kpop import Population
>>> pop = Population.random(5, 10)
>>> classifier = pop.classification(['A', 'A', 'B', 'B', 'A'])

The method returns a trained classifier object that associates each individual in the population with the given labels. Notice that we created a random population with 5 individuals and we had to provide the same number of labels.

Classifier objects are used a callable that receive a single population argument. It then returns a list of labels corresponding to the assigned classification of each individual. When we classify the training set, there is a fair chance of obtaining the original labels:

>>> classifier(pop)                                             
['A', 'A', 'B', 'B', 'A']

The classifier exposes different classification algorithms that can be accessed either using the pop.classification(labels, <method>) method or using the corresponding attribute pop.classification.<method>(labels). For instance, we could try different classifiers

>>> labels = ['A', 'A', 'B', 'B', 'A']
>>> nb = pop.classification.naive_bayes(labels)
>>> svm = pop.classification.svm(labels)

You can check the :cls:`kpop.population.classification.Classification` to see all available classifiers.

Easy labels

The default procedure for training a classifier involves passing a list of labels for the training algorithms. Sometimes those labels can be stored as meta data in the population object or can be derived from the population somehow. If the labels argument is a string, kpop will try to obtain the label list by using the first option valid option:

  • Use population.meta[<label>], if it exists.
  • If label equals ‘ancestry’, it creates a list of labels assigning the

id of each sub-population to all its individuals. * If label is the empty string or None, it looks for a ‘labels’ column in the meta information and then returns it.

This interface makes it very convenient to train classifiers to infer population ancestry. Remember that this is not an admixture analysis since we are assuming that all individuals belong to a single population.

>>> popA = Population.random(5, 20, id='A')
>>> popB = Population.random(5, 20, id='B')
>>> pop = popA + popB
>>> classifier = pop.classification(labels='ancestry')
>>> classifier(popA)
['A', 'A', 'A', 'A', 'A']
>>> classifier(popB)
['B', 'B', 'B', 'B', 'B']

Probabilistic classifiers

Some classifiers allow for probabilistic classification. That is, instead of assigning a single label per individual, it assigns a probability distribution with the probability that each individual belongs to each label. This is accomplished by the .prob_* methods of the classifier. Each method represents the probability distribution in a different way.

>>> probs = classifier.prob_list(popA)
>>> probs[0]                                                    
Prob({'A': 0.951, 'B': 0.049})

API docs

class kpop.population.classification.Classification[source]

Implements the population.classification attribute.

naive_bayes(labels=None, data='count', prior='uniform', alpha=0.5)[source]

Classify objects using the naive_bayes classifier.

Parameters:
  • labels – List of labels or a string with the metadata column used as label. Optionally, the ‘ancestry’ string classify using the sub-populations as labels.
  • alpha – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
  • prior – The prior probability for each label. Must be either a Prob() object, the string ‘uniform’ or None. The default value is ‘uniform’ that assigns a fixed uniform prior. If prior is None, it learns priors from data. Finally, it can also be specified as a Prob() object or a mapping from labels to probabilities.
sklearn(classifier, labels=None, data='count', **kwargs)[source]

Uses a scikit learn classifier to classify population.

Parameters:
  • classifier – A scikit learn classifier class (e.g., sklearn.naive_bayes.BernoulliNB)
  • labels – A sequence of labels used to train the classifier.
  • data (str) – The method used to convert the population to a usable data set. It uses the same options as in the Population.as_array() method.
svm(labels=None, data='count', **kwargs)[source]

Classify objects using the Support Vector Machine (SVM) classifier.