Extending PyPruning
If you want to implement your own pruning method then there are two ways:
Implementing a custom metric
You can implement your own metric for GreedyPruningClassifier
, MIQPPruningClassifier
or a RankPruningClassifier
you simply have to implement a python function that should be minimized. The specific interface required by each method slightly differs so please check out the specific documentation for the method of your choice. In all cases, each method expects functions with at-least three parameters
i
(int): The classifier which should be ratedensemble_proba
(A (M, N, C) matrix ): All N predictions of all M classifier in the entire ensemble for all C classestarget
(list / array): A list / array of class targets.
Note that ensemble_proba
contains all class probabilities predicted by all members in the ensemble. So in order to get individual class predictions for the i-th classifier you can access it via ensemble_proba[i,:,:]
. A complete example which simply computes the error of each method would be
def individual_error(i, ensemble_proba, target):
iproba = ensemble_proba[i,:,:]
return (iproba.argmax(axis=1) != target).mean()
Implementing a custom pruner
You can implement your own pruner as a well. In this case you just have to implement the PruningClassifier
class. To do so, you just need to implement the prune_(self, proba, target)
function which receives a list of all predictions of all classifiers as well as the corresponding data and targets. The function is supposed to return a list of indices corresponding to the chosen estimators as well as the corresponding weights. If you need access to the estimators as well (and not just their predictions) you can access self.estimators_
which already contains a copy of each classier. For more details have a look at the PruningClassifier.py
interface. An example implementation could be:
class RandomPruningClassifier(PruningClassifier):
def __init__(self):
super().__init__()
def prune_(self, proba, target, data = None):
n_received = len(proba)
if self.n_estimators >= n_received:
return range(0, n_received), [1.0 / n_received for _ in range(n_received)]
else:
return np.random.choice(range(0, n_received),size=self.n_estimators), [1.0 / self.n_estimators for _ in range(self.n_estimators)]
PyPruning.PruningClassifier module
- class PyPruning.PruningClassifier.PruningClassifier
Bases:
abc.ABC
This abstract class forms the basis of all pruning methods and offers a unified interface. New pruning methods must extend this class and implement the prune_ method as detailed below.
- weights_
An array of weights corresponding to each classifier in self.estimators_
- Type
numpy array
- estimators_
A list of estimators
- Type
list
- n_classes_
The number of classes the pruned ensemble supports.
- Type
int
- predict(X)
Predict classes using the pruned model.
- Parameters
X (array-like or sparse matrix, shape (n_samples, n_features)) – The samples to be predicted.
- Returns
y – The predicted classes.
- Return type
array, shape (n_samples,)
- predict_proba(X)
Predict class probabilities using the pruned model.
- Parameters
X (array-like or sparse matrix, shape (n_samples, n_features)) – The samples to be predicted.
- Returns
y – The predicted class probabilities.
- Return type
array, shape (n_samples,C)
- prune(X, y, estimators, classes=None, n_classes=None)
Prunes the given ensemble on the supplied dataset. There are a few assumptions placed on the behavior of the individual classifiers in estimators. If you use scikit-learn classifier and any classifier implementing their interface they should work without a problem. The detailed assumptions are listed below:
predict_proba: Each estimator should offer a predict_proba function which returns the class probabilities for each class on a batch of data
n_classes_: Each estimator should offer a field on the number of classes it has been trained on. Ideally, this should be the same for all classifier in the ensemble but might differ e.g. due to different bootstrap samples. This field is not accessed if you manually supply n_classes as parameter to this function
classes_: Each estimator should offer a class mapping which shows the order of classes returned by predict_proba. Usually this should simply be [0,1,2,3,4] for 5 classes, but if your classifier returns class probabilities in a different order, e.g. [2,1,0,3,4] you should store this order in classes_. This field is not accessed if you manually supply classes as parameter to this function
For pruning this function calls predict_proba on each classifier in estimators and then calls prune_ of the implementing class. After pruning, it extracts the selected classifiers from estimators with their corresponding weight and stores them in self.weights_ and self.estimators_
- Parameters
X (numpy matrix) – A (N, d) matrix with the datapoints used for pruning where N is the number of data points and d is the dimensionality
Y (numpy array / list of ints) – A numpy array or list of N integers where each integer represents the class for each example. Classes should start with 0, so that for C classes the integer 0,1,…,C-1 are used
estimators (list) – A list of estimators from which the pruned ensemble is selected.
classes (numpy array / list of ints) – Contains the class mappings of each base learner in the order which is returned by predict_proba. Usually this should be something like [0,1,2,3,4] for a 5 class problem. However, sometimes weird stuff happens and the mapping might be [2,1,0,3,4]. In this case, you can manually supply the list of mappings
n_classes (int) – The total number of classes. Usually, this it should be n_classes = len(classes). However, sometimes estimators are only fitted on a subset of data (e.g. during cross validation or bootstrapping) and the prune set might contain classes which are not in the original training set and vice-versa. In this case its best to supply n_classes beforehand.
- Return type
The pruned ensemble (self).
- abstract prune_(proba, target, data=None)
Prunes the ensemble using the ensemble predictions proba and the pruning data targets / data. If the pruning method requires access to the original ensemble members you can access these via self.estimators_. Note that self.estimators_ is already a deep-copy of the estimators so you are also free to change the estimators in this list if you want to.
- Parameters
proba (numpy matrix) – A (N,M,C) matrix which contains the individual predictions of each ensemble member on the pruning data. Each ensemble prediction is generated via predict_proba. N is size of the pruning data, M the size of the base ensemble and C is the number of classes
target (numpy array of ints) – A numpy array or list of N integers where each integer represents the class for each example. Classes should start with 0, so that for C classes the integer 0,1,…,C-1 are used
data (numpy matrix, optional) – The data points in a (N, M) matrix on which the proba has been computed, where N is the pruning set size and M is the number of classifier in the original ensemble. This can be used by a pruning method if required, but most methods do not require the actual data points but only the individual predictions.
- Returns
A tuple of indices and weights (idx, weights) with the following properties
idx (numpy array / list of ints) – A list of integers which classifier should be selected from self.estimators_. Any changes made to self.estimators_ are also reflected here, so make sure that the order of classifier in proba and self.estimators_ remains the same (or you return idx accordingly)
weights (numpy array / list of floats) – The individual weights for each selected classifier. The size of this array should match the size of idx (and not the size of the original base ensemble).