Greedy

class PyPruning.GreedyPruningClassifier.GreedyPruningClassifier(n_estimators=5, metric=<function reduced_error>, n_jobs=8, metric_options=None)

Bases: PyPruning.PruningClassifier.PruningClassifier

Greedy / Ordering-based pruning.

Greedy or ordering-based methods order the estimators in the ensemble according to their performance. In contrast to ranking-based pruning however they also consider the already selected sub-ensemble for selecting the next classifier. They start with the empty ensemble and then greedily select in each round that classifier which minimizes a loss function the most:

\[\arg\min_{h} L\left(\frac{K-1}{K} f(x) + \frac{1}{K} \cdot h(x), y\right)\]

where f is the already selected ensemble with K-1 members and h is the newly selected member. In this sense, this selection re-order the classifiers in the ensemble and hence sometimes the name ordering-based pruning is used. In this implementation a loss function receives 4 parameters:

  • i (int): The classifier which should be rated

  • ensemble_proba (A (M, N, C) matrix ): All N predictions of all M classifier in the entire ensemble for all C classes

  • selected_models (list of ints): All models which are selected so far

  • target (list / array): A list / array of class targets.

A simple loss function which minimizes the overall sub-ensembles error would be

def error(i, ensemble_proba, selected_models, target):
    iproba = ensemble_proba[i,:,:]
    sub_proba = ensemble_proba[selected_models, :, :]
    pred = 1.0 / (1 + len(sub_proba)) * (sub_proba.sum(axis=0) + iproba)
n_estimators

The number of estimators which should be selected.

Type

int, default is 5

metric

A function that assigns a score (smaller is better) to each classifier which is then used for selecting the next classifier in each round

Type

function, default is reduced_error

n_jobs

The number of threads used for computing the individual metrics for each classifier.

Type

int, default is 8

metric_options

Any additional metric_options are directly supplied to the metric function via the ** operator

Type

dict or None, default is None

prune_(proba, target, data=None)

Prunes the ensemble using the ensemble predictions proba and the pruning data targets / data. If the pruning method requires access to the original ensemble members you can access these via self.estimators_. Note that self.estimators_ is already a deep-copy of the estimators so you are also free to change the estimators in this list if you want to.

Parameters
  • proba (numpy matrix) – A (N,M,C) matrix which contains the individual predictions of each ensemble member on the pruning data. Each ensemble prediction is generated via predict_proba. N is size of the pruning data, M the size of the base ensemble and C is the number of classes

  • target (numpy array of ints) – A numpy array or list of N integers where each integer represents the class for each example. Classes should start with 0, so that for C classes the integer 0,1,…,C-1 are used

  • data (numpy matrix, optional) – The data points in a (N, M) matrix on which the proba has been computed, where N is the pruning set size and M is the number of classifier in the original ensemble. This can be used by a pruning method if required, but most methods do not require the actual data points but only the individual predictions.

Returns

  • A tuple of indices and weights (idx, weights) with the following properties

  • idx (numpy array / list of ints) – A list of integers which classifier should be selected from self.estimators_. Any changes made to self.estimators_ are also reflected here, so make sure that the order of classifier in proba and self.estimators_ remains the same (or you return idx accordingly)

  • weights (numpy array / list of floats) – The individual weights for each selected classifier. The size of this array should match the size of idx (and not the size of the original base ensemble).

PyPruning.GreedyPruningClassifier.complementariness(i, ensemble_proba, selected_models, target)

Computes the complementariness of the i-th classifier wrt. to the sub-ensemble. A classifier is complementary to the sub-ensemble if it disagrees with the ensemble, but is correct (and the ensemble is wrong)

Reference:

Martínez-Muñoz, G., & Suárez, A. (2004). Aggregation ordering in bagging. Proceedings of the IASTED International Conference. Applied Informatics, 258–263. Retrieved from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.2035&rep=rep1&type=pdf

PyPruning.GreedyPruningClassifier.drep(i, ensemble_proba, selected_models, target, rho=0.25)

A multi-class version of a PAC-style bound which includes the diversity of the sub-ensemble. This basically counts the number of different predictions between the i-th classifier and the sub-ensemble. The paper suggest to use \(\rho \in \{0.2, 0.25, \dots, 0.5 \}\). Our default is 0.25 here because it just works well. You can supply a different rho via the metric_options parameter of GreedyPruningClassifier.

Reference:

Li, N., Yu, Y., & Zhou, Z.-H. (2012). Diversity Regularized Ensemble Pruning. In P. A. Flach, T. De Bie, & N. Cristianini (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 330–345). Berlin, Heidelberg: Springer Berlin Heidelberg. https://link.springer.com/content/pdf/10.1007%2F978-3-642-33460-3.pdf

PyPruning.GreedyPruningClassifier.margin_distance(i, ensemble_proba, selected_models, target, p_range=[0, 0.25])

Computes how including the i-th classifiers into the sub-ensemble changes its prediction towards a reference vector. The paper randomly samples p from \((0, 0.25)\) which we also use as a default here. You can supply a different p_range via the metric_options parameter of GreedyPruningClassifier

Reference:

Martínez-Muñoz, G., & Suárez, A. (2004). Aggregation ordering in bagging. Proceedings of the IASTED International Conference. Applied Informatics, 258–263. Retrieved from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.2035&rep=rep1&type=pdf

PyPruning.GreedyPruningClassifier.neg_auc(i, ensemble_proba, selected_models, target)

Compute the (negative) roc-auc score of the sub-ensemble including the i-th classifier.

PyPruning.GreedyPruningClassifier.reduced_error(i, ensemble_proba, selected_models, target)

Computes the error of the sub-ensemble including the i-th classifier.

Reference:

Margineantu, D., & Dietterich, T. G. (1997). Pruning Adaptive Boosting. Proceedings of the Fourteenth International Conference on Machine Learning, 211–218. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.7017&rep=rep1&type=pdf