Clustering

class PyPruning.ClusterPruningClassifier.ClusterPruningClassifier(n_estimators=5, cluster_estimators=<function kmeans>, select_estimators=<function random_selector>, cluster_mode='probabilities', cluster_options=None, selector_options=None)

Bases: PyPruning.PruningClassifier.PruningClassifier

Clustering-based pruning.

Clustering-based methods follow a two-step procedure. In first step, they cluster the estimators in the ensemble according to some clustering algorithm. Then, in the second, a representative form each cluster is selected to form the pruned ensemble. More formally, clustering-based pruning uses the following optimization problem:

In this implementation, you must provide two functions

cluster_estimators: A function which clusters the estimators given their representation X (see cluster_mode for details) and return the cluster assignment for each estimator. An example of kmeans clustering would be:

def kmeans(X, n_estimators, **kwargs):
    kmeans = KMeans(n_clusters = n_estimators, **kwargs)
    assignments = kmeans.fit_predict(X)
    return assignments

select_estimators: A function which selects the estimators from the clustering and returns the selected indices. An example of which selects the centroids would be:

def centroid_selector(x, assignments, target, **kwargs):

    clf = NearestCentroid()
    clf.fit(x, assignments)
    centroids = clf.centroids_

    centroid_idx,_ = pairwise_distances_argmin_min(centroids, x)

    return centroid_idx

If you want to pass additional parameter to cluster_estimators or select_estimators you can do so via the cluster_options and selector_options respectively. These parameters are passed via **kwargs to the functions so please make sure that they are either None or valid Python dictionaries.

n_estimators

The number of estimators which should be selected.

Type: int, default is 5

cluster_estimators

A function that clusters the classifier.

Type: function, default is kmeans

select_estimators

A function that selects representatives from each cluster

Type: function, default is random_selector

cluster_mode

The representation of each estimator used for clustering. Must be one of {“probabilities”, “predictions”, “accuracy”}:

“probabilities”: Uses the raw probability output of each estimator for clustering. For multi-class problems the vector is “flattened” to a N * C vector where N is the number pf data points in the pruning set and C is the number of classes
“predictions”: Same as “probabilities”, but uses the predictions instead of the probabilities.
“accuracy”: Computes the accuracy of each estimator on each datapoint and uses the corresponding vector for clustering.

Type: str, default is probabilities”

cluster_options

Additional options passed to cluster_estimators

Type: dict, default is None

selector_options

Additional options passed to select_estimators

Type: dict, default is None

prune_(proba, target, data=None)

Prunes the ensemble using the ensemble predictions proba and the pruning data targets / data. If the pruning method requires access to the original ensemble members you can access these via self.estimators_. Note that self.estimators_ is already a deep-copy of the estimators so you are also free to change the estimators in this list if you want to.

Parameters

proba (numpy matrix) – A (N,M,C) matrix which contains the individual predictions of each ensemble member on the pruning data. Each ensemble prediction is generated via predict_proba. N is size of the pruning data, M the size of the base ensemble and C is the number of classes
target (numpy array of ints) – A numpy array or list of N integers where each integer represents the class for each example. Classes should start with 0, so that for C classes the integer 0,1,…,C-1 are used
data (numpy matrix, optional) – The data points in a (N, M) matrix on which the proba has been computed, where N is the pruning set size and M is the number of classifier in the original ensemble. This can be used by a pruning method if required, but most methods do not require the actual data points but only the individual predictions.

Returns

A tuple of indices and weights (idx, weights) with the following properties
idx (numpy array / list of ints) – A list of integers which classifier should be selected from self.estimators_. Any changes made to self.estimators_ are also reflected here, so make sure that the order of classifier in proba and self.estimators_ remains the same (or you return idx accordingly)
weights (numpy array / list of floats) – The individual weights for each selected classifier. The size of this array should match the size of idx (and not the size of the original base ensemble).

PyPruning.ClusterPruningClassifier.agglomerative(X, n_estimators, **kwargs)

Perform agglomerative clustering on the given data X. The original publication (see below) considers the accuracy / error of each estimator which can be achieved by setting cluster_mode = “accuracy” in the ClusterPruningClassifier.

Reference:: Giacinto, G., Roli, F., & Fumera, G. (n.d.). Design of effective multiple classifier systems by clustering of classifiers. Proceedings 15th International Conference on Pattern Recognition. ICPR-2000. doi:10.1109/icpr.2000.906039

PyPruning.ClusterPruningClassifier.centroid_selector(X, assignments, target)

Returns the centroid of each cluster. Bakker and Heske propose this approach, although there are subtle differences. Originally they propose to use annealing via an EM algorithm, whereas we use kmeans / agglomerative clustering.

Reference: Bakker, Bart, and Tom Heskes. “Clustering ensembles of neural network models.” Neural networks 16.2 (2003): 261-269.

PyPruning.ClusterPruningClassifier.cluster_accuracy(X, assignments, target, n_classes=None)

Select the most accurate model from each cluster. Lazarevic and Obradovic propose this approach although there are subtle differences. In the original paper they remove the least-accurate classifier as long as the performance of the sub-ensemble does not decrease. In this implementation we simply select the best / most accurate classifier from each cluster.

Reference: Lazarevic, A., & Obradovic, Z. (2001). Effective pruning of neural network classifier ensembles. Proceedings of the International Joint Conference on Neural Networks, 2(January), 796–801. https://doi.org/10.1109/ijcnn.2001.939461

PyPruning.ClusterPruningClassifier.kmeans(X, n_estimators, **kwargs)

Perform kmeans clustering on given data X. The original publication (see below) considers the predictions of each estimator which can be achieved by setting cluster_mode = “predictions” in the ClusterPruningClassifier. Second, the original publication discusses binary classification problems. In this multi-class implementation the proba/predictions for each class are flattened before clustering.

Reference:: Lazarevic, A., & Obradovic, Z. (2001). Effective pruning of neural network classifier ensembles. Proceedings of the International Joint Conference on Neural Networks, 2(January), 796–801. https://doi.org/10.1109/ijcnn.2001.939461pdf

PyPruning.ClusterPruningClassifier.largest_mean_distance(X, assignments, target, metric='euclidean', n_jobs=None)

Select the most distant classifier to all other clusters.

Reference:: Giacinto, G., Roli, F., & Fumera, G. (n.d.). Design of effective multiple classifier systems by clustering of classifiers. Proceedings 15th International Conference on Pattern Recognition. ICPR-2000. doi:10.1109/icpr.2000.906039

PyPruning.ClusterPruningClassifier.random_selector(X, assignments, target): Randomly select a classifier from each cluster.