diff --git a/README.md b/README.md index b6c2a19..96851ea 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,6 @@ ![https://img.shields.io/badge/python-3.8%2B-blue](https://img.shields.io/badge/python-3.8%2B-brightgreen) [![DOI](https://zenodo.org/badge/262658230.svg)](https://zenodo.org/badge/latestdoi/262658230) - # STree Oblique Tree classifier based on SVM nodes. The nodes are built and splitted with sklearn SVC models. Stree is a sklearn estimator and can be integrated in pipelines, grid searches, etc. @@ -39,24 +38,23 @@ Can be found in [stree.readthedocs.io](https://stree.readthedocs.io/en/stable/) ## Hyperparameters -| | **Hyperparameter** | **Type/Values** | **Default** | **Meaning** | -| --- | ------------------- | ------------------------------------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| \* | C | \ | 1.0 | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. | -| \* | kernel | {"liblinear", "linear", "poly", "rbf", "sigmoid"} | linear | Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library | -| \* | max_iter | \ | 1e5 | Hard limit on iterations within solver, or -1 for no limit. | -| \* | random_state | \ | None | Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.
Pass an int for reproducible output across multiple function calls | -| | max_depth | \ | None | Specifies the maximum depth of the tree | -| \* | tol | \ | 1e-4 | Tolerance for stopping criterion. | -| \* | degree | \ | 3 | Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. | -| \* | gamma | {"scale", "auto"} or \ | scale | Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses 1 / (n_features \* X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features. | -| | split_criteria | {"impurity", "max_samples"} | impurity | Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node\*\*. max_samples is incompatible with 'ovo' multiclass_strategy | -| | criterion | {“gini”, “entropy”} | entropy | The function to measure the quality of a split (only used if max_features != num_features).
Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. | -| | min_samples_split | \ | 0 | The minimum number of samples required to split an internal node. 0 (default) for any | -| | max_features | \, \

or {“auto”, “sqrt”, “log2”} | None | The number of features to consider when looking for the split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features. | -| | splitter | {"best", "random", "mutual", "cfs", "fcbf"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter | -| | normalize | \ | False | If standardization of features should be applied on each node with the samples that reach it | -| \* | multiclass_strategy | {"ovo", "ovr"} | "ovo" | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest | - +| | **Hyperparameter** | **Type/Values** | **Default** | **Meaning** | +| --- | ------------------- | ------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| \* | C | \ | 1.0 | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. | +| \* | kernel | {"liblinear", "linear", "poly", "rbf", "sigmoid"} | linear | Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library | +| \* | max_iter | \ | 1e5 | Hard limit on iterations within solver, or -1 for no limit. | +| \* | random_state | \ | None | Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.
Pass an int for reproducible output across multiple function calls | +| | max_depth | \ | None | Specifies the maximum depth of the tree | +| \* | tol | \ | 1e-4 | Tolerance for stopping criterion. | +| \* | degree | \ | 3 | Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. | +| \* | gamma | {"scale", "auto"} or \ | scale | Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses 1 / (n_features \* X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features. | +| | split_criteria | {"impurity", "max_samples"} | impurity | Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node\*\*. max_samples is incompatible with 'ovo' multiclass_strategy | +| | criterion | {“gini”, “entropy”} | entropy | The function to measure the quality of a split (only used if max_features != num_features).
Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. | +| | min_samples_split | \ | 0 | The minimum number of samples required to split an internal node. 0 (default) for any | +| | max_features | \, \

or {“auto”, “sqrt”, “log2”} | None | The number of features to consider when looking for the split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features. | +| | splitter | {"best", "random", "mutual", "cfs", "fcbf", "iwss"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **“trandom”**: The algorithm generates a true random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm | +| | normalize | \ | False | If standardization of features should be applied on each node with the samples that reach it | +| \* | multiclass_strategy | {"ovo", "ovr"} | "ovo" | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest | \* Hyperparameter used by the support vector classifier of every node diff --git a/docs/source/hyperparameters.md b/docs/source/hyperparameters.md index 7e5fe33..949d6db 100644 --- a/docs/source/hyperparameters.md +++ b/docs/source/hyperparameters.md @@ -1,22 +1,22 @@ # Hyperparameters -| | **Hyperparameter** | **Type/Values** | **Default** | **Meaning** | -| --- | ------------------- | ------------------------------------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| \* | C | \ | 1.0 | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. | -| \* | kernel | {"liblinear", "linear", "poly", "rbf", "sigmoid"} | linear | Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library | -| \* | max_iter | \ | 1e5 | Hard limit on iterations within solver, or -1 for no limit. | -| \* | random_state | \ | None | Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.
Pass an int for reproducible output across multiple function calls | -| | max_depth | \ | None | Specifies the maximum depth of the tree | -| \* | tol | \ | 1e-4 | Tolerance for stopping criterion. | -| \* | degree | \ | 3 | Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. | -| \* | gamma | {"scale", "auto"} or \ | scale | Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses 1 / (n_features \* X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features. | -| | split_criteria | {"impurity", "max_samples"} | impurity | Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node\*\*. max_samples is incompatible with 'ovo' multiclass_strategy | -| | criterion | {“gini”, “entropy”} | entropy | The function to measure the quality of a split (only used if max_features != num_features).
Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. | -| | min_samples_split | \ | 0 | The minimum number of samples required to split an internal node. 0 (default) for any | -| | max_features | \, \

or {“auto”, “sqrt”, “log2”} | None | The number of features to consider when looking for the split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features. | -| | splitter | {"best", "random", "mutual", "cfs", "fcbf"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter | -| | normalize | \ | False | If standardization of features should be applied on each node with the samples that reach it | -| \* | multiclass_strategy | {"ovo", "ovr"} | "ovo" | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest | +| | **Hyperparameter** | **Type/Values** | **Default** | **Meaning** | +| --- | ------------------- | ------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| \* | C | \ | 1.0 | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. | +| \* | kernel | {"liblinear", "linear", "poly", "rbf", "sigmoid"} | linear | Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library | +| \* | max_iter | \ | 1e5 | Hard limit on iterations within solver, or -1 for no limit. | +| \* | random_state | \ | None | Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.
Pass an int for reproducible output across multiple function calls | +| | max_depth | \ | None | Specifies the maximum depth of the tree | +| \* | tol | \ | 1e-4 | Tolerance for stopping criterion. | +| \* | degree | \ | 3 | Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. | +| \* | gamma | {"scale", "auto"} or \ | scale | Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses 1 / (n_features \* X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features. | +| | split_criteria | {"impurity", "max_samples"} | impurity | Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node\*\*. max_samples is incompatible with 'ovo' multiclass_strategy | +| | criterion | {“gini”, “entropy”} | entropy | The function to measure the quality of a split (only used if max_features != num_features).
Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. | +| | min_samples_split | \ | 0 | The minimum number of samples required to split an internal node. 0 (default) for any | +| | max_features | \, \

or {“auto”, “sqrt”, “log2”} | None | The number of features to consider when looking for the split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features. | +| | splitter | {"best", "random", "mutual", "cfs", "fcbf", "iwss"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **“trandom”**: The algorithm generates a true random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm | +| | normalize | \ | False | If standardization of features should be applied on each node with the samples that reach it | +| \* | multiclass_strategy | {"ovo", "ovr"} | "ovo" | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest | \* Hyperparameter used by the support vector classifier of every node diff --git a/stree/Splitter.py b/stree/Splitter.py index 2f6e590..2df3f57 100644 --- a/stree/Splitter.py +++ b/stree/Splitter.py @@ -271,10 +271,17 @@ class Splitter: f"criteria has to be max_samples or impurity; got ({criteria})" ) - if feature_select not in ["random", "best", "mutual", "cfs", "fcbf"]: + if feature_select not in [ + "random", + "best", + "mutual", + "cfs", + "fcbf", + "iwss", + ]: raise ValueError( - "splitter must be in {random, best, mutual, cfs, fcbf} got " - f"({feature_select})" + "splitter must be in {random, best, mutual, cfs, fcbf, iwss} " + f"got ({feature_select})" ) self.criterion_function = getattr(self, f"_{self._criterion}") self.decision_criteria = getattr(self, f"_{self._criteria}") @@ -409,6 +416,31 @@ class Splitter: mufs = MUFS(max_features=max_features, discrete=False) return mufs.fcbf(dataset, labels, 5e-4).get_results() + @staticmethod + def _fs_iwss( + dataset: np.array, labels: np.array, max_features: int + ) -> tuple: + """Correlattion-based feature selection based on iwss with max_features + limit + + Parameters + ---------- + dataset : np.array + array of samples + labels : np.array + labels of the dataset + max_features : int + number of features of the subspace + (< number of features in dataset) + + Returns + ------- + tuple + indices of the features selected + """ + mufs = MUFS(max_features=max_features, discrete=False) + return mufs.iwss(dataset, labels, 0.25).get_results() + def partition_impurity(self, y: np.array) -> np.array: return self.criterion_function(y) diff --git a/stree/tests/Splitter_test.py b/stree/tests/Splitter_test.py index 39f80f5..b360801 100644 --- a/stree/tests/Splitter_test.py +++ b/stree/tests/Splitter_test.py @@ -285,3 +285,15 @@ class Splitter_test(unittest.TestCase): Xs, computed = tcl.get_subspace(X, y, rs) self.assertListEqual(expected, list(computed)) self.assertListEqual(X[:, expected].tolist(), Xs.tolist()) + + def test_get_iwss_subspaces(self): + results = [ + (4, [1, 5, 9, 12]), + (6, [1, 5, 9, 12, 4, 15]), + ] + for rs, expected in results: + X, y = load_dataset(n_features=20, n_informative=7) + tcl = self.build(feature_select="iwss", random_state=rs) + Xs, computed = tcl.get_subspace(X, y, rs) + self.assertListEqual(expected, list(computed)) + self.assertListEqual(X[:, expected].tolist(), Xs.tolist())