Compare commits

..

5 Commits

Author SHA1 Message Date
b044a057df Update comments and README.md 2021-11-02 14:04:10 +01:00
fc48bc8ba4 Update docs and version number 2021-11-02 12:17:46 +01:00
Ricardo Montañana Gómez
8251f07674 Fix Citation (#49) 2021-11-02 10:58:30 +01:00
Ricardo Montañana Gómez
0b15a5af11 Fix space in CITATION.cff 2021-11-02 00:25:21 +01:00
Ricardo Montañana Gómez
28d905368b Create CITATION.cff 2021-11-02 00:20:49 +01:00
7 changed files with 101 additions and 43 deletions

37
CITATION.cff Normal file
View File

@@ -0,0 +1,37 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Montañana"
given-names: "Ricardo"
orcid: "https://orcid.org/0000-0003-3242-5452"
- family-names: "Gámez"
given-names: "José A."
orcid: "https://orcid.org/0000-0003-1188-1117"
- family-names: "Puerta"
given-names: "José M."
orcid: "https://orcid.org/0000-0002-9164-5191"
title: "STree"
version: 1.0.2
doi: 10.5281/zenodo.5504083
date-released: 2021-11-02
url: "https://github.com/Doctorado-ML/STree"
preferred-citation:
type: article
authors:
- family-names: "Montañana"
given-names: "Ricardo"
orcid: "https://orcid.org/0000-0003-3242-5452"
- family-names: "Gámez"
given-names: "José A."
orcid: "https://orcid.org/0000-0003-1188-1117"
- family-names: "Puerta"
given-names: "José M."
orcid: "https://orcid.org/0000-0002-9164-5191"
doi: "10.1007/978-3-030-85713-4_6"
journal: "Lecture Notes in Computer Science"
month: 9
start: 54
end: 64
title: "STree: A Single Multi-class Oblique Decision Tree Based on Support Vector Machines"
volume: 12882
year: 2021

View File

@@ -37,7 +37,7 @@ Can be found in [stree.readthedocs.io](https://stree.readthedocs.io/en/stable/)
## Hyperparameters
| | **Hyperparameter** | **Type/Values** | **Default** | **Meaning** |
| --- | ------------------- | ------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| --- | ------------------- | -------------------------------------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| \* | C | \<float\> | 1.0 | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. |
| \* | kernel | {"liblinear", "linear", "poly", "rbf", "sigmoid"} | linear | Specifies the kernel type to be used in the algorithm. It must be one of liblinear, linear, poly or rbf. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library |
| \* | max_iter | \<int\> | 1e5 | Hard limit on iterations within solver, or -1 for no limit. |
@@ -50,7 +50,7 @@ Can be found in [stree.readthedocs.io](https://stree.readthedocs.io/en/stable/)
| | criterion | {“gini”, “entropy”} | entropy | The function to measure the quality of a split (only used if max_features != num_features). <br>Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. |
| | min_samples_split | \<int\> | 0 | The minimum number of samples required to split an internal node. 0 (default) for any |
| | max_features | \<int\>, \<float\> <br><br>or {“auto”, “sqrt”, “log2”} | None | The number of features to consider when looking for the split:<br>If int, then consider max_features features at each split.<br>If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.<br>If “auto”, then max_features=sqrt(n_features).<br>If “sqrt”, then max_features=sqrt(n_features).<br>If “log2”, then max_features=log2(n_features).<br>If None, then max_features=n_features. |
| | splitter | {"best", "random", "mutual", "cfs", "fcbf", "iwss"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **trandom”**: The algorithm generates a true random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm |
| | splitter | {"best", "random", "trandom", "mutual", "cfs", "fcbf", "iwss"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **trandom”**: The algorithm generates only one random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm |
| | normalize | \<bool\> | False | If standardization of features should be applied on each node with the samples that reach it |
| \* | multiclass_strategy | {"ovo", "ovr"} | "ovo" | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest |
@@ -73,3 +73,7 @@ python -m unittest -v stree.tests
## License
STree is [MIT](https://github.com/doctorado-ml/stree/blob/master/LICENSE) licensed
## Reference
R. Montañana, J. A. Gámez, J. M. Puerta, "STree: a single multi-class oblique decision tree based on support vector machines.", 2021 LNAI 12882, pg. 54-64

View File

@@ -54,4 +54,4 @@ html_theme = "sphinx_rtd_theme"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_static_path = []

View File

@@ -1,7 +1,7 @@
# Hyperparameters
| | **Hyperparameter** | **Type/Values** | **Default** | **Meaning** |
| --- | ------------------- | ------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| --- | ------------------- | -------------------------------------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| \* | C | \<float\> | 1.0 | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. |
| \* | kernel | {"liblinear", "linear", "poly", "rbf", "sigmoid"} | linear | Specifies the kernel type to be used in the algorithm. It must be one of liblinear, linear, poly or rbf. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library |
| \* | max_iter | \<int\> | 1e5 | Hard limit on iterations within solver, or -1 for no limit. |
@@ -14,7 +14,7 @@
| | criterion | {“gini”, “entropy”} | entropy | The function to measure the quality of a split (only used if max_features != num_features). <br>Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. |
| | min_samples_split | \<int\> | 0 | The minimum number of samples required to split an internal node. 0 (default) for any |
| | max_features | \<int\>, \<float\> <br><br>or {“auto”, “sqrt”, “log2”} | None | The number of features to consider when looking for the split:<br>If int, then consider max_features features at each split.<br>If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.<br>If “auto”, then max_features=sqrt(n_features).<br>If “sqrt”, then max_features=sqrt(n_features).<br>If “log2”, then max_features=log2(n_features).<br>If None, then max_features=n_features. |
| | splitter | {"best", "random", "mutual", "cfs", "fcbf", "iwss"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **trandom”**: The algorithm generates a true random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm |
| | splitter | {"best", "random", "trandom", "mutual", "cfs", "fcbf", "iwss"} | "random" | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **trandom”**: The algorithm generates only one random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm |
| | normalize | \<bool\> | False | If standardization of features should be applied on each node with the samples that reach it |
| \* | multiclass_strategy | {"ovo", "ovr"} | "ovo" | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest |

View File

@@ -202,7 +202,8 @@ class Splitter:
max_features < num_features). Supported strategies are: “best”: sklearn
SelectKBest algorithm is used in every node to choose the max_features
best features. “random”: The algorithm generates 5 candidates and
choose the best (max. info. gain) of them. "mutual": Chooses the best
choose the best (max. info. gain) of them. “trandom”: The algorithm
generates only one random combination. "mutual": Chooses the best
features w.r.t. their mutual info with the label. "cfs": Apply
Correlation-based Feature Selection. "fcbf": Apply Fast Correlation-
Based, by default None
@@ -478,6 +479,18 @@ class Splitter:
@staticmethod
def _entropy(y: np.array) -> float:
"""Compute entropy of a labels set
Parameters
----------
y : np.array
set of labels
Returns
-------
float
entropy
"""
n_labels = len(y)
if n_labels <= 1:
return 0
@@ -485,10 +498,13 @@ class Splitter:
proportions = counts / n_labels
n_classes = np.count_nonzero(proportions)
if n_classes <= 1:
return 0.0
from scipy.stats import entropy
return entropy(y, base=n_classes)
return 0
entropy = 0.0
# Compute standard entropy.
for prop in proportions:
if prop != 0.0:
entropy -= prop * log(prop, n_classes)
return entropy
def information_gain(
self, labels: np.array, labels_up: np.array, labels_dn: np.array

View File

@@ -82,7 +82,8 @@ class Stree(BaseEstimator, ClassifierMixin):
max_features < num_features). Supported strategies are: “best”: sklearn
SelectKBest algorithm is used in every node to choose the max_features
best features. “random”: The algorithm generates 5 candidates and
choose the best (max. info. gain) of them. "mutual": Chooses the best
choose the best (max. info. gain) of them. “trandom”: The algorithm
generates only one random combination. "mutual": Chooses the best
features w.r.t. their mutual info with the label. "cfs": Apply
Correlation-based Feature Selection. "fcbf": Apply Fast Correlation-
Based , by default "random"
@@ -128,7 +129,7 @@ class Stree(BaseEstimator, ClassifierMixin):
References
----------
R. Montañana, J. A. Gámez, J. M. Puerta, "STree: a single multi-class
oblique decision tree based on support vector machines.", 2021 LNAI...
oblique decision tree based on support vector machines.", 2021 LNAI 12882
"""

View File

@@ -1,6 +1,6 @@
from .Strees import Stree, Siterator
__version__ = "1.2.1"
__version__ = "1.2.2"
__author__ = "Ricardo Montañana Gómez"
__copyright__ = "Copyright 2020-2021, Ricardo Montañana Gómez"