Change entropy function with scipy (#38 )

Update benchmark hyperparams os STree
Remove obsolete binder links
2025-08-17 16:36:01 +00:00 · 2021-11-01 18:41:15 +01:00 · 2021-10-31 12:41:30 +01:00 · 2021-10-31 11:51:31 +01:00 · 2021-10-29 12:59:03 +02:00 · 2021-10-29 11:49:46 +02:00
14 changed files with 323 additions and 76 deletions
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -26,7 +26,6 @@ jobs:
          pip install -q --upgrade pip
          pip install -q -r requirements.txt
          pip install -q --upgrade codecov coverage black flake8 codacy-coverage
-          pip install -q git+https://github.com/doctorado-ml/mfs
      - name: Lint
        run: |
          black --check --diff stree
--- a/1
+++ b/1
@@ -26,6 +26,7 @@ doc:  ## Update documentation

 build:  ## Build package
 	rm -fr dist/*
+	rm -fr build/*
 	python setup.py sdist bdist_wheel

 doc-clean:  ## Update documentation
--- a/README.md
+++ b/README.md
@@ -2,6 +2,9 @@
 [![codecov](https://codecov.io/gh/doctorado-ml/stree/branch/master/graph/badge.svg)](https://codecov.io/gh/doctorado-ml/stree)
 [![Codacy Badge](https://app.codacy.com/project/badge/Grade/35fa3dfd53a24a339344b33d9f9f2f3d)](https://www.codacy.com/gh/Doctorado-ML/STree?utm_source=github.com&utm_medium=referral&utm_content=Doctorado-ML/STree&utm_campaign=Badge_Grade)
 [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/Doctorado-ML/STree.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/Doctorado-ML/STree/context:python)
+[![PyPI version](https://badge.fury.io/py/STree.svg)](https://badge.fury.io/py/STree)
+![https://img.shields.io/badge/python-3.8%2B-blue](https://img.shields.io/badge/python-3.8%2B-brightgreen)
+[![DOI](https://zenodo.org/badge/262658230.svg)](https://zenodo.org/badge/latestdoi/262658230)

 # STree

@@ -23,8 +26,6 @@ Can be found in [stree.readthedocs.io](https://stree.readthedocs.io/en/stable/)

 ### Jupyter notebooks

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Doctorado-ML/STree/master?urlpath=lab/tree/notebooks/benchmark.ipynb) Benchmark
-
 - [![benchmark](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Doctorado-ML/STree/blob/master/notebooks/benchmark.ipynb) Benchmark

 - [![features](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Doctorado-ML/STree/blob/master/notebooks/features.ipynb) Some features
@@ -36,7 +37,7 @@ Can be found in [stree.readthedocs.io](https://stree.readthedocs.io/en/stable/)
 ## Hyperparameters

 |     | **Hyperparameter**  | **Type/Values**                                        | **Default** | **Meaning**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-| --- | ------------------- | ------------------------------------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| --- | ------------------- | ------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | \*  | C                   | \<float\>                                              | 1.0         | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
 | \*  | kernel              | {"liblinear", "linear", "poly", "rbf", "sigmoid"}      | linear      | Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library                                                                                                                                                                                                                                                                                                                        |
 | \*  | max_iter            | \<int\>                                                | 1e5         | Hard limit on iterations within solver, or -1 for no limit.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
@@ -49,11 +50,10 @@ Can be found in [stree.readthedocs.io](https://stree.readthedocs.io/en/stable/)
 |     | criterion           | {“gini”, “entropy”}                                    | entropy     | The function to measure the quality of a split (only used if max_features != num_features). <br>Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.                                                                                                                                                                                                                                                                                                                                                                                                                                           |
 |     | min_samples_split   | \<int\>                                                | 0           | The minimum number of samples required to split an internal node. 0 (default) for any                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
 |     | max_features        | \<int\>, \<float\> <br><br>or {“auto”, “sqrt”, “log2”} | None        | The number of features to consider when looking for the split:<br>If int, then consider max_features features at each split.<br>If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.<br>If “auto”, then max_features=sqrt(n_features).<br>If “sqrt”, then max_features=sqrt(n_features).<br>If “log2”, then max_features=log2(n_features).<br>If None, then max_features=n_features.                                                                                                                                                                                  |
-|     | splitter            | {"best", "random", "mutual", "cfs", "fcbf"}            | "random"    | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter |
+|     | splitter            | {"best", "random", "mutual", "cfs", "fcbf", "iwss"}    | "random"    | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **“trandom”**: The algorithm generates a true random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm |
 |     | normalize           | \<bool\>                                               | False       | If standardization of features should be applied on each node with the samples that reach it                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | \*  | multiclass_strategy | {"ovo", "ovr"}                                         | "ovo"       | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |

-
 \* Hyperparameter used by the support vector classifier of every node

 \*\* **Splitting in a STree node**
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,4 +1,4 @@
 sphinx
 sphinx-rtd-theme
 myst-parser
-git+https://github.com/doctorado-ml/stree
+mufs
--- a/docs/source/example.md
+++ b/docs/source/example.md
@@ -2,8 +2,6 @@

 ## Notebooks

- [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Doctorado-ML/STree/master?urlpath=lab/tree/notebooks/benchmark.ipynb) Benchmark
-
 - [![benchmark](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Doctorado-ML/STree/blob/master/notebooks/benchmark.ipynb) Benchmark

 - [![features](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Doctorado-ML/STree/blob/master/notebooks/features.ipynb) Some features
--- a/docs/source/hyperparameters.md
+++ b/docs/source/hyperparameters.md
@@ -1,7 +1,7 @@
 # Hyperparameters

 |     | **Hyperparameter**  | **Type/Values**                                        | **Default** | **Meaning**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-| --- | ------------------- | ------------------------------------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| --- | ------------------- | ------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | \*  | C                   | \<float\>                                              | 1.0         | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
 | \*  | kernel              | {"liblinear", "linear", "poly", "rbf", "sigmoid"}      | linear      | Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) library through scikit-learn library                                                                                                                                                                                                                                                                                                                        |
 | \*  | max_iter            | \<int\>                                                | 1e5         | Hard limit on iterations within solver, or -1 for no limit.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
@@ -14,7 +14,7 @@
 |     | criterion           | {“gini”, “entropy”}                                    | entropy     | The function to measure the quality of a split (only used if max_features != num_features). <br>Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.                                                                                                                                                                                                                                                                                                                                                                                                                                           |
 |     | min_samples_split   | \<int\>                                                | 0           | The minimum number of samples required to split an internal node. 0 (default) for any                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
 |     | max_features        | \<int\>, \<float\> <br><br>or {“auto”, “sqrt”, “log2”} | None        | The number of features to consider when looking for the split:<br>If int, then consider max_features features at each split.<br>If float, then max_features is a fraction and int(max_features \* n_features) features are considered at each split.<br>If “auto”, then max_features=sqrt(n_features).<br>If “sqrt”, then max_features=sqrt(n_features).<br>If “log2”, then max_features=log2(n_features).<br>If None, then max_features=n_features.                                                                                                                                                                                  |
-|     | splitter            | {"best", "random", "mutual", "cfs", "fcbf"}            | "random"    | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter |
+|     | splitter            | {"best", "random", "mutual", "cfs", "fcbf", "iwss"}    | "random"    | The strategy used to choose the feature set at each node (only used if max_features < num_features). Supported strategies are: **“best”**: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. **“random”**: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. **“trandom”**: The algorithm generates a true random combination. **"mutual"**: Chooses the best features w.r.t. their mutual info with the label. **"cfs"**: Apply Correlation-based Feature Selection. **"fcbf"**: Apply Fast Correlation-Based Filter. **"iwss"**: IWSS based algorithm |
 |     | normalize           | \<bool\>                                               | False       | If standardization of features should be applied on each node with the samples that reach it                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | \*  | multiclass_strategy | {"ovo", "ovr"}                                         | "ovo"       | Strategy to use with multiclass datasets, **"ovo"**: one versus one. **"ovr"**: one versus rest                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |

--- a/docs/source/stree.md
+++ b/docs/source/stree.md
@@ -1,9 +1,12 @@
 # STree

-[![Codeship Status for Doctorado-ML/STree](https://app.codeship.com/projects/8b2bd350-8a1b-0138-5f2c-3ad36f3eb318/status?branch=master)](https://app.codeship.com/projects/399170)
+![CI](https://github.com/Doctorado-ML/STree/workflows/CI/badge.svg)
 [![codecov](https://codecov.io/gh/doctorado-ml/stree/branch/master/graph/badge.svg)](https://codecov.io/gh/doctorado-ml/stree)
 [![Codacy Badge](https://app.codacy.com/project/badge/Grade/35fa3dfd53a24a339344b33d9f9f2f3d)](https://www.codacy.com/gh/Doctorado-ML/STree?utm_source=github.com&utm_medium=referral&utm_content=Doctorado-ML/STree&utm_campaign=Badge_Grade)
 [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/Doctorado-ML/STree.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/Doctorado-ML/STree/context:python)
+[![PyPI version](https://badge.fury.io/py/STree.svg)](https://badge.fury.io/py/STree)
+![https://img.shields.io/badge/python-3.8%2B-blue](https://img.shields.io/badge/python-3.8%2B-brightgreen)
+[![DOI](https://zenodo.org/badge/262658230.svg)](https://zenodo.org/badge/latestdoi/262658230)

 Oblique Tree classifier based on SVM nodes. The nodes are built and splitted with sklearn SVC models. Stree is a sklearn estimator and can be integrated in pipelines, grid searches, etc.

--- a/notebooks/benchmark.ipynb
+++ b/notebooks/benchmark.ipynb
@@ -178,7 +178,7 @@
   "outputs": [],
   "source": [
    "# Stree\n",
-    "stree = Stree(random_state=random_state, C=.01, max_iter=1e3)"
+    "stree = Stree(random_state=random_state, C=.01, max_iter=1e3, kernel=\"liblinear\", multiclass_strategy=\"ovr\")"
   ]
  },
  {
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,2 +1,2 @@
 scikit-learn>0.24
-mfs
+mufs
--- a/setup.py
+++ b/setup.py
@@ -44,7 +44,7 @@ setuptools.setup(
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
        "Intended Audience :: Science/Research",
    ],
-    install_requires=["scikit-learn", "numpy", "mfs"],
+    install_requires=["scikit-learn", "mufs"],
    test_suite="stree.tests",
    zip_safe=False,
 )
--- a/stree/Splitter.py
+++ b/stree/Splitter.py
@@ -12,12 +12,32 @@ from sklearn.feature_selection import SelectKBest, mutual_info_classif
 from sklearn.preprocessing import StandardScaler
 from sklearn.svm import SVC
 from sklearn.exceptions import ConvergenceWarning
-from mfs import MFS
+from mufs import MUFS


 class Snode:
-    """Nodes of the tree that keeps the svm classifier and if testing the
+    """
+    Nodes of the tree that keeps the svm classifier and if testing the
    dataset assigned to it
+
+    Parameters
+    ----------
+    clf : SVC
+        Classifier used
+    X : np.ndarray
+        input dataset in train time (only in testing)
+    y : np.ndarray
+        input labes in train time
+    features : np.array
+        features used to compute hyperplane
+    impurity : float
+        impurity of the node
+    title : str
+        label describing the route to the node
+    weight : np.ndarray, optional
+        weights applied to input dataset in train time, by default None
+    scaler : StandardScaler, optional
+        scaler used if any, by default None
    """

    def __init__(
@@ -165,6 +185,55 @@ class Siterator:


 class Splitter:
+    """
+    Splits a dataset in two based on different criteria
+
+    Parameters
+    ----------
+    clf : SVC, optional
+        classifier, by default None
+    criterion : str, optional
+        The function to measure the quality of a split (only used if
+        max_features != num_features). Supported criteria are “gini” for the
+        Gini impurity and “entropy” for the information gain., by default
+        "entropy", by default None
+    feature_select : str, optional
+        The strategy used to choose the feature set at each node (only used if
+        max_features < num_features). Supported strategies are: “best”: sklearn
+        SelectKBest algorithm is used in every node to choose the max_features
+        best features. “random”: The algorithm generates 5 candidates and
+        choose the best (max. info. gain) of them. "mutual": Chooses the best
+        features w.r.t. their mutual info with the label. "cfs": Apply
+        Correlation-based Feature Selection. "fcbf": Apply Fast Correlation-
+        Based, by default None
+    criteria : str, optional
+        ecides (just in case of a multi class classification) which column
+        (class) use to split the dataset in a node. max_samples is
+        incompatible with 'ovo' multiclass_strategy, by default None
+    min_samples_split : int, optional
+        The minimum number of samples required to split an internal node. 0
+        (default) for any, by default None
+    random_state : optional
+        Controls the pseudo random number generation for shuffling the data for
+        probability estimates. Ignored when probability is False.Pass an int
+        for reproducible output across multiple function calls, by
+        default None
+    normalize : bool, optional
+        If standardization of features should be applied on each node with the
+        samples that reach it , by default False
+
+    Raises
+    ------
+    ValueError
+        clf has to be a sklearn estimator
+    ValueError
+        criterion must be gini or entropy
+    ValueError
+        criteria has to be max_samples or impurity
+    ValueError
+        splitter must be in {random, best, mutual, cfs, fcbf}
+    """
+
    def __init__(
        self,
        clf: SVC = None,
@@ -175,6 +244,7 @@ class Splitter:
        random_state=None,
        normalize=False,
    ):
+
        self._clf = clf
        self._random_state = random_state
        if random_state is not None:
@@ -201,10 +271,19 @@ class Splitter:
                f"criteria has to be max_samples or impurity; got ({criteria})"
            )

-        if feature_select not in ["random", "best", "mutual", "cfs", "fcbf"]:
+        if feature_select not in [
+            "random",
+            "trandom",
+            "best",
+            "mutual",
+            "cfs",
+            "fcbf",
+            "iwss",
+        ]:
            raise ValueError(
-                "splitter must be in {random, best, mutual, cfs, fcbf} got "
-                f"({feature_select})"
+                "splitter must be in {random, trandom, best, mutual, cfs, "
+                "fcbf, iwss} "
+                f"got ({feature_select})"
            )
        self.criterion_function = getattr(self, f"_{self._criterion}")
        self.decision_criteria = getattr(self, f"_{self._criteria}")
@@ -235,6 +314,31 @@ class Splitter:
        features_sets = self._generate_spaces(n_features, max_features)
        return self._select_best_set(dataset, labels, features_sets)

+    @staticmethod
+    def _fs_trandom(
+        dataset: np.array, labels: np.array, max_features: int
+    ) -> tuple:
+        """Return the a random feature set combination
+
+        Parameters
+        ----------
+        dataset : np.array
+            array of samples
+        labels : np.array
+            labels of the dataset
+        max_features : int
+            number of features of the subspace
+            (< number of features in dataset)
+
+        Returns
+        -------
+        tuple
+            indices of the features selected
+        """
+        # Random feature reduction
+        n_features = dataset.shape[1]
+        return tuple(sorted(random.sample(range(n_features), max_features)))
+
    @staticmethod
    def _fs_best(
        dataset: np.array, labels: np.array, max_features: int
@@ -312,8 +416,8 @@ class Splitter:
        tuple
            indices of the features selected
        """
-        mfs = MFS(max_features=max_features, discrete=False)
-        return mfs.cfs(dataset, labels).get_results()
+        mufs = MUFS(max_features=max_features, discrete=False)
+        return mufs.cfs(dataset, labels).get_results()

    @staticmethod
    def _fs_fcbf(
@@ -336,8 +440,33 @@ class Splitter:
        tuple
            indices of the features selected
        """
-        mfs = MFS(max_features=max_features, discrete=False)
-        return mfs.fcbf(dataset, labels, 5e-4).get_results()
+        mufs = MUFS(max_features=max_features, discrete=False)
+        return mufs.fcbf(dataset, labels, 5e-4).get_results()
+
+    @staticmethod
+    def _fs_iwss(
+        dataset: np.array, labels: np.array, max_features: int
+    ) -> tuple:
+        """Correlattion-based feature selection based on iwss with max_features
+        limit
+
+        Parameters
+        ----------
+        dataset : np.array
+            array of samples
+        labels : np.array
+            labels of the dataset
+        max_features : int
+            number of features of the subspace
+            (< number of features in dataset)
+
+        Returns
+        -------
+        tuple
+            indices of the features selected
+        """
+        mufs = MUFS(max_features=max_features, discrete=False)
+        return mufs.iwss(dataset, labels, 0.25).get_results()

    def partition_impurity(self, y: np.array) -> np.array:
        return self.criterion_function(y)
@@ -349,18 +478,6 @@ class Splitter:

    @staticmethod
    def _entropy(y: np.array) -> float:
-        """Compute entropy of a labels set
-
-        Parameters
-        ----------
-        y : np.array
-            set of labels
-
-        Returns
-        -------
-        float
-            entropy
-        """
        n_labels = len(y)
        if n_labels <= 1:
            return 0
@@ -368,13 +485,10 @@ class Splitter:
        proportions = counts / n_labels
        n_classes = np.count_nonzero(proportions)
        if n_classes <= 1:
-            return 0
-        entropy = 0.0
-        # Compute standard entropy.
-        for prop in proportions:
-            if prop != 0.0:
-                entropy -= prop * log(prop, n_classes)
-        return entropy
+            return 0.0
+        from scipy.stats import entropy
+
+        return entropy(y, base=n_classes)

    def information_gain(
        self, labels: np.array, labels_up: np.array, labels_dn: np.array
--- a/stree/Strees.py
+++ b/stree/Strees.py
@@ -20,11 +20,117 @@ from .Splitter import Splitter, Snode, Siterator


 class Stree(BaseEstimator, ClassifierMixin):
-    """Estimator that is based on binary trees of svm nodes
+    """
+    Estimator that is based on binary trees of svm nodes
    can deal with sample_weights in predict, used in boosting sklearn methods
    inheriting from BaseEstimator implements get_params and set_params methods
    inheriting from ClassifierMixin implement the attribute _estimator_type
    with "classifier" as value
+
+    Parameters
+    ----------
+    C : float, optional
+        Regularization parameter. The strength of the regularization is
+        inversely proportional to C. Must be strictly positive., by default 1.0
+    kernel : str, optional
+        Specifies the kernel type to be used in the algorithm. It must be one
+        of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses
+        [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/) library and
+        the rest uses [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
+        library through scikit-learn library, by default "linear"
+    max_iter : int, optional
+        Hard limit on iterations within solver, or -1 for no limit., by default
+        1e5
+    random_state : int, optional
+        Controls the pseudo random number generation for shuffling the data for
+        probability estimates. Ignored when probability is False.Pass an int
+        for reproducible output across multiple function calls, by
+        default None
+    max_depth : int, optional
+        Specifies the maximum depth of the tree, by default None
+    tol : float, optional
+        Tolerance for stopping, by default 1e-4
+    degree : int, optional
+        Degree of the polynomial kernel function (‘poly’). Ignored by all other
+        kernels., by default 3
+    gamma : str, optional
+        Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.if gamma='scale'
+        (default) is passed then it uses 1 / (n_features * X.var()) as value
+        of gamma,if ‘auto’, uses 1 / n_features., by default "scale"
+    split_criteria : str, optional
+        Decides (just in case of a multi class classification) which column
+        (class) use to split the dataset in a node. max_samples is
+        incompatible with 'ovo' multiclass_strategy, by default "impurity"
+    criterion : str, optional
+        The function to measure the quality of a split (only used if
+        max_features != num_features). Supported criteria are “gini” for the
+        Gini impurity and “entropy” for the information gain., by default
+        "entropy"
+    min_samples_split : int, optional
+        The minimum number of samples required to split an internal node. 0
+        (default) for any, by default 0
+    max_features : optional
+        The number of features to consider when looking for the split: If int,
+        then consider max_features features at each split. If float, then
+        max_features is a fraction and int(max_features * n_features) features
+        are considered at each split. If “auto”, then max_features=
+        sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If
+        “log2”, then max_features=log2(n_features). If None, then max_features=
+        n_features., by default None
+    splitter : str, optional
+        The strategy used to choose the feature set at each node (only used if
+        max_features < num_features). Supported strategies are: “best”: sklearn
+        SelectKBest algorithm is used in every node to choose the max_features
+        best features. “random”: The algorithm generates 5 candidates and
+        choose the best (max. info. gain) of them. "mutual": Chooses the best
+        features w.r.t. their mutual info with the label. "cfs": Apply
+        Correlation-based Feature Selection. "fcbf": Apply Fast Correlation-
+        Based , by default "random"
+    multiclass_strategy : str, optional
+        Strategy to use with multiclass datasets, "ovo": one versus one. "ovr":
+        one versus rest, by default "ovo"
+    normalize : bool, optional
+        If standardization of features should be applied on each node with the
+        samples that reach it , by default False
+
+    Attributes
+    ----------
+    classes_ : ndarray of shape (n_classes,)
+        The classes labels.
+
+    n_classes_ : int
+        The number of classes
+
+    n_iter_ : int
+        Max number of iterations in classifier
+
+    depth_ : int
+        Max depht of the tree
+
+    n_features_ : int
+        The number of features when ``fit`` is performed.
+
+    n_features_in_ : int
+        Number of features seen during :term:`fit`.
+
+    max_features_ : int
+        Number of features to use in hyperplane computation
+
+    tree_ : Node
+        root of the tree
+
+    X_ : ndarray
+        points to the input dataset
+
+    y_ : ndarray
+        points to the input labels
+
+    References
+    ----------
+    R. Montañana, J. A. Gámez, J. M. Puerta, "STree: a single multi-class
+    oblique decision tree based on support vector machines.", 2021 LNAI...
+
+
    """

    def __init__(
@@ -45,6 +151,7 @@ class Stree(BaseEstimator, ClassifierMixin):
        multiclass_strategy: str = "ovo",
        normalize: bool = False,
    ):
+
        self.max_iter = max_iter
        self.C = C
        self.kernel = kernel
--- a/stree/init.py
+++ b/stree/init.py
@@ -1,6 +1,6 @@
 from .Strees import Stree, Siterator

-__version__ = "1.2"
+__version__ = "1.2.1"

 __author__ = "Ricardo Montañana Gómez"
 __copyright__ = "Copyright 2020-2021, Ricardo Montañana Gómez"
--- a/stree/tests/Splitter_test.py
+++ b/stree/tests/Splitter_test.py
@@ -285,3 +285,28 @@ class Splitter_test(unittest.TestCase):
            Xs, computed = tcl.get_subspace(X, y, rs)
            self.assertListEqual(expected, list(computed))
            self.assertListEqual(X[:, expected].tolist(), Xs.tolist())
+
+    def test_get_iwss_subspaces(self):
+        results = [
+            (4, [1, 5, 9, 12]),
+            (6, [1, 5, 9, 12, 4, 15]),
+        ]
+        for rs, expected in results:
+            X, y = load_dataset(n_features=20, n_informative=7)
+            tcl = self.build(feature_select="iwss", random_state=rs)
+            Xs, computed = tcl.get_subspace(X, y, rs)
+            self.assertListEqual(expected, list(computed))
+            self.assertListEqual(X[:, expected].tolist(), Xs.tolist())
+
+    def test_get_trandom_subspaces(self):
+        results = [
+            (4, [3, 7, 9, 12]),
+            (6, [0, 1, 2, 8, 15, 18]),
+            (7, [1, 2, 4, 8, 10, 12, 13]),
+        ]
+        for rs, expected in results:
+            X, y = load_dataset(n_features=20, n_informative=7)
+            tcl = self.build(feature_select="trandom", random_state=rs)
+            Xs, computed = tcl.get_subspace(X, y, rs)
+            self.assertListEqual(expected, list(computed))
+            self.assertListEqual(X[:, expected].tolist(), Xs.tolist())
Author	SHA1	Message	Date
Ricardo Montañana	7a625eee09	Change entropy function with scipy (#38 )	2021-11-01 18:41:15 +01:00
Ricardo Montañana	e5d49132ec	Update benchmark hyperparams os STree	2021-10-31 12:41:30 +01:00
Ricardo Montañana	8daecc4726	Remove obsolete binder links	2021-10-31 11:51:31 +01:00
Ricardo Montañana Gómez	bf678df159	(#46 ) Implement true random feature selection (#48 ) * (#46) Implement true random feature selection	2021-10-29 12:59:03 +02:00
Ricardo Montañana Gómez	36b08b1bcf	Implement iwss feature selection (#45 ) (#47 )	2021-10-29 11:49:46 +02:00
Ricardo Montañana	36ff3da26d	Update Docs	2021-09-13 18:32:59 +02:00
Ricardo Montañana Gómez	6b281ebcc8	Add DOI to README	2021-09-13 18:23:11 +02:00
Ricardo Montañana Gómez	3aaddd096f	Add package version badge in README	2021-08-17 12:00:36 +02:00
Ricardo Montañana Gómez	15a5a4c407	Add python 3.8 badge to README Add badge from shields.io	2021-08-12 11:05:07 +02:00
Ricardo Montañana Gómez	0afe14a447	Mfstomufs #43 (#44 ) * Implement module mfs changed name to mufs * Update github CI file	2021-08-02 18:03:59 +02:00