Commit Inicial

2025-08-18 17:06:02 +00:00 · 2020-11-20 11:23:40 +01:00
commit 5611e5bc01
2914 changed files with 2625178 additions and 0 deletions
--- a/data/tanveer/musk-2/clean2.info
+++ b/data/tanveer/musk-2/clean2.info
@@ -0,0 +1,153 @@
+1. Title: MUSK "Clean2" database
+
+2. Sources:
+   (a) Creators:  AI Group at Arris Pharmaceutical Corporation
+        contact:  David Chapman or Ajay Jain
+                  Arris Pharmaceutical Corporation
+                  385 Oyster Point Blvd.
+                  South San Francisco, CA 94080
+                  415-737-8600
+                  zvona@arris.com, jain@arris.com
+   (b) Donor:     Tom Dietterich
+                  Department of Computer Science
+                  Oregon State University
+                  Corvallis, OR 97331
+                  503-737-5559
+                  tgd@cs.orst.edu
+   (c) Date received: September 12, 1994
+
+3. Past Usage:
+
+   (a) Dietterich, T. G., Jain, A., Lathrop, R., Lozano-Perez, T. (1994).
+       A comparison of dynamic reposing and tangent distance for drug
+       activity prediction.  Advances in Neural Information Processing
+       Systems, 6.  San Mateo, CA: Morgan Kaufmann.  216--223.
+
+       The clean2 dataset included here is derived from the starting
+       poses employed in this paper.  The paper reports the following
+       results:
+
+       Algorithm:                                 20-fold XVAL:
+       1-nearest neighbor (euclidean distance)    75%
+       neural network (standard poses)            75%
+       1-nearest neighbor (tangent distance)      79%
+       neural network (dynamic reposing)          91%
+
+       The tangent distance and dynamic reposing technique require
+       computation of the molecular surface, which cannot be done
+       using the feature vectors included in this data set.
+
+   (b) Jain, A. N., Dietterich, T. G., Lathrop, R. H., 
+       Chapman, D., Critchlow, R. E., Bauer, B. E., Webster, T. A.,
+       Lozano-Perez, T.  Compass: A shape-based machine learning tool for
+       drug design.  Accepted for publication in Computer-Aided
+       Molecular Design. 
+
+       This paper describes the dynamic reposing technique in more
+       detail and reports the same result for dynamic reposing as
+       above.  The paper also gives a complete description of each of
+       the 102 molecules in the data set.
+
+   (c) Dietterich, T. G., Lathrop, R. H., Lozano-Perez, T. (submitted)
+       Solving the multiple-instance problem with axis-parallel rectangles.
+       Submitted to Artificial Intelligence.
+
+       This paper describes a family of axis-parallel rectangle
+       algorithms and compares various approaches to the multiple
+       instance problem.  It includes the following table:
+
+        Algorithm             TP FN FP TN errs %correct [CI]
+        iterated-discrim APR  30  9  2 61  11  89.2 [83.2--95.2]
+        GFS elim-kde APR      32  7 13 50  20  80.4 [72.7--88.1]
+        GFS elim-count APR    31  8 17 46  25  75.5 [67.1--83.8]
+        all-positive APR      34  5 23 40  28  72.6 [63.9--81.2]
+        backpropagation       16 23 10 53  33  67.7 [58.6--76.7]
+        GFS all-positive APR  37  2 32 31  34  66.7 [57.5--75.8]
+        most frequent class    0 39  0 63  39  61.8 [52.3--71.2]
+        C4.5 (pruned)         32  7 35 28  42  58.8 [49.3--68.4]
+        
+        key: TP = true positives
+             FN = false negatives
+             FP = false positives
+             TN = true negatives
+             errs = errors = FN+FP
+             %correct = 10-fold cross-validation %correct.
+             CI = 95% confidence interval on proportion of correct
+             predictions.
+             For explanations of the various algorithms, see the
+             paper. 
+
+        C4.5 and backprop were applied ignoring the multiple instance
+        problem (see below) during training, but obeying it during
+        testing.  
+
+        This paper also gives more details on the construction of the
+        data set. 
+
+4. Relevant Information:
+   This dataset describes a set of 102 molecules of which 39 are judged
+   by human experts to be musks and the remaining 63 molecules are
+   judged to be non-musks.  The goal is to learn to predict whether
+   new molecules will be musks or non-musks.  However, the 166 features
+   that describe these molecules depend upon the exact shape, or
+   conformation, of the molecule.  Because bonds can rotate, a single
+   molecule can adopt many different shapes.  To generate this data
+   set, all the low-energy conformations of the molecules were
+   generated to produce 6,598 conformations.  Then, a feature vector
+   was extracted that describes each conformation. 
+
+   This many-to-one relationship between feature vectors and molecules
+   is called the "multiple instance problem".  When learning a
+   classifier for this data, the classifier should classify a molecule
+   as "musk" if ANY of its conformations is classified as a musk.  A
+   molecule should be classified as "non-musk" if NONE of its
+   conformations is classified as a musk.
+
+5. Number of Instances  6,598
+
+6. Number of Attributes 168 plus the class.
+
+7. For Each Attribute:
+   
+   Attribute:           Description:
+   molecule_name:       Symbolic name of each molecule.  Musks have names such
+                        as MUSK-188.  Non-musks have names such as
+                        NON-MUSK-jp13.
+   conformation_name:   Symbolic name of each conformation.  These
+                        have the format MOL_ISO+CONF, where MOL is the
+                        molecule number, ISO is the stereoisomer
+                        number (usually 1), and CONF is the
+                        conformation number. 
+   f1 through f162:     These are "distance features" along rays (see
+                        paper cited above).  The distances are
+                        measured in hundredths of Angstroms.  The
+                        distances may be negative or positive, since
+                        they are actually measured relative to an
+                        origin placed along each ray.  The origin was
+                        defined by a "consensus musk" surface that is
+                        no longer used.  Hence, any experiments with
+                        the data should treat these feature values as
+                        lying on an arbitrary continuous scale.  In
+                        particular, the algorithm should not make any
+                        use of the zero point or the sign of each
+                        feature value. 
+   f163:                This is the distance of the oxygen atom in the
+                        molecule to a designated point in 3-space.
+                        This is also called OXY-DIS.
+   f164:                OXY-X: X-displacement from the designated
+                        point.
+   f165:                OXY-Y: Y-displacement from the designated
+                        point.
+   f166:                OXY-Z: Z-displacement from the designated
+                        point. 
+   class:               0 => non-musk, 1 => musk
+
+   Please note that the molecule_name and conformation_name attributes
+   should not be used to predict the class.
+
+8. Missing Attribute Values: none.
+
+9. Class Distribution: 
+   Musks:     39
+   Non-musks: 63
+