mirror of
https://github.com/Doctorado-ML/Stree_datasets.git
synced 2025-08-17 16:36:02 +00:00
154 lines
6.9 KiB
Plaintext
Executable File
154 lines
6.9 KiB
Plaintext
Executable File
1. Title: MUSK "Clean2" database
|
|
|
|
2. Sources:
|
|
(a) Creators: AI Group at Arris Pharmaceutical Corporation
|
|
contact: David Chapman or Ajay Jain
|
|
Arris Pharmaceutical Corporation
|
|
385 Oyster Point Blvd.
|
|
South San Francisco, CA 94080
|
|
415-737-8600
|
|
zvona@arris.com, jain@arris.com
|
|
(b) Donor: Tom Dietterich
|
|
Department of Computer Science
|
|
Oregon State University
|
|
Corvallis, OR 97331
|
|
503-737-5559
|
|
tgd@cs.orst.edu
|
|
(c) Date received: September 12, 1994
|
|
|
|
3. Past Usage:
|
|
|
|
(a) Dietterich, T. G., Jain, A., Lathrop, R., Lozano-Perez, T. (1994).
|
|
A comparison of dynamic reposing and tangent distance for drug
|
|
activity prediction. Advances in Neural Information Processing
|
|
Systems, 6. San Mateo, CA: Morgan Kaufmann. 216--223.
|
|
|
|
The clean2 dataset included here is derived from the starting
|
|
poses employed in this paper. The paper reports the following
|
|
results:
|
|
|
|
Algorithm: 20-fold XVAL:
|
|
1-nearest neighbor (euclidean distance) 75%
|
|
neural network (standard poses) 75%
|
|
1-nearest neighbor (tangent distance) 79%
|
|
neural network (dynamic reposing) 91%
|
|
|
|
The tangent distance and dynamic reposing technique require
|
|
computation of the molecular surface, which cannot be done
|
|
using the feature vectors included in this data set.
|
|
|
|
(b) Jain, A. N., Dietterich, T. G., Lathrop, R. H.,
|
|
Chapman, D., Critchlow, R. E., Bauer, B. E., Webster, T. A.,
|
|
Lozano-Perez, T. Compass: A shape-based machine learning tool for
|
|
drug design. Accepted for publication in Computer-Aided
|
|
Molecular Design.
|
|
|
|
This paper describes the dynamic reposing technique in more
|
|
detail and reports the same result for dynamic reposing as
|
|
above. The paper also gives a complete description of each of
|
|
the 102 molecules in the data set.
|
|
|
|
(c) Dietterich, T. G., Lathrop, R. H., Lozano-Perez, T. (submitted)
|
|
Solving the multiple-instance problem with axis-parallel rectangles.
|
|
Submitted to Artificial Intelligence.
|
|
|
|
This paper describes a family of axis-parallel rectangle
|
|
algorithms and compares various approaches to the multiple
|
|
instance problem. It includes the following table:
|
|
|
|
Algorithm TP FN FP TN errs %correct [CI]
|
|
iterated-discrim APR 30 9 2 61 11 89.2 [83.2--95.2]
|
|
GFS elim-kde APR 32 7 13 50 20 80.4 [72.7--88.1]
|
|
GFS elim-count APR 31 8 17 46 25 75.5 [67.1--83.8]
|
|
all-positive APR 34 5 23 40 28 72.6 [63.9--81.2]
|
|
backpropagation 16 23 10 53 33 67.7 [58.6--76.7]
|
|
GFS all-positive APR 37 2 32 31 34 66.7 [57.5--75.8]
|
|
most frequent class 0 39 0 63 39 61.8 [52.3--71.2]
|
|
C4.5 (pruned) 32 7 35 28 42 58.8 [49.3--68.4]
|
|
|
|
key: TP = true positives
|
|
FN = false negatives
|
|
FP = false positives
|
|
TN = true negatives
|
|
errs = errors = FN+FP
|
|
%correct = 10-fold cross-validation %correct.
|
|
CI = 95% confidence interval on proportion of correct
|
|
predictions.
|
|
For explanations of the various algorithms, see the
|
|
paper.
|
|
|
|
C4.5 and backprop were applied ignoring the multiple instance
|
|
problem (see below) during training, but obeying it during
|
|
testing.
|
|
|
|
This paper also gives more details on the construction of the
|
|
data set.
|
|
|
|
4. Relevant Information:
|
|
This dataset describes a set of 102 molecules of which 39 are judged
|
|
by human experts to be musks and the remaining 63 molecules are
|
|
judged to be non-musks. The goal is to learn to predict whether
|
|
new molecules will be musks or non-musks. However, the 166 features
|
|
that describe these molecules depend upon the exact shape, or
|
|
conformation, of the molecule. Because bonds can rotate, a single
|
|
molecule can adopt many different shapes. To generate this data
|
|
set, all the low-energy conformations of the molecules were
|
|
generated to produce 6,598 conformations. Then, a feature vector
|
|
was extracted that describes each conformation.
|
|
|
|
This many-to-one relationship between feature vectors and molecules
|
|
is called the "multiple instance problem". When learning a
|
|
classifier for this data, the classifier should classify a molecule
|
|
as "musk" if ANY of its conformations is classified as a musk. A
|
|
molecule should be classified as "non-musk" if NONE of its
|
|
conformations is classified as a musk.
|
|
|
|
5. Number of Instances 6,598
|
|
|
|
6. Number of Attributes 168 plus the class.
|
|
|
|
7. For Each Attribute:
|
|
|
|
Attribute: Description:
|
|
molecule_name: Symbolic name of each molecule. Musks have names such
|
|
as MUSK-188. Non-musks have names such as
|
|
NON-MUSK-jp13.
|
|
conformation_name: Symbolic name of each conformation. These
|
|
have the format MOL_ISO+CONF, where MOL is the
|
|
molecule number, ISO is the stereoisomer
|
|
number (usually 1), and CONF is the
|
|
conformation number.
|
|
f1 through f162: These are "distance features" along rays (see
|
|
paper cited above). The distances are
|
|
measured in hundredths of Angstroms. The
|
|
distances may be negative or positive, since
|
|
they are actually measured relative to an
|
|
origin placed along each ray. The origin was
|
|
defined by a "consensus musk" surface that is
|
|
no longer used. Hence, any experiments with
|
|
the data should treat these feature values as
|
|
lying on an arbitrary continuous scale. In
|
|
particular, the algorithm should not make any
|
|
use of the zero point or the sign of each
|
|
feature value.
|
|
f163: This is the distance of the oxygen atom in the
|
|
molecule to a designated point in 3-space.
|
|
This is also called OXY-DIS.
|
|
f164: OXY-X: X-displacement from the designated
|
|
point.
|
|
f165: OXY-Y: Y-displacement from the designated
|
|
point.
|
|
f166: OXY-Z: Z-displacement from the designated
|
|
point.
|
|
class: 0 => non-musk, 1 => musk
|
|
|
|
Please note that the molecule_name and conformation_name attributes
|
|
should not be used to predict the class.
|
|
|
|
8. Missing Attribute Values: none.
|
|
|
|
9. Class Distribution:
|
|
Musks: 39
|
|
Non-musks: 63
|
|
|