Commit Inicial

This commit is contained in:
2020-11-20 11:23:40 +01:00
commit 5611e5bc01
2914 changed files with 2625178 additions and 0 deletions

5
data/tanveer/adult/adult.cost Executable file
View File

@@ -0,0 +1,5 @@
% Rows Columns
2 2
% Matrix elements
0.0 1.0
1.0 0.0

32562
data/tanveer/adult/adult.data Executable file

File diff suppressed because it is too large Load Diff

15
data/tanveer/adult/adult.desc Executable file
View File

@@ -0,0 +1,15 @@
1 continua
2 discreta 8 Private Self-emp-not-inc Self-emp-inc Federal-gov Local-gov State-gov Without-pay Never-worked
3 continua
4 discreta 16 Bachelors Some-college 11th HS-grad Prof-school Assoc-acdm Assoc-voc 9th 7th-8th 12th Masters 1st-4th 10th Doctorate 5th-6th Preschool
5 continua
6 discreta 7 Married-civ-spouse Divorced Never-married Separated Widowed Married-spouse-absent Married-AF-spouse
7 discreta 14 Tech-support Craft-repair Other-service Sales Exec-managerial Prof-specialty Handlers-cleaners Machine-op-inspct Adm-clerical Farming-fishing Transport-moving Priv-house-serv Protective-serv Armed-Forces
8 discreta 6 Wife Own-child Husband Not-in-family Other-relative Unmarried
9 discreta 5 White Asian-Pac-Islander Amer-Indian-Eskimo Other Black
10 discreta 2 Female Male
11 continua
12 continua
13 continua
14 discreta 41 United-States Cambodia England Puerto-Rico Canada Germany Outlying-US(Guam-USVI-etc) India Japan Greece South China Cuba Iran Honduras Philippines Italy Poland Jamaica Vietnam Mexico Portugal Ireland France Dominican-Republic Laos Ecuador Taiwan Haiti Columbia Hungary Guatemala Nicaragua Scotland Thailand Yugoslavia El-Salvador Trinadad&Tobago Peru Hong Holand-Netherlands
individual <=50K >50K

110
data/tanveer/adult/adult.names Executable file
View File

@@ -0,0 +1,110 @@
| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
| Data Mining and Visualization
| Silicon Graphics.
| e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database. A set of
| reasonably clean records was extracted using the following conditions:
| ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a person makes over 50K
| a year.
|
| First cited in:
| @inproceedings{kohavi-nbtree,
| author={Ron Kohavi},
| title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
| Decision-Tree Hybrid},
| booktitle={Proceedings of the Second International Conference on
| Knowledge Discovery and Data Mining},
| year = 1996,
| pages={to appear}}
|
| Error Accuracy reported as follows, after removal of unknowns from
| train/test sets):
| C4.5 : 84.46+-0.30
| Naive-Bayes: 83.88+-0.30
| NBTree : 85.90+-0.28
|
|
| Following algorithms were later run with the following error rates,
| all after removal of unknowns and using the original train/test split.
| All these numbers are straight runs using MLC++ with default values.
|
| Algorithm Error
| -- ---------------- -----
| 1 C4.5 15.54
| 2 C4.5-auto 14.46
| 3 C4.5 rules 14.94
| 4 Voted ID3 (0.6) 15.64
| 5 Voted ID3 (0.8) 16.47
| 6 T2 16.84
| 7 1R 19.54
| 8 NBTree 14.10
| 9 CN2 16.00
| 10 HOODG 14.82
| 11 FSS Naive Bayes 14.05
| 12 IDTM (Decision table) 14.46
| 13 Naive-Bayes 16.12
| 14 Nearest-neighbor (1) 21.42
| 15 Nearest-neighbor (3) 20.35
| 16 OC1 15.04
| 17 Pebls Crashed. Unknown why (bounds WERE increased)
|
| Conversion of original data as follows:
| 1. Discretized agrossincome into two ranges with threshold 50,000.
| 2. Convert U.S. to US to avoid periods.
| 3. Convert Unknown to "?"
| 4. Run MLC++ GenCVFiles to generate data,test.
|
| Description of fnlwgt (final weight)
|
| The weights on the CPS files are controlled to independent estimates of the
| civilian noninstitutional population of the US. These are prepared monthly
| for us by Population Division here at the Census Bureau. We use 3 sets of
| controls.
| These are:
| 1. A single cell estimate of the population 16+ for each state.
| 2. Controls for Hispanic Origin by age and sex.
| 3. Controls by Race, age and sex.
|
| We use all three sets of controls in our weighting program and "rake" through
| them 6 times so that by the end we come back to all the controls we used.
|
| The term estimate refers to population totals derived from CPS by creating
| "weighted tallies" of any specified socio-economic characteristics of the
| population.
|
| People with similar demographic characteristics should have
| similar weights. There is one important caveat to remember
| about this statement. That is that since the CPS sample is
| actually a collection of 51 state samples, each with its own
| probability of selection, the statement only applies within
| state.
>50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

16282
data/tanveer/adult/adult.test Executable file

File diff suppressed because it is too large Load Diff

10
data/tanveer/adult/adult.txt Executable file
View File

@@ -0,0 +1,10 @@
n_entradas= 14
n_clases= 2
n_arquivos= 2
fich1= adult_train_R.dat
n_patrons1= 32561
fich2= adult_test_R.dat
n_patrons2= 16281
n_patrons_entrena= 16281
n_patrons_valida= 16280
n_conxuntos= 1

16298
data/tanveer/adult/adult_test.arff Executable file

File diff suppressed because it is too large Load Diff

16282
data/tanveer/adult/adult_test_R.dat Executable file

File diff suppressed because it is too large Load Diff

32578
data/tanveer/adult/adult_train.arff Executable file

File diff suppressed because it is too large Load Diff

32562
data/tanveer/adult/adult_train_R.dat Executable file

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

77
data/tanveer/adult/le_datos.m Executable file
View File

@@ -0,0 +1,77 @@
% adult
printf('lendo problema adult...\n');
n_entradas= 14; n_clases= 2; n_fich= 2; fich{1}= 'adult.data'; n_patrons(1)= 32561; fich{2}= 'adult.test'; n_patrons(2)= 16281;
n_max= max(n_patrons);
x = zeros(n_fich, n_max, n_entradas); cl= zeros(n_fich, n_max);
discreta = [0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1];
workclass = {'Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked'};
education = {'Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool'};
marital = {'Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse'};
occupation = {'Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces'};
relationship = {'Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried'};
race = {'White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black'};
sex = {'Male', 'Female'};
country = {'United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands'};
n_workclass=8; n_education=16; n_marital=7; n_occupation=14; n_relationship=6; n_race=5; n_sex=2; n_country=41;
for i_fich = 1:n_fich
f=fopen(fich{i_fich}, 'r');
if -1==f
error('erro en fopen abrindo %s\n', fich{i_fich});
end
for i=1:n_patrons(i_fich)
fprintf(2,'%5.1f%%\r', 100*i/n_patrons(i_fich));
for j = 1:n_entradas
if discreta(j)==1
s = fscanf(f,'%s',1); fscanf(f,'%c',1);
% printf('%s ', s)
if strcmp(s, '?') % entrada ausente neste patrón
x(i_fich,i,j)=0;
else
if j==2
n = n_workclass; p=workclass;
elseif j==4
n = n_education; p=education;
elseif j==6
n = n_marital; p=marital;
elseif j==7
n = n_occupation; p=occupation;
elseif j==8
n = n_relationship; p=relationship;
elseif j==9
n = n_race; p=race;
elseif j==10
n = n_sex; p=sex;
elseif j==14
n = n_country; p=country;
end
a = 2/(n-1); b= (1+n)/(1-n);
for k=1:n
if strcmp(s, p(k))
x(i_fich,i,j) = a*k + b; break
end
end
end
else
x(i_fich,i,j) = fscanf(f,'%g',1); fscanf(f,'%c',1);
end
% printf('%g ', x(i_fich,i,j))
end
s = fscanf(f,'%s',1); fscanf(f,'%c',1);
if strcmp(s, '<=50K')
cl(i_fich,i)=0;
elseif strcmp(s, '>50K')
cl(i_fich,i)=1;
else
error('clase %s descoñecida\n', s)
end
% printf('\n')
% disp(x(i_fich,i,:)); disp(cl(i_fich,i))
end
fclose(f);
end