mirror of
https://github.com/Doctorado-ML/Stree_datasets.git
synced 2025-08-18 00:46:03 +00:00
Commit Inicial
This commit is contained in:
2
data/tanveer/page-blocks/conxuntos.dat
Executable file
2
data/tanveer/page-blocks/conxuntos.dat
Executable file
File diff suppressed because one or more lines are too long
8
data/tanveer/page-blocks/conxuntos_kfold.dat
Executable file
8
data/tanveer/page-blocks/conxuntos_kfold.dat
Executable file
File diff suppressed because one or more lines are too long
23
data/tanveer/page-blocks/le_datos.m
Executable file
23
data/tanveer/page-blocks/le_datos.m
Executable file
@@ -0,0 +1,23 @@
|
||||
printf('lendo problema %s ...\n', problema);
|
||||
|
||||
n_entradas= 10; n_clases= 5; n_fich= 1; fich{1}= 'page-blocks.data'; n_patrons(1)= 5473;
|
||||
|
||||
n_max= max(n_patrons);
|
||||
x = zeros(n_fich, n_max, n_entradas); cl= zeros(n_fich, n_max);
|
||||
|
||||
n_patrons_total = sum(n_patrons); n_iter=0;
|
||||
|
||||
for i_fich=1:n_fich
|
||||
f=fopen(fich{i_fich}, 'r');
|
||||
if -1==f
|
||||
error('erro en fopen abrindo %s\n', fich{i_fich});
|
||||
end
|
||||
for i=1:n_patrons(i_fich)
|
||||
fprintf(2,'%5.1f%%\r', 100*n_iter++/n_patrons_total);
|
||||
for j = 1:n_entradas
|
||||
x(i_fich,i,j) = fscanf(f,'%g',1);
|
||||
end
|
||||
cl(i_fich,i) = fscanf(f,'%i',1) - 1; % lectura da clase
|
||||
end
|
||||
fclose(f);
|
||||
end
|
5486
data/tanveer/page-blocks/page-blocks.arff
Executable file
5486
data/tanveer/page-blocks/page-blocks.arff
Executable file
File diff suppressed because it is too large
Load Diff
8
data/tanveer/page-blocks/page-blocks.cost
Executable file
8
data/tanveer/page-blocks/page-blocks.cost
Executable file
@@ -0,0 +1,8 @@
|
||||
% Rows Columns
|
||||
5 5
|
||||
% Matrix elements
|
||||
0.0 1.0 1.0 1.0 1.0
|
||||
1.0 0.0 1.0 1.0 1.0
|
||||
1.0 1.0 0.0 1.0 1.0
|
||||
1.0 1.0 1.0 0.0 1.0
|
||||
1.0 1.0 1.0 1.0 0.0
|
5473
data/tanveer/page-blocks/page-blocks.data
Executable file
5473
data/tanveer/page-blocks/page-blocks.data
Executable file
File diff suppressed because it is too large
Load Diff
91
data/tanveer/page-blocks/page-blocks.names
Executable file
91
data/tanveer/page-blocks/page-blocks.names
Executable file
@@ -0,0 +1,91 @@
|
||||
1. Title of Database: Blocks Classification
|
||||
2. Sources:
|
||||
(a) Donato Malerba
|
||||
Dipartimento di Informatica
|
||||
University of Bari
|
||||
via Orabona 4
|
||||
70126 Bari - Italy
|
||||
phone: +39 - 80 - 5443269
|
||||
fax: +39 - 80 - 5443196
|
||||
malerbad@vm.csata.it
|
||||
(b) Donor: Donato Malerba
|
||||
(c) Date: July 1995
|
||||
3. Past Usage:
|
||||
This data set have been used to try different simplification methods
|
||||
for decision trees. A summary of the results can be found in:
|
||||
|
||||
Malerba, D., Esposito, F., and Semeraro, G.
|
||||
"A Further Comparison of Simplification Methods for Decision-Tree Induction."
|
||||
In D. Fisher and H. Lenz (Eds.), "Learning from Data:
|
||||
Artificial Intelligence and Statistics V", Lecture Notes in Statistics,
|
||||
Springer Verlag, Berlin, 1995.
|
||||
|
||||
The problem consists in classifying all the blocks of the page
|
||||
layout of a document that has been detected by a segmentation
|
||||
process. This is an essential step in document analysis
|
||||
in order to separate text from graphic areas. Indeed,
|
||||
the five classes are: text (1), horizontal line (2),
|
||||
picture (3), vertical line (4) and graphic (5).
|
||||
For a detailed presentation of the problem see:
|
||||
|
||||
Esposito F., Malerba D., & Semeraro G.
|
||||
Multistrategy Learning for Document Recognition
|
||||
Applied Artificial Intelligence, 8, pp. 33-84, 1994
|
||||
|
||||
All instances have been personally checked so that
|
||||
low noise is present in the data.
|
||||
|
||||
4. Relevant Information Paragraph:
|
||||
|
||||
The 5473 examples comes from 54 distinct documents.
|
||||
Each observation concerns one block.
|
||||
All attributes are numeric.
|
||||
Data are in a format readable by C4.5.
|
||||
|
||||
5. Number of Instances: 5473.
|
||||
|
||||
6. Number of Attributes
|
||||
|
||||
height: integer. | Height of the block.
|
||||
lenght: integer. | Length of the block.
|
||||
area: integer. | Area of the block (height * lenght);
|
||||
eccen: continuous. | Eccentricity of the block (lenght / height);
|
||||
p_black: continuous. | Percentage of black pixels within the block (blackpix / area);
|
||||
p_and: continuous. | Percentage of black pixels after the application of the Run Length Smoothing Algorithm (RLSA) (blackand / area);
|
||||
mean_tr: continuous. | Mean number of white-black transitions (blackpix / wb_trans);
|
||||
blackpix: integer. | Total number of black pixels in the original bitmap of the block.
|
||||
blackand: integer. | Total number of black pixels in the bitmap of the block after the RLSA.
|
||||
wb_trans: integer. | Number of white-black transitions in the original bitmap of the block.
|
||||
|
||||
|
||||
|
||||
7. Missing Attribute Values: No missing value.
|
||||
|
||||
8. Class Distribution:
|
||||
|
||||
Valid Cum
|
||||
Class Frequency Percent Percent Percent
|
||||
|
||||
text 4913 89.8 89.8 89.8
|
||||
horiz. line 329 6.0 6.0 95.8
|
||||
graphic 28 .5 .5 96.3
|
||||
vert. line 88 1.6 1.6 97.9
|
||||
picture 115 2.1 2.1 100.0
|
||||
------- ------- -------
|
||||
TOTAL 5473 100.0 100.0
|
||||
|
||||
Summary Statistics:
|
||||
|
||||
Variable Mean Std Dev Minimum Maximum Correlation
|
||||
|
||||
HEIGHT 10.47 18.96 1 804 .3510
|
||||
LENGTH 89.57 114.72 1 553 -.0045
|
||||
AREA 1198.41 4849.38 7 143993 .2343
|
||||
ECCEN 13.75 30.70 .007 537.00 .0992
|
||||
P_BLACK .37 .18 .052 1.00 .2130
|
||||
P_AND .79 .17 .062 1.00 -.1771
|
||||
MEAN_TR 6.22 69.08 1.00 4955.00 .0723
|
||||
BLACKPIX 365.93 1270.33 7 33017 .1656
|
||||
BLACKAND 741.11 1881.50 7 46133 .1565
|
||||
WB_TRANS 106.66 167.31 1 3212 .0337
|
||||
|
8
data/tanveer/page-blocks/page-blocks.txt
Executable file
8
data/tanveer/page-blocks/page-blocks.txt
Executable file
@@ -0,0 +1,8 @@
|
||||
n_entradas= 10
|
||||
n_clases= 5
|
||||
n_arquivos= 1
|
||||
fich1= page-blocks_R.dat
|
||||
n_patrons1= 5473
|
||||
n_patrons_entrena= 2737
|
||||
n_patrons_valida= 2736
|
||||
n_conxuntos= 1
|
5474
data/tanveer/page-blocks/page-blocks_R.dat
Executable file
5474
data/tanveer/page-blocks/page-blocks_R.dat
Executable file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user