Commit Inicial

2025-08-18 00:46:03 +00:00 · 2020-11-20 11:23:40 +01:00
commit 5611e5bc01
2914 changed files with 2625178 additions and 0 deletions
--- a/data/tanveer/pendigits/conxuntos.dat
+++ b/data/tanveer/pendigits/conxuntos.dat
--- a/data/tanveer/pendigits/conxuntos_kfold.dat
+++ b/data/tanveer/pendigits/conxuntos_kfold.dat
--- a/data/tanveer/pendigits/le_datos.m
+++ b/data/tanveer/pendigits/le_datos.m
@@ -0,0 +1,24 @@
+printf('lendo problema %s ...\n', problema);
+
+n_entradas= 16; n_clases= 10; 
+n_fich= 2; fich{1}= 'pendigits.tra'; n_patrons(1)= 7494; fich{2}= 'pendigits.tes'; n_patrons(2)= 3498;
+
+n_max= max(n_patrons);
+x = zeros(n_fich, n_max, n_entradas); cl= zeros(n_fich, n_max);
+
+n_patrons_total = sum(n_patrons); n_iter=0;
+
+for i_fich=1:n_fich
+  f=fopen(fich{i_fich}, 'r');
+  if -1==f
+	error('erro en fopen abrindo %s\n', fich{i_fich});
+  end
+  for i=1:n_patrons(i_fich)
+   	fprintf(2,'%5.1f%%\r', 100*n_iter++/n_patrons_total);
+	for j = 1:n_entradas
+	  x(i_fich,i,j) = fscanf(f,'%i',1);
+	end
+	cl(i_fich,i) = fscanf(f,'%i',1);  	% lectura da clase
+  end
+  fclose(f);
+end
--- a/data/tanveer/pendigits/pendigits.cost
+++ b/data/tanveer/pendigits/pendigits.cost
@@ -0,0 +1,13 @@
+% Rows	Columns
+10	10
+% Matrix elements
+ 0.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	
+ 1.0	 0.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	
+ 1.0	 1.0	 0.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	
+ 1.0	 1.0	 1.0	 0.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	
+ 1.0	 1.0	 1.0	 1.0	 0.0	 1.0	 1.0	 1.0	 1.0	 1.0	
+ 1.0	 1.0	 1.0	 1.0	 1.0	 0.0	 1.0	 1.0	 1.0	 1.0	
+ 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 0.0	 1.0	 1.0	 1.0	
+ 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 0.0	 1.0	 1.0	
+ 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 0.0	 1.0	
+ 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 1.0	 0.0	
--- a/data/tanveer/pendigits/pendigits.names
+++ b/data/tanveer/pendigits/pendigits.names
@@ -0,0 +1,136 @@
+1. Title of Database: Pen-Based Recognition of Handwritten Digits
+
+2. Source:
+	E. Alpaydin, Fevzi. Alimoglu
+	Department of Computer Engineering
+	Bogazici University, 80815 Istanbul Turkey
+	alpaydin@boun.edu.tr
+	July 1998
+
+3. Past Usage:
+	F. Alimoglu (1996) Combining Multiple Classifiers for Pen-Based
+	Handwritten Digit Recognition, 
+	MSc Thesis, Institute of Graduate Studies in Science and 
+	Engineering, Bogazici University.
+	http://www.cmpe.boun.edu.tr/~alimoglu/alimoglu.ps.gz
+
+	F. Alimoglu, E. Alpaydin, "Methods of Combining Multiple Classifiers 
+	Based on Different Representations for Pen-based Handwriting
+	Recognition," Proceedings of the Fifth Turkish Artificial 
+	Intelligence and Artificial Neural Networks Symposium (TAINN 96), 
+	June 1996, Istanbul, Turkey.
+	http://www.cmpe.boun.edu.tr/~alimoglu/tainn96.ps.gz
+
+	
+4. Relevant Information:
+
+	We create a digit database by collecting 250 samples from 44 writers.
+	The samples written by 30 writers are used for training,
+	cross-validation and writer dependent testing, and the digits 
+	written by the other 14 are used for writer independent testing. This
+	database is also available in the UNIPEN format.
+
+	We use a WACOM PL-100V pressure sensitive tablet with an integrated 
+	LCD display and a cordless stylus. The input and display areas are
+	located in the same place. Attached to the serial port of an Intel 
+	486 based PC, it allows us to collect handwriting samples. The tablet
+	sends $x$ and $y$ tablet coordinates and pressure level values of the
+	pen at fixed time intervals (sampling rate) of 100 miliseconds. 
+
+	These writers are asked to write 250 digits in random order inside 
+	boxes of 500 by 500 tablet pixel resolution.  Subject are monitored 
+	only during the first entry screens. Each screen contains five boxes
+	with the digits to be written displayed above. Subjects are told to
+	write only inside these boxes.  If they make a mistake or are unhappy
+	with their writing, they are instructed to clear the content of a box 
+	by using an on-screen button. The first ten digits are ignored 
+	because most writers are not familiar with this type of input devices,
+	but subjects are not aware of this. 
+
+	In our study, we use only ($x, y$) coordinate information. The stylus
+	pressure level values are ignored. First we apply normalization to 
+	make our representation invariant to translations and scale 
+	distortions. The raw data that we capture from the tablet consist of
+	integer values between 0 and 500 (tablet input box resolution). The 
+	new coordinates are such that the coordinate which has the maximum 
+	range varies between 0 and 100. Usually $x$ stays in this range, since
+	most characters are taller than they are wide.  
+
+	In order to train and test our classifiers, we need to represent 
+	digits as constant length feature vectors. A commonly used technique
+	leading to good results is resampling the ( x_t, y_t) points. 
+	Temporal resampling (points regularly spaced in time) or spatial
+	resampling (points regularly spaced in arc length) can be used here. 
+	Raw point data are already regularly spaced in time but the distance
+	between them is variable. Previous research showed that spatial
+	resampling to obtain a constant number of regularly spaced points 
+	on the trajectory yields much better performance, because it provides 
+	a better alignment between points. Our resampling algorithm uses 
+	simple linear interpolation between pairs of points. The resampled
+	digits are represented as a sequence of T points ( x_t, y_t )_{t=1}^T,
+	regularly spaced in arc length, as opposed to the input sequence, 
+	which is regularly spaced in time.
+
+	So, the input vector size is 2*T, two times the number of points
+	resampled. We considered spatial resampling to T=8,12,16 points in our
+	experiments and found that T=8 gave the best trade-off between 
+	accuracy and complexity.
+
+
+5. Number of Instances
+	pendigits.tra	Training	7494
+	pendigits.tes	Testing		3498
+	
+	The way we used the dataset was to use first half of training for 
+	actual training, one-fourth for validation and one-fourth
+	for writer-dependent testing. The test set was used for 
+	writer-independent testing and is the actual quality measure.
+
+6. Number of Attributes
+	16 input+1 class attribute
+
+7. For Each Attribute:
+	All input attributes are integers in the range 0..100.
+	The last attribute is the class code 0..9
+
+8. Missing Attribute Values
+	None
+
+9. Class Distribution
+	Class: No of examples in training set
+	0:  780
+	1:  779
+	2:  780
+	3:  719
+	4:  780
+	5:  720
+	6:  720
+	7:  778
+	8:  719
+	9:  719
+	Class: No of examples in testing set
+	0:  363
+	1:  364
+	2:  364
+	3:  336
+	4:  364
+	5:  335
+	6:  336
+	7:  364
+	8:  336
+	9:  336
+
+Accuracy on the testing set with k-nn 
+using Euclidean distance as the metric
+
+ k =  1 : 97.74
+ k =  2 : 97.37
+ k =  3 : 97.80
+ k =  4 : 97.66
+ k =  5 : 97.60
+ k =  6 : 97.57
+ k =  7 : 97.54
+ k =  8 : 97.54
+ k =  9 : 97.46
+ k = 10 : 97.48
+ k = 11 : 97.34
--- a/data/tanveer/pendigits/pendigits.tes
+++ b/data/tanveer/pendigits/pendigits.tes
--- a/data/tanveer/pendigits/pendigits.tra
+++ b/data/tanveer/pendigits/pendigits.tra
--- a/data/tanveer/pendigits/pendigits.txt
+++ b/data/tanveer/pendigits/pendigits.txt
@@ -0,0 +1,10 @@
+n_entradas= 16
+n_clases= 10
+n_arquivos= 2
+fich1= pendigits_train_R.dat
+n_patrons1= 7494
+fich2= pendigits_test_R.dat
+n_patrons2= 3498
+n_patrons_entrena= 3747
+n_patrons_valida= 3747
+n_conxuntos= 1
--- a/data/tanveer/pendigits/pendigits_test.arff
+++ b/data/tanveer/pendigits/pendigits_test.arff
--- a/data/tanveer/pendigits/pendigits_test_R.dat
+++ b/data/tanveer/pendigits/pendigits_test_R.dat
--- a/data/tanveer/pendigits/pendigits_train.arff
+++ b/data/tanveer/pendigits/pendigits_train.arff
--- a/data/tanveer/pendigits/pendigits_train_R.dat
+++ b/data/tanveer/pendigits/pendigits_train_R.dat