First commit

2025-06-22 00:31:33 +02:00
parent a52c20d1fb
commit 4bdbcad256
110 changed files with 31991 additions and 1 deletions
--- a/libsvm-3.36/tools/README
+++ b/libsvm-3.36/tools/README
@@ -0,0 +1,210 @@
+This directory includes some useful codes:
+
+1. subset selection tools.
+2. parameter selection tools.
+3. LIBSVM format checking tools
+
+Part I: Subset selection tools
+
+Introduction
+============
+
+Training large data is time consuming. Sometimes one should work on a
+smaller subset first. The python script subset.py randomly selects a
+specified number of samples. For classification data, we provide a
+stratified selection to ensure the same class distribution in the
+subset.
+
+Usage: subset.py [options] dataset number [output1] [output2]
+
+This script selects a subset of the given data set.
+
+options:
+-s method : method of selection (default 0)
+     0 -- stratified selection (classification only)
+     1 -- random selection
+
+output1 : the subset (optional)
+output2 : the rest of data (optional)
+
+If output1 is omitted, the subset will be printed on the screen.
+
+Example
+=======
+
+> python subset.py heart_scale 100 file1 file2
+
+From heart_scale 100 samples are randomly selected and stored in
+file1. All remaining instances are stored in file2.
+
+
+Part II: Parameter Selection Tools
+
+Introduction
+============
+
+grid.py is a parameter selection tool for C-SVM classification using
+the RBF (radial basis function) kernel. It uses cross validation (CV)
+technique to estimate the accuracy of each parameter combination in
+the specified range and helps you to decide the best parameters for
+your problem.
+
+grid.py directly executes libsvm binaries (so no python binding is needed)
+for cross validation and then draw contour of CV accuracy using gnuplot.
+You must have libsvm and gnuplot installed before using it. The package
+gnuplot is available at http://www.gnuplot.info/
+
+On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
+which thus must be installed as well. In addition, this version of
+gnuplot does not support png, so you need to change "set term png
+transparent small" and use other image formats. For example, you may
+have "set term pbm small color".
+
+Usage: grid.py [grid_options] [svm_options] dataset
+
+grid_options :
+-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
+    begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
+    "null"         -- do not grid with c
+-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
+    begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
+    "null"         -- do not grid with g
+-v n : n-fold cross validation (default 5)
+-svmtrain pathname : set svm executable path and name
+-gnuplot {pathname | "null"} :
+    pathname -- set gnuplot executable path and name
+    "null"   -- do not plot
+-out {pathname | "null"} : (default dataset.out)
+    pathname -- set output file path and name
+    "null"   -- do not output file
+-png pathname : set graphic output file path and name (default dataset.png)
+-resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out)
+    Use this option only if some parameters have been checked for the SAME data.
+
+svm_options : additional options for svm-train
+
+The program conducts v-fold cross validation using parameter C (and gamma)
+= 2^begin, 2^(begin+step), ..., 2^end.
+
+You can specify where the libsvm executable and gnuplot are using the
+-svmtrain and -gnuplot parameters.
+
+For windows users, please use pgnuplot.exe. If you are using gnuplot
+3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
+has a bug. If you use cygwin on windows, please use gunplot-x11.
+
+If the task is terminated accidentally or you would like to change the
+range of parameters, you can apply '-resume' to save time by re-using
+previous results.  You may specify the output file of a previous run
+or use the default (i.e., dataset.out) without giving a name. Please
+note that the same condition must be used in two runs. For example,
+you cannot use '-v 10' earlier and resume the task with '-v 5'.
+
+The value of some options can be "null." For example, `-log2c -1,0,1
+-log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma
+value. That is, you do not conduct parameter selection on gamma.
+
+Example
+=======
+
+> python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale
+
+Users (in particular MS Windows users) may need to specify the path of
+executable files. You can either change paths in the beginning of
+grid.py or specify them in the command line. For example,
+
+> grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svm-train.exe" -gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale
+
+Output: two files
+dataset.png: the CV accuracy contour plot generated by gnuplot
+dataset.out: the CV accuracy at each (log2(C),log2(gamma))
+
+The following example saves running time by loading the output file of a previous run.
+
+> python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out heart_scale
+
+Parallel grid search
+====================
+
+You can conduct a parallel grid search by dispatching jobs to a
+cluster of computers which share the same file system. First, you add
+machine names in grid.py:
+
+ssh_workers = ["linux1", "linux5", "linux5"]
+
+and then setup your ssh so that the authentication works without
+asking a password.
+
+The same machine (e.g., linux5 here) can be listed more than once if
+it has multiple CPUs or has more RAM. If the local machine is the
+best, you can also enlarge the nr_local_worker. For example:
+
+nr_local_worker = 2
+
+Example:
+
+> python grid.py heart_scale
+[local] -1 -1 78.8889  (best c=0.5, g=0.5, rate=78.8889)
+[linux5] -1 -7 83.3333  (best c=0.5, g=0.0078125, rate=83.3333)
+[linux5] 5 -1 77.037  (best c=0.5, g=0.0078125, rate=83.3333)
+[linux1] 5 -7 83.3333  (best c=0.5, g=0.0078125, rate=83.3333)
+.
+.
+.
+
+If -log2c, -log2g, or -v is not specified, default values are used.
+
+If your system uses telnet instead of ssh, you list the computer names
+in telnet_workers.
+
+Calling grid in Python
+======================
+
+In addition to using grid.py as a command-line tool, you can use it as a
+Python module.
+
+>>> rate, param = find_parameters(dataset, options)
+
+You need to specify `dataset' and `options' (default ''). See the following example.
+
+> python
+
+>>> from grid import *
+>>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g -1,1,1')
+[local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148)
+[local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037)
+.
+.
+[local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889)
+.
+.
+>>> rate
+78.8889
+>>> param
+{'c': 0.5, 'g': 0.5}
+
+
+Part III: LIBSVM format checking tools
+
+Introduction
+============
+
+`svm-train' conducts only a simple check of the input data. To do a
+detailed check, we provide a python script `checkdata.py.'
+
+Usage: checkdata.py dataset
+
+Exit status (returned value): 1 if there are errors, 0 otherwise.
+
+This tool is written by Rong-En Fan at National Taiwan University.
+
+Example
+=======
+
+> cat bad_data
+1 3:1 2:4
+> python checkdata.py bad_data
+line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4
+Found 1 lines with error.
+
+
--- a/libsvm-3.36/tools/checkdata.py
+++ b/libsvm-3.36/tools/checkdata.py
@@ -0,0 +1,108 @@
+#!/usr/bin/env python
+
+#
+# A format checker for LIBSVM
+#
+
+#
+# Copyright (c) 2007, Rong-En Fan
+#
+# All rights reserved.
+#
+# This program is distributed under the same license of the LIBSVM package.
+#
+
+from sys import argv, exit
+import os.path
+
+def err(line_no, msg):
+    print("line {0}: {1}".format(line_no, msg))
+
+# works like float() but does not accept nan and inf
+def my_float(x):
+    if x.lower().find("nan") != -1 or x.lower().find("inf") != -1:
+        raise ValueError
+
+    return float(x)
+
+def main():
+    if len(argv) != 2:
+        print("Usage: {0} dataset".format(argv[0]))
+        exit(1)
+
+    dataset = argv[1]
+
+    if not os.path.exists(dataset):
+        print("dataset {0} not found".format(dataset))
+        exit(1)
+
+    line_no = 1
+    error_line_count = 0
+    for line in open(dataset, 'r'):
+        line_error = False
+
+        # each line must end with a newline character
+        if line[-1] != '\n':
+            err(line_no, "missing a newline character in the end")
+            line_error = True
+
+        nodes = line.split()
+
+        # check label
+        try:
+            label = nodes.pop(0)
+
+            if label.find(',') != -1:
+                # multi-label format
+                try:
+                    for l in label.split(','):
+                        l = my_float(l)
+                except:
+                    err(line_no, "label {0} is not a valid multi-label form".format(label))
+                    line_error = True
+            else:
+                try:
+                    label = my_float(label)
+                except:
+                    err(line_no, "label {0} is not a number".format(label))
+                    line_error = True
+        except:
+            err(line_no, "missing label, perhaps an empty line?")
+            line_error = True
+
+        # check features
+        prev_index = -1
+        for i in range(len(nodes)):
+            try:
+                (index, value) =  nodes[i].split(':')
+
+                index = int(index)
+                value = my_float(value)
+
+                # precomputed kernel's index starts from 0 and LIBSVM
+                # checks it. Hence, don't treat index 0 as an error.
+                if index < 0:
+                    err(line_no, "feature index must be positive; wrong feature {0}".format(nodes[i]))
+                    line_error = True
+                elif index <= prev_index:
+                    err(line_no, "feature indices must be in an ascending order, previous/current features {0} {1}".format(nodes[i-1], nodes[i]))
+                    line_error = True
+                prev_index = index
+            except:
+                err(line_no, "feature '{0}' not an <index>:<value> pair, <index> integer, <value> real number ".format(nodes[i]))
+                line_error = True
+
+        line_no += 1
+
+        if line_error:
+            error_line_count += 1
+
+    if error_line_count > 0:
+        print("Found {0} lines with error.".format(error_line_count))
+        return 1
+    else:
+        print("No error.")
+        return 0
+
+if __name__ == "__main__":
+    exit(main())
--- a/libsvm-3.36/tools/easy.py
+++ b/libsvm-3.36/tools/easy.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python
+
+import sys
+import os
+from subprocess import *
+
+if len(sys.argv) <= 1:
+    print('Usage: {0} training_file [testing_file]'.format(sys.argv[0]))
+    raise SystemExit
+
+# svm, grid, and gnuplot executable files
+
+is_win32 = (sys.platform == 'win32')
+if not is_win32:
+    svmscale_exe = "../svm-scale"
+    svmtrain_exe = "../svm-train"
+    svmpredict_exe = "../svm-predict"
+    grid_py = "./grid.py"
+    gnuplot_exe = "/usr/bin/gnuplot"
+else:
+        # example for windows
+    svmscale_exe = r"..\windows\svm-scale.exe"
+    svmtrain_exe = r"..\windows\svm-train.exe"
+    svmpredict_exe = r"..\windows\svm-predict.exe"
+    gnuplot_exe = r"c:\tmp\gnuplot\binary\pgnuplot.exe"
+    grid_py = r".\grid.py"
+
+assert os.path.exists(svmscale_exe),"svm-scale executable not found"
+assert os.path.exists(svmtrain_exe),"svm-train executable not found"
+assert os.path.exists(svmpredict_exe),"svm-predict executable not found"
+assert os.path.exists(gnuplot_exe),"gnuplot executable not found"
+assert os.path.exists(grid_py),"grid.py not found"
+
+train_pathname = sys.argv[1]
+assert os.path.exists(train_pathname),"training file not found"
+file_name = os.path.split(train_pathname)[1]
+scaled_file = file_name + ".scale"
+model_file = file_name + ".model"
+range_file = file_name + ".range"
+
+if len(sys.argv) > 2:
+    test_pathname = sys.argv[2]
+    file_name = os.path.split(test_pathname)[1]
+    assert os.path.exists(test_pathname),"testing file not found"
+    scaled_test_file = file_name + ".scale"
+    predict_test_file = file_name + ".predict"
+
+cmd = '{0} -s "{1}" "{2}" > "{3}"'.format(svmscale_exe, range_file, train_pathname, scaled_file)
+print('Scaling training data...')
+Popen(cmd, shell = True, stdout = PIPE).communicate()
+
+cmd = '{0} -svmtrain "{1}" -gnuplot "{2}" "{3}"'.format(grid_py, svmtrain_exe, gnuplot_exe, scaled_file)
+print('Cross validation...')
+f = Popen(cmd, shell = True, stdout = PIPE).stdout
+
+line = ''
+while True:
+    last_line = line
+    line = f.readline()
+    if not line: break
+c,g,rate = map(float,last_line.split())
+
+print('Best c={0}, g={1} CV rate={2}'.format(c,g,rate))
+
+cmd = '{0} -c {1} -g {2} "{3}" "{4}"'.format(svmtrain_exe,c,g,scaled_file,model_file)
+print('Training...')
+Popen(cmd, shell = True, stdout = PIPE).communicate()
+
+print('Output model: {0}'.format(model_file))
+if len(sys.argv) > 2:
+    cmd = '{0} -r "{1}" "{2}" > "{3}"'.format(svmscale_exe, range_file, test_pathname, scaled_test_file)
+    print('Scaling testing data...')
+    Popen(cmd, shell = True, stdout = PIPE).communicate()
+
+    cmd = '{0} "{1}" "{2}" "{3}"'.format(svmpredict_exe, scaled_test_file, model_file, predict_test_file)
+    print('Testing...')
+    Popen(cmd, shell = True).communicate()
+
+    print('Output prediction: {0}'.format(predict_test_file))
--- a/libsvm-3.36/tools/grid.py
+++ b/libsvm-3.36/tools/grid.py
@@ -0,0 +1,500 @@
+#!/usr/bin/env python
+__all__ = ['find_parameters']
+
+import os, sys, traceback, getpass, time, re
+from threading import Thread
+from subprocess import *
+
+if sys.version_info[0] < 3:
+    from Queue import Queue
+else:
+    from queue import Queue
+
+telnet_workers = []
+ssh_workers = []
+nr_local_worker = 1
+
+class GridOption:
+    def __init__(self, dataset_pathname, options):
+        dirname = os.path.dirname(__file__)
+        if sys.platform != 'win32':
+            self.svmtrain_pathname = os.path.join(dirname, '../svm-train')
+            self.gnuplot_pathname = '/usr/bin/gnuplot'
+        else:
+            # example for windows
+            self.svmtrain_pathname = os.path.join(dirname, r'..\windows\svm-train.exe')
+            # svmtrain_pathname = r'c:\Program Files\libsvm\windows\svm-train.exe'
+            self.gnuplot_pathname = r'c:\tmp\gnuplot\binary\pgnuplot.exe'
+        self.fold = 5
+        self.c_begin, self.c_end, self.c_step = -5,  15,  2
+        self.g_begin, self.g_end, self.g_step =  3, -15, -2
+        self.grid_with_c, self.grid_with_g = True, True
+        self.dataset_pathname = dataset_pathname
+        self.dataset_title = os.path.split(dataset_pathname)[1]
+        self.out_pathname = '{0}.out'.format(self.dataset_title)
+        self.png_pathname = '{0}.png'.format(self.dataset_title)
+        self.pass_through_string = ' '
+        self.resume_pathname = None
+        self.parse_options(options)
+
+    def parse_options(self, options):
+        if type(options) == str:
+            options = options.split()
+        i = 0
+        pass_through_options = []
+
+        while i < len(options):
+            if options[i] == '-log2c':
+                i = i + 1
+                if options[i] == 'null':
+                    self.grid_with_c = False
+                else:
+                    self.c_begin, self.c_end, self.c_step = map(float,options[i].split(','))
+            elif options[i] == '-log2g':
+                i = i + 1
+                if options[i] == 'null':
+                    self.grid_with_g = False
+                else:
+                    self.g_begin, self.g_end, self.g_step = map(float,options[i].split(','))
+            elif options[i] == '-v':
+                i = i + 1
+                self.fold = options[i]
+            elif options[i] in ('-c','-g'):
+                raise ValueError('Use -log2c and -log2g.')
+            elif options[i] == '-svmtrain':
+                i = i + 1
+                self.svmtrain_pathname = options[i]
+            elif options[i] == '-gnuplot':
+                i = i + 1
+                if options[i] == 'null':
+                    self.gnuplot_pathname = None
+                else:
+                    self.gnuplot_pathname = options[i]
+            elif options[i] == '-out':
+                i = i + 1
+                if options[i] == 'null':
+                    self.out_pathname = None
+                else:
+                    self.out_pathname = options[i]
+            elif options[i] == '-png':
+                i = i + 1
+                self.png_pathname = options[i]
+            elif options[i] == '-resume':
+                if i == (len(options)-1) or options[i+1].startswith('-'):
+                    self.resume_pathname = self.dataset_title + '.out'
+                else:
+                    i = i + 1
+                    self.resume_pathname = options[i]
+            else:
+                pass_through_options.append(options[i])
+            i = i + 1
+
+        self.pass_through_string = ' '.join(pass_through_options)
+        if not os.path.exists(self.svmtrain_pathname):
+            raise IOError('svm-train executable not found')
+        if not os.path.exists(self.dataset_pathname):
+            raise IOError('dataset not found')
+        if self.resume_pathname and not os.path.exists(self.resume_pathname):
+            raise IOError('file for resumption not found')
+        if not self.grid_with_c and not self.grid_with_g:
+            raise ValueError('-log2c and -log2g should not be null simultaneously')
+        if self.gnuplot_pathname and not os.path.exists(self.gnuplot_pathname):
+            sys.stderr.write('gnuplot executable not found\n')
+            self.gnuplot_pathname = None
+
+def redraw(db,best_param,gnuplot,options,tofile=False):
+    if len(db) == 0: return
+    begin_level = round(max(x[2] for x in db)) - 3
+    step_size = 0.5
+
+    best_log2c,best_log2g,best_rate = best_param
+
+    # if newly obtained c, g, or cv values are the same,
+    # then stop redrawing the contour.
+    if all(x[0] == db[0][0]  for x in db): return
+    if all(x[1] == db[0][1]  for x in db): return
+    if all(x[2] == db[0][2]  for x in db): return
+
+    if tofile:
+        gnuplot.write(b"set term png transparent small linewidth 2 medium enhanced\n")
+        gnuplot.write("set output \"{0}\"\n".format(options.png_pathname.replace('\\','\\\\')).encode())
+        #gnuplot.write(b"set term postscript color solid\n")
+        #gnuplot.write("set output \"{0}.ps\"\n".format(options.dataset_title).encode().encode())
+    elif sys.platform == 'win32':
+        gnuplot.write(b"set term windows\n")
+    else:
+        gnuplot.write( b"set term x11\n")
+    gnuplot.write(b"set xlabel \"log2(C)\"\n")
+    gnuplot.write(b"set ylabel \"log2(gamma)\"\n")
+    gnuplot.write("set xrange [{0}:{1}]\n".format(options.c_begin,options.c_end).encode())
+    gnuplot.write("set yrange [{0}:{1}]\n".format(options.g_begin,options.g_end).encode())
+    gnuplot.write(b"set contour\n")
+    gnuplot.write("set cntrparam levels incremental {0},{1},100\n".format(begin_level,step_size).encode())
+    gnuplot.write(b"unset surface\n")
+    gnuplot.write(b"unset ztics\n")
+    gnuplot.write(b"set view 0,0\n")
+    gnuplot.write("set title \"{0}\"\n".format(options.dataset_title).encode())
+    gnuplot.write(b"unset label\n")
+    gnuplot.write("set label \"Best log2(C) = {0}  log2(gamma) = {1}  accuracy = {2}%\" \
+                  at screen 0.5,0.85 center\n". \
+                  format(best_log2c, best_log2g, best_rate).encode())
+    gnuplot.write("set label \"C = {0}  gamma = {1}\""
+                  " at screen 0.5,0.8 center\n".format(2**best_log2c, 2**best_log2g).encode())
+    gnuplot.write(b"set key at screen 0.9,0.9\n")
+    gnuplot.write(b"splot \"-\" with lines\n")
+
+    db.sort(key = lambda x:(x[0], -x[1]))
+
+    prevc = db[0][0]
+    for line in db:
+        if prevc != line[0]:
+            gnuplot.write(b"\n")
+            prevc = line[0]
+        gnuplot.write("{0[0]} {0[1]} {0[2]}\n".format(line).encode())
+    gnuplot.write(b"e\n")
+    gnuplot.write(b"\n") # force gnuplot back to prompt when term set failure
+    gnuplot.flush()
+
+
+def calculate_jobs(options):
+
+    def range_f(begin,end,step):
+        # like range, but works on non-integer too
+        seq = []
+        while True:
+            if step > 0 and begin > end: break
+            if step < 0 and begin < end: break
+            seq.append(begin)
+            begin = begin + step
+        return seq
+
+    def permute_sequence(seq):
+        n = len(seq)
+        if n <= 1: return seq
+
+        mid = int(n/2)
+        left = permute_sequence(seq[:mid])
+        right = permute_sequence(seq[mid+1:])
+
+        ret = [seq[mid]]
+        while left or right:
+            if left: ret.append(left.pop(0))
+            if right: ret.append(right.pop(0))
+
+        return ret
+
+
+    c_seq = permute_sequence(range_f(options.c_begin,options.c_end,options.c_step))
+    g_seq = permute_sequence(range_f(options.g_begin,options.g_end,options.g_step))
+
+    if not options.grid_with_c:
+        c_seq = [None]
+    if not options.grid_with_g:
+        g_seq = [None]
+
+    nr_c = float(len(c_seq))
+    nr_g = float(len(g_seq))
+    i, j = 0, 0
+    jobs = []
+
+    while i < nr_c or j < nr_g:
+        if i/nr_c < j/nr_g:
+            # increase C resolution
+            line = []
+            for k in range(0,j):
+                line.append((c_seq[i],g_seq[k]))
+            i = i + 1
+            jobs.append(line)
+        else:
+            # increase g resolution
+            line = []
+            for k in range(0,i):
+                line.append((c_seq[k],g_seq[j]))
+            j = j + 1
+            jobs.append(line)
+
+    resumed_jobs = {}
+
+    if options.resume_pathname is None:
+        return jobs, resumed_jobs
+
+    for line in open(options.resume_pathname, 'r'):
+        line = line.strip()
+        rst = re.findall(r'rate=([0-9.]+)',line)
+        if not rst:
+            continue
+        rate = float(rst[0])
+
+        c, g = None, None
+        rst = re.findall(r'log2c=([0-9.-]+)',line)
+        if rst:
+            c = float(rst[0])
+        rst = re.findall(r'log2g=([0-9.-]+)',line)
+        if rst:
+            g = float(rst[0])
+
+        resumed_jobs[(c,g)] = rate
+
+    return jobs, resumed_jobs
+
+
+class WorkerStopToken:  # used to notify the worker to stop or if a worker is dead
+    pass
+
+class Worker(Thread):
+    def __init__(self,name,job_queue,result_queue,options):
+        Thread.__init__(self)
+        self.name = name
+        self.job_queue = job_queue
+        self.result_queue = result_queue
+        self.options = options
+
+    def run(self):
+        while True:
+            (cexp,gexp) = self.job_queue.get()
+            if cexp is WorkerStopToken:
+                self.job_queue.put((cexp,gexp))
+                # print('worker {0} stop.'.format(self.name))
+                break
+            try:
+                c, g = None, None
+                if cexp != None:
+                    c = 2.0**cexp
+                if gexp != None:
+                    g = 2.0**gexp
+                rate = self.run_one(c,g)
+                if rate is None: raise RuntimeError('get no rate')
+            except:
+                # we failed, let others do that and we just quit
+
+                traceback.print_exception(sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2])
+
+                self.job_queue.put((cexp,gexp))
+                sys.stderr.write('worker {0} quit.\n'.format(self.name))
+                break
+            else:
+                self.result_queue.put((self.name,cexp,gexp,rate))
+
+    def get_cmd(self,c,g):
+        options=self.options
+        cmdline = '"' + options.svmtrain_pathname + '"'
+        if options.grid_with_c:
+            cmdline += ' -c {0} '.format(c)
+        if options.grid_with_g:
+            cmdline += ' -g {0} '.format(g)
+        cmdline += ' -v {0} {1} {2} '.format\
+            (options.fold,options.pass_through_string,'"' + options.dataset_pathname + '"')
+        return cmdline
+
+class LocalWorker(Worker):
+    def run_one(self,c,g):
+        cmdline = self.get_cmd(c,g)
+        result = Popen(cmdline,shell=True,stdout=PIPE,stderr=PIPE,stdin=PIPE).stdout
+        for line in result.readlines():
+            if str(line).find('Cross') != -1:
+                return float(line.split()[-1][0:-1])
+
+class SSHWorker(Worker):
+    def __init__(self,name,job_queue,result_queue,host,options):
+        Worker.__init__(self,name,job_queue,result_queue,options)
+        self.host = host
+        self.cwd = os.getcwd()
+    def run_one(self,c,g):
+        cmdline = 'ssh -x -t -t {0} "cd {1}; {2}"'.format\
+            (self.host,self.cwd,self.get_cmd(c,g))
+        result = Popen(cmdline,shell=True,stdout=PIPE,stderr=PIPE,stdin=PIPE).stdout
+        for line in result.readlines():
+            if str(line).find('Cross') != -1:
+                return float(line.split()[-1][0:-1])
+
+class TelnetWorker(Worker):
+    def __init__(self,name,job_queue,result_queue,host,username,password,options):
+        Worker.__init__(self,name,job_queue,result_queue,options)
+        self.host = host
+        self.username = username
+        self.password = password
+    def run(self):
+        import telnetlib
+        self.tn = tn = telnetlib.Telnet(self.host)
+        tn.read_until('login: ')
+        tn.write(self.username + '\n')
+        tn.read_until('Password: ')
+        tn.write(self.password + '\n')
+
+        # XXX: how to know whether login is successful?
+        tn.read_until(self.username)
+        #
+        print('login ok', self.host)
+        tn.write('cd '+os.getcwd()+'\n')
+        Worker.run(self)
+        tn.write('exit\n')
+    def run_one(self,c,g):
+        cmdline = self.get_cmd(c,g)
+        result = self.tn.write(cmdline+'\n')
+        (idx,matchm,output) = self.tn.expect(['Cross.*\n'])
+        for line in output.split('\n'):
+            if str(line).find('Cross') != -1:
+                return float(line.split()[-1][0:-1])
+
+def find_parameters(dataset_pathname, options=''):
+
+    def update_param(c,g,rate,best_c,best_g,best_rate,worker,resumed):
+        if (rate > best_rate) or (rate==best_rate and g==best_g and c<best_c):
+            best_rate,best_c,best_g = rate,c,g
+        stdout_str = '[{0}] {1} {2} (best '.format\
+            (worker,' '.join(str(x) for x in [c,g] if x is not None),rate)
+        output_str = ''
+        if c != None:
+            stdout_str += 'c={0}, '.format(2.0**best_c)
+            output_str += 'log2c={0} '.format(c)
+        if g != None:
+            stdout_str += 'g={0}, '.format(2.0**best_g)
+            output_str += 'log2g={0} '.format(g)
+        stdout_str += 'rate={0})'.format(best_rate)
+        print(stdout_str)
+        if options.out_pathname and not resumed:
+            output_str += 'rate={0}\n'.format(rate)
+            result_file.write(output_str)
+            result_file.flush()
+
+        return best_c,best_g,best_rate
+
+    options = GridOption(dataset_pathname, options);
+
+    if options.gnuplot_pathname:
+        gnuplot = Popen(options.gnuplot_pathname,stdin = PIPE,stdout=PIPE,stderr=PIPE).stdin
+    else:
+        gnuplot = None
+
+    # put jobs in queue
+
+    jobs,resumed_jobs = calculate_jobs(options)
+    job_queue = Queue(0)
+    result_queue = Queue(0)
+
+    for (c,g) in resumed_jobs:
+        result_queue.put(('resumed',c,g,resumed_jobs[(c,g)]))
+
+    for line in jobs:
+        for (c,g) in line:
+            if (c,g) not in resumed_jobs:
+                job_queue.put((c,g))
+
+    # hack the queue to become a stack --
+    # this is important when some thread
+    # failed and re-put a job. It we still
+    # use FIFO, the job will be put
+    # into the end of the queue, and the graph
+    # will only be updated in the end
+
+    job_queue._put = job_queue.queue.appendleft
+
+    # fire telnet workers
+
+    if telnet_workers:
+        nr_telnet_worker = len(telnet_workers)
+        username = getpass.getuser()
+        password = getpass.getpass()
+        for host in telnet_workers:
+            worker = TelnetWorker(host,job_queue,result_queue,
+                     host,username,password,options)
+            worker.start()
+
+    # fire ssh workers
+
+    if ssh_workers:
+        for host in ssh_workers:
+            worker = SSHWorker(host,job_queue,result_queue,host,options)
+            worker.start()
+
+    # fire local workers
+
+    for i in range(nr_local_worker):
+        worker = LocalWorker('local',job_queue,result_queue,options)
+        worker.start()
+
+    # gather results
+
+    done_jobs = {}
+
+    if options.out_pathname:
+        if options.resume_pathname:
+            result_file = open(options.out_pathname, 'a')
+        else:
+            result_file = open(options.out_pathname, 'w')
+
+
+    db = []
+    best_rate = -1
+    best_c,best_g = None,None
+
+    for (c,g) in resumed_jobs:
+        rate = resumed_jobs[(c,g)]
+        best_c,best_g,best_rate = update_param(c,g,rate,best_c,best_g,best_rate,'resumed',True)
+
+    for line in jobs:
+        for (c,g) in line:
+            while (c,g) not in done_jobs:
+                (worker,c1,g1,rate1) = result_queue.get()
+                done_jobs[(c1,g1)] = rate1
+                if (c1,g1) not in resumed_jobs:
+                    best_c,best_g,best_rate = update_param(c1,g1,rate1,best_c,best_g,best_rate,worker,False)
+            db.append((c,g,done_jobs[(c,g)]))
+        if gnuplot and options.grid_with_c and options.grid_with_g:
+            redraw(db,[best_c, best_g, best_rate],gnuplot,options)
+            redraw(db,[best_c, best_g, best_rate],gnuplot,options,True)
+
+
+    if options.out_pathname:
+        result_file.close()
+    job_queue.put((WorkerStopToken,None))
+    best_param, best_cg  = {}, []
+    if best_c != None:
+        best_param['c'] = 2.0**best_c
+        best_cg += [2.0**best_c]
+    if best_g != None:
+        best_param['g'] = 2.0**best_g
+        best_cg += [2.0**best_g]
+    print('{0} {1}'.format(' '.join(map(str,best_cg)), best_rate))
+
+    return best_rate, best_param
+
+
+if __name__ == '__main__':
+
+    def exit_with_help():
+        print("""\
+Usage: grid.py [grid_options] [svm_options] dataset
+
+grid_options :
+-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
+    begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
+    "null"         -- do not grid with c
+-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
+    begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
+    "null"         -- do not grid with g
+-v n : n-fold cross validation (default 5)
+-svmtrain pathname : set svm executable path and name
+-gnuplot {pathname | "null"} :
+    pathname -- set gnuplot executable path and name
+    "null"   -- do not plot
+-out {pathname | "null"} : (default dataset.out)
+    pathname -- set output file path and name
+    "null"   -- do not output file
+-png pathname : set graphic output file path and name (default dataset.png)
+-resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out)
+    This is experimental. Try this option only if some parameters have been checked for the SAME data.
+
+svm_options : additional options for svm-train""")
+        sys.exit(1)
+
+    if len(sys.argv) < 2:
+        exit_with_help()
+    dataset_pathname = sys.argv[-1]
+    options = sys.argv[1:-1]
+    try:
+        find_parameters(dataset_pathname, options)
+    except (IOError,ValueError) as e:
+        sys.stderr.write(str(e) + '\n')
+        sys.stderr.write('Try "grid.py" for more information.\n')
+        sys.exit(1)
--- a/libsvm-3.36/tools/subset.py
+++ b/libsvm-3.36/tools/subset.py
@@ -0,0 +1,120 @@
+#!/usr/bin/env python
+
+import os, sys, math, random
+from collections import defaultdict
+
+if sys.version_info[0] >= 3:
+    xrange = range
+
+def exit_with_help(argv):
+    print("""\
+Usage: {0} [options] dataset subset_size [output1] [output2]
+
+This script randomly selects a subset of the dataset.
+
+options:
+-s method : method of selection (default 0)
+     0 -- stratified selection (classification only)
+     1 -- random selection
+
+output1 : the subset (optional)
+output2 : rest of the data (optional)
+If output1 is omitted, the subset will be printed on the screen.""".format(argv[0]))
+    exit(1)
+
+def process_options(argv):
+    argc = len(argv)
+    if argc < 3:
+        exit_with_help(argv)
+
+    # default method is stratified selection
+    method = 0
+    subset_file = sys.stdout
+    rest_file = None
+
+    i = 1
+    while i < argc:
+        if argv[i][0] != "-":
+            break
+        if argv[i] == "-s":
+            i = i + 1
+            method = int(argv[i])
+            if method not in [0,1]:
+                print("Unknown selection method {0}".format(method))
+                exit_with_help(argv)
+        i = i + 1
+
+    dataset = argv[i]
+    subset_size = int(argv[i+1])
+    if i+2 < argc:
+        subset_file = open(argv[i+2],'w')
+    if i+3 < argc:
+        rest_file = open(argv[i+3],'w')
+
+    return dataset, subset_size, method, subset_file, rest_file
+
+def random_selection(dataset, subset_size):
+    l = sum(1 for line in open(dataset,'r'))
+    return sorted(random.sample(xrange(l), subset_size))
+
+def stratified_selection(dataset, subset_size):
+    labels = [line.split(None,1)[0] for line in open(dataset)]
+    label_linenums = defaultdict(list)
+    for i, label in enumerate(labels):
+        label_linenums[label] += [i]
+
+    l = len(labels)
+    remaining = subset_size
+    ret = []
+
+    # classes with fewer data are sampled first; otherwise
+    # some rare classes may not be selected
+    for label in sorted(label_linenums, key=lambda x: len(label_linenums[x])):
+        linenums = label_linenums[label]
+        label_size = len(linenums)
+        # at least one instance per class
+        s = int(min(remaining, max(1, math.ceil(label_size*(float(subset_size)/l)))))
+        if s == 0:
+            sys.stderr.write('''\
+Error: failed to have at least one instance per class
+    1. You may have regression data.
+    2. Your classification data is unbalanced or too small.
+Please use -s 1.
+''')
+            sys.exit(-1)
+        remaining -= s
+        ret += [linenums[i] for i in random.sample(xrange(label_size), s)]
+    return sorted(ret)
+
+def main(argv=sys.argv):
+    dataset, subset_size, method, subset_file, rest_file = process_options(argv)
+    #uncomment the following line to fix the random seed
+    #random.seed(0)
+    selected_lines = []
+
+    if method == 0:
+        selected_lines = stratified_selection(dataset, subset_size)
+    elif method == 1:
+        selected_lines = random_selection(dataset, subset_size)
+
+    #select instances based on selected_lines
+    dataset = open(dataset,'r')
+    prev_selected_linenum = -1
+    for i in xrange(len(selected_lines)):
+        for cnt in xrange(selected_lines[i]-prev_selected_linenum-1):
+            line = dataset.readline()
+            if rest_file:
+                rest_file.write(line)
+        subset_file.write(dataset.readline())
+        prev_selected_linenum = selected_lines[i]
+    subset_file.close()
+
+    if rest_file:
+        for line in dataset:
+            rest_file.write(line)
+        rest_file.close()
+    dataset.close()
+
+if __name__ == '__main__':
+    main(sys.argv)
+