550 lines
19 KiB
Plaintext
Executable File
550 lines
19 KiB
Plaintext
Executable File
-------------------------------------
|
|
--- Python interface of LIBLINEAR ---
|
|
-------------------------------------
|
|
|
|
Table of Contents
|
|
=================
|
|
|
|
- Introduction
|
|
- Installation via PyPI
|
|
- Installation via Sources
|
|
- Quick Start
|
|
- Quick Start with Scipy
|
|
- Design Description
|
|
- Data Structures
|
|
- Utility Functions
|
|
- Additional Information
|
|
|
|
Introduction
|
|
============
|
|
|
|
Python (http://www.python.org/) is a programming language suitable for rapid
|
|
development. This tool provides a simple Python interface to LIBLINEAR, a library
|
|
for support vector machines (http://www.csie.ntu.edu.tw/~cjlin/liblinear). The
|
|
interface is very easy to use as the usage is the same as that of LIBLINEAR. The
|
|
interface is developed with the built-in Python library "ctypes."
|
|
|
|
Installation via PyPI
|
|
=====================
|
|
|
|
To install the interface from PyPI, execute the following command:
|
|
|
|
> pip install -U liblinear-official
|
|
|
|
Installation via Sources
|
|
========================
|
|
|
|
Alternatively, you may install the interface from sources by
|
|
generating the LIBLINEAR shared library.
|
|
|
|
Depending on your use cases, you can choose between local-directory
|
|
and system-wide installation.
|
|
|
|
- Local-directory installation:
|
|
|
|
On Unix systems, type
|
|
|
|
> make
|
|
|
|
This generates a .so file in the LIBLINEAR main directory and you
|
|
can run the interface in the current python directory.
|
|
|
|
For Windows, starting from version 2.48, we no longer provide the
|
|
pre-built shared library liblinear.dll. To run the interface in the
|
|
current python directory, please follow the instruction of building
|
|
Windows binaries in LIBLINEAR README. You can copy liblinear.dll to
|
|
the system directory (e.g., `C:\WINDOWS\system32\') to make it
|
|
system-widely available.
|
|
|
|
- System-wide installation:
|
|
|
|
Type
|
|
|
|
> pip install -e .
|
|
|
|
or
|
|
|
|
> pip install --user -e .
|
|
|
|
The option --user would install the package in the home directory
|
|
instead of the system directory, and thus does not require the
|
|
root privilege.
|
|
|
|
Please note that you must keep the sources after the installation.
|
|
|
|
For Windows, to run the above command, Microsoft Visual C++ and
|
|
other tools are needed.
|
|
|
|
In addition, DON'T use the following FAILED commands
|
|
|
|
> python setup.py install (failed to run at the python directory)
|
|
> pip install .
|
|
|
|
Quick Start
|
|
===========
|
|
|
|
"Quick Start with Scipy" is in the next section.
|
|
|
|
There are two levels of usage. The high-level one uses utility
|
|
functions in liblinearutil.py and commonutil.py (shared with LIBSVM
|
|
and imported by svmutil.py). The usage is the same as the LIBLINEAR
|
|
MATLAB interface.
|
|
|
|
>>> from liblinear.liblinearutil import *
|
|
# Read data in LIBSVM format
|
|
>>> y, x = svm_read_problem('../heart_scale')
|
|
>>> m = train(y[:200], x[:200], '-c 4')
|
|
>>> p_label, p_acc, p_val = predict(y[200:], x[200:], m)
|
|
|
|
# Construct problem in python format
|
|
# Dense data
|
|
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
|
|
# Sparse data
|
|
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
|
|
>>> prob = problem(y, x)
|
|
>>> param = parameter('-s 0 -c 4 -B 1')
|
|
>>> m = train(prob, param)
|
|
|
|
# Other utility functions
|
|
>>> save_model('heart_scale.model', m)
|
|
>>> m = load_model('heart_scale.model')
|
|
>>> p_label, p_acc, p_val = predict(y, x, m, '-b 1')
|
|
>>> ACC, MSE, SCC = evaluations(y, p_label)
|
|
|
|
# Getting online help
|
|
>>> help(train)
|
|
|
|
The low-level use directly calls C interfaces imported by liblinear.py. Note that
|
|
all arguments and return values are in ctypes format. You need to handle them
|
|
carefully.
|
|
|
|
>>> from liblinear.liblinear import *
|
|
>>> prob = problem([1,-1], [{1:1, 3:1}, {1:-1,3:-1}])
|
|
>>> param = parameter('-c 4')
|
|
>>> m = liblinear.train(prob, param) # m is a ctype pointer to a model
|
|
# Convert a Python-format instance to feature_nodearray, a ctypes structure
|
|
>>> x0, max_idx = gen_feature_nodearray({1:1, 3:1})
|
|
>>> label = liblinear.predict(m, x0)
|
|
|
|
Quick Start with Scipy
|
|
======================
|
|
|
|
Make sure you have Scipy installed to proceed in this section.
|
|
If numba (http://numba.pydata.org) is installed, some operations will be much faster.
|
|
|
|
There are two levels of usage. The high-level one uses utility functions
|
|
in liblinearutil.py and the usage is the same as the LIBLINEAR MATLAB interface.
|
|
|
|
>>> import numpy as np
|
|
>>> import scipy
|
|
>>> from liblinear.liblinearutil import *
|
|
# Read data in LIBSVM format
|
|
>>> y, x = svm_read_problem('../heart_scale', return_scipy = True) # y: ndarray, x: csr_matrix
|
|
>>> m = train(y[:200], x[:200, :], '-c 4')
|
|
>>> p_label, p_acc, p_val = predict(y[200:], x[200:, :], m)
|
|
|
|
# Construct problem in Scipy format
|
|
# Dense data: numpy ndarray
|
|
>>> y, x = np.asarray([1,-1]), np.asarray([[1,0,1], [-1,0,-1]])
|
|
# Sparse data: scipy csr_matrix((data, (row_ind, col_ind))
|
|
>>> y, x = np.asarray([1,-1]), scipy.sparse.csr_matrix(([1, 1, -1, -1], ([0, 0, 1, 1], [0, 2, 0, 2])))
|
|
>>> prob = problem(y, x)
|
|
>>> param = parameter('-s 0 -c 4 -B 1')
|
|
>>> m = train(prob, param)
|
|
|
|
# Apply data scaling in Scipy format
|
|
>>> y, x = svm_read_problem('../heart_scale', return_scipy=True)
|
|
>>> scale_param = csr_find_scale_param(x, lower=0)
|
|
>>> scaled_x = csr_scale(x, scale_param)
|
|
|
|
# Other utility functions
|
|
>>> save_model('heart_scale.model', m)
|
|
>>> m = load_model('heart_scale.model')
|
|
>>> p_label, p_acc, p_val = predict(y, x, m, '-b 1')
|
|
>>> ACC, MSE, SCC = evaluations(y, p_label)
|
|
|
|
# Getting online help
|
|
>>> help(train)
|
|
|
|
The low-level use directly calls C interfaces imported by liblinear.py. Note that
|
|
all arguments and return values are in ctypes format. You need to handle them
|
|
carefully.
|
|
|
|
>>> from liblinear.liblinear import *
|
|
>>> prob = problem(np.asarray([1,-1]), scipy.sparse.csr_matrix(([1, 1, -1, -1], ([0, 0, 1, 1], [0, 2, 0, 2]))))
|
|
>>> param = parameter('-s 1 -c 4')
|
|
# One may also direct assign the options after creating the parameter instance
|
|
>>> param = parameter()
|
|
>>> param.solver_type = 1
|
|
>>> param.C = 4
|
|
>>> m = liblinear.train(prob, param) # m is a ctype pointer to a model
|
|
# Convert a tuple of ndarray (index, data) to feature_nodearray, a ctypes structure
|
|
# Note that index starts from 0, though the following example will be changed to 1:1, 3:1 internally
|
|
>>> x0, max_idx = gen_feature_nodearray((np.asarray([0,2]), np.asarray([1,1])))
|
|
>>> label = liblinear.predict(m, x0)
|
|
|
|
Design Description
|
|
==================
|
|
|
|
There are two files liblinear.py and liblinearutil.py, which respectively correspond to
|
|
low-level and high-level use of the interface.
|
|
|
|
In liblinear.py, we adopt the Python built-in library "ctypes," so that
|
|
Python can directly access C structures and interface functions defined
|
|
in linear.h.
|
|
|
|
While advanced users can use structures/functions in liblinear.py, to
|
|
avoid handling ctypes structures, in liblinearutil.py we provide some easy-to-use
|
|
functions. The usage is similar to LIBLINEAR MATLAB interface.
|
|
|
|
Data Structures
|
|
===============
|
|
|
|
Three data structures derived from linear.h are node, problem, and
|
|
parameter. They all contain fields with the same names in
|
|
linear.h. Access these fields carefully because you directly use a C structure
|
|
instead of a Python object. The following description introduces additional
|
|
fields and methods.
|
|
|
|
Before using the data structures, execute the following command to load the
|
|
LIBLINEAR shared library:
|
|
|
|
>>> from liblinear.liblinear import *
|
|
|
|
- class feature_node:
|
|
|
|
Construct a feature_node.
|
|
|
|
>>> node = feature_node(idx, val)
|
|
|
|
idx: an integer indicates the feature index.
|
|
|
|
val: a float indicates the feature value.
|
|
|
|
Show the index and the value of a node.
|
|
|
|
>>> print(node)
|
|
|
|
- Function: gen_feature_nodearray(xi [,feature_max=None])
|
|
|
|
Generate a feature vector from a Python list/tuple/dictionary, numpy ndarray or tuple of (index, data):
|
|
|
|
>>> xi_ctype, max_idx = gen_feature_nodearray({1:1, 3:1, 5:-2})
|
|
|
|
xi_ctype: the returned feature_nodearray (a ctypes structure)
|
|
|
|
max_idx: the maximal feature index of xi
|
|
|
|
feature_max: if feature_max is assigned, features with indices larger than
|
|
feature_max are removed.
|
|
|
|
- class problem:
|
|
|
|
Construct a problem instance
|
|
|
|
>>> prob = problem(y, x [,bias=-1])
|
|
|
|
y: a Python list/tuple/ndarray of l labels (type must be int/double).
|
|
|
|
x: 1. a list/tuple of l training instances. Feature vector of
|
|
each training instance is a list/tuple or dictionary.
|
|
|
|
2. an l * n numpy ndarray or scipy spmatrix (n: number of features).
|
|
|
|
bias: if bias >= 0, instance x becomes [x; bias]; if < 0, no bias term
|
|
added (default -1)
|
|
|
|
You can also modify the bias value by
|
|
|
|
>>> prob.set_bias(1)
|
|
|
|
Note that if your x contains sparse data (i.e., dictionary), the internal
|
|
ctypes data format is still sparse.
|
|
|
|
Copy a problem instance.
|
|
DON'T use this unless you know what it does.
|
|
|
|
>>> prob_copy = prob.copy()
|
|
|
|
The reason we need to copy a problem instance is because, for example,
|
|
in multi-label tasks using the OVR setting, we need to train a binary
|
|
classification problem for each label on data with/without that label.
|
|
Since each training uses the same x but different y, simply looping
|
|
over labels and creating a new problem instance for each training would
|
|
introduce overhead either from repeatedly transforming x into feature_node
|
|
or from allocating additional memory space for x during parallel training.
|
|
|
|
With problem copying, suppose x represents data, y1 represents the label
|
|
for class 1 and y2 represent the label for class 2.
|
|
We can do:
|
|
|
|
>>> class1_prob = problem(y1, x)
|
|
>>> class2_prob = class1_prob.copy()
|
|
>>> class2_prob.y = (ctypes.c_double * class1_prob.l)(*y2)
|
|
|
|
Note that although the copied problem is a new instance, attributes such as
|
|
y (POINTER(c_double)), x (POINTER(POINTER(feature_node))), and x_space (list/np.ndarray)
|
|
are copied by reference. That is, class1_prob and class2_prob share the same
|
|
y, x and x_space after the copy.
|
|
|
|
- class parameter:
|
|
|
|
Construct a parameter instance
|
|
|
|
>>> param = parameter('training_options')
|
|
|
|
If 'training_options' is empty, LIBLINEAR default values are applied.
|
|
|
|
Set param to LIBLINEAR default values.
|
|
|
|
>>> param.set_to_default_values()
|
|
|
|
Parse a string of options.
|
|
|
|
>>> param.parse_options('training_options')
|
|
|
|
Show values of parameters.
|
|
|
|
>>> print(param)
|
|
|
|
- class model:
|
|
|
|
There are two ways to obtain an instance of model:
|
|
|
|
>>> model_ = train(y, x)
|
|
>>> model_ = load_model('model_file_name')
|
|
|
|
Note that the returned structure of interface functions
|
|
liblinear.train and liblinear.load_model is a ctypes pointer of
|
|
model, which is different from the model object returned
|
|
by train and load_model in liblinearutil.py. We provide a
|
|
function toPyModel for the conversion:
|
|
|
|
>>> model_ptr = liblinear.train(prob, param)
|
|
>>> model_ = toPyModel(model_ptr)
|
|
|
|
If you obtain a model in a way other than the above approaches,
|
|
handle it carefully to avoid memory leak or segmentation fault.
|
|
|
|
Some interface functions to access LIBLINEAR models are wrapped as
|
|
members of the class model:
|
|
|
|
>>> nr_feature = model_.get_nr_feature()
|
|
>>> nr_class = model_.get_nr_class()
|
|
>>> class_labels = model_.get_labels()
|
|
>>> is_prob_model = model_.is_probability_model()
|
|
>>> is_regression_model = model_.is_regression_model()
|
|
|
|
The decision function is W*x + b, where
|
|
W is an nr_class-by-nr_feature matrix, and
|
|
b is a vector of size nr_class.
|
|
To access W_kj (i.e., coefficient for the k-th class and the j-th feature)
|
|
and b_k (i.e., bias for the k-th class), use the following functions.
|
|
|
|
>>> W_kj = model_.get_decfun_coef(feat_idx=j, label_idx=k)
|
|
>>> b_k = model_.get_decfun_bias(label_idx=k)
|
|
|
|
We also provide a function to extract w_k (i.e., the k-th row of W) and
|
|
b_k directly as follows.
|
|
|
|
>>> [w_k, b_k] = model_.get_decfun(label_idx=k)
|
|
|
|
Note that w_k is a Python list of length nr_feature, which means that
|
|
w_k[0] = W_k1.
|
|
For regression models, W is just a vector of length nr_feature. Either
|
|
set label_idx=0 or omit the label_idx parameter to access the coefficients.
|
|
|
|
>>> W_j = model_.get_decfun_coef(feat_idx=j)
|
|
>>> b = model_.get_decfun_bias()
|
|
>>> [W, b] = model_.get_decfun()
|
|
|
|
For one-class SVM models, label_idx is ignored and b=-rho is
|
|
returned from get_decfun(). That is, the decision function is
|
|
w*x+b = w*x-rho.
|
|
|
|
>>> rho = model_.get_decfun_rho()
|
|
>>> [W, b] = model_.get_decfun()
|
|
|
|
Note that in get_decfun_coef, get_decfun_bias, and get_decfun, feat_idx
|
|
starts from 1, while label_idx starts from 0. If label_idx is not in the
|
|
valid range (0 to nr_class-1), then a NaN will be returned; and if feat_idx
|
|
is not in the valid range (1 to nr_feature), then a zero value will be
|
|
returned. For regression models, label_idx is ignored.
|
|
|
|
Utility Functions
|
|
=================
|
|
|
|
To use utility functions, type
|
|
|
|
>>> from liblinear.liblinearutil import *
|
|
|
|
The above command loads
|
|
train() : train a linear model
|
|
predict() : predict testing data
|
|
svm_read_problem() : read the data from a LIBSVM-format file or object.
|
|
load_model() : load a LIBLINEAR model.
|
|
save_model() : save model to a file.
|
|
evaluations() : evaluate prediction results.
|
|
csr_find_scale_param() : find scaling parameter for data in csr format.
|
|
csr_scale() : apply data scaling to data in csr format.
|
|
|
|
- Function: train
|
|
|
|
There are three ways to call train()
|
|
|
|
>>> model = train(y, x [, 'training_options'])
|
|
>>> model = train(prob [, 'training_options'])
|
|
>>> model = train(prob, param)
|
|
|
|
y: a list/tuple/ndarray of l training labels (type must be int/double).
|
|
|
|
x: 1. a list/tuple of l training instances. Feature vector of
|
|
each training instance is a list/tuple or dictionary.
|
|
|
|
2. an l * n numpy ndarray or scipy spmatrix (n: number of features).
|
|
|
|
training_options: a string in the same form as that for LIBLINEAR command
|
|
mode.
|
|
|
|
prob: a problem instance generated by calling
|
|
problem(y, x).
|
|
|
|
param: a parameter instance generated by calling
|
|
parameter('training_options')
|
|
|
|
model: the returned model instance. See linear.h for details of this
|
|
structure. If '-v' is specified, cross validation is
|
|
conducted and the returned model is just a scalar: cross-validation
|
|
accuracy for classification and mean-squared error for regression.
|
|
|
|
If the '-C' option is specified, best parameters are found
|
|
by cross validation. The parameter selection utility is supported
|
|
only by -s 0, -s 2 (for finding C) and -s 11 (for finding C, p).
|
|
The returned structure is a triple with the best C, the best p,
|
|
and the corresponding cross-validation accuracy or mean squared
|
|
error. The returned best p for -s 0 and -s 2 is set to -1 because
|
|
the p parameter is not used by classification models.
|
|
|
|
|
|
To train the same data many times with different
|
|
parameters, the second and the third ways should be faster..
|
|
|
|
Examples:
|
|
|
|
>>> y, x = svm_read_problem('../heart_scale')
|
|
>>> prob = problem(y, x)
|
|
>>> param = parameter('-s 3 -c 5 -q')
|
|
>>> m = train(y, x, '-c 5')
|
|
>>> m = train(prob, '-w1 5 -c 5')
|
|
>>> m = train(prob, param)
|
|
>>> CV_ACC = train(y, x, '-v 3')
|
|
>>> best_C, best_p, best_rate = train(y, x, '-C -s 0') # best_p is only for -s 11
|
|
>>> m = train(y, x, '-c {0} -s 0'.format(best_C)) # use the same solver: -s 0
|
|
|
|
- Function: predict
|
|
|
|
To predict testing data with a model, use
|
|
|
|
>>> p_labs, p_acc, p_vals = predict(y, x, model [,'predicting_options'])
|
|
|
|
y: a list/tuple/ndarray of l true labels (type must be int/double).
|
|
It is used for calculating the accuracy. Use [] if true labels are
|
|
unavailable.
|
|
|
|
x: 1. a list/tuple of l training instances. Feature vector of
|
|
each training instance is a list/tuple or dictionary.
|
|
|
|
2. an l * n numpy ndarray or scipy spmatrix (n: number of features).
|
|
|
|
predicting_options: a string of predicting options in the same format as
|
|
that of LIBLINEAR.
|
|
|
|
model: a model instance.
|
|
|
|
p_labels: a list of predicted labels
|
|
|
|
p_acc: a tuple including accuracy (for classification), mean
|
|
squared error, and squared correlation coefficient (for
|
|
regression).
|
|
|
|
p_vals: a list of decision values or probability estimates (if '-b 1'
|
|
is specified). If k is the number of classes, for decision values,
|
|
each element includes results of predicting k binary-class
|
|
SVMs. If k = 2 and solver is not MCSVM_CS, only one decision value
|
|
is returned. For probabilities, each element contains k values
|
|
indicating the probability that the testing instance is in each class.
|
|
Note that the order of classes here is the same as 'model.label'
|
|
field in the model structure.
|
|
|
|
Example:
|
|
|
|
>>> m = train(y, x, '-c 5')
|
|
>>> p_labels, p_acc, p_vals = predict(y, x, m)
|
|
|
|
- Functions: svm_read_problem/load_model/save_model
|
|
|
|
See the usage by examples:
|
|
|
|
>>> y, x = svm_read_problem('data.txt')
|
|
>>> with open('data.txt') as f:
|
|
>>> y, x = svm_read_problem(f)
|
|
>>> m = load_model('model_file')
|
|
>>> save_model('model_file', m)
|
|
|
|
- Function: evaluations
|
|
|
|
Calculate some evaluations using the true values (ty) and the predicted
|
|
values (pv):
|
|
|
|
>>> (ACC, MSE, SCC) = evaluations(ty, pv, useScipy)
|
|
|
|
ty: a list/tuple/ndarray of true values.
|
|
|
|
pv: a list/tuple/ndarray of predicted values.
|
|
|
|
useScipy: convert ty, pv to ndarray, and use scipy functions to do the evaluation
|
|
|
|
ACC: accuracy.
|
|
|
|
MSE: mean squared error.
|
|
|
|
SCC: squared correlation coefficient.
|
|
|
|
- Function: csr_find_scale_parameter/csr_scale
|
|
|
|
Scale data in csr format.
|
|
|
|
>>> param = csr_find_scale_param(x [, lower=l, upper=u])
|
|
>>> x = csr_scale(x, param)
|
|
|
|
x: a csr_matrix of data.
|
|
|
|
l: x scaling lower limit; default -1.
|
|
|
|
u: x scaling upper limit; default 1.
|
|
|
|
The scaling process is: x * diag(coef) + ones(l, 1) * offset'
|
|
|
|
param: a dictionary of scaling parameters, where param['coef'] = coef and param['offset'] = offset.
|
|
|
|
coef: a scipy array of scaling coefficients.
|
|
|
|
offset: a scipy array of scaling offsets.
|
|
|
|
Additional Information
|
|
======================
|
|
|
|
This interface was originally written by Hsiang-Fu Yu from Department of Computer
|
|
Science, National Taiwan University. If you find this tool useful, please
|
|
cite LIBLINEAR as follows
|
|
|
|
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
|
|
LIBLINEAR: A Library for Large Linear Classification, Journal of
|
|
Machine Learning Research 9(2008), 1871-1874. Software available at
|
|
http://www.csie.ntu.edu.tw/~cjlin/liblinear
|
|
|
|
For any question, please contact Chih-Jen Lin <cjlin@csie.ntu.edu.tw>,
|
|
or check the FAQ page:
|
|
|
|
http://www.csie.ntu.edu.tw/~cjlin/liblinear/faq.html
|