Planet Maemo: category "feed:46b1d6b26651a331cde2ad188d699e0c"

Large-scale sparse multiclass classification

Sun, 12 May 2013 12:52:56 +0000

I’m thrilled to announce that my paper “Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classiﬁcation” (published in the Machine Learning journal) is now online: PDF, BibTeX [*].

Abstract

Over the past decade, l1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., l1/l2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs l1/l2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun, 2009) and the other without (Richtárik and Takáč, 2012). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms l1/l2- regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.

Code

The code of the proposed multiclass method is available in my Python/Cython machine learning library, lightning. Below is an example of how to use it on the News20 dataset.

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.primal_cd import CDClassifier
 
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target
 
clf = CDClassifier(penalty="l1/l2",
                   loss="squared_hinge",
                   multiclass=True,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)
clf.fit(X, y)
# accuracy
print clf.score(X, y) 
# percentage of selected features
print clf.n_nonzero(percentage=True)

To use the variant without line search (as presented in the paper), add the max_steps=0 option to CDClassifier.

Data

I also released the Amazon7 dataset used in the paper. It contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144-dimensional vector. It is, to my knowledge, one of the largest publically available multiclass classification dataset.

[*] The final publication is available here.

0 0

Transparent system-wide proxy

Sun, 26 Jun 2011 23:18:29 +0000

Proxies can be a powerful way to enforce anonymity or to bypass various kinds of restrictions on Internet (government censorship, regional contents, …). In this post, I’ll describe a simple technique to create a transparent proxy at the system level. It’s especially useful in cases when you want to make sure that all connections make it through the proxy or when your application of interest doesn’t have proxy support. I don’t think this technique is that well-known, hence this post.

1. Create SOCKS proxy with OpenSSH

If you have a user account on a remote machine, the simplest way to create a proxy is to use the following command.

$ ssh -N -D 1080 username@serverhost

It creates a SOCKS proxy listening to port 1080 on your local machine. The main advantage of the SOCKS protocol is that it can route connections from any port between the client and the server.

2. Forward connection transparently with iptables or ipfw

Most modern browsers can let the user define a SOCKS proxy in their advanced networking preference section. Yet, many applications may not support the SOCKS protocol at all. The solution is to use system tools such as iptables (Linux) or ipfw (FreeBSD, OS X) to enforce the routing at the system level. For example, on OS X, I use the following command to redirect port 80 (HTTP).

$ sudo ipfw add 100 fwd 127.0.0.1,12345 dst-port 80

3. Redirect connections to SOCKS proxy with redsocks

iptables and ipfw don’t have built-in support for the SOCKS protocol. It is thus necessary to use an additional program to transform the connections on the fly. This is where redsocks comes in.

$ redsocks -c config_file

In the configuration file, you need to configure redsocks to listen to port 12345 and to redirect to port 1080. “generic” can be used as the redirector option. The github repository includes a sample configuration file.

And voila! We now have configured the system to transparently route connections for us. To summarize, here is the big picture:

local machine -> redsocks -> SOCKS proxy -> target server

4 1

Regularized Least Squares

Wed, 09 Feb 2011 14:20:28 +0000

Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.

Ridge regression

Given a centered input , a linear regression model predicts the output by

where is a weight vector.

Given a set of training instances and its associated set of outputs , the method of least-squares finds the weight vector which minimizes the sum of square differences between the predictions and the actual labels.

However, minizing the sum of square errors on the training set doesn’t necessarily mean that the model will do well on new data. For this reason, Ridge regression adds a regularization term, which controls the complexity of the weight vector. The objective function becomes

where is a parameter which controls the regularization strength. Minimizing is thus a trade-off between minimizing the sum of errors and keeping the weight vector simple (i.e., small). Put differently, by introducing some bias (the regularization), we are purposely limiting the freedom of the model and thus reducing the variance of the weight vector. Regularization is a simple way to avoid overfitting and often improves the prediction accuracy on new data, especially if the training set is small.

Calculating the gradient of with respect to and solving for , we find that has a closed-form solution

where . Note that the matrix inverse is just notational convenience. In practice, we simply solve the corresponding system of linear equations . Since is symmetric positive-definite, one typically first finds the Cholesky decomposition . Then since is lower triangular, solving the system is simply a matter of applying forward and backward substitution. Another commonly used method is the conjugate gradient.

Note that not only does regularization improve accuracy, it is often necessary because it makes invertible when is not. See Tikhonov regularization.

Kernel ridge

As per the representer theorem, just like we did for kernel perceptron, we can replace with . Substituting into then deriving with respect to and solving for , we find

where and is the kernel matrix. The prediction function becomes

where is the kernel function. However, in contrast to SVM, is not sparse. This means that the whole training set needs to be used at prediction time.

Multiple output

So far we have considered the case when there is only one output per instance. We can easily extend to the output case. To do that, we simply need to solve , where is and is . Note that the Cholesky decomposition needs to be computed only once. Since it takes time, it is much more efficient to solve directly than to solve , times.

Classification

M-class classification can easily be achieved by combining multiple-output ridge regression and the 1-of-M coding scheme. This is sometimes referred by some authors as Regularized Least Square Classifier (RLSC). Concretely, we just need to construct a matrix by setting to if the instance is associated with the label and to otherwise. Then prediction is just a matter of taking the most confident label for each instance, just like one-vs-all SVM.

Note that the square error is sometimes criticized for classification because it penalizes overconfident predictions. For example, predicting 3 instead of 1 is penalized just as much as predicting -1 instead of 1, even though 3 would still give the correct prediction and -1 would not. However, in practice, RLSC often achieves accuracy on a par with other classifiers such as SVM.

Efficient leave-one-out cross-validation

Estimating is a model selection problem, usually performed by k-fold or leave-one-out cross-validation. The naive approach to leave-one-out is to learn models by holding out one instance at a time. Obviously, this is computationally very expensive. Fortunately, there exist nice computational tricks. Below, I show how to use them in the kernel case but they can equally be applied to the normal case.

The first trick is to take the eigendecomposition

Using basic linear algebra, we obtain

Since is diagonal, it can easily be inverted by taking its reciprocal. Therefore by first paying the price of an eigendecomposition, can thus be inverted inexpensively for many different .

Let be the prediction value when the model was fitted to all training instances but . The second trick, sometimes called generalized cross-validation, will allow us to calculate the almost for free.

First, let be the label vector of the model when is held out.

When we replace by in , clearly, doesn’t affect the solution (as desired) and we have

However, this is a circular definition, since itself depends on . We can calculate the difference with , which will allow us to factorize .

By definition, the element of and are and , respectively. Therefore,

and finally, by simplifying, we obtain

Similar computational tricks exist for Gaussian Process Regression as well.

References

Regularized Least Squares, Ryan Rifkin

1 2

Kernel Perceptron in Python

Sun, 31 Oct 2010 05:36:32 +0000

The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It’s also trivial to kernelize, which makes it an ideal candidate to gain insights on kernel methods.

Perceptron

The Perceptron predicts the class of an input with the function

where sign(y) = +1 if y > 0 and -1 if y < 0, is a feature-space transformation and is a feature weight vector. If is already a feature vector and a projection to another space is not needed, then .

Given a data set comprising training examples and their corresponding labels , where , the Perceptron makes a prediction for each using the current estimate of . When the prediction is correct (equal to the label ), the algorithm jumps to the next example. When the prediction is incorrect, in order to correct for the mistake, if then is incremented by , otherwise it is decremented by . Since , the update rule is thus:

Figure 1: The decision boundary (hyperplane), to which is normal, partitions the vector space into two sets, one for each class. An update changes the direction of the decision boundary to correct for the incorrect prediction. (From this document)

Figure 2: Decision boundary in the linear case.

Kernel Perceptron

From the update rule above, we clearly see that, if is initialized to the zero vector, it is a linear combination of the training examples.

Injecting into the prediction function , we get

where is a Mercer kernel.

The update rule for when a mistake is made predicting now simply becomes

i.e., is the number of times a mistake has been made with respect to .

A few remarks:

Instead of learning a weight vector with respect to features, we’re now learning a weight vector with respect to examples.
To predict the class of an input , we need to store the training examples . Kernel methods are memory-based methods, like k-NN. However, we only need to store examples for which a mistake has been made, i.e. . In the context of Support Vector Machines (SVMs), these are called support vectors. SVMs, however, not only store examples for which a mistake has been made, they also store examples that lie inside the margin, i.e. . (See Figure 4 from my post on SVMs)
In the online learning setting, the number of support vectors can grow and grow as more mistakes are made. The Forgetron is an extension of the Kernel Perceptron which can learn with a “memory budget”. When the budget is exceeded, some support vectors are “forgotten”.
For some kinds of objects like sequences, trees and graphs, it might be difficult to map objects to feature vectors while it is easy to come up with a similarity measure between any two objects and . In this case, kernel methods are very useful, since they can be used “as is”.

For a brief and intuitive introduction to kernel classifiers, I highly recommend these slides, by Mark Johnson.

Figure 3: Decision boundary when using a gaussian kernel. Green dots indicate support vectors.

The voted and average Perceptrons are two straightforward extensions to the Perceptron, which for some applications are competitive with SVMs. For details, see “Large Margin Classiﬁcation Using the Perceptron Algorithm” by Freund and Schapire.

Source

http://gist.github.com/656147

0 0

Support Vector Machines in Python

Sun, 19 Sep 2010 14:07:15 +0000

Support Vector Machines (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let’s play a little bit with SVM in Python! To make it easier to read, we will use the same notation as in Christopher Bishop’s book “Pattern Recognition and Machine Learning”.

Maximum margin

In SVM, the class of an input vector can be decided by evaluating the sign of .

If 0' title='y(\mathbf{x}) > 0' class='latex' /> we assign to class +1 and if , we assign it to class -1. Here is a feature-space transformation, which can map to a space of higher, possibly infinite, dimensions.

Given a data set comprising input vectors and their corresponding labels , where , we would like to find and such that it explains the training data: when and when . This can be rewritten in a single constraint:

In addition, and are chosen so that the distance between the decision boundary (a line in the 2-d case, a plane in the 3-d case, a hyperplane in the n-d case) and the closest points to it is maximized. This distance is called the margin, hence the name maximum margin classifier. Geometrically, the margin is found to be and so the maximum margin problem can be equivalently expressed as the minimization problem:

subject to constraint (7.5).

Figure 1: The linearly separable case. The decision line is the plain line and the margin is the gap between the two dotted lines. Only 3 support vectors (green dots) out of 180 training examples are necessary.

Dual representation

Since this is a constrained optimization problem, we can introduce Lagrange multipliers (one per training example), differentiate the Lagrangian function with respect to and and inject the solution back in the Lagrangian function (equations (7.7) to (7.9) in Bishop’s book), so that it depends only on the Lagrange multipliers. Doing so, we find that the maximum margin equivalently emerges from the maximization of

subject to the constraints

This is the so-called dual representation and is a quadratic programming (QP) problem. is called the kernel function.

Similarly to the objective function, can also be re-expressed solely in terms of the Lagrange multipliers.

The important thing to notice here is that we’ve gone from a sum over dimensions (the dot product in equation (7.1)) to a sum over points. This may seem like a disadvantage as the number of training examples is usually bigger than the number of dimensions . However, this is very useful and is called the kernel trick: this allows to use SVM, originally a linear classifier, to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, but without explicitly computing .

In many situations, only a small proportion of the Lagrange multipliers will be non-zero, therefore we only need to store the corresponding training examples . These are called the support vectors and this is why SVMs are sometimes called sparse kernel machines.

That being said, in the linear case, i.e. when , it is faster to directly compute from equation (7.1). and can be computed in terms of the Lagrange multipliers by equations (7.8) and (7.18) in Bishop’s book.

Figure 2: The non-linearly separable case. Example of a gaussian kernel with parameter sigma=5.0. Perfect prediction is achieved on the held-out 20 data points.

QP solver

We want to find the Lagrange multipliers maximizing equation (7.10) subject to the constraints (7.11) and (7.11). This can be done by a standard QP solver such as cvxopt.

Minimize

subject to

(inequality constraint)

(equality constraint)

The unknow is , which in our case correspond to the Lagrange multipliers . We just need to rework the formulation a little bit to use matrix notation and be a minimization (hence the -1 multiplicative factors).

# Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
    for j in range(n_samples):
        K[i,j] = self.kernel(X[i], X[j])
 
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
 
# Solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
 
# Lagrange multipliers
a = np.ravel(solution['x'])

Note here that is a matrix. Thus, a standard QP solver can’t be used for a large number of training examples, as P needs to be stored in memory. There exists a number of algorithms in order to decompose the original QP problem into smaller QP problems that target only a few training samples at a time. One such algorithm is Sequential Minimal Optimization (SMO). One advantage of SMO is that the smaller QP problems can be solved analytically and so SMO doesn’t even need a QP solver.

Soft margin

The problem with the formulation we have used thus far is that it doesn’t allow for misclassification of the training examples. This can lead to poor generalization if there is overlap between the distributions of the two classes. To solve this problem, we can rework constraint (7.5) as

are called slack variables and are introduced to allow the misclassification of some examples. If , the corresponding training example is correctly classified. If , the training example lies inside the margin but is still on the correct side of the decision boundary. If 1' title='\xi_n > 1' class='latex' />, the training example is misclassified. Equation (7.6) then becomes

0' title='C > 0' class='latex' /> is the parameter which controls the trade-off between the slack variable penality and the margin. Again, we can introduce Lagrange multipliers, derive the Lagrangian function with respect to , and , and inject the solutions back in the Lagragian function (equations (7.22) to (7.31) in Bishop’s book).

Which is identical to the hard margin case! The constraints become:

Interestingly, the slack variables have vanished and the only difference with the hard margin is that the inequality constraint now has an upper bound, .

The attentive reader will have noticed that the inequality constraints in cvxopt have an upper bound but no lower bound.

The trick is to rewrite constraint (7.33) as a system of inequations, in matrix notation. Example with 2 training examples:

Figure 3: The hard margin case. 180 vectors out of 180 are support vectors! And the classifier only achieves 11 correct predictions out of 20, on held-out data.

Figure 4: The soft margin case (C=0.1). 36 vectors out of 180 are support vectors. The classifier achieves 19 correct predictions out of 20!

Source

http://gist.github.com/586753

References

Pattern Recognition and Machine Learning, Christopher Bishop, 2006.

ソフトマージンSVM, 人工知能に関する断想録

0 0

Latent Dirichlet Allocation in Python

Sat, 21 Aug 2010 20:52:00 +0000

Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) – see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. LDA can be seen as a Bayesian extension of pLSA.

As Blei, the author of LDA, points out, the topic proportions in pLSA are tied with the training documents. This is problematic: 1) the number of parameters grows linearly with the number of training documents, which can cause serious overfitting 2) it is difficult to generalize to new documents and requires so-called “folding-in”. LDA fixes those issues by being a fully generative model: where pLSA uses a matrix of P(topic|document) probabilities, LDA uses a distribution over topics.

To date, there exists several parameter estimation schemes for LDA: variational Bayes, expectation propagation and Gibbs sampling. I’ve chosen to implement the latter. It has first been described in a paper entitled “Finding scientific topics”, by Griffiths and Steyvers.

Artificial data

As with all model-based algorithms, during the early development phase, it is useful to work with artificial data, generated by following the model assumptions. In the case of LDA (and pLSA), the core assumption is that words (w) in documents are generated by mixture of topics (z). In other words, the probability of a word is:

The generative process can be summarized as follows: 1) set the topic proportions once for all when the collection is instantiated and 2) for each document and for as many words as needed, sample a topic from the topic distribution and sample a word from the word distribution of the selected topic. Obviously, this is only an approximation of how documents are created in reality.

To generate an artificial dataset, we can fix the word distribution of each topic and then generate documents as explained above. Since we generated documents by sticking to the generative assumption of the model, if the algorithm is correctly implemented, it should be able to recover the word distribution of each topic, from the generated documents.

Graphical example

To gain insight and intuition, we can reuse the graphical example from Griffiths and Steyvers’ paper.

In the bag-of-words model, documents are represented by vectors of dimension , where is the vocabulary size. Moreover, an image of size has pixels: it can thus be stored as a string/vector of length/size . This means that a document in the bag-of-words model can be represented as an image, where pixels correspond to words and pixel intensities correspond to word counts!

As put previously, we first need to fix the word distribution of each topic. Let’s arbitrarily create 10 topics.

5 with “vertical” bars:

and another 5 with “horizontal” bars:

Each topic distribution is represented by a 5×5 image, so the vocabulary is of size 25. Black pixels correspond to words that the topic will never possibly generate. White pixels correspond to words that the topic can generate with probability 1/5.

Now let’s generate 500 documents using the generative process previously described. Here are 3 examples of such generated documents.

We clearly see bars emerging from the documents and can thus confirm that documents are mixtures of topics.

We can now use the generated documents as training data. If the Gibbs sampler is correctly implemented, we should be able to recover the original topics. Here are the results for the 1st, 6th and 26th iterations. The number between brackets is the log-likelihood.

1st iteration (-278541.7835):

5th iteration (-165139.56193):

[...]

26th iteration (-129272.328181):

After a few iterations, we see that the algorithm recovered the topics correctly. Also, the log-likelihood increases: as the number of iterations increases, it becomes more and more likely that the model generated the data. The fact that it works pretty well is not surprising: the data used were generated by sticking to the model assumptions.

Gibbs sampling

The Gibbs sampler used is said to be collapsed: the parameters of interest are not sampled directly. Instead we sample the topic assignments and the parameters can be computed in terms of those.

It is not necessarily obvious from the equation of the full conditional distribution (from which the topic assignments are sampled) but the sampler is naturally sparse: it doesn’t need to iterate over words with zero-count. This is a nice property, given that sampling algorithms are often considered slow.

Source code

http://gist.github.com/542786

Fairly readable and compact code but to be considered a toy implementation.

Useful Resources

MCMC

- “MCMC lecture at MLSS09” (Iain Murray). Nice for a first general overview and the insights.

- “Gibbs sampling for the uninitiated” (Resnik and Hardisty). Nice for a first general overview and the insights.

- “Pattern Recognition and Machine Learning” (Bishop), Chapters 8 and 11 on graphical models and sampling methods. Excellent chapters.

- “Review Course: Markov Chains and Monte Carlo Methods” (Cosma and Evers). Very nice free online course and solutions to exercises in Python and R!

LDA

- “Latent Dirichlet Allocation” (Blei et al, 2003). By Blei himself.

- “Finding scientific topics” (Griffiths and Steyvers). Insightful comments and nice intuitive graphical example.

- “Parameter Estimation for text analysis” (Heinrich). Very nice introduction to Bayesian thinking. Pseudo-code for the LDA Gibbs sampler.

- “On an equivalence between PLSI and LDA” (Girolami and Kaban). Connections between pLSA and LDA.

- “Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling” (Carpenter). Very detailed, step-by-step derivation of the collapsed Gibbs samplers for LDA and NB.

- “Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details” (Wang). Insightful comments and pseudo-code of the LDA Gibbs sampler.

Other Python implementations

- nrolland’s pyLDA. Works fine but mixes Python-style and Numpy-style.

- alextp’s pylda. Numpy-style but not tested.

0 0

Semi-supervised Naive Bayes in Python

Mon, 21 Jun 2010 17:47:50 +0000

Expectation-Maximization

The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.

Naive Bayes trained with EM

In their paper “Semi-supervised Text Classification Using EM”, Nigam et al. describe how to use EM to train a Naive Bayes classifier in a semi-supervised fashion, that is with both labeled and unlabeled data. The algorithm is very intuitive:

Train a classifier with your labeled data
While the model likelihood increases:
- E-step: Use your current classifier to find P(c|x) for all classes c and all unlabeled examples x. These can be thought as probabilistic/fractional labels.
- M-step: Train your classifier with the union of your labeled and probabilistically-labeled data.

The hope is that using (abundantly available) unlabeled data, in addition to (labor-intensive) labeled data, improves the quality of the classifier.

Code

I made a simple implementation of it in Python + Numpy. The code is fairly optimized.

$ git clone http://www.mblondel.org/code/seminb.git

web interface

Implementation details

Here are implementation details that were not mentioned in the original paper and that I found necessary to get a correct implementation.

Naive Bayes is called naive because of the (obviously wrong) assumption that words are conditionally independent given the class:

However, since the vocabulary size V can be pretty big and the probabilities P(w|c) can be pretty small, P(x|c) can quickly exceed the precision of the computer and become zero. The solution is to perform the computations in the log domain:

To turn around P(x|c), we use Bayes’rule:

By posing , we get:

This is the softmax function. However, we are back to our initial problem because, since the are likely to tend to -inf, the exponentials are likely to in turn underflow. The trick is to multiply the numerator and denominator by the same constant :

Setting m to , the values will get closer to zero. The rationale for this is that computation of the exponential overflows earlier and is less precise for big values (positive or negative) than for small values.

In case this is not enough, we can further define:

t\}} e^{z_k-m}}, & \mbox{otherwise} \end{cases}' title='P(y_i=c_j|x_i) = \begin{cases} 0, & \mbox{if } z_j - m \le t \\ \frac{e^{z_j-m}}{\sum_{\{k~:~z_k-m > t\}} e^{z_k-m}}, & \mbox{otherwise} \end{cases}' class='latex' />

This truncates the exponentials to zero when . For t=-10, this corresponds to 0.000045. Equivalently we can see it as setting the exponentials to zero when . Since both t and and m are negative, this shows that subtracting the maximum m, as explained before, does help improving the precision.

Reference

Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

0 0

LSA and pLSA in Python

Sun, 13 Jun 2010 17:42:58 +0000

Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.

Note: LSA and pLSA are also known in the Information Retrieval community as LSI and pLSI, where I stands for Indexing.

Comparison

	LSA	pLSA
1. Theoretical background	Linear Algebra	Probabilities and Statistics
2. Objective function	Frobenius norm	Likelihood function
3. Polysemy	No	Yes
4. Folding-in	Straightforward	Complicated

1. LSA stems from Linear Algebra as it is nothing more than a Singular Value Decomposition. On the other hand, pLSA has a strong probabilistic grounding (latent variable models).

2. SVD is a least squares method (it finds a low-rank matrix approximation that minimizes the Frobenius norm of the difference with the original matrix). Moreover, as it is well known in Machine Learning, the least squares solution corresponds to the Maximum Likelihood solution when experimental errors are gaussian. Therefore, LSA makes an implicit assumption of gaussian noise on the term counts. On the other hand, the objective function maximized in pLSA is the likelihood function of multinomial sampling.

The values in the concept-term matrix found by LSA are not normalized and may even contain negative values. On the other hand, values found by pLSA are probabilities which means they are interpretable and can be combined with other models.

Note: SVD is equivalent to PCA (Principal Component Analysis) when the data is centered (has zero-mean).

3. Both LSA and pLSA can handle synonymy but LSA cannot handle polysemy, as words are defined by a unique point in a space.

4. LSA and pLSA analyze a corpus of documents in order to find a new low-dimensional representation of it. In order to be comparable, new documents that were not originally in the corpus must be projected in the lower-dimensional space too. This is called “folding-in”. Clearly, new documents folded-in don’t contribute to learning the factored representation so it is necessary to rebuild the model using all the documents from time to time.

In LSA, folding-in is as easy as a matrix-vector product. In pLSA, this requires several iterations of the EM algorithm.

Implementation in Python

LSA is straightforward to implement as it is nothing more than a SVD and Numpy’s Linear Algebra module has a function “svd” already. This function has an argument full_matrices which when set to False greatly reduces the time required. This argument doesn’t mean that the SVD is not full, just that the returned matrices don’t contain vectors corresponding to zero singular values. Scipy’s Linear Algebra package unfortunately doesn’t seem to have a sparse SVD. Likewise, there’s no truncated SVD (there exists fast algorithms to directly compute a truncated SVD rather than computing the full SVD then taking the top K singular values).

pLSA’s source code is a bit longer although quite compact too. Although the Python/Numpy code was quite optimized, it took half a day to compute on a 50000 x 8000 term-document matrix. I rewrote the training part in C and it now takes half an hour. Keeping the Python version is quite nice for checking the correctness of the C version and as a reference as the C version is a straightforward port of it.

The implementation is sparse. It works with both Numpy’s ndarrays and Scipy’s sparse matrices.

$ git clone http://www.mblondel.org/code/plsa.git

web interface

Next, I would like to explore Fisher Kernels as there seems to have nice interactions with pLSA. I would also like to implement Latent Dirichlet Allocation (LDA), although it’s more challenging. LDA is a Bayesian extension of pLSA : pLSA is equivalent to LDA under a uniform Dirichlet prior distribution.

0 0

Seam Carving in Python

Tue, 09 Feb 2010 15:57:20 +0000

Seam Carving is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper “Seam Carving for Content-Aware Image Resizing“.

Miyako Island, Okinawa, Japan.

The principle is very simple. Find the connected paths of low energy pixels (“the seams”). This can be done efficiently by dynamic programming (see my post on DTW).

Same image in the gradient domain showing the vertical and horizontal seams of lowest cumulated energy.

The seams of lowest cumulated energy can be seen as the pixels contributing the least to an image. By repeatedly removing or adding seams, it is thus possible to perform “content-aware” image reduction or extension. The resulting images feel more natural, less “streched”.

Height reduced by 50% by seam carving.

Height reduced by 50% by traditional rescaling.

Although seam carving doesn’t need human intervention, in the original paper, a graphical user interface (GUI) was also developed to let the user define areas that can’t be removed, or conversely, that must be removed.

In my opinion, seam carving is simple and elegant. No sophisticated object recognition algorithm was used, yet the results are quite impressive.

You can find my implementation in 250 lines of Python in my git repo:

$ git clone http://www.mblondel.org/code/seam-carving.git

web interface

Unfortunately, it’s too slow to be real-time.

0 0

Easy parallelization with data decomposition

Fri, 27 Nov 2009 18:17:28 +0000

Recently I came across this blog post which introduced me to the new multiprocessing module in Python 2.6, a module to execute multiple concurrent processes. It makes parallelizing your programs very easy. The author also provided a smart code snippet that makes using multiprocessing even easier. I studied how the snippet works and I came up with an alternative solution which is in my opinion very elegant and easy to read. I’m so excited about the new possibilities provided by this module that I had to spread the word. But first, off to some background.

The multi-core trend

Moore’s law states that:

The density of transistors on chips doubles every 24 month.

Although Moore’s law, contrary to what is often thought still holds true, the exponential processor transistor growth predicted by Moore does not always translate into exponentially greater practical computing performance. Therefore parallel computation has recently become necessary to take full advantage of the gains allowed by Moore’s law. This explains the recent multi-core trend: most recent computers are now equipped with 2 or more cores.

The problem is that you can’t just use multi-core equipped computers and hope that your programs will run faster on them. Programs need be modified to operate in a parallel fashion as opposed to a sequential fashion.

At the same time, languages like Ruby and Python are famous for their GIL (Global Interpreter Lock). Because of the GIL, even programs that are designed to be parallel can effectively use only one core at a time, resulting in no speed improvement. Parallelism here is just an illusion: the processor switches between threads but does so frequently that the user perceive the operations as being performed in parallel.

The novelty of the multiprocessing module in Python 2.6 is that is uses processes instead of threads (see Threads compared with processes) and it does not suffer from the GIL. Programs running on multi-cores can therefore operate in a truly parallel fashion.

Parallelizing programs

To make things simpler, let me quote the excellent blog post How-to Split a Problem into Tasks.

The very first step in every successful parallelization effort is always the same: you take a look at the problem that needs to be solved and start splitting it into tasks that can be computed in parallel. [...] what I am describing here is also called problem decomposition. The goal here is to divide the problem into several smaller subproblems, called tasks that can be computed in parallel later on. The tasks can be of different size and must not necessarily be independent.

And, about data decomposition:

When data structures with large amounts of similar data need to be processed, data decomposition is usually a well-performing decomposition technique. The tasks in this strategy consist of groups of data. These can be either input data, output data or even intermediate data, decompositions to all varieties are possible and may be useful. All processors perform the same operations on these data, which are often independent from one another. This is my favorite decomposition technique, because it is usually easy to do, often has no dependencies in between tasks and scales really well.

Data decomposition is so straightforward that it can without any doubt be called embarrassingly parallel.

Map

If you are a Python user, you most probably know list comprehensions:

>>> from math import sqrt
>>> [sqrt(i) for i in [1, 4, 9, 16]] 
[1.0, 2.0, 3.0, 4.0]

In this example, sqrt is applied to each element of the list and a list is returned. The resulting list and the input list are therefore the same size.

Probably less known are generator comprehensions, which can be written by replacing the outer brackets with parentheses:

>>> gen = (sqrt(i) for i in [1, 4, 9, 16]) 
<generator object at 0xb7cec56c>
>>> for i in gen: print i
1.0
2.0
3.0
4.0

The difference between list and generator comprehensions is that list comprehensions are evaluated entirely before returning, while generator comprehensions yield results one by one. Generators are therefore more “lazy” and can results in big memory savings when iterating over large lists.

The outer parentheses can even be omitted when calling functions with only 1 argument:

>>> print(sqrt(i) for i in [1, 4, 9, 16]) 
<generator object at 0xb7cec68c>

For those more familiar with functional programming, list comprehensions are similar to the map higher-order function.

In fact, Python has a built-in map function too.

>>> map(sqrt, [1, 4, 9, 16])
[1.0, 2.0, 3.0, 4.0]

Reduce

While map applies a function to each element of a list and returns the resulting list, reduce is a higher-order function that uses another function to combine the elements of a list in some way.

>>> plus = concatenate = lambda x,y: x+y
>>> reduce(plus, [1,2,3,4])
10
>>> reduce(concatenate, [[1,2], [3,4]])
[1, 2, 3, 4]

multiprocessing.Pool ‘s map

Applying a function to each element of a list with map kind of assumes that the function is pure, i.e. that the result output by the function is only function of its input arguments. Although nothing prevents you from giving an impure function as argument to map, it is dirty, potentially dangerous and not the functional philosophy. Concretely, it means that you’d better not use global variables or anything the state of which may be changed during the program execution, in your functions. This thus also includes instance methods (an object, in essence, encapsulates a state).

To reuse the terminology above, if we think of applying our function to each element of the list as tasks, then our tasks are independent from each other and so there’s is no reason to operate over the list sequentially. Independence is also very nice because communication and collaboration between threads/processes happen to be one of the most difficult aspect of concurrent programming. Here, no communication between threads/processes is required.

And here comes the new multiprocessing module and more particularly its Pool class. This class represents pools of worker processes and has a map method, which is similar to the map built-in function.

>>> from multiprocessing import Pool
>>> pool = Pool(processes=4)
>>> pool.map(sqrt, [1,4,9,16])
[1.0, 2.0, 3.0, 4.0]

The difference with the built-in map here is that 4 processes are used. This will result in about a 4x speedup if the computer running the program has at least 4 cores. Of course, sqrt is a toy example but here’s a real-life example in a Machine Learning context.

>>> image_sets = [set1, ..., setn]
>>> preprocessed = pool.map(preprocess, images_sets)
>>> feat_sets = pool.map(feat_extract, preprocessed)
>>> models = pool.map(train, feat_sets)

As long as you can write your code as list comprehensions, you can apply the data decomposition approach. It’s easy, abuse it!

However, spawning a process has a cost because of context switching. Therefore, when the function to be applied on each element returns quasi instantaneously, it may be worth splitting the data into larger chunks, run each chunk in a separate process and then recombine the results with reduce. (See also MapReduce)

Helpers

Here are some helpers which make parallelizing your list comprehensions even more straightforward and easy to read.

As mentioned before, the blog post that introduced me to this new multiprocessing module also came with a smart code snippet. I reworked it to fit my liking and this is what it looks like now:

>>> sqrtd = delayed(sqrt)
>>> powd = delayed(pow)
>>> squares = [1, 4, 9, 16]
 
>>> pool_parallelize([sqrtd(i) for i in squares], njobs=4)
[1.0, 2.0, 3.0, 4.0]
 
>>> pool_parallelize([powd(i, 0.5) for i in squares], njobs=4)
[1.0, 2.0, 3.0, 4.0]

Contrary to Pool’s map, this supports parallelizing functions of any arity.

Then I came up with this solution, which reduces the typing and is quite elegant.

>>> sqrtp = parallelized(sqrt)
>>> powp = parallelized(pow)
 
>>> sqrtp(squares, njobs=4)
[1.0, 2.0, 3.0, 4.0]
 
>>> powp([(i, 0.5) for i in squares], njobs=4)
[1.0, 2.0, 3.0, 4.0]

Code available here.

Conclusion

You really gotta love functional programming in Python! (By the way, see also Charming Python: Functional programming in Python part 1 and part 2)

0 0

First release of Tegaki

Wed, 11 Feb 2009 22:33:25 +0000

Today I’m releasing Tegaki 0.1. Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, that is suitable for both the desktop and mobile devices, and that is designed from the ground up to work well with Chinese and Japanese.

Screencast video: ogg or youtube.

This release features desktop and SCIM integration. However, the main “innovation” brought to you by this release is the user interface. It uses two drawing areas for continuous writing. The user can eventually fix recognition errors by choosing alternative candidates or editing characters. Since a video is worth a thousand words, see the screencast above. This interface is largely inspired from the Nintendo DS game “Kanji Sono Mama Rakubiki Jiten” (漢字そのまま楽引辞典).

Tegaki is designed to be able to use several recognition engines. However so far it only supports Zinnia, which is the only recognition engine that I know with acceptable recognition accuracy and good performance on mobile devices. One challenge of the project in the future will be to create a new recognition engine that can yield better results than Zinnia.

A take that I have on this project is to use Python whenever this is possible and only use C or C++ when performance is critical, like in recognition engines. Compared to Tomoe, which implements everything in C and provides bindings for several languages, this means less reusability of the components but I hope this will make the project go forward faster.

There are still a lot of things that can be done in various areas but I really wanted to release the code I’ve put together so far because I think it can already be useful to end-users. By the way, Maemo supports both pygtk and SCIM through third-party projects, thus Tegaki is just a few Debian packages away from being available on Maemo.

For further details:
http://tegaki.sourceforge.net/

8 0

Linux in a Virtual Machine

Fri, 26 Dec 2008 12:39:14 +0000

I own a Macbook on which I’ve been running Linux 99% of the time for over a year now. Although a Macbook is not necessarily the best choice to run Linux, I made that decision because installing Linux on a Macbook is very well documented. However, as far as you can get, it’s always difficult to get a configuration you are 100% happy with (no subwoofer support, flaky suspend…). With recent advances in virtualization technologies, both in software and hardware, I’ve been willing to test running Linux and Windows (the guest OSes) inside Mac OS X (the host OS).

Feature set

Here are some of the features you can expect from virtualization software.

- Near-native speed: thanks to recent hardware support for virtualization, it’s now possible to run a x86-based guest OS inside a host x86 OS at considerably reduced speed penalty. As a result, the guest OS is perfectly comfortable to use. See X86 virtualization.

- Hardware emulation: in order to run unmodified guest OSes, virtual machines emulate hardware for which most of the OSes have drivers. As a result, it’s very easy to install the guest OS because the hardware emulated is standard. The virtual machine can be seen has an abstraction layer between the actual hardware and the guest OS.

- Growing-size hard disk: while it’s usually possible to boot directly from a disk partition, most of the virtualization softwares support growing-size file disks. Those files contain the contents of the emulated hard disk and are able to grow dynamically in size. If you allocated 32 GB to your virtual hard disk but it only contains 8 GB so far, the guest OS will see a 32 GB disk available but it will only take 8 GB on the actual hard disk.

Personally, I have a partition for Windows on my Macbook but I used it only once to release Fantasdic for Windows. With their dynamically growing file disk, virtual machines would have allowed me to gain a lot of space. Some softwares like QEMU support compressed disk files, to gain even more space.

- NAT: this allows to transfer network traffic from and to the virtual machine. Whether you’re connected to Internet through Ethernet, Wifi or 3G, for the guest OS, this is the same. It only sees an Ethernet network card and you can connect to Internet from the guest OS transparently. However, this also has some drawbacks because you wouldn’t be able to run software that directly needs the Wifi card (e.g. network scanner).

- Bridge networking or Host interface networking: this allows your virtual machines to have their own IP address in your local network. If you install the Apache HTTP server in your Linux virtual machine, you can access it from Mac OS X or any other computer on the local network. This also allows me to ssh into my Nokia internet tablet or copy files to it.

Although most virtualization software have built-in convenience support for shared folders, bridge networking allows you to share folders across virtual machines and even real machines over Samba or SSH.

- USB: if you plug USB devices in your computer, they can be redirected at will to the virtual machine.

- CD Rom: you can redirect your real CD Rom to the virtual machine or select an .iso image and it will be seen as a CD Rom from the virtual machine.

- Seamless integration in the host OS: seamless mouse pointer integration, copy-paste between host and guest OS…

- 3D acceleration: although I don’t play games on computers, this may be of interest for some people.

Parallels Desktop and Vmware Fusion

The first thing I did was to try out the trial version of commercial products Parallels Desktop and Vmware Fusion. Both work well, are easy to use, are highly-integrated in Mac OS X and share a fair amount of features. Both cost $79.99. I didn’t test them in full depth but from what I saw, it would be difficult for me to choose one between those two if had to.

I digress but I was the witness of a shameless commercial practice at Parallels Desktop. Since my browser is in French, it was automatically redirected to the French version of Parallels Desktop’s webiste and the software was priced 79.99 euros. However, if you go to the English version, the software costs 79.99 dollars. That is 56.98 euros… Is that the cost of localization?!

QEMU

I got all excited about virtualization but not to the point of paying this prohibitive price so I started to have a look at open-source solutions. I tried out Q, which is a Mac OS X port of QEMU. QEMU does more than Parallels Desktop and Vmware Fusion. It allows to run a guest OS for a processor architecture inside a host OS with another processor architecture. It’s thus also a processor emulator.

Although it worked well, Q was terribly slow. While looking for acceleration extensions for QEMU, I found QVM86 but Wikipedia mentions that the developer ceased development when VirtualBox was released so I quickly switched to trying out VirtualBox.

VirtualBox

VirtualBox was THE pleasant surprise of all the software I tested. It’s free, it’s open-source, it’s easy to use, it’s well integrated in Mac OS X. Of all the feature set mentioned above, VirtualBox has everything. I’m simply amazed by this software.

A limited set of components such as USB support are closed-source. This means that if you install VirtualBox on a Linux host using the distribution package, you get the open-source edition, which misses some features. However, it is always possible to download the binaries from the official website and install them by hand.

VirtualBox was originally developed by German company innotek. However, the company has been acquired by Sun in February 2008. Also note that VirtualBox includes code from QEMU.

Maemo development

I wanted to check whether VirtualBox could be used for Maemo development or not. Since the Maemo SDK itself relies on virtualization software (QEMU) to emulate the ARMEL architecture, my concern was that running virtualization software inside virtualization software may not work well or be too slow.

I installed Scratchbox and the Maemo SDK inside my Linux virtual machine and it turned out that my concerns were not justified. The SDK works well and is very responsive, even when running graphical applications with Xephyr.

As mentioned before, Bridge networking allowed me to ssh into my tablet and copy files to it with scp, from the virtual machine. While the USB is correctly redirected to the virtual machine when the tablet is in USB mass storage mode, I was unable to get USB networking working because Mac OS X intercepts/detects the network connection.

I think VirtualBox is a good solution for Mac OS X users willing to develop Maemo applications.

1 1

Zinnia

Sat, 20 Sep 2008 07:55:45 +0000

In my last post, I was writing about this impressive Chinese character recognition demo using AJAX on the client side and Support Vector Machines (SVM) on the server side, for the recognition process. Well, I don’t know if it’s just a coincidence (this demo was from 2 years ago) but Taku Kudo released last week the backend he’s using as free software. Needless to say that this was awesome news for me! I know the basic principle of SVM but time to learn more about it I guess…

His project, called Zinnia, has been rewritten from scratch to be more flexible and reusable. Models for Japanese and Chinese are included but models for other languages can be built easily provided that you have training data. I’m pretty sure that this package could also be useful for Gesture Recognition because it’s so close to Handwriting Recognition…

For the sake of comparison, I wanted to evaluate how Zinnia performs compared to both Tomoe and my own HMM experiment. I used the same evaluation corpus as I wrote about in earlier posts, that is two sets of 50 kanjis written by a Japanese friend of mine and me. The characters have the correct stroke order and were drawn carefully. Therefore, the results below indicate how the different recognizers perform in ideal conditions and don’t indicate how robust they would be in more difficult conditions.

Tomoe - Zinnia

	Tomoe	Zinnia
1st match accuracy	61%	77%
5 matches accuracy	74%	92%
10 matches accuracy	74%	93%
Recognition time	21 / 100 = 0.21 s	3 / 100 = 0.03 s
Total number of kanji	3000	3000

1st match accuracy is the percentage of characters that were recognized as first match.
5 matches accuracy is the percentage of characters that were recognized in the first 5 matches.

You can download my evaluation script for Zinnia here. Tomoe’s evaluation script is sitting in Tomoe’ SVN, in the benchmark/ folder.

A few remarks:
- Zinnia is notably better than Tomoe in terms of accuracy
- Zinnia is about 7 times faster than Tomoe, making it a good candidate for an embedded platform
- In both cases, 5 matches and 10 matches accuracy are about the same, meaning that it would be enough for the user interface to display the first 5 matches only.

Project Tegaki - Zinnia

Due to lack of training data, my personal HMM experiment (project Tegaki) was only conducted over a set of 50 characters. However, Zinnia supports over 3000 characters. For fair comparison, I thus created new models for Zinnia using the same training data as I used for my experiment.

Zinnia was trained with only one sample per character, using the same data as Tomoe, which is template-based. While SVM seems to be able to cope with only one sample per character, it’s a little bit more complicated to do that with HMM because of the need to find the parameters of the Observation Probability Density Function (e.g. mean and variance for a Gaussian).

	Project Tegaki	Zinnia
1st match accuracy	92%	100%
5 matches accuracy	100%	100%
10 matches accuracy	100%	100%
Recognition time	14 / 100 = 0.14 s	1.50 / 100 = 0.015 s
Total number of kanji	50	50

A few remarks:
- My experiment is slow, which is probably due to the fact that I’m using Character-level models. Stroke-level models are known to scale much better.
- My experiment has slightly worse accuracy, which is probably because I’m only using two features per observation.

Handwriting database

If you follow my adventures in the world of handwritten Chinese character recognition, you probably know that I’m planning to create a handwriting database website. This database will aim to 1) make it easy and attractive for people to contribute their handwriting samples and 2) make it easy for the database staff to manage and organize what is supposed to become a large collection of handwriting samples.

The database will use a client/server architecture. So far I’m thinking of four important clients:
- A client that people will be able to use directly in their web browser, using my web canvas
- A client for the Maemo platform
- A client for the Iphone
- A multi-platform client for the Destkop

A client of slightly lesser priority would be a Facebook application.

The handwriting samples collected will be distributed in free software license. For projects like Zinnia or Project Tegaki, this will mean more training data and more means to evaluate the performance. I consider this database one of my priorities among my free software projects but it’s going to be quite hard for me to find time for that before December…

Contribute

As always, more people are welcome to contribute.

To download the source code of my work,

$ git clone http://www.mblondel.org/code/hwr.git

web interface

0 1

Web Canvas

Fri, 01 Aug 2008 13:46:35 +0000

In my last post, I was calling for contributors to write a web canvas using the

Since nobody responded to my call (sic), I decided to tackle it by myself. It turns out that it was a nice little project. The canvas Javascript API is very similar to the cairo API so it was easy to use. I also improved my level in Javascript a lot. So far the web canvas supports draw, import (JSON), export (XML), save as an image and replay (stroke by stroke animation).

You can try it by using the online DEMO.

What can it be useful for?

- I’m planning to use it for the handwriting database website that I wrote about some time ago. While it will be possible to contribute your handwriting using a pygtk client (Desktop or Maemo), you will also be able to contribute your handwriting using your browser directly. Not having to install any program should help increase the number of people contributing their handwriting.

- A second way of using it would be to do handwriting recognition directly in the browser. For example, one could install Tomoe (or my recognizer when it’s ready ;-)) on the server side and the web canvas on the client side. Since Tomoe has Python and Ruby bindings, this is fairly easy!

You can reuse the web canvas for your own projects if you like but I would appreciate if you could send me any feature improvement. In particular, the web canvas has a bug under Internet Explorer that I couldn’t figure out…

Source code (GPL) : gitweb

6 0

Handwriting renderers

Sun, 13 Jul 2008 07:03:22 +0000

Canvas

If you didn’t read my previous post, for short, project Tegaki is a framework for handwritten Chinese character recognition (HCCR) written in Python. It includes reusable components and is a placeholder for experimentation. The goal is to create the next-generation open-source HCCR software but it may be useful for academic researchers as well.

One reusable component is the Canvas. This is the user interface component that allows to draw characters. In addition, the Canvas supports “replaying” the character (stroke by stroke animation) and setting a background model (to help users draw an unknown character). It is multi-platform.

Example of a character drawn using the Canvas provided by libtegaki-gtk

The Canvas has a get_writing() method. It allows to retrieve the Writing object for the handwriting currently displayed in the Canvas.

XML representation

The Writing object supports reading from and writing to an XML file. The XML file can optionally be compressed using gzip or bz2. On my hard drive, I have a small set of handwriting samples. 500 characters take about 10 MB. That’s why compression is very useful.

The XML representation of a handwriting sample looks like that.

Renderers

I’ve recently added support for what I named “renderers”. They take a Writing object as parameter and generate a visual representation of it. Since I used the cairo graphics library as drawing backend, the representation can be saved to PNG, SVG and PDF! Those renderers will be very useful for the handwriting database website that I wrote about in my previous post!

Complete character renderer

Stroke order renderer

Stroke order with each single stroke

Stroke order with stroke groups

Strokes can be grouped together when the stroke order is obvious. However, this requires to know which strokes to combine together. A dictionary must be created for that. A entry example would be:

駅 1,1,3,1,4,2,2

The canvas I was writing about above is written in pygtk and is intended to be used for the Desktop or for Maemo. However, in the case of the handwriting database website, since we want as many people to contribute their handwriting as possible, it would be nice to not require any particular installation. For that, a canvas directly in the browser would be the ideal solution.

One solution would be to use Flash but I would prefer to use the

I am looking for a contributor to create a new canvas using this technology. The canvas should support drawing, displaying existing handwriting and replay (stroke by stroke animation).

For more information:

Canvas (HTML element) (Wikipedia)
Canvas tutorial (developer.mozilla.org)
Canvas painter (a paint-like application in the browser)
ExplorerCanvas (by Google)

GIF stroke animation

Even though GIF uses a patented compression, GIF is still the only format with support for animations and wide support in the browsers. Therefore it would be very cool to be able to generate GIF stroke animations from a writing object.

I had a look at python-imagemagick and Python Imaging Library (PIL) but they both seem to have very limited support for GIF animations. So I’m thinking of writing my own library for GIF generation in Python. Byzanz, a software to create screencasts as GIF animations, can be used as inspiration because it includes a GIF encoder. It also supports color quantization (using octrees) and dithering. From what I see, it should take less than 1000 lines of Python code.

I read a little bit about color quantization. I found it very interesting. Here’s a short explanation about color quantization for those who don’t know about it. Basically, each pixel in an image may have three components Red Blue Green. For a 400×400 picture, this is about 400*400*3=480KB. To gain space, an idea is to store colors in a palette (a table index => color). Then each pixel only needs to refer to the index in the palette instead of having to define the three components. For a 256-color palette, this saves two bytes for each pixel. However, since we now use 256 colors only instead of 256 * 256 * 256 = 16,777,216 colors, there’s a color precision loss. The challenge is thus to find what colors to put in the palette to have the smallest precision loss possible. For example, we may want to put in the palette colors that are the closest to the most frequently used colors. This is a 3-dimensional clustering problem, thus it reminded me of Machine Learning, a topic in which I’ve been very interested recently.

For more information, I recommend the reading of those Wikipedia articles:

3 0

A roadmap for project Tegaki

Fri, 04 Jul 2008 18:16:49 +0000

Codename Project Tegaki

I wrote in a previous post about my first experiment with applying a modern technique, namely Hidden Markov Models, for handwritten Chinese character recognition. I’m quite motivated in making this more than just a single isolated experiment so I decided to give a name to the project. I named it Project Tegaki. This is going to be the codename for the effort starting from now. Tegaki means Handwriting in Japanese.

Project statement

The aim of Project Tegaki is to push forward the creation of the next-generation open-source handwritten Chinese character recognition (HCCR) software.

Currently, the only open-source package for HCCR is Tomoe. This is a project that I have been contributing to and that I used for my Google Summer of Code project, “Japanese/Chinese handwriting recognition on maemo”. Maemo is the open-source platform used by Nokia PDAs. I have decided to start Project Tegaki as an external effort because I considered that Tomoe would not be a good environment to welcome the effort. However, if the Tomoe community is ready to help me in this effort, I will be happy to merge Project Tegaki back into Tomoe once Project Tegaki becomes ready for prime-time.

Handwritten Chinese character recognition in a PDA…

Here are some goals for the project:

- Free and open-source. The goal is to produce the next-generation free and open-source HCCR software.

- Modern. The software should use modern approaches to Handwriting recognition and be in tight connection with research.

- Embedded. The project must be designed to work with devices with restricted resources such as cell phones or PDAs.

- Online, as opposed to offline. In online recognition, characters are drawn using a device, typically a mouse, a tablet or a PDA stylus. In this setting, characters can be represented as sequences of points. In offline recognition, characters are scanned a posteriori. In this setting, characters are represented as images (width * height pixels).

- Isolated Chinese character recognition. Here Chinese character doesn’t restrict to Chinese language, since Japanese kanji are also Chinese characters! Even though the package should theoretically be generalizable to any kind of character, Chinese characters have some specific challenges and some approaches that give good results for Chinese characters may not give good results with other kinds of characters, due to the unique properties of Chinese characters. “Isolated character recognition” means that user will have to draw one character at a time in a separate box, as opposed to continuous handwriting recognition. This makes things much easier and in the case of Chinese characters, this is a reasonable limitation.

- Stroke order dependent and independent. Both situations have useful applications so Project Tegaki should ideally support both.

Python?

Usually I’m more of a Ruby fan but the project was started in Python due to dependencies on third-party libraries that only exist in Python. Even though I’m slowly getting away from those dependencies, I don’t want to re-implement everything just for the sake of using Ruby. So I keep up with Python.

As it was emphasized, this project is highly experimental. Moreover, a collaborative website will be created (see below) and it will reuse number of existing components. It thus makes sense to use a high-level language to focus on the experiments and to create the website.

Subprojects

Project Tegaki is now split into several subprojects.

libtegaki

This Python library contains functionality that will be useful to other subprojects. This includes array manipulation, character input/output, viterbi decoder…

libtegaki-gtk

This Python library contains user interface elements that will be useful to other subprojects. So far it only includes a Canvas, which can be used to draw characters. It is replacement for TomoeCanvas with some additional benefits:

- Truly reusable. TomoeCanvas assumes that a recognizer is connected to the canvas. However, there are situations when a recognizer is not needed.

- Resizable. TomoeCanvas cannot be resized at will.

- Animation. A stroke animation of a character can be displayed.

- Background character. A background character can be set as a model and animations will be displayed to help draw the same character stroke by stroke.

- Features other than (x,y) coordinates are supported such as pen pressure and pen inclination when available, stroke duration, point timestamp.

libtegaki-gtk is written in pygtk and depends on libtegaki.

tegaki-db

The most successful handwriting recognition systems nowadays use a “learn by example” philosophy. For each character supported, several samples of the handwritten character must be provided to the system in order to learn from them. Because those samples are used to train the system, they are called “training samples”. The challenge for the final recognizer is to be able to recognize unseen handwritten instances of the same characters. This is the ability of the recognizer to “generalize” the acquired knowledge.

A “training corpus” is a set of training samples. A good corpus should contain dozens of handwritten samples for each character. The corpus should be representative enough of all handwriting styles. Collecting all the handwriting samples and designing a good corpus is a huge task for Chinese characters because there exist thousands of them!

Such handwritten Chinese character databases do exist but they have a fee and they are usually restricted to academic research. They are by no means suitable for free software. The goal of the tegaki-db subproject is to create a collaborative web platform to collect handwriting samples. Native speakers and learners alike will be able to log in and contribute their own handwriting. The collected data will be published in a free license so that it can benefit to academic research as well. The tegaki-db will use a client / server architecture.

tegaki-db-client

tegaki-db-client is a client for people to input their handwriting. It will be written in Python and use the canvas provided by libtegaki-gtk. The client will communicate with the server through web services. The client should be distributed for several platforms such as Linux, Windows and Maemo to increase the number of potential contributors. A detailed specification of tegaki-db and tegaki-db-client will be provided later in a separate post.

tegaki-models

tegaki-models is by no means an end-user package and will only be used by developers. It is the placeholder for experimentation. Thanks to this package, model ideas will be tested and evaluated.

I continued to work on new model ideas… However, because my current training corpus is so small, it’s kind of irrelevant to spend to much time on models. The top priority now is to create tegaki-db.

tegaki-decoder

tegaki-decoder is going to be a high-performance decoder (recognizer). It should be a fast implementation of the Viterbi decoder. It will be written in C and designed to work with embedded systems. This is going to be the end-product that people will use. Once sufficient data have been collected, good models have been generated and the tegaki decoder is ready, then Project Tegaki will be ready for real use! Currently, implementing tegaki-decoder is not the top priority.

Roadmap

- Launch tegaki-db and tegaki-db-client.
- Hope that the collaborative effort is successfull and collect lots of handwriting samples from many different people.
- Create new models, especially stroke-based models.
- Implement tegaki-decoder.

If I continue to be the only one interested in this project, at this rate it will take from several months to a couple of years to achieve everything. That’s why I hope I can attract a few contributors.

Download

The work completed so far is still very experimental and thus targets potential contributors. If you want to test it with your own handwriting anyway, please see my previous post.

To download the source code, you can use

$ git clone http://www.mblondel.org/code/hwr.git

$ git pull

from the repository folder if you already have the repository on your computer.

The code can be browsed online using gitweb. By clicking the “snapshot” links you can get a complete copy of the source code at a given revision.

See my memo on git if you don’t know it yet.

I published my work under GPL license.

8 0

Handwriting recognition inside the VKB

Sat, 08 Sep 2007 12:49:59 +0000

Since my last post, MaemoCJK has been released for beta-testing with support for Japanese/Chinese handwriting recognition, together with a few new interesting features such as switch between input methods at runtime.

This screenshot showcases the latest development I made. This is not shipped with the beta-release mentioned above but should hopefully be part of the upcoming release. As suggested in the comments of my last post, I tried to include the input pad directly inside the VKB (Virtual Keyboard) rather than using a separate window. Quite unexpectedly (for me), it gave very good results in terms of usability. The main advantage of this approach is that the input pad doesn’t take the whole screen and therefore you can still see the characters getting input when you click on them. Even though, the input pad is much smaller than before, it doesn’t seem to affect recognition accuracy, which is a very good point. As you can see from the screenshot, you can switch between VKB mode and handwriting mode with the button on the right. You can also switch between Chinese and Japanese with the menu on the left. A positive side-effect of the inclusion in the VKB is that we are sure that the dictionary gets loaded only once.

Attentive readers will have noticed that there is a small bug with fonts. Even though on this screenshot Chinese is selected, Pango (GTK’s text layout engine) picks up a Japanese font for some characters and a Chinese font for some others. This explains why the characters’ font don’t look consistent and the first candidate is displayed as 真 (Japanese) instead of 真 (Chinese). The problem comes from the fact that that those two characters share the same Unicode value so if Pango cannot detect the language from the context, no assumption can be made as to which font will be selected to render the character (see Han unification controversy for more details). To solve this problem, we just have to tell Pango explicitly which language it is.

Apart from inclusion in the VKB, I gave a hand for other parts of the MaemoCJK project. I added scim-anthy support, added a French keyboard layout (had to fix a problem in libfakekey for that) and improved the scim-config UI on small screens, among other things.

In other news, Nokia announced this week that they released their proprietary Hildon IM engine under LGPL. I wonder how this will affect the overall MaemoCJK project.

The Google Summer of Code is now over. I think overall it went really not bad. I reached my objectives which were adding handwriting recognition in MaemoCJK with particular focus on performance and smooth integration. Performance is now quite reasonable and integration is pretty good with inclusion in the VKB. One big regret that I have is that many of my changes to Tomoe have not been merged yet to the upstream project and I think I could have contributed much more to the project if the maintainers had been ready for that. However, I now have a good knowledge of the source base so I’m quite confident that I can help the project in the future. My main interest is in adding handwriting recognition using Hidden Markov Model. And of course, I’ll continue to maintain the handwriting recognition portion of MaemoCJK.

1 0

Planet Maemo: category "feed:46b1d6b26651a331cde2ad188d699e0c"

Large-scale sparse multiclass classification

Abstract

Code

Data

Transparent system-wide proxy

1. Create SOCKS proxy with OpenSSH

2. Forward connection transparently with iptables or ipfw

3. Redirect connections to SOCKS proxy with redsocks

Regularized Least Squares

Ridge regression

Kernel ridge

Multiple output

Classification

Efficient leave-one-out cross-validation

References

Kernel Perceptron in Python

Perceptron

Kernel Perceptron

Source

Support Vector Machines in Python

Maximum margin

Dual representation

QP solver

Soft margin

Source

References

Latent Dirichlet Allocation in Python

Artificial data

Graphical example

Gibbs sampling

Source code

Useful Resources

MCMC

LDA

Other Python implementations

Semi-supervised Naive Bayes in Python

Expectation-Maximization

Naive Bayes trained with EM

Code

Implementation details

Reference

LSA and pLSA in Python

Comparison

Implementation in Python

Seam Carving in Python

Easy parallelization with data decomposition

The multi-core trend

Parallelizing programs

Map

Reduce

multiprocessing.Pool ‘s map

Helpers

Conclusion

First release of Tegaki

Linux in a Virtual Machine

Feature set

Parallels Desktop and Vmware Fusion

QEMU

VirtualBox

Maemo development

Zinnia

Tomoe - Zinnia

Project Tegaki - Zinnia

Handwriting database

Contribute

Web Canvas

Handwriting renderers

Canvas

XML representation

Renderers

Complete character renderer

Stroke order renderer

HTML tag

GIF stroke animation

A roadmap for project Tegaki

Codename Project Tegaki

Project statement

Python?

Subprojects