Planet maemo: category "feed:46b1d6b26651a331cde2ad188d699e0c"

mblondel

Transparent system-wide proxy

2011-06-26 23:18 UTC  by  mblondel
0
0

Proxies can be a powerful way to enforce anonymity or to bypass various kinds of restrictions on Internet (government censorship, regional contents, …). In this post, I’ll describe a simple technique to create a transparent proxy at the system level. It’s especially useful in cases when you want to make sure that all connections make it through the proxy or when your application of interest doesn’t have proxy support. I don’t think this technique is that well-known, hence this post.

1. Create SOCKS proxy with OpenSSH

If you have a user account on a remote machine, the simplest way to create a proxy is to use the following command.

$ ssh -N -D 1080 username@serverhost

It creates a SOCKS proxy listening to port 1080 on your local machine. The main advantage of the SOCKS protocol is that it can route connections from any port between the client and the server.

2. Forward connection transparently with iptables or ipfw

Most modern browsers can let the user define a SOCKS proxy in their advanced networking preference section. Yet, many applications may not support the SOCKS protocol at all. The solution is to use system tools such as iptables (Linux) or ipfw (FreeBSD, OS X) to enforce the routing at the system level. For example, on OS X, I use the following command to redirect port 80 (HTTP).

$ sudo ipfw add 100 fwd 127.0.0.1,12345 dst-port 80

3. Redirect connections to SOCKS proxy with redsocks

iptables and ipfw don’t have built-in support for the SOCKS protocol. It is thus necessary to use an additional program to transform the connections on the fly. This is where redsocks comes in.

$ redsocks -c config_file

In the configuration file, you need to configure redsocks to listen to port 12345 and to redirect to port 1080. “generic” can be used as the redirector option. The github repository includes a sample configuration file.

And voila! We now have configured the system to transparently route connections for us. To summarize, here is the big picture:

local machine -> redsocks -> SOCKS proxy -> target server
Categories: In English
mblondel

Regularized Least Squares

2011-02-09 14:20 UTC  by  mblondel
0
0

Recently, I’ve contributed a bunch of improvements (sparse matrix support, classification, generalized cross-validation) to the ridge module in scikits.learn. Since I’m receiving good feedback regarding my posts on Machine Learning, I’m taking this as an opportunity to summarize some important points about Regularized Least Squares, and more precisely Ridge regression.

Click to read 1566 more words
Categories: In English
mblondel

Kernel Perceptron in Python

2010-10-31 05:36 UTC  by  mblondel
0
0

The Perceptron (Rosenblatt, 1957) is one of the oldest and simplest Machine Learning algorithms. It’s also trivial to kernelize, which makes it an ideal candidate to gain insights on kernel methods.

Click to read 1120 more words
Categories: In English
mblondel

Support Vector Machines in Python

2010-09-19 14:07 UTC  by  mblondel
0
0

Support Vector Machines (SVM) are the state-of-the-art classifier in many applications and have become ubiquitous thanks to the wealth of open-source libraries implementing them. However, you learn a lot more by actually doing than by just reading, so let’s play a little bit with SVM in Python! To make it easier to read, we will use the same notation as in Christopher Bishop’s book “Pattern Recognition and Machine Learning”.

Click to read 2240 more words
Categories: In English
mblondel

Latent Dirichlet Allocation in Python

2010-08-21 20:52 UTC  by  mblondel
0
0

Like Latent Semantic Analysis (LSA) and probabilistic LSA (pLSA) – see my previous post “LSA and pLSA in Python“, Latent Dirichlet Allocation (LDA) is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. LDA can be seen as a Bayesian extension of pLSA.

Click to read 1920 more words
Categories: In English
mblondel

Semi-supervised Naive Bayes in Python

2010-06-21 17:47 UTC  by  mblondel
0
0

Expectation-Maximization

The Expectation-Maximization (EM) algorithm is a popular algorithm in statistics and machine learning to estimate the parameters of a model that depends on latent variables. (A latent variable is a variable that is not expressed in the dataset and thus that you can’t directly count. For example, in pLSA, the document topics z are latent variables.) EM is very intuitive. It works by pretending that we know what we’re looking for: the model parameters. First, we make an initial guess, which can be either random or “our best bet”. Then, in the E-step, we use our current model parameters to estimate some “measures”, the ones we would have used to compute the parameters, had they been available to us. In the M-step, we use these measures to compute the model parameters. The beauty of EM is that by iteratively repeating these two steps, the algorithm will provably converge to a local maximum for the likelihood that the model generated the data.

Click to read 854 more words
Categories: In English
mblondel

LSA and pLSA in Python

2010-06-13 17:42 UTC  by  mblondel
0
0

Latent Semantic Analysis (LSA) and its probabilistic counterpart pLSA are two well known techniques in Natural Language Processing that aim to analyze the co-occurrences of terms in a corpus of documents in order to find hidden/latent factors, regarded as topics or concepts. Since the number of topics/concepts is usually greatly inferior to the number of words and since it is not necessary to know the document categories/classes, LSA and pLSA are thus unsupervised dimensionality reduction techniques. Applications include information retrieval, document classification and collaborative filtering.

Click to read 1240 more words
Categories: In English
mblondel

The Little Machine Learner

2010-02-18 11:57 UTC  by  mblondel
0
0

The idea

I’ve been having this idea on my mind for quite some time: wouldn’t it be nice to write a book about Machine Learning where each chapter is a literate program?

Click to read 1220 more words
Categories: In English
mblondel

Seam Carving in Python

2010-02-09 15:57 UTC  by  mblondel
0
0

Seam Carving is an algorithm for image resizing introduced in 2007 by S. Avidan and A. Shamir in their paper “Seam Carving for Content-Aware Image Resizing“.


Miyako Island, Okinawa, Japan.

The principle is very simple. Find the connected paths of low energy pixels (“the seams”). This can be done efficiently by dynamic programming (see my post on DTW).


Same image in the gradient domain showing the vertical and horizontal seams of lowest cumulated energy.

The seams of lowest cumulated energy can be seen as the pixels contributing the least to an image. By repeatedly removing or adding seams, it is thus possible to perform “content-aware” image reduction or extension. The resulting images feel more natural, less “streched”.


Height reduced by 50% by seam carving.


Height reduced by 50% by traditional rescaling.

Although seam carving doesn’t need human intervention, in the original paper, a graphical user interface (GUI) was also developed to let the user define areas that can’t be removed, or conversely, that must be removed.

In my opinion, seam carving is simple and elegant. No sophisticated object recognition algorithm was used, yet the results are quite impressive.

You can find my implementation in 250 lines of Python in my git repo:

$ git clone http://www.mblondel.org/code/seam-carving.git

web interface

Unfortunately, it’s too slow to be real-time.

Categories: Image Processing
mblondel

Caching computation tasks

2010-01-27 14:12 UTC  by  mblondel
0
0

When I work on computationally expensive projects (e.g., Machine Learning), I always find myself in the same situation: my programs can be broken down into a chain of tasks, where tasks may depend on the results of other tasks. A typical such chain would be:

Click to read 1636 more words
Categories: In English
mblondel

First release of Tegaki

2009-02-11 22:33 UTC  by  mblondel
0
0

Today I’m releasing Tegaki 0.1. Tegaki is an ongoing project which aims to develop a free and open-source modern implementation of handwriting recognition software, that is suitable for both the desktop and mobile devices, and that is designed from the ground up to work well with Chinese and Japanese.

Screencast video: ogg or youtube.

This release features desktop and SCIM integration. However, the main “innovation” brought to you by this release is the user interface. It uses two drawing areas for continuous writing. The user can eventually fix recognition errors by choosing alternative candidates or editing characters. Since a video is worth a thousand words, see the screencast above. This interface is largely inspired from the Nintendo DS game “Kanji Sono Mama Rakubiki Jiten” (漢字そのまま楽引辞典).

Tegaki is designed to be able to use several recognition engines. However so far it only supports Zinnia, which is the only recognition engine that I know with acceptable recognition accuracy and good performance on mobile devices. One challenge of the project in the future will be to create a new recognition engine that can yield better results than Zinnia.

A take that I have on this project is to use Python whenever this is possible and only use C or C++ when performance is critical, like in recognition engines. Compared to Tomoe, which implements everything in C and provides bindings for several languages, this means less reusability of the components but I hope this will make the project go forward faster.

There are still a lot of things that can be done in various areas but I really wanted to release the code I’ve put together so far because I think it can already be useful to end-users. By the way, Maemo supports both pygtk and SCIM through third-party projects, thus Tegaki is just a few Debian packages away from being available on Maemo.

For further details:
http://tegaki.sourceforge.net/

Categories: Projects
mblondel

Linux in a Virtual Machine

2008-12-26 12:39 UTC  by  mblondel
0
0

I own a Macbook on which I’ve been running Linux 99% of the time for over a year now. Although a Macbook is not necessarily the best choice to run Linux, I made that decision because installing Linux on a Macbook is very well documented. However, as far as you can get, it’s always difficult to get a configuration you are 100% happy with (no subwoofer support, flaky suspend…). With recent advances in virtualization technologies, both in software and hardware, I’ve been willing to test running Linux and Windows (the guest OSes) inside Mac OS X (the host OS).

Click to read 2150 more words
Categories: Sysadmin