A roadmap for project Tegaki
From http://www.mblondel.org/journal/2008/07/04/a-roadm
Posted on 2008-07-04 21:16:49 EEST.
Codename Project Tegaki
I wrote in a previous post about my first experiment with applying a modern technique, namely Hidden Markov Models, for handwritten Chinese character recognition. I’m quite motivated in making this more than just a single isolated experiment so I decided to give a name to the project. I named it Project Tegaki. This is going to be the codename for the effort starting from now. Tegaki means Handwriting in Japanese.
Project statement
The aim of Project Tegaki is to push forward the creation of the next-generation open-source handwritten Chinese character recognition (HCCR) software.
Currently, the only open-source package for HCCR is Tomoe. This is a project that I have been contributing to and that I used for my Google Summer of Code project, “Japanese/Chinese handwriting recognition on maemo”. Maemo is the open-source platform used by Nokia PDAs. I have decided to start Project Tegaki as an external effort because I considered that Tomoe would not be a good environment to welcome the effort. However, if the Tomoe community is ready to help me in this effort, I will be happy to merge Project Tegaki back into Tomoe once Project Tegaki becomes ready for prime-time.

Handwritten Chinese character recognition in a PDA…
Here are some goals for the project:
- Free and open-source. The goal is to produce the next-generation free and open-source HCCR software.
- Modern. The software should use modern approaches to Handwriting recognition and be in tight connection with research.
- Embedded. The project must be designed to work with devices with restricted resources such as cell phones or PDAs.
- Online, as opposed to offline. In online recognition, characters are drawn using a device, typically a mouse, a tablet or a PDA stylus. In this setting, characters can be represented as sequences of points. In offline recognition, characters are scanned a posteriori. In this setting, characters are represented as images (width * height pixels).
- Isolated Chinese character recognition. Here Chinese character doesn’t restrict to Chinese language, since Japanese kanji are also Chinese characters! Even though the package should theoretically be generalizable to any kind of character, Chinese characters have some specific challenges and some approaches that give good results for Chinese characters may not give good results with other kinds of characters, due to the unique properties of Chinese characters. “Isolated character recognition” means that user will have to draw one character at a time in a separate box, as opposed to continuous handwriting recognition. This makes things much easier and in the case of Chinese characters, this is a reasonable limitation.
- Stroke order dependent and independent. Both situations have useful applications so Project Tegaki should ideally support both.
Python?
Usually I’m more of a Ruby fan but the project was started in Python due to dependencies on third-party libraries that only exist in Python. Even though I’m slowly getting away from those dependencies, I don’t want to re-implement everything just for the sake of using Ruby. So I keep up with Python.
As it was emphasized, this project is highly experimental. Moreover, a collaborative website will be created (see below) and it will reuse number of existing components. It thus makes sense to use a high-level language to focus on the experiments and to create the website.
Subprojects
Project Tegaki is now split into several subprojects.
libtegaki
This Python library contains functionality that will be useful to other subprojects. This includes array manipulation, character input/output, viterbi decoder…
libtegaki-gtk
This Python library contains user interface elements that will be useful to other subprojects. So far it only includes a Canvas, which can be used to draw characters. It is replacement for TomoeCanvas with some additional benefits:
- Truly reusable. TomoeCanvas assumes that a recognizer is connected to the canvas. However, there are situations when a recognizer is not needed.
- Resizable. TomoeCanvas cannot be resized at will.
- Animation. A stroke animation of a character can be displayed.
- Background character. A background character can be set as a model and animations will be displayed to help draw the same character stroke by stroke.
- Features other than (x,y) coordinates are supported such as pen pressure and pen inclination when available, stroke duration, point timestamp.
libtegaki-gtk is written in pygtk and depends on libtegaki.
tegaki-db
The most successful handwriting recognition systems nowadays use a “learn by example” philosophy. For each character supported, several samples of the handwritten character must be provided to the system in order to learn from them. Because those samples are used to train the system, they are called “training samples”. The challenge for the final recognizer is to be able to recognize unseen handwritten instances of the same characters. This is the ability of the recognizer to “generalize” the acquired knowledge.
A “training corpus” is a set of training samples. A good corpus should contain dozens of handwritten samples for each character. The corpus should be representative enough of all handwriting styles. Collecting all the handwriting samples and designing a good corpus is a huge task for Chinese characters because there exist thousands of them!
Such handwritten Chinese character databases do exist but they have a fee and they are usually restricted to academic research. They are by no means suitable for free software. The goal of the tegaki-db subproject is to create a collaborative web platform to collect handwriting samples. Native speakers and learners alike will be able to log in and contribute their own handwriting. The collected data will be published in a free license so that it can benefit to academic research as well. The tegaki-db will use a client / server architecture.
tegaki-db-client
tegaki-db-client is a client for people to input their handwriting. It will be written in Python and use the canvas provided by libtegaki-gtk. The client will communicate with the server through web services. The client should be distributed for several platforms such as Linux, Windows and Maemo to increase the number of potential contributors. A detailed specification of tegaki-db and tegaki-db-client will be provided later in a separate post.
tegaki-models
tegaki-models is by no means an end-user package and will only be used by developers. It is the placeholder for experimentation. Thanks to this package, model ideas will be tested and evaluated.
I continued to work on new model ideas… However, because my current training corpus is so small, it’s kind of irrelevant to spend to much time on models. The top priority now is to create tegaki-db.
tegaki-decoder
tegaki-decoder is going to be a high-performance decoder (recognizer). It should be a fast implementation of the Viterbi decoder. It will be written in C and designed to work with embedded systems. This is going to be the end-product that people will use. Once sufficient data have been collected, good models have been generated and the tegaki decoder is ready, then Project Tegaki will be ready for real use! Currently, implementing tegaki-decoder is not the top priority.
Roadmap
- Launch tegaki-db and tegaki-db-client.
- Hope that the collaborative effort is successfull and collect lots of handwriting samples from many different people.
- Create new models, especially stroke-based models.
- Implement tegaki-decoder.
If I continue to be the only one interested in this project, at this rate it will take from several months to a couple of years to achieve everything. That’s why I hope I can attract a few contributors.
Download
The work completed so far is still very experimental and thus targets potential contributors. If you want to test it with your own handwriting anyway, please see my previous post.
To download the source code, you can use
$ git clone http://www.mblondel.org/code/hwr.git
or
$ git pull
from the repository folder if you already have the repository on your computer.
The code can be browsed online using gitweb. By clicking the “snapshot” links you can get a complete copy of the source code at a given revision.
See my memo on git if you don’t know it yet.
I published my work under GPL license.
