Planet maemo: category "feed:43af5b2374081abdd0dbc4ba26a0b54c"

Philip Van Hoof

I made some documentation about our SPARQL-IN feature that we recently added. I added some interesting use-cases like doing an insert and a delete based on in values.

For the new class signal API that we’re developing this and next week, we’ll probably emit the IDs that tracker:id() would give you if you’d use that on a resource. This means that IN is very useful for the purpose of giving you metadata of resources that are in the list of IDs that you just received from the class signal.

We never documented tracker:id() very much, as it’s not an RDF standard; rather it’s something Tracker specific. But neither are the class signals a RDF standard; they are Tracker specific too. I guess here that makes it usable in combo and turns the status of ‘internal API’, irrelevant.

We’re right now prototyping the new class signals API. It’ll probably be a “sa(iii)a(iii)”:

That’s class-name and two arrays of subject-id, predicate-id, object-id. The class-name is to allow D-Bus filtering. The first array are the deletes and the second are the inserts. We’ll only give you object-ids of non-literal objects (literal objects have no internal object-id). This means that we don’t throw literals to you in the signal (you need to make a query to get them, we’ll throw 0 to you in the signal).

We give you the object-ids because of a use-case that we didn’t cover yet:

Given triple <a> nie:isLogicalPartOf <b>. When <a> is deleted, how do you know <b> during the signal? So the feature request was to do a select ?b { <a> nie:isLogicalPartOf ?b } when <a> is deleted (so the client couldn’t do that query anymore).

With the new signal we’ll give you the ID of <b> when <a> is deleted. We’ll also implement a tracker:uri(integer id) allowing you to get <b> out of that ID. It’ll do something like this, but then much faster: select ?subject { ?subject a rdfs:Resource . FILTER (tracker:id(?subject) IN (%d)) }

I know there will be people screaming for all objects, also literals, in the signals, but we don’t want to flood your D-Bus daemon with all that data. Scream all you want. Really, we don’t. Just do a roundtrip query.

Categories: condescending
Philip Van Hoof

Busy handling

Click to read 1320 more words
Categories: condescending
Philip Van Hoof

Neelie Kroes on open source

2010-07-15 10:18 UTC  by  Philip Van Hoof
0
0


Video link

Categories: english
Philip Van Hoof

The support for domain specific indexes is, awaiting review / finished. Although we can further optimize it now. More on that later in this post. Image that you have this ontology:

Click to read 1164 more words
Categories: condescending
Philip Van Hoof

SQLite’s WAL

SQLite is working on WAL, which stands for Write Ahead Logging.

The new logging technique means that we can probably keep read statements open for multiple processes. It’s not full MVCC yet as writes are still not doable simultaneously. But in our use-case is reading with multiple processes vastly more important anyway.

We’re investigating WAL mode of SQLite thoroughly these next few days. Jürg is working most on this at the moment. If WAL is fit for our purpose then we’ll probably also start developing a direct-access library that’ll allow your process to connect directly with our SQLite database, avoiding any form of IPC.

Adrien‘s FD-passing is in master, though. And it’s performing quite well!

We’re thrilled that SQLite’s team is taking this direction with WAL. Very awesome guys!

Domain specific indexes

Yesterday I worked on support for deleting a domain specific index from the ontology. Because SQLite doesn’t support dropping a column with its ALTER support, I had to do it by renaming the original table, recreating the table without the mirror column, and then copying the data from the renamed table. And finally dropping the renamed table. It’s nasty, but it works. I think SQLite should just add DROP COLUMN to ALTER. Why is this so hard to add?

I finally got it working, now it must of course be tested and then again tested.

Next for the feature is adapting the SPARQL engine to start using the indexed mirror column and produce better performing SQL queries.

Categories: english
Philip Van Hoof

Working on domain specific indexes

2010-07-01 15:01 UTC  by  Philip Van Hoof
0
0

So … what is involved in a “simple change” like what I wrote about yesterday?

First you add support for annotating the domain specific index in the ontology files. This is straight forward as we of course have a generic Turtle parser, and it’s just a matter of adding properties to certain classes, and filling the values from the ontology in in the instances in our in-memory representation of the ontology. You of course also need to change the CREATE-TABLE statements. Trivial.

Then you implement detecting changes in the ontology. And more complex; coping with the changes. This means doing ALTER on the SQL tables. You also need to copy from the InformationElement table to the MusicPiece table (I’m using MusicPiece to clarify, it’s of course generic) in case of such a domain specific index being added during an ontology change, and put an implicit index on the column. After all, that index is why we’re doing this.

I finished those two yesterday. I have not finished detecting a deletion of a domain specifix index yet. That will have to ALTER the table with a DROP of the column. The most difficult here is detecting the deletion itself. We don’t yet have any code to diff on multivalue properties in the ontology (the ontology is a collection of RDF statements like everything else, describing itself).

Today I finished writing copy values to the MusicPiece table’s mirror column

Next few days will be about adapting the SPARQL engine and of course coping with a deletion of a domain specific index. And then testing, and again testing. Mind that this has to work from a journal replay situation too. In which case no ontology is involved (it’s all stored in the history of the persistent journal).

Where’s my Redbull? Ah, waiting for me in the fridge. Good!

Categories: english
Philip Van Hoof

Domain specific indexes

2010-06-30 13:51 UTC  by  Philip Van Hoof
0
0

We store our data in a decomposed way. For single value properties we create a table per class and have a column per property. Multi value properties go in a separate table. For now I’ll focus on those single value properties.

Imagine you have a MusicPiece. In Nepomuk that’s a subclass of InformationElement. InformationElement adds properties like title and subject. MusicPiece has performer, which is a Contact, and duration, an integer. A Contact has a fullname.

Alright, that looks like this in our internal storage.

Querying that in SPARQL goes like this. I’ll add the Nepomuk prefixes.

SELECT ?musicpiece ?title ?subject ?performer {
   ?musicpiece a nmm:MusicPiece ;
               nmm:performer ?p ;
               nie:title ?title ;
               nie:subject ?subject .
   ?p nco:fullname ?performer .
} ORDER BY ?title

A problem if you ORDER BY the title field is that Tracker needs to make a join and a full table scan with that InformationElement table.

So we’re working on what we’ll call domain specific indexes. It means that we’ll for certain properties have a redundant mirror column, on which we’ll place the index. The native SQL query will be generated to use that mirror column instead. A good example is nie:title for nmm:MusicPiece.

ps. A normal triple store has instead a huge table with just three columns: subject, predicate and object. That wouldn’t help you much with optimizing of course.

Categories: Informatics and programming
Philip Van Hoof

IPC performance, the report

2010-05-13 19:58 UTC  by  Philip Van Hoof
0
0

The Tracker team will be doing a codecamp this month. Among the subjects we will address is the IPC overhead of tracker-store, our RDF query service.

Click to read 3214 more words
Categories: Informatics and programming
Philip Van Hoof

The crawler’s modification time queries

Yesterday we optimized the crawler’s query that gets the modification time of files. We use this timestamp to know whether or not a file must be reindexed.

Originally, we used a custom SQLite function called tracker:uri-is-parent() in SPARQL. This, however, caused a full table scan. As long as your SQL table for nfo:FileDataObjects wasn’t too large, that wasn’t a huge problem. But it didn’t scale linear. I started with optimizing the function itself. It was using a strlen() so I replaced that with a sqlite3_value_bytes(). We only store UTF-8, so that worked fine. It gained me ~ 10%; not enough.

So this commit was a better improvement. First it makes nfo:belongsToContainer an indexed property. The x nfo:belongsToContainer p means x is in a directory p for file resources. The commit changes the query to use the property that is now indexed.

The original query before we started with this optimization took 1.090s when you had ~ 300,000 nfo:FileDataObject resources. The new query takes about 0.090s. It’s of course an unfair comparison because now we use an indexed property. Adding the index only took a total of 10s for a ~ 300,000 large table and the table is being queried while we index (while we insert into it). Do the math, it’s a huge win in all situations. For the SQLite freaks; the SQLite database grew by 4 MB, with all items in the table indexed.

PDF extractor

Another optimization I did earlier was the PDF extractor. Originally, we used the poppler-glib library. This library doesn’t allow us to set the OutputDev at runtime. If compiled with Cairo, the OutputDev is in some versions a CairoOutputDev. We don’t want all images in the PDF to be rendered to a Cairo surface. So I ported this back to C++ and made it always use a TextOutputDev instead. In poppler-glib master this appears to have improved (in git master poppler_page_get_text_page is always using a TextOutputDev).

Another major problem with poppler-glib is the huge amount of copying strings in heap. The performance to extract metadata and content text for a 70 page PDF document without any images went from 1.050s to 0.550s. A lot of it was caused by copying strings and GValue boxing due to GObject properties.

Table locked problem

Last week I improved D-Bus marshaling by using a database cursor. I forgot to handle SQLITE_LOCKED while Jürg and Carlos had been introducing multithreaded SELECT support. Not good. I fixed this; it was causing random Table locked errors.

Categories: Informatics and programming
Philip Van Hoof

RDF propaganda, time for change

2010-04-27 21:06 UTC  by  Philip Van Hoof
0
0

I’m not supposed to but I’m proud. It’s not only me who’s doing it.

Adrien is one of the new guys on the block. He’s working on integration with Tracker’s RDF service and various web services like Flickr, Facebook, Twitter, picasaweb and RSS. This is the kind of guy several companies should be afraid of. His work is competing with what they are trying to do do: integrating the social web with mobile.

Oh come on Steve, stop pretending that you aren’t. And you better come up with something good, because we are.

Not only that, Adrien is implementing so-called writeback. It means that when you change a local resource’s properties, that this integration will update Flickr, Facebook, picasaweb and Twitter.

You change a piece of info about a photo on your phone, and it’ll be replicated to Flickr. It’ll also be synchronized onto your phone as soon as somebody else made a change.

This is the future of computing and information technology. Integration with social networking and the phone is what people want. Dear Mark, it’s unstoppable. You better keep your eyes open, because we are going fast. Faster than your business.

I’m not somebody trying to guess how technology will look in a few years. I try to be in the middle of the technical challenge of actually doing it. Talking about it is telling history before your lip’s muscles moved.

At the Tracker project we are building a SPARQL endpoint that uses D-Bus as IPC. This is ideal on Nokia’s Meego. It’ll be a centerpiece for information gathering. On Meego you wont ask the filesystem, instead you’ll ask Tracker using SPARQL and RDF.

To be challenged is likely the most beautiful state of mind.

I invite everybody to watch this demo by Adrien. It’s just the beginning. It’s going to get better.

Tracker writeback & web service integration demo / MeegoTouch UI from Adrien Bustany on Vimeo.

I tagged this as ‘extremely controversial’. That’s fine, Adrien told me that “people are used to me anyway”.

Categories: Informatics and programming
Philip Van Hoof

Before

For returning the results of a SPARQL SELECT query we used to have a callback like this. I removed error handling, you can find the original here.

We need to marshal a database result_set to a GPtrArray because dbus-glib fancies that. This is a lot of boxing the strings into GValue and GStrv. It does allocations, so not good.

static void
query_callback(TrackerDBResultSet *result_set,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  GPtrArray *values = tracker_dbus_query_result_to_ptr_array (result_set);
  dbus_g_method_return (info->context, values);
  tracker_dbus_results_ptr_array_free (&values);
}

void
tracker_resources_sparql_query (TrackerResources *self, const gchar *query,
                                DBusGMethodInvocation *context, GError **error)
{
  TrackerDBusMethodInfo *info = ...; guint request_id;
  TrackerResourcesPrivate *priv= ...; gchar *sender;
  info->context = context;
  tracker_store_sparql_query (query, TRACKER_STORE_PRIORITY_HIGH,
                              query_callback, ...,
                              info, destroy_method_info);
}

After

Last week I changed the asynchronous callback to return a database cursor. In SQLite that means an sqlite3_step(). SQLite returns const pointers to the data in the cell with its sqlite3_column_* APIs.

This means that now we’re not even copying the strings out of SQLite. Instead, we’re using them as const to fill in a raw DBusMessage:

static void
query_callback(TrackerDBCursor *cursor,GError *error,gpointer user_data)
{
  TrackerDBusMethodInfo *info = user_data;
  DBusMessage *reply; DBusMessageIter iter, rows_iter;
  guint cols; guint length = 0;
  reply = dbus_g_method_get_reply (info->context);
  dbus_message_iter_init_append (reply, &iter);
  cols = tracker_db_cursor_get_n_columns (cursor);
  dbus_message_iter_open_container (&iter, DBUS_TYPE_ARRAY,
                                    "as", &rows_iter);
  while (tracker_db_cursor_iter_next (cursor, NULL)) {
    DBusMessageIter cols_iter; guint i;
    dbus_message_iter_open_container (&rows_iter, DBUS_TYPE_ARRAY,
                                      "s", &cols_iter);
    for (i = 0; i < cols; i++, length++) {
      const gchar *result_str = tracker_db_cursor_get_string (cursor, i);
      dbus_message_iter_append_basic (&cols_iter,
                                      DBUS_TYPE_STRING,
                                      &result_str);
    }
    dbus_message_iter_close_container (&rows_iter, &cols_iter);
  }
  dbus_message_iter_close_container (&iter, &rows_iter);
  dbus_g_method_send_reply (info->context, reply);
}

Results

The test is a query on 13500 resources where we ask for two strings, repeated eleven times. I removed a first repeat from each round, because the first time the sqlite3_stmt still has to be created. This means that our measurement would get a few more milliseconds. I also directed the standard out to /dev/null to avoid the overhead created by the terminal. The results you see below are the value for “real”.

There is of course an overhead created by the “tracker-sparql” program. It does demarshaling using normal dbus-glib. If your application uses DBusMessage directly, then it can avoid the same overhead. But since for both rounds I used the same “tracker-sparql” it doesn’t matter for the measurement.

$ time tracker-sparql -q "SELECT ?u  ?m { ?u a rdfs:Resource ;
          tracker:modified ?m }" > /dev/null

Without the optimization:

0.361s, 0.399s, 0.327s, 0.355s, 0.340s, 0.377s, 0.346s, 0.380s, 0.381s, 0.393s, 0.345s

With the optimization:

0.279s, 0.271s, 0.305s, 0.296s, 0.295s, 0.294s, 0.295s, 0.244s, 0.289s, 0.237s, 0.307s

The improvement ranges between 7% and 40% with average improvement of 22%.

Categories: Informatics and programming
Philip Van Hoof

Focus on query performance

2010-04-12 23:57 UTC  by  Philip Van Hoof
0
0

Every (good) developer knows that copying of memory and boxing, especially when dealing with a large amount of pieces like members of collections and the cells in a table, are a bad thing for your performance.

More experienced developers also know that novice developers tend to focus on just their algorithms to improve performance, while often the single biggest bottleneck is needless boxing and allocating. Experienced developers come up with algorithms that avoid boxing and copying; they master clever pragmatical engineering and know how to improve algorithms. A lot of newcomers use virtual machines and script languages that are terrible at giving you the tools to control this and then they start endless religious debates about how great their programming language is (as if it matters). (Anti-.NET people don’t get on your horses too soon: if you know what you are doing, C# is actually quite good here).

We were of course doing some silly copying ourselves. Apparently it had a significant impact on performance.

Once Jürg and Carlos have finished the work on parallelizing SELECT queries we plan to let the code that walks the SQLite statement fill in the DBusMessage directly without any memory copying or boxing (for marshalling to DBus). We found the get_reply and send_reply functions; they sound useful for this purpose.

I still don’t really like DBus as IPC for data transfer of Tracker’s RDF store’s query results. Personally I think I would go for a custom Unix socket here. But Jürg so far isn’t convinced. Admittedly he’s probably right; he’s always right. Still, DBus to me doesn’t feel like a good IPC for this data transfer..

We know about the requests to have direct access to the SQLite database from your own process. I explained in the bug that SQLite3 isn’t MVCC and that this means that your process will often get blocked for a long time on our transaction. A longer time than any IPC overhead takes.

Categories: Informatics and programming