Focus on query performance

2010-04-12 23:57 UTC by Philip Van Hoof

0

6

2

Every (good) developer knows that copying of memory and boxing, especially when dealing with a large amount of pieces like members of collections and the cells in a table, are a bad thing for your performance.

More experienced developers also know that novice developers tend to focus on just their algorithms to improve performance, while often the single biggest bottleneck is needless boxing and allocating. Experienced developers come up with algorithms that avoid boxing and copying; they master clever pragmatical engineering and know how to improve algorithms. A lot of newcomers use virtual machines and script languages that are terrible at giving you the tools to control this and then they start endless religious debates about how great their programming language is (as if it matters). (Anti-.NET people don’t get on your horses too soon: if you know what you are doing, C# is actually quite good here).

We were of course doing some silly copying ourselves. Apparently it had a significant impact on performance.

Once Jürg and Carlos have finished the work on parallelizing SELECT queries we plan to let the code that walks the SQLite statement fill in the DBusMessage directly without any memory copying or boxing (for marshalling to DBus). We found the get_reply and send_reply functions; they sound useful for this purpose.

I still don’t really like DBus as IPC for data transfer of Tracker’s RDF store’s query results. Personally I think I would go for a custom Unix socket here. But Jürg so far isn’t convinced. Admittedly he’s probably right; he’s always right. Still, DBus to me doesn’t feel like a good IPC for this data transfer..

We know about the requests to have direct access to the SQLite database from your own process. I explained in the bug that SQLite3 isn’t MVCC and that this means that your process will often get blocked for a long time on our transaction. A longer time than any IPC overhead takes.

Categories: Informatics and programming

Supporting ontology changes in Tracker

2010-04-09 12:03 UTC by Philip Van Hoof

0

7

1

It used to be in Tracker that you couldn’t just change the ontology. When you did, you had to reboot the database. This means loosing all the non-embedded data. For example your tags or other such information that’s uniquely stored in Tracker’s RDF store.

Click to read 970 more words

Categories: Informatics and programming

Zürichsee

2010-04-03 17:44 UTC by Philip Van Hoof

0

1

6

Today after I brought Tinne to the airport I drove around Zürichsee. She can’t stay in Switzerland the entire month; she has to go back to school on Monday.

While driving on the Seestrasse I started counting luxury cars. After I reached two for Lamborgini and three for Ferrari I started thinking: Zimmerberg Sihltal and Pfannenstiel must be expensive districts too… And yes, they are.

I was lucky today that it was nice weather. But wow, what a nice view on the mountain tops when you look south over Zürichsee. People from Zürich, you guys are so lucky! Such immense calming feeling the view gives me! For me, it beats sauna. And I’m a real sauna fan.

I’m thinking to check it out south of Zürich. But not the canton. I think the house prices are just exaggerated high in the canton of Zürich. I was thinking Sankt Gallen, Toggenburg. I’ve never been there; I’ll check it out tomorrow.

Hmmr, meteoswiss gives rain for tomorrow. Doesn’t matter.

Actually, when I came back from the airport the first thing I really did was fix coping with property changes in ontologies for Tracker. Yesterday it wasn’t my day, I think. I couldn’t find this damn problem in my code! And in the evening I lost three chess games in a row against Tinne. That’s really a bad score for me. Maybe after two weeks of playing chess almost every evening, she got better than me? Hmmrr, that’s a troubling idea.

Anyway, so when I got back from the airport I couldn’t resist beating the code problem that I didn’t find on Friday. I found it! It works!

I guess I’m both a dreamer and a realist programmer. But don’t tell my customers that I’m such a dreamer.

Categories: Art culture

Reporting busy status

2010-03-26 14:44 UTC by Philip Van Hoof

0

3

2

We’re nearing our first release since very long, so I’ll do another technical blog post about Tracker ;)

When the RDF store is replaying its journal at startup and when the RDF store is restoring a backup it can be in busy state. This means that we can’t handle your DBus requests during that time; your DBus method will be returned late.

Because that’s not very nice from a UI perspective (the uh, what is going on?? -syndrome kicks in) we’re adding a signal emission that emits the progression and status. You can also ask it using DBus methods GetProgress and GetStatus.

The miners already had something like this, so I kept the API more or less the same.

signal sender=:1.99 -> dest=(null destination) serial=1454
  path=/org/freedesktop/Tracker1/Status;
  interface=org.freedesktop.Tracker1.Status; member=Progress
   string "Journal replaying"
   double 0.197824
signal sender=:1.99 -> dest=(null destination) serial=1455
  path=/org/freedesktop/Tracker1/Status;
  interface=org.freedesktop.Tracker1.Status; member=Progress
   string "Journal replaying"
   double 0.698153

Jürg just reviewed the SPARQL regex performance improvement of yesterday, so that’s now in master. If you want this busy status notifying today already you can test with the busy-notifications branch.

Categories: Informatics and programming

Performance improvements for SPARQL’s regex in Tracker

2010-03-25 13:20 UTC by Philip Van Hoof

0

7

1

The original SPARQL regex support of Tracker is using a custom SQLite function. But of course back when we wrote it we didn’t yet think much about optimizing. As a result, we were using g_regex_match_simple which of course recompiles the regular expression each time.

Today Jürg and me found out about sqlite3_get_auxdata and sqlite3_set_auxdata which allows us to cache a compiled value for a specific custom SQLite function for the duration of the query.

This is much better:

static void
function_sparql_regex (sqlite3_context *context,
                       int              argc,
                       sqlite3_value   *argv[])
{
  gboolean ret;
  const gchar *text, *pattern, *flags;
  GRegexCompileFlags regex_flags;
  GRegex *regex;

  if (argc != 3) {
    sqlite3_result_error (context, “Invalid argument count”, -1);
    return;
  }

  regex = sqlite3_get_auxdata (context, 1);
  text = sqlite3_value_text (argv[0]);
  flags = sqlite3_value_text (argv[2]);
  if (regex == NULL) {
    gchar *err_str;
    GError *error = NULL;
    pattern = sqlite3_value_text (argv[1]);
    regex_flags = 0;
    while (*flags) {
      switch (*flags) {
      case ’s’: regex_flags |= G_REGEX_DOTALL; break;
      case ‘m’: regex_flags |= G_REGEX_MULTILINE; break;
      case ‘i’: regex_flags |= G_REGEX_CASELESS; break;
      case ‘x’: regex_flags |= G_REGEX_EXTENDED; break;
      default:
        err_str = g_strdup_printf (”Invalid SPARQL regex flag ‘%c’”, *flags);
        sqlite3_result_error (context, err_str, -1);
        g_free (err_str);
        return;
      }
      flags++;
    }
    regex = g_regex_new (pattern, regex_flags, 0, &error);
    if (error) {
      sqlite3_result_error (context, error->message, error->code);
      g_clear_error (&error);
      return;
    }
    sqlite3_set_auxdata (context, 1, regex, (void (*) (void*)) g_regex_unref);
  }
  ret = g_regex_match (regex, text, 0, NULL);
  sqlite3_result_int (context, ret);
  return;
}

Before (this was a test on a huge amount of resources):

$ time tracker-sparql -q "select ?u { ?u a rdfs:Resource . FILTER (regex(?u, '^titl', 'i')) }"
real	0m3.337s
user	0m0.004s
sys	0m0.008s

After:

$ time tracker-sparql -q "select ?u { ?u a rdfs:Resource . FILTER (regex(?u, '^titl', 'i')) }"
real	0m1.887s
user	0m0.008s
sys	0m0.008s

This will hit Tracker’s master today or tomorrow.

Categories: Informatics and programming

Working hard at the Tracker project

2010-03-17 17:41 UTC by Philip Van Hoof

0

8

1

Today we improved journal replaying from 1050s for my test of 25249 resources to 58s.

Journal replaying happens when your cache database gets corrupted. Also when you restore a backup: restore uses the same code the journal replaying uses, backup just makes a copy of your journal.

During the performance improvements we of course found other areas related to data entry. It looks like we’re entering a period of focus on performance, as we have a few interesting ideas for next week already. The ideas for next week will focus on performance of some SPARQL functions like regex.

Meanwhile are Michele Tameni and Roberto Guido working on a RSS miner for Tracker and has Adrien Bustany been working on other web miners like for Flickr, GData, Twitter and Facebook.

I think the first pieces of the RSS- and the other web miners will start becoming available in this week’s unstable 0.7 release. Martyn is still reviewing the branches of the guys, but we’re very lucky with such good software developers as contributors. Very nice work Michele, Roberto and Adrien!

Categories: Informatics and programming

Tinymail 1.0!

2010-03-05 17:35 UTC by Philip Van Hoof

0

13

1

Tinymail’s co-maintainer Sergio Villar just released Tinymail’s first release.

psst. I have inside information that I might not be allowed to share that 1.2 is being prepared already, and will have bodystructure and envelope summary fetch. And it’ll fetch E-mail body content per requested MIME part, instead of always entire E-mails. Whoohoo!

Categories: Informatics and programming

An ode to our testers

2010-03-02 13:49 UTC by Philip Van Hoof

0

1

5

You know about those guys that use your software against huge datasets like their entire filesystem, with thousands of files?

We do. His name is Tshepang Lekhonkhobe and we owe him a few beers for reporting to us many scalability issues.

Today we found and fixed such a scalability issue: the update query to reset the availability of file resources (this is for support for removable media) was causing at least a linear increase of VmRss usage per amount of file resources. For Tshepang’s situation that meant 600 MB of VmRss. Jürg reduced this to 30 MB of peak VmRss in the same use-case, and a performance improvement from minutes to a second or two, three. Without memory fragmentation as glibc is returning almost all of the VmRss back to the kernel.

Thursday is our usual release day. I invite all of the 0.7 pioneers to test us with your huge filesystems, just like Tshepang always does.

So long and thanks for all the testing, Tshepang! I’m glad we finally found it.

Categories: Informatics and programming

Invisible costs

2010-03-01 17:49 UTC by Philip Van Hoof

0

3

6

We would rather suffer the visible costs of a few bad decisions than incur the many invisible costs that come from decisions made too slowly - or not at all - because of a stifling bureaucracy.

– Letter by Warren E. Buffett to the shareholders of Berkshire, February 26, 2010

Categories: Art & culture

Working hard

2010-02-18 22:09 UTC by Philip Van Hoof

0

11

1

I don’t decide about Tracker’s release. The team of course does.

But when you look at our roadmap you notice one remaining ‘big feature’. That’s coping with modest ontology changes.

Right now if we’d make even a small ontology change all of our users would have to recreate their databases. We also don’t support restoring a backup of your metadata over a modified ontology.

This is about to change. This week I started working in a branch on supporting class and property ontology additions.

I finished it today and it appears to be working. The patches obviously need a thorough review by the other team members, and then testing of course. I invite all the contributors and people who have been testing Tracker 0.7’s releases to tryout the branch. It only supports additions, so don’t try to change properties or classes, or remove them. You can only add new ones. You might have noticed the nao:deprecated property in the ontology files? That’s what we do with deleted properties.

Anyway

Meanwhile are Martyn and Carlos working on a bugfix in the miner about duplicate entries for file resources and on a timeout for the extractor so that extraction of large or complicated documents doesn’t block the entire filesystem miner.

Jürg is working on timezone storage for xsd:dateTime fields and last few days he implemented limited support for named graphs.

By the looks of it, one would almost believe that Tracker’s first new stable release is almost ready!

Categories: Informatics and programming

Please don’t rewrite softwares (that are) written in .NET

2010-02-09 13:54 UTC by Philip Van Hoof

0

6

11

This (super) cool .NET developer and good friend came to me at the FOSDEM bar to tell me he was confused about why during the Tracker presentation I was asking people to replace F-Spot and Banshee.

I hope I didn’t say it like that, I would never intent to say that. But I’ll review the video of the presentation as soon as Rob publishes it.

Anyway, to ensure everybody understood correctly what I did wanted to say (whether or not I did, is another question):

The call was to inspire people to reimplement or to provide different implementations of F-Spot’s and Banshee’s data backends, so that they would use an RDF store like tracker-store instead of each app its own metadata database.

I think I also mentioned Rhythmbox in the same sentence because the last thing I would want is to turn this into a .NET vs. anti-.NET debate. It just happens to be that the best GNOME softwares for photo and music management are written in .NET (and that has a good reason).

People who know me also know that I think those anti-.NET people are disruptive ignorable people. I also actively and willingly ignore them (and they should know this). I’m actually a big fan of the Mono platform.

I’ll try to ensure that I don’t create this confusion during presentations anymore.

Categories: Informatics and programming

Hmrrr

2010-02-07 11:39 UTC by Philip Van Hoof

0

1

10

In line with what I usually do at conferences, I lost my glasses at the GNOME Beer event this year. If somebody found it, and maybe even has it, please let me know. It’s kinda hard to see presentations without it.

Categories: Informatics and programming

Register

Subscriptions

Planet maemo: category "feed:43af5b2374081abdd0dbc4ba26a0b54c"

Focus on query performance

Supporting ontology changes in Tracker

Zürichsee

Reporting busy status

Performance improvements for SPARQL’s regex in Tracker

Working hard at the Tracker project

Tinymail 1.0!

An ode to our testers

Invisible costs

Working hard

Please don’t rewrite softwares (that are) written in .NET

Hmrrr