Nikolas Askitis

A retired researcher of high performance data structures and algorithms

An example image of a hybrid HAT-trie

Greetings. Thanks for visiting my site.

Just a heads-up, I am no longer a researcher. Nonetheless, over the years, people from around the world including software developers, entrepreneurs, business managers and owners, students, researchers and even work colleagues have reached out for advice on data structures. I didn't expect all of the feedback and interaction. I'm really glad to know that my research has helped a lot of people. On that note, I would like to thank all of you who took the time to read my thesis, and who reached out to me and/or mentioned my papers online. Some examples that were found online are provided below for your reference (and in no particular order).

People have also been asking for my original source code – the one used in my thesis. You can find it at github.com/naskitis. I have the old datasets too, but I'll have to dig them out first. Keep in mind that my code is old – like 20+ years old. It was written back in the days of the Intel Pentium 4. So don't expect it to compile anymore or be tuned for modern processors and operating systems. However, it can still be used as a reference to help you learn. In addition, researchers have cited my papers, so I suggest checking out recent literature – to see if there are any better alternatives.

As a side note: companies have also reached out to me over the years seeking help with their choice and implementation of data structures (some of which were experiencing significant performance bottlenecks). Here is what I've commonly suggested, amongst several other key performance improvements. The array hash table – although fast and compact – was implemented (to a degree) for general purpose use. This, in turn, allowed for a fairer comparison against existing data structures.

... but ... ahem ... if you don't care about being fair or supporting general purpose use, then there's a relatively simple implementation quirk that can help boost the performance of the array hash (and the HAT-trie too). As described in my papers, we store strings in a dynamic array like so: 5hello5there4mate\0. If we want to search for the word "hello", we send in the query token "hello\0" – note the null character; as that's usually how variable-length strings are terminated.

Here is a snippet of the code for searching a null-terminated query token:

 
/* compare the query to the word in the array, a character a time
 * until a mismatch occurs or a null character is encountered 
 */ 

for (; query != '\0' && query == *array; query++, array++); 

/* if every character of the query string was compared and the
 * length of the query matches the length of the string compared,
 * then its a match 
 */ 

if (*query == '\0' && (array-word_start) == len) 

Assuming that we are using the original ASCII table (the one with a character range of 0 to 127), lets choose a character that is unused in our data – and importantly, one that is not expected to be encountered in future datasets. A character that is outside the ASCII range would be nice, say 0xff. Now for the quirk: instead of null-terminating your query token, terminate it using 0xff. Thus, the query token becomes: "hello0xff". Only the query token changes; the array hash and its dynamic array structure/format stay the same. With this in mind, I invite you to study the code snippet shown above. Can you spot the changes that can be made to reduce both computational cost and branch mispredictions, as well to promote better compile-time optimisations?

I won't provide my email address here, but if you like to contact me, then you can find my Gmail somewhere within my GitHub repositories. Don't expect a quick response though, as I'm often busy at work.

Cheers.