A retired researcher of high performance data structures and algorithms
Greetings. Thanks for visiting my site.
Just a heads-up, I am no longer a researcher. Nonetheless, over the years, many people from around the world (including students, software developers, entrepreneurs, etc.) reached out for advice on data structures after reading my papers. I'm really glad to know that my research has helped a lot of people. On that note, I would like to thank all of you who took the time to read my papers; and thanks to those who spent time verifying my results, and who wrote and published versions in other programming languages. Some examples are provided below for your reference (in no particular order).
Many have also asked for my original source code – the one used in my thesis. You can find it at github.com/naskitis. I have the old datasets too, but I'll have to dig them out first. Keep in mind that my code is old – like 20+ years old. It was written back in the days of the Intel Pentium 4 .. so it may not compile anymore - especially on macOS (though it probably wont take much to get it to work). Nonetheless, the code can still be used as a reference. In addition, researchers have cited my papers, so I suggest checking out recent literature for more modern alternatives.
As a side note: companies have also reached out to me seeking help with their choice and implementation of data structures (some of which were experiencing significant performance bottlenecks). Here is what I've commonly suggested, amongst several other key performance improvements. The array hash table – although fast and compact – was implemented (to a degree) for general purpose use. This, in turn, allowed for a fairer comparison against existing data structures - which is important for research publications.
... but ... ahem ... if you don't care about being fair or supporting general purpose use, then there's a relatively simple implementation quirk that can help boost the performance of the array hash (and the HAT-trie too). As described in my papers, we store strings within a dynamic array like so: 5hello5there4mate\0. If we want to search for the word "hello", we send in the query token "hello\0" – note the null character.
Here is a snippet of the code for searching a null-terminated query token:
/* compare the query to the word in the array, a character a time * until a mismatch occurs or a null character is encountered */ for (; query != '\0' && query == *array; query++, array++); /* if every character of the query string was compared and the * length of the query matches the length of the string compared, * then its a match */ if (*query == '\0' && (array-word_start) == len)
Assuming that we are using the original ASCII table (the one with a character range of 0 to 127), lets choose a character that is unused in our data – and importantly, one that is not expected to be encountered in future datasets. A character that is outside the ASCII range would be nice, say 0xff. Now for the quirk: instead of null-terminating your query token, terminate it using 0xff. Thus, the query token becomes: "hello0xff". Only the query token changes; the array hash and its dynamic array structure/format stay the same. With this in mind, I invite you to study the code snippet shown above. Can you spot the changes that can be made to reduce both computational cost and branch mispredictions, as well to promote better compile-time optimisations?
If you would like to contact me, you may find my gmail email within my GitHub repository; but don't expect a quick response though.
Cheers.