All posts by murrayc

About murrayc

Murray Cumming is a software developer who works with Linux, C++, and Java. He runs Openismus GmbH - a small consultancy that works with open source platforms.

C++: std::string_view not so useful when calling C functions

string_view does not own string data

C++17 adds std::string_view, which is a thin view of a character array, holding just a pointer and a length. This makes it easy to provide just one method that can efficiently take either a const char*, or a std::string, without unnecessary copying of the underlying array. For instance:

void use_string(std::string_view str);

You can then call that function like so:

use_string("abc");

or

std::string str("abc");
use_string(str);

This involves no deep copying of the character array until the function’s implementation needs to do that. Most obviously, it involves no copying when you are just passing a string literal to the function. For instance it doesn’t create a temporary std::string just to call the function, as would be necessary if the function took std::string.

string_view knows nothing of null-termination

However, though the string literal (“abc”) is null-terminated, and the std::string is almost-certainly  null-terminated (but implementation defined), our use_string() function cannot know for sure that the underlying array is null terminated. It could have been called liked so:

const char* str = "abc"; // null-terminated
use_string(std::string_view(str, 2));  // not 3.

or even like so:

const char str[] = {'a', 'b', 'c'}; //not null-terminated
use_string(std::string_view(str, 3));

or as a part of a much larger string that we are parsing.

Unlike std::string, there is no std::string_view::c_str() which will give you a null-terminated character array. There is std::string_view::data() but, like std::string::data(), that doesn’t guarantee the the character array will be null-terminated. (update: since C++11, std::string::data() is guaranteed to be null-terminated, but std::string_view::data() in C++17 is not.)

So if you call a typical C function, such as gtk_label_set_text(), you have to construct a temporary std::string, like so:

void use_string(std::string_view str) {
  gtk_label_set_text(label, std::string(str).c_str());
}

But that creates a copy of the array inside the std::string, even if that wasn’t really necessary. std::string_view has no way to know if the original array is null-terminated, so it can’t copy only when necessary.

This is understandable, and certainly useful for pure C++ code bases, or when using C APIs that deal with lengths instead of just null termination. I do like that it’s in the standard library now. But it’s a little disappointing in my real world of integrating with typical C APIs, for instance when implementing gtkmm.

Implementation affecting the interface

Of course, any C function that is expected to take a large string would have a version that takes a length. For instance, gtk_text_buffer_set_text(), so we can (and gtkmm will) use std::string_view as the parameter for any C++ function that uses that C function. But it’s a shame that we can’t have a uniform API that uses the same type for all string parameters. I don’t like when the implementation details dictate the types used in our API.

There is probably no significant performance issue for small strings, even when using the temporary std::string() technique, but it wouldn’t be nice to make the typical cases worse, even just theoretically.

In gtkmm, we could create our own string_view type, which is aware of null termination, but we are trying to be as standard as possible so our API is as obvious as possible.

C++ Implementations of “Engineering the Compiler” Pseudocode

Implementing is understanding

I’m gradually reading through the Engineering a Compiler book. I’m enjoying it but after a while I started wondering if I really understood the algorithms and I wanted to make sure of that before going further. So I implemented the algorithms in C++ (C++17, because it’s nice), testing them against the example inputs and grammars from the book. For instance, here is my code to construct, and use, the action and goto tables which define a DFA for a bottom-up table-driven LR(1) parser, along with some test code to use it with a simple expression grammar.

So far I’ve done this for Chapter 2 (Scanners) and Chapter 3 (Parsers) and I’ll try to keep going. It is a useful exercise for me, and maybe it’s useful to someone else reading the book.  Please note that the code will probably be meaningless to you if you haven’t read the book or something similar. On the other hand, the code will probably seem childlike if you’ve studied compilers properly.

Trying to get  the code to work often showed me that I had had only the illusion of fully understanding. Much of the pseudocode in the book is not very clear, even when you adjust your mind to its vaguely mathematical syntax, and to know what it really means you have to closely read the text descriptions, sometimes finding clues several pages back. For instance it’s not always clear what is meant by a particular symbol, and I found at least one conditional check that appeared in the description but not in the code. I would much rather see descriptions inline in the code.

Pseudocode is vague

I’m not a fan of pseudocode either, though I understand the difficulty in choosing a real programming language for examples. It means putting some readers off. But there could at least be some real code in an appendix. The book has some detailed walkthroughs that show that the authors must have implemented the algorithms somehow, so it’s a shame that I couldn’t just look at that code. I wonder what real programming language might be the most compact for manipulating sets of states when dealing with finite state automata states and grammar symbols.

For the parsers chapter this wasn’t helped by the, presumably traditional, ambiguity of the “FIRST” sets terminology. FIRST(symbol), FIRST(symbols), FIRST(production), and FIRST+(production) are all things.

My code isn’t meant to be efficient. For instance, my LR(1) parser table-building code has some awfully inefficient use of std::set as a key in a std::map. The code is already lengthier than I’d like, so I would prefer not to make it more efficient at the cost of readability. I’ve also probably overdone it with the half-constexpr and vaguely-concepty-generic Grammar classes. But I am tempted by the idea of making these fully constexpr with some kind of constexpr map, and then one day having the compiler build the generated (DFA) tables at compile time, so I wanted to explore that just a little.

But the book is good

Despite my complaints about the presentation of the algorithms, so far I’d still recommend the book. I feel like it’s getting the ideas into my head , it’s not really that bad, and as far as I know there’s nothing better. Of course, given that I haven’t read the alternatives, my recommendation shouldn’t mean much.

Also, these days I always write up my notes as a bigoquiz.com quiz, so here are my quizzable compiler notes so far.

Boost Graph Library: modernization

Modern C++ (C++11 and later) can greatly simplify generic templated C++ code. I’ve recently been playing around with modernizing the Boost Graph Library code. The experience is similar to how I modernized the libsigc++ code, though I have not gone nearly into that much depth yet. The BGL is currently a big messy jumble of code that isn’t getting much love, and modernizing it could start to let its accumulated wisdom shine through, while also freeing it of other boost dependencies.

Please note that this is just an experiment in my own fork that even I am not pursuing particularly seriously. These changes are not likely to ever be accepted into the BGL because it would mean requiring modern C++ compilers. Personally, I think that’s the only way to move forward. I also think that Boost’s monolithic release process, and lack of a real versioning strategy, holds its libraries back from evolving. At the least, I think only a small set of generally-useful libraries should be released together, and I think that set should drop anything that’s now been moved into the standard library. BGL seems to have been stagnant for the last decade, so there doesn’t seem to be much to lose.

I’ve modernized the example code (and also tried to remove the “using namespace boost” lines), and done some work to modernize the boost graph code itself.

At the least, liberal use of auto can let you delete lots of ugly type declarations that really exist just to make things compile rather than to provide useful clues to the reader. For instance, auto makes the example code less cluttered with magic incantations. Range-based for loops simplify more code – for instance, in the examples.The simpler code is then easier to understand, letting you see the useful work underneath the boiler plate.

I’ve jumped right into C++17 and used structured bindings (and in the examples) because they are particularly helpful with the BGL API, which has many methods that return the begin and end of a range inside a std::pair. This lets us avoid some more type declarations. However, in the examples, I used a make_range_pair() utility function in many places instead, so I could use a simple range-based for. I guess that the range library would provide a type to use instead, maybe as part of the standard library one day.

I also replaced most uses of the boost type traits (some from boost::mpl) with the std:: equivalents. For instance, std::is_same instead of boost::is_same. It should now be possible to remove the boost::mpl dependency completely with a little work.

I’ve also tried putting all of BGL in the boost::graph namespace, instead of just in the boost namespace, but the API currently expects application code to use “using namespace boost”, to make its generic API work, and this makes that even more obvious.

As a next step, I guess that the boost graph implementation code could be simplified much more by use of decltype(auto). As I was modernizing libsigc++, I sometimes found templates that were used only by specializations that were used only by typedefs that were no longer needed, because they could be replaced by decltype(auto) return types. You have to pull at the threads one by one.

gtkmm 4 progress

gtkmm 4 is alive

We (mostly Kjell Ahlstedt and myself) have been quietly working away on gtkmm 4 and an associated ABI-breaking version of glibmm. We’ve been tracking GTK+ 4 from git master, making sure that gtkmm builds against it, and making various API-breaking or ABI-breaking changes that had been left in the code as TODO comments waiting for a chance like this.

This includes simple ABI-breaking (but not API-breaking) stuff such as adding a base classes to widget esor changing the type sof a method parameters. Many other changes are about updating the code to be more modern C++, trying to do things the right way and avoiding duplication with new API in the C++ Standard library.

These changes pleases us as purists but, honestly, as an application developer it isn’t going to give you interesting new behaviour in your applications. If you too are enthusiastic about the C++ renaissance, porting to the new APIs might feel rewarding and correct, and you’ll have to do it someday anyway to get whatever new features arrive later, but I cannot claim there is a more compelling reason to do the work. You might get shiny new features via GTK+ 4, but so far it feels like a similar exercise in internal cleanup and removal of unloved API.

There are still people stuck on GTK+ 2, because most people only port code when they need to. I don’t think the transition to GTK+ 4 or gtkmm 4 will be any faster. Certainly not until the GTK+ 4 porting advice gets a lots better. Porting gtkmm and glom to GTK+ 4 has not been fun, particularly as the trend for API changes without explanation has only increased.

Anyway, here are some of the more significant changes so far in glibmm-2.54 (a horrible ABI name, but there are reasons) and gtkmm-4.0, both still thoroughly unstable and subject to API change:

(Lots of this is currently on in git master but will be in tarball releases soonish, when there are glib and gtk+ releases we can depend on.)

Deprecated API is gone

Anything that was deprecated in gtkmm 3 (or glibmm-2.4) is now completely removed from gtkmm 4 (or glibmm-2.54). You’ll want to build your application with gtkmm 3 with all deprecated API disabled before attempting to build against gtkmm 4.

In some cases this included deprecated base classes that we couldn’t let you optionally disable without breaking ABI, but now they are really really gone.

This is a perfect example of API changes that make us feel better but which are really not of much direct benefit to you as an application developer. It’s for your own good, really.

Glib::RefPtr is now std::shared_ptr

In gtkmm 3, Glib::RefPtr<> is a reference-counting smart pointer, mostly used to hide the manual GObject reference-counting that’s needed in C. It worked well, but C++11 introduced std::shared_ptr so we saw a chance to delete some code and  make our API even more “standard”. So, Glib::RefPtr is now just an alias for std::shared_ptr and we will probably change our APIs to use std::shared_ptr instead of Glib::RefPtr.

Glib::RefPtr is an intrusive smart pointer, meaning that it needs, and uses, a reference count in the underlying object rather than in the smartpointer itself. But std::shared_ptr is non-intrusive. So we now just take one reference when we instantiate the std::shared_ptr and release that one reference when the last std::shared_ptr is destroyed, via its Deleter callback. That means we always need to instantiate the std::shared_ptr via Glib::make_refptr_for_instance(), but we hide that inside glibmm/gtkmm code, and application developers would rarely need to do this anyway.

std::shared_ptr<> is a familiar type, so using it in our API should make it easier for the average person to reason about our API. And using it makes us part of the wider ongoing conversation in the C++ community about how to indicate and enforce ownership. For instance, we might start taking Thing* parameters instead of std::shared_ptr<Thing> parameters when we know that the called method will not need to share ownership. I mentioned this idea in an earlier post. However, we often cannot assume much about what the underlying C function really does.

This is a big change. Hopefully it will work.

Now uses libsigc++-3.0 instead of libsigc++-2.0

I rewrote libsigc++ for modern C++, using variadic templates and some other modern C++ techniques. It feels like libsigc++-2.0, but the code is much simpler. Compiler errors might be slightly less cryptic. This requires C++14.

Enums are inside related classes

Enums that were only used with a particular class are now inside that class. For instance, Gio::ApplicationFlags is now Gio::Application::Flags. This makes the API slightly clearer. This didn’t need C++11, but it did need an API break.

Enums are now C++11 enum classes

Enums are now declared as “enum class” (scoped enumerations) instead of “enum” (unscoped enumerations), so they cannot be implicitly converted to other types such as bool or int. That lets the compiler find some subtle programmer errors.

The enum values must now be prefixed by the enum name rather than having a prefix as part of the value name. For instance, we now use Gio::Application::Flags::HANDLES_OPEN instead of Gio::Application::FLAGS_HANDLES_OPEN (actually Gio::APPLICATION_FLAGS_HANDLES_OPEN before we put enums inside classes).

Gtk::TreeView and Gtk::TextView now have real const_iterators

This lets us make the API more const-correct, requiring less arbitrary const_casts<>s in application code.

Removed the old intermediate ListHandle/SListHandle/ArrayHandle/SArrayHandle types

Long ago, the gtkmm API used these intermediate types, wrapping glib types such as GList, GSList, and C arrays, so you didn’t need to choose between using a std::list, std::vector, or other standard container. Since gtkmm 3 we decided that this was more trouble than it was worth, and decided to just uses std::vector everywhere, but it’s only now that we’ve been able to remove them from the glibmm, pangomm, and atkmm APIs.

A possible change: Not using Glib::ustring

We are still considering whether to replace uses of Glib::ustring in our API with std::string, maybe just keeping Glib::ustring for when people really want to manipulate UTF-8 “characters”. I would much prefer standard C++ to have real UTF-8 support, for instance letting us step through and insert characters in a UTF-8 string, but that doesn’t look like it will happen in the foreseeable future.

Glib::ustring still wraps useful UTF-8 APIs in glib, in a std::string-like API, so we wouldn’t want to remove it.

Also, currently it’s useful to know that, for instance, a gtkmm method that returns a Glib::ustring is giving us a UTF-8 string (such as Gtk::FileChooser::get_current_name()), rather than something of unknown encoding (such as Gtk::FileChooser::get_filename()). We allow implicit conversions, for convenience, so we can’t use the compiler to check for awareness of these encoding differences, but having it in the method signature still feels nicer than having to read a method’s documentation.

Trying googletest

Recently I’ve played a bit with googletest in a couple of small projects: in murrayc-tuple-utils (and some later commits) and in prefixsuffix. It’s pretty straightforward and improved my test code and test result output (test-suite.log with autotools). Here are some notes.

Code Changes

I usually just use the standard autotools and CMake test features, sprinkling assert()s and “return EXIT_FAILURE;” in my C++ test code. However, googletest lets me replace a simple assert():

assert(foo.get_thing() == "hello");

with an EXPECT_EQ() call that will give me a clue when the test fails:

EXPECT_EQ("hello", foo.get_thing());

There are some other simple assertions.

googletest also wants me to declare test functions, previously simply declared like so,

void test_thing_something() {

now like so:

TEST("TestThing", "Something") {

which is all a bit macrotastic, unfortunately, but it works.

I chose to use the standard main() implementation, so I just removed my main() function that called these tests and compiled gtest_main.cc into the gtest library, as suggested in the googletest documentation.

Build Changes

googletest is meant to be built in your project, not built separately and just linked to as a library. For performance testing, I can see the benefit of building the test framework with exactly the same environment and options as the code being tested, but this does seem rather awkward. I guess it’s just what Google do in their monolithic builds so there’s nobody making an effort to support any other build system.

(Update: I wrote that in 2016, but now in 2019, I’ve noticed that added documentation about using pkg-config in August 2017 (now here) and removed that advice about building it yourself, in July 2018. I’m very pleased about that. However, your Linux distro probably doesn’t have a package for the library. The Debian googletest package and the Ubuntu googletest package (the same as the libgtest-dev package), for instance, “does not contain a library to link against, but rather the source code to build the google test and mock libraries”.

Anyway, it’s fairly easy to add googletest as a git submodule and build the library in your autotools or CMake project. For autotools, I did this with libtool in murrayc-tuple-utils and without libtool in prefixsuffix.

Unfortunately, I had to list the individual googletest source files in EXTRA_DIST to make this work with autotool’s “make distcheck”.

This is easier with CMake, because googletest has a CMakeList.txt file so you can use “add_subdirectory (googletest)”. Of course, CMake doesn’t have an equivalent for “make distcheck”.

Also unfortunately, I had to stop using the wonderful -Wsuggest-override warning with warnings as error, because googletest doesn’t use the override keyword. I think it hasn’t really caught up with C++11 yet, which seems odd as I guess Google is using at least C++11 in all its code.

Conclusion

The end result is not overwhelming so far, and it’s arguable if it’s worth having to deal with the git submodule awkwardness. But in theory, when the tests fail, I now won’t have to add so many printfs to find out what happened. This makes it a bit more like using JUnit with Java projects.

Also in theory, the test output would integrate nicely with a proper continuous integration system, such as Jenkins, or whatever Google use internally. But I’m not using any of those on these projects. Travis-CI is free with github projects, but it seems to be all about builds, with no support for test results.

Suffix Tree: Suffix Array and LCP Array, C++

I played a bit more with my C++ Suffix Tree experiment, gradually exploring how Suffix Trees are equivalent to a Suffix Array and its associated LCP Array, allowing us to do anything with Suffix Arrays (and LCP arrays – Longest Common Prefix) that we can do with Suffix Trees, but with less memory (but only by a constant factor) and, I think, more cache-friendly use of memory. This code is still far from suitable for any real use.

Converting between Suffix Trees and Suffix Arrays

I first moved my previous code for Ukkonen’s algorithm from a branch into master, so we can now construct a SuffixTree from a string, in O(n) linear time. Subsequent insert()s still use the naive O(n^2) code. Traditionally, you’d concatenate all your strings with unique separator characters, but I’m trying to avoid that. And, really, I’m just trying to explore the use of the relevant algorithms – it’s a bonus if I can wrap it up into something useful.

I then added a get_suffix_array_and_lcp_array() method to get the Suffix Array and associated LCP Array, in O(n) linear time. This works by doing a lexicographically-ordered depth first search.

Finally, I added a SuffixTree constructor that takes such a Suffix Array and LCP Array to build the suffix tree, again in O(n) linear time. This iterates over the suffixes in the Suffix Array adding a node for each, walking back up the tree to split at the point specified by the LCP array. I kept the path in a stack for that walk up. This constructor let me test that I could use a Suffix Array (and LCP Array) obtained from the SuffixTree to recreate the SuffixTree correctly.

This was based on the ideas I learned from Dan Gusfield’s Fundamentals of Stringology lectures, specifically part 3. Incidentally, his later discussion of the Burrows Wheeler Transform led me to Colt McAnlis’ wonderful Compressor Head series, which far more people should watch.

Towards a usable SuffixArray API.

I’ve already created a SuffixArray class to do much of what the SuffixTree class can do, such as finding matches for a substring. It currently does an O(log(n)) binary search over the suffix array. But this should be possible in O(m) time, where m is the length of the substring, hopefully with far less string comparisons, just as it is possible with a Suffix Array. This can apparently be done with the help of the LCP Array. Hopefully I’ll get to that.

I also want to implement Kärkkäinen & Sanders DC3 algorithm to construct the Suffix Array directly from the string in O(n) linear time. And, if I understand Dan Gusfield’s lecture properly, we can use the Burrows-Wheeler transform to construct the associated LCP Array also in linear time. Hopefully I’ll get to that too.

gtkmm 4 started

We (the gtkmm developers) have started work on an ABI-breaking gtkmm-4.0, as well as an ABI-breaking glibmm, target GTK+ 4, and letting us clean up some cruft that has gathered over the years. These install in parallel with the existing gtkmm-3.0 and glibmm-2.4 APIs/ABIs.

A couple of days ago I released first versions of glibmm (2.51.1) and gtkmm (3.89.1), as well as accompanying pangomm and atkmm releases.

This also lets us use my rewrite of libsigc++ for libsigc++-3.0, bringing us more fully into the world of “modern C++”. We might use this opportunity to make other fundamental changes, so now is the time to make suggestions.

We did a parallel-installing ABI-breaking glibmm even though glib isn’t doing that. That’s because, now that GTK+ is forcing us to do it for gtkmm, this seems like as good a time as any to do it for glibmm too. It’s generally harder to maintain C++ ABIs than C ABIs, largely because C++ is just more complicated and its types are more exactly specified. For instance we’d never have been able to switch to libsigc++-3.0 without breaking ABI.

 

Suffix Tree, Ukkonen, C++

I’ve been playing with suffix trees recently, in particular trying to get my head around Ukkonen’s algorithm for suffix tree construction. I’ve got the hang of it now after trying out some code and checking it against this excellent video walk through of Ukkonen’s algorithm by Tushar Roy.

Tries, Radix Trees, Suffix Trees

A suffix tree stores every suffix of a string, or of several strings, and makes it easy to find prefixes of those suffixes, in linear O(m) time, where m is the length of the substring. So it lets you find substrings in O(m) time. So it can be used for full text search, and it’s often used for DNA comparison.

A suffix tree is actually a radix tree (also known as a radix trie), with suffixes inserted instead of just the whole string. A radix tree  is itself a compacted trie (also known as a prefix tree).

A trie (pronounced try) is a tree with characters on the edges, letting you discover the whole string as you traverse the edges, staring with the edge from the root for the first character of the string. For instance, this lets you offer text completion if the user gives you the first few correct characters of a word.

A radix tree compacts the Trie by letting each edge have several characters, splitting the edge only when necessary to distinguish different strings. So, while a Trie containing “abc” would have a chain of root->a->b->c, a Radix Tree would have just root->abc. Adding “abd” would change that to root->ab->c, with another for root->ab->d.

A suffix tree, or radix tree, can save space by storing just the start and end indices of its strings instead of the strings themselves. Then each edge has constant size, and the space for the tree is linear O(m), where m is the length of one inserted string. Or maybe you could say its linear O(nk), for n strings of length k, where k is m considered effectively constant.

For the string “banana” a suffix tree would look like this, with both leaf nodes and intermediate end markers marked with *.

root
    banana*
    a*
     na*
       na*
    na*
      na*

We can add a second string in the same suffix tree, associating values with the leaf at the end of each path, letting us know which items have a particular substring. For instance, if I add “banana” (value 0) and “bandana” (value 1) to a suffix tree, I can ask about “ana” and discover that it identifies both items (value 0 and value 1), via the a->n->a path. The tree looks like this:

ban
   ana (0)
   dana (1)
a (0, 1)
 n
  a (0, 1)
   na (0)
  dana (1)
n
 a (0, 1)
  na (0)
 dana (1)
dana (1)

Suffix Tree Construction

The hard part is constructing the suffix tree. When we insert one string of length m, we actually insert m suffixes. If each suffix insertion takes O(m) time, then the whole construction time is at least O(m^2).

Ukkonen’s algorithm reduces this to O(m) by iterating through the string just once from start to end and using the ideas of an implicit “global end”, the “active point” and “suffix links” to jump to wherever the rest of a smaller suffix should be inserted.

Each “phase” deals with one character from the string, in sequence, performing “extensions” within that “phase” to add suffixes that start with that character. This is useful because of the repeating sub-structure of suffix trees – each sub-tree can appear again as part of a smaller suffix.

Global End

Whenever the algorithm adds a node, that’s going to be a leaf node, meaning the edge to the leaf node will have the actual end of the inserted string. So the algorithm assigns a global end to the edge. We increment that end index each time we process the next character, so these edges get longer automatically.

Active point

The active point identifies a point on an edge in the tree by specifying a node, an edge from that node (by specifying a first character), and a length along that edge. Within every “extension”, the algorithm starts looking for the next character at the active point. Sometimes it will find the next character only because an edge got longer automatically when we incremented our global end.

Whenever it finds the next character from that active point, maybe even after stepping across an intermediate node,  it sets that newly-found position as the active point. This is called Rule 3.

Whenever it doesn’t find the next character from that active point, it adds a node, splitting the edge if necessary. This is called Rule 2. Whenever it splits an edge and creates an internal node, it changes the active point to the same character in a smaller edge, and looks again for the same next character. This lets it do the same split in the smaller suffix. It keeps doing this until all the smaller suffixes have been split in the same way.

This example, with the string “banana”, shows the use of the global end and of Rule 3 and Rule 2. It doesn’t show the use of suffix links yet, because it always splits edges from the root. Note that Ukkonen’s algorithm typically appends a special out-of-alphabet “$” character to the string so it can insert leaf nodes.

b: Look from root. Add to root.

b

a: Look from root. Add to root.

b
ba (because we incremented the global end).

n: Look from root. Add to root.

ban (because we incremented the global end).
an (because we incremented the global end).
n

a: Look from root. Find it (“ana”). Set it as the active point.

bana (because we incremented the global end).
ana (because we incremented the global end).
na

n: Look from previously-found “a”, in “_a_na”. Find it. Set it as the active point.

banan (because we incremented the global end).
anan (because we incremented the global end).
nan

a: Look from previously-found “n” in “a_n_an”. Find it. Set it as the active point.

banana (because we incremented the global end).
anana (because we incremented the global end).
nana

$: Look from previously-found “a” in “an_a_na”. Not found, so add an internal node and add the $. Change the active point to same character in a smaller (suffix) edge: “a” in “n_ana”.

banana$ (because we incremented the global end).
ana
   na$ (because we incremented the global end).
   $
nana$

$: Look from “a” in “n_a_na$”. Not found, so add an internal node and the $.
Change the active point to same character in a smaller (suffix) edge: “a” in “_a_na$”.

banana$
ana
   na$
   $
na
  na$
  $

$: Look from “a” in “_a_na$”. Not found, so add an internal node and the $. Change the active point to root.

banana$
a
 na
   na$
   $
 $
na
  na$
  $

$: Look from root.  Add to root.

 banana$
 a
  na
    na$
    $
  $
 na
   na$
   $
 $

Suffix Links

In Rule 2, when the algorithm splits an edge, adding an internal node, if sets a suffix link for that internal node, pointing to root. If it has previously set a suffix link for any other internal node, while looking for the same character (in the same “phase”), it updates the link for that previously-created internal node to point to the new node.

If the active node is not root after adding the internal node, it also changes the active node via the active node’s suffix link, taking us to a previously-created internal node. As before, this lets us jump to the smaller suffix of the same substring, to make the same split there, but suffix links let us do it even when we are not on an edge from root.

We can see this when using “banbadbans” as an example. We reach this point:

a: Look from “n” in “ba->_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to the “n” in “a->_n_badban$”.

ba
  n
   badban$
   $
  dban$
a
 nbadban$
 dban$
nbadban$
dban$

a: Look from “n” in “a->_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to the “n” in “_n_badban$”.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
nbadban$
dban$

a: Look from “n” in “_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to root.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
n
 badban$
 $
dban$

a: Look from root. Not found. Add to root.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
n
 badban$
 $
dban$
$

Example Code

Here is some example C++ code. It’s also in github, with far more comments.

const auto key_start = key.start_;
const auto key_end = str_end(key);

ActivePoint active;
active.node = &root_;
active.edge_valid = false; // Instead of -1.
active.length = 0;

std::size_t remaining = 0;
auto end_ptr = std::make_shared<KeyIterator>(key_start);
KeyIterator& end = *end_ptr;

// The "phases"
for (auto i = key_start; i != key_end; ++i) {
  ++remaining;
  ++end;

  Node* prev_created_internal_node = nullptr;

  // The "extensions".
  while(remaining) {
    const auto edge_match = (active.edge_valid && active.length) ?
      find_partial_edge(active, i) :
      find_partial_edge(active.node, i);
    const auto edge = edge_match.edge_;
    const auto part_len_used = edge_match.edge_part_used_;

    if (!edge_match.char_found_) {
      KeyInternal prefix(i, end_ptr);

      // Rule 2 extension: There is no match:
      if (part_len_used == 0) {
        active.node->append_node(prefix, value);
      } else {
        auto extra_node = edge->split(part_len_used);
        extra_node->append_node(prefix, value);
        extra_node->suffix_link_ = &root_;

        if (prev_created_internal_node) {
          prev_created_internal_node->suffix_link_ = extra_node;
        }
        prev_created_internal_node = extra_node;

        if (active.node != &root_) {
          active.node = active.node->suffix_link_;
        } else {
          --active.length;
          ++active.edge;
        }
      }

      --remaining;
      continue;
    }

    // Rule 3 extension:
    active.node = edge_match.parent_node_;
    active.edge = edge->part_.start_;
    active.edge_valid = true;
    active.length = part_len_used;

    break;
  }
}

Generic Suffix Tree in C++

However, the basic Ukkonen’s algorithm just stores one string. Before trying Ukkonen’s algorithm, I wrote a templated C++ suffix tree that can contain multiple strings (see the test code), with a simple O(m^2), or possibly O(m^3) construction time.

I created a version that uses Ukkonen’s algorithm, reducing construction to O(m) time, but it doesn’t support multiple inserts. In this implementation, I’ve avoided the need for extra “$” leaf nodes. I’ve also stored the suffix links in a hash table, to avoid wasting space per node even after construction, though that replaces a true constant-time read/write with an amortized constant time hash table insert/lookup.

I’m trying to adapt the Ukkonen-based suffix tree to support multiple inserts. Usually, to create such a “generalized” suffix tree with Ukkonen’s algorithm, you concatenate the strings together, separated by unique out-of-alphabet characters (“$”, “#”, etc), insert the whole thing, then prune the paths you don’t need, but I’d like to avoid that.

I’m not suggesting that anyone use this code for serious work. For now, it’s littered with asserts() and debug output and it’s in need of optimization because it’s generalized for any standard C++ container, such as std::string. For instance, it doesn’t seem wise to store std::string::iterators, instead of pointers, for the start/end ranges on edges. I’m also bothered by the use of std::shared_pointer<>s to share the “global end”, which still take space per edge even after construction.

Alternative Algorithms

Update: I’ve just watched Erik Demaine’s 2012 MIT “Strings” lecture about suffix trees (I needed to watch his previous “Static Strings” lecture first). He doesn’t mention Ukkonen’s algorithm, but does mention a “DC3 (Difference Cover 3)” algorithm by Kärkkäinen-Sanders-Burkhardt (at least the lecture’s student notes call it that) for building suffix arrays in linear time. He shows that a suffix tree can then be constructed from a  suffix array by using the LCP (Longest Common Prefix) of the suffix array to build a cartesian tree, again in linear time. So the whole suffix array construction would be linear time. Suffix arrays themselves anyway offer better data locality than suffix trees and this DC3 algorithm apparently allows parallelization. I hope I find time to explore all that sometime in code.

He also mentions the use of a “suffix tray” to store the edges in each node, to get  O(P  + lg(Σ))) query time, much the same as the O(P) time you’d get with an array, while reducing the overall space complexity to O(T) instead of O(TΣ ). Here P is the size of the query string, T is the size of the text being searched, and Σ is the size of the alphabet. He also mentioned instead using a combination of hash tables and van Emde Boas trees to get O(P + lglg(Σ)) query time. Those parts of the lecture felt quite rushed to me, so I’d like to investigate that sometime too if I find time.

C++17: constexpr if

Yesterday I played with constexpr if a little, using clang++ built from the latest LLVM sources. C++17 will not have many of the features I was looking forward to, but this is a really nice convenience.

It lets us clean up some of those little templates-plus-specializations that only exist to execute different code in a parent template depending on the template’s type, particularly when the specialization is on a bool or integer value. It won’t help if the template specializations need different member variables or need to provide different type aliases or typedefs.

For instance, libsigc++-3.0 still had a little with_type template with partial template specializations for true and false. Replacing it with constexpr if makes the intention clearer.

I made many similar simplifications by using constexpr if in my murrayc-tuple-utils, which is used by libsigc++-3.0 via copy/paste/rename.

In my murrayc-dp-algorithms project, some code became much clearer and more efficient with constexpr if, allowing me to remove some template specializations that only existed to satisfy the compiler.

Galaxy Zoo iPhone/iPad app

A few weeks ago I finally released my iOS app for Galaxy Zoo on the Apple App Store. Now that it’s been used a bit and I’ve had the chance to fix a couple of crashes, I’m announcing it properly. I am a little afraid of harsh reviews but please try it out and  be kind.

I’m still not quite as comfortable with it as I am with my Android app for Galaxy Zoo, partly because I just don’t feel at home with Objective-C and because iOS lacks something like SyncProvider. But it seems to work.

I actually finished writing the app about eight months ago, but I had to wait until Apple gave me permission to use the old “Galaxy Zoo” app name, allowing this app to appear as just a new version of the old (withdrawn) Galaxy Zoo app from 2014 or so. Joe Zuntz, who wrote the old app, was very helpful with that, pestering Apple until it happened. Unlike that old app, it should be possible to keep this one working even if the server changes, because it’s open source on github.

Here are some screenshots:

screenshot_iphone_4p7inch_classify
screenshot_iphone_4p7inch_list
screenshot_iphone_4p7inch_help