Trying googletest

Recently I’ve played a bit with googletest in a couple of small projects: in murrayc-tuple-utils (and some later commits) and in prefixsuffix. It’s pretty straightforward and improved my test code and test result output (test-suite.log with autotools). Here are some notes.

Code Changes

I usually just use the standard autotools and CMake test features, sprinkling assert()s and “return EXIT_FAILURE;” in my C++ test code. However, googletest lets me replace a simple assert():

assert(foo.get_thing() == "hello");

with an EXPECT_EQ() call that will give me a clue when the test fails:

EXPECT_EQ("hello", foo.get_thing());

There are some other simple assertions.

googletest also wants me to declare test functions, previously simply declared like so,

void test_thing_something() {

now like so:

TEST("TestThing", "Something") {

which is all a bit macrotastic, unfortunately, but it works.

I chose to use the standard main() implementation, so I just removed my main() function that called these tests and compiled gtest_main.cc into the gtest library, as suggested in the googletest documentation.

Build Changes

googletest is meant to be built in your project, not built separately and just linked to as a library. For performance testing, I can see the benefit of building the test framework with exactly the same environment and options as the code being tested, but this does seem rather awkward. I guess it’s just what Google do in their monolithic builds so there’s nobody making an effort to support any other build system.

Anyway, it’s fairly easy to add googletest as a git submodule and build the library in your autotools or CMake project. For autotools, I did this with libtool in murrayc-tuple-utils and without libtool in prefixsuffix.

Unfortunately, I had to list the individual googletest source files in EXTRA_DIST to make this work with autotool’s “make distcheck”.

This is easier with CMake, because googletest has a CMakeList.txt file so you can use “add_subdirectory (googletest)”. Of course, CMake doesn’t have an equivalent for “make distcheck”.

Also unfortunately, I had to stop using the wonderful -Wsuggest-override warning with warnings as error, because googletest doesn’t use the override keyword. I think it hasn’t really caught up with C++11 yet, which seems odd as I guess Google is using at least C++11 in all its code.

Conclusion

The end result is not overwhelming so far, and it’s arguable if it’s worth having to deal with the git submodule awkwardness. But in theory, when the tests fail, I now won’t have to add so many printfs to find out what happened. This makes it a bit more like using JUnit with Java projects.

Also in theory, the test output would integrate nicely with a proper continuous integration system, such as Jenkins, or whatever Google use internally. But I’m not using any of those on these projects. Travis-CI is free with github projects, but it seems to be all about builds, with no support for test results.

Suffix Tree: Suffix Array and LCP Array, C++

I played a bit more with my C++ Suffix Tree experiment, gradually exploring how Suffix Trees are equivalent to a Suffix Array and its associated LCP Array, allowing us to do anything with Suffix Arrays (and LCP arrays – Longest Common Prefix) that we can do with Suffix Trees, but with less memory (but only by a constant factor) and, I think, more cache-friendly use of memory. This code is still far from suitable for any real use.

Converting between Suffix Trees and Suffix Arrays

I first moved my previous code for Ukkonen’s algorithm from a branch into master, so we can now construct a SuffixTree from a string, in O(n) linear time. Subsequent insert()s still use the naive O(n^2) code. Traditionally, you’d concatenate all your strings with unique separator characters, but I’m trying to avoid that. And, really, I’m just trying to explore the use of the relevant algorithms – it’s a bonus if I can wrap it up into something useful.

I then added a get_suffix_array_and_lcp_array() method to get the Suffix Array and associated LCP Array, in O(n) linear time. This works by doing a lexicographically-ordered depth first search.

Finally, I added a SuffixTree constructor that takes such a Suffix Array and LCP Array to build the suffix tree, again in O(n) linear time. This iterates over the suffixes in the Suffix Array adding a node for each, walking back up the tree to split at the point specified by the LCP array. I kept the path in a stack for that walk up. This constructor let me test that I could use a Suffix Array (and LCP Array) obtained from the SuffixTree to recreate the SuffixTree correctly.

This was based on the ideas I learned from Dan Gusfield’s Fundamentals of Stringology lectures, specifically part 3. Incidentally, his later discussion of the Burrows Wheeler Transform led me to Colt McAnlis’ wonderful Compressor Head series, which far more people should watch.

Towards a usable SuffixArray API.

I’ve already created a SuffixArray class to do much of what the SuffixTree class can do, such as finding matches for a substring. It currently does an O(log(n)) binary search over the suffix array. But this should be possible in O(m) time, where m is the length of the substring, hopefully with far less string comparisons, just as it is possible with a Suffix Array. This can apparently be done with the help of the LCP Array. Hopefully I’ll get to that.

I also want to implement Kärkkäinen & Sanders DC3 algorithm to construct the Suffix Array directly from the string in O(n) linear time. And, if I understand Dan Gusfield’s lecture properly, we can use the Burrows-Wheeler transform to construct the associated LCP Array also in linear time. Hopefully I’ll get to that too.

gtkmm 4 started

We (the gtkmm developers) have started work on an ABI-breaking gtkmm-4.0, as well as an ABI-breaking glibmm, target GTK+ 4, and letting us clean up some cruft that has gathered over the years. These install in parallel with the existing gtkmm-3.0 and glibmm-2.4 APIs/ABIs.

A couple of days ago I released first versions of glibmm (2.51.1) and gtkmm (3.89.1), as well as accompanying pangomm and atkmm releases.

This also lets us use my rewrite of libsigc++ for libsigc++-3.0, bringing us more fully into the world of “modern C++”. We might use this opportunity to make other fundamental changes, so now is the time to make suggestions.

We did a parallel-installing ABI-breaking glibmm even though glib isn’t doing that. That’s because, now that GTK+ is forcing us to do it for gtkmm, this seems like as good a time as any to do it for glibmm too. It’s generally harder to maintain C++ ABIs than C ABIs, largely because C++ is just more complicated and its types are more exactly specified. For instance we’d never have been able to switch to libsigc++-3.0 without breaking ABI.

 

Suffix Tree, Ukkonen, C++

I’ve been playing with suffix trees recently, in particular trying to get my head around Ukkonen’s algorithm for suffix tree construction. I’ve got the hang of it now after trying out some code and checking it against this excellent video walk through of Ukkonen’s algorithm by Tushar Roy.

Tries, Radix Trees, Suffix Trees

A suffix tree stores every suffix of a string, or of several strings, and makes it easy to find prefixes of those suffixes, in linear O(m) time, where m is the length of the substring. So it lets you find substrings in O(m) time. So it can be used for full text search, and it’s often used for DNA comparison.

A suffix tree is actually a radix tree (also known as a radix trie), with suffixes inserted instead of just the whole string. A radix tree  is itself a compacted trie (also known as a prefix tree).

A trie (pronounced try) is a tree with characters on the edges, letting you discover the whole string as you traverse the edges, staring with the edge from the root for the first character of the string. For instance, this lets you offer text completion if the user gives you the first few correct characters of a word.

A radix tree compacts the Trie by letting each edge have several characters, splitting the edge only when necessary to distinguish different strings. So, while a Trie containing “abc” would have a chain of root->a->b->c, a Radix Tree would have just root->abc. Adding “abd” would change that to root->ab->c, with another for root->ab->d.

A suffix tree, or radix tree, can save space by storing just the start and end indices of its strings instead of the strings themselves. Then each edge has constant size, and the space for the tree is linear O(m), where m is the length of one inserted string. Or maybe you could say its linear O(nk), for n strings of length k, where k is m considered effectively constant.

For the string “banana” a suffix tree would look like this, with both leaf nodes and intermediate end markers marked with *.

root
    banana*
    a*
     na*
       na*
    na*
      na*

We can add a second string in the same suffix tree, associating values with the leaf at the end of each path, letting us know which items have a particular substring. For instance, if I add “banana” (value 0) and “bandana” (value 1) to a suffix tree, I can ask about “ana” and discover that it identifies both items (value 0 and value 1), via the a->n->a path. The tree looks like this:

ban
   ana (0)
   dana (1)
a (0, 1)
 n
  a (0, 1)
   na (0)
  dana (1)
n
 a (0, 1)
  na (0)
 dana (1)
dana (1)

Suffix Tree Construction

The hard part is constructing the suffix tree. When we insert one string of length m, we actually insert m suffixes. If each suffix insertion takes O(m) time, then the whole construction time is at least O(m^2).

Ukkonen’s algorithm reduces this to O(m) by iterating through the string just once from start to end and using the ideas of an implicit “global end”, the “active point” and “suffix links” to jump to wherever the rest of a smaller suffix should be inserted.

Each “phase” deals with one character from the string, in sequence, performing “extensions” within that “phase” to add suffixes that start with that character. This is useful because of the repeating sub-structure of suffix trees – each sub-tree can appear again as part of a smaller suffix.

Global End

Whenever the algorithm adds a node, that’s going to be a leaf node, meaning the edge to the leaf node will have the actual end of the inserted string. So the algorithm assigns a global end to the edge. We increment that end index each time we process the next character, so these edges get longer automatically.

Active point

The active point identifies a point on an edge in the tree by specifying a node, an edge from that node (by specifying a first character), and a length along that edge. Within every “extension”, the algorithm starts looking for the next character at the active point. Sometimes it will find the next character only because an edge got longer automatically when we incremented our global end.

Whenever it finds the next character from that active point, maybe even after stepping across an intermediate node,  it sets that newly-found position as the active point. This is called Rule 3.

Whenever it doesn’t find the next character from that active point, it adds a node, splitting the edge if necessary. This is called Rule 2. Whenever it splits an edge and creates an internal node, it changes the active point to the same character in a smaller edge, and looks again for the same next character. This lets it do the same split in the smaller suffix. It keeps doing this until all the smaller suffixes have been split in the same way.

This example, with the string “banana”, shows the use of the global end and of Rule 3 and Rule 2. It doesn’t show the use of suffix links yet, because it always splits edges from the root. Note that Ukkonen’s algorithm typically appends a special out-of-alphabet “$” character to the string so it can insert leaf nodes.

b: Look from root. Add to root.

b

a: Look from root. Add to root.

b
ba (because we incremented the global end).

n: Look from root. Add to root.

ban (because we incremented the global end).
an (because we incremented the global end).
n

a: Look from root. Find it (“ana”). Set it as the active point.

bana (because we incremented the global end).
ana (because we incremented the global end).
na

n: Look from previously-found “a”, in “_a_na”. Find it. Set it as the active point.

banan (because we incremented the global end).
anan (because we incremented the global end).
nan

a: Look from previously-found “n” in “a_n_an”. Find it. Set it as the active point.

banana (because we incremented the global end).
anana (because we incremented the global end).
nana

$: Look from previously-found “a” in “an_a_na”. Not found, so add an internal node and add the $. Change the active point to same character in a smaller (suffix) edge: “a” in “n_ana”.

banana$ (because we incremented the global end).
ana
   na$ (because we incremented the global end).
   $
nana$

$: Look from “a” in “n_a_na$”. Not found, so add an internal node and the $.
Change the active point to same character in a smaller (suffix) edge: “a” in “_a_na$”.

banana$
ana
   na$
   $
na
  na$
  $

$: Look from “a” in “_a_na$”. Not found, so add an internal node and the $. Change the active point to root.

banana$
a
 na
   na$
   $
 $
na
  na$
  $

$: Look from root.  Add to root.

 banana$
 a
  na
    na$
    $
  $
 na
   na$
   $
 $

Suffix Links

In Rule 2, when the algorithm splits an edge, adding an internal node, if sets a suffix link for that internal node, pointing to root. If it has previously set a suffix link for any other internal node, while looking for the same character (in the same “phase”), it updates the link for that previously-created internal node to point to the new node.

If the active node is not root after adding the internal node, it also changes the active node via the active node’s suffix link, taking us to a previously-created internal node. As before, this lets us jump to the smaller suffix of the same substring, to make the same split there, but suffix links let us do it even when we are not on an edge from root.

We can see this when using “banbadbans” as an example. We reach this point:

a: Look from “n” in “ba->_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to the “n” in “a->_n_badban$”.

ba
  n
   badban$
   $
  dban$
a
 nbadban$
 dban$
nbadban$
dban$

a: Look from “n” in “a->_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to the “n” in “_n_badban$”.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
nbadban$
dban$

a: Look from “n” in “_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to root.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
n
 badban$
 $
dban$

a: Look from root. Not found. Add to root.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
n
 badban$
 $
dban$
$

Example Code

Here is some example C++ code. It’s also in github, with far more comments.

const auto key_start = key.start_;
const auto key_end = str_end(key);

ActivePoint active;
active.node = &root_;
active.edge_valid = false; // Instead of -1.
active.length = 0;

std::size_t remaining = 0;
auto end_ptr = std::make_shared<KeyIterator>(key_start);
KeyIterator& end = *end_ptr;

// The "phases"
for (auto i = key_start; i != key_end; ++i) {
  ++remaining;
  ++end;

  Node* prev_created_internal_node = nullptr;

  // The "extensions".
  while(remaining) {
    const auto edge_match = (active.edge_valid && active.length) ?
      find_partial_edge(active, i) :
      find_partial_edge(active.node, i);
    const auto edge = edge_match.edge_;
    const auto part_len_used = edge_match.edge_part_used_;

    if (!edge_match.char_found_) {
      KeyInternal prefix(i, end_ptr);

      // Rule 2 extension: There is no match:
      if (part_len_used == 0) {
        active.node->append_node(prefix, value);
      } else {
        auto extra_node = edge->split(part_len_used);
        extra_node->append_node(prefix, value);
        extra_node->suffix_link_ = &root_;

        if (prev_created_internal_node) {
          prev_created_internal_node->suffix_link_ = extra_node;
        }
        prev_created_internal_node = extra_node;

        if (active.node != &root_) {
          active.node = active.node->suffix_link_;
        } else {
          --active.length;
          ++active.edge;
        }
      }

      --remaining;
      continue;
    }

    // Rule 3 extension:
    active.node = edge_match.parent_node_;
    active.edge = edge->part_.start_;
    active.edge_valid = true;
    active.length = part_len_used;

    break;
  }
}

Generic Suffix Tree in C++

However, the basic Ukkonen’s algorithm just stores one string. Before trying Ukkonen’s algorithm, I wrote a templated C++ suffix tree that can contain multiple strings (see the test code), with a simple O(m^2), or possibly O(m^3) construction time.

I created a version that uses Ukkonen’s algorithm, reducing construction to O(m) time, but it doesn’t support multiple inserts. In this implementation, I’ve avoided the need for extra “$” leaf nodes. I’ve also stored the suffix links in a hash table, to avoid wasting space per node even after construction, though that replaces a true constant-time read/write with an amortized constant time hash table insert/lookup.

I’m trying to adapt the Ukkonen-based suffix tree to support multiple inserts. Usually, to create such a “generalized” suffix tree with Ukkonen’s algorithm, you concatenate the strings together, separated by unique out-of-alphabet characters (“$”, “#”, etc), insert the whole thing, then prune the paths you don’t need, but I’d like to avoid that.

I’m not suggesting that anyone use this code for serious work. For now, it’s littered with asserts() and debug output and it’s in need of optimization because it’s generalized for any standard C++ container, such as std::string. For instance, it doesn’t seem wise to store std::string::iterators, instead of pointers, for the start/end ranges on edges. I’m also bothered by the use of std::shared_pointer<>s to share the “global end”, which still take space per edge even after construction.

Alternative Algorithms

Update: I’ve just watched Erik Demaine’s 2012 MIT “Strings” lecture about suffix trees (I needed to watch his previous “Static Strings” lecture first). He doesn’t mention Ukkonen’s algorithm, but does mention a “DC3 (Difference Cover 3)” algorithm by Kärkkäinen-Sanders-Burkhardt (at least the lecture’s student notes call it that) for building suffix arrays in linear time. He shows that a suffix tree can then be constructed from a  suffix array by using the LCP (Longest Common Prefix) of the suffix array to build a cartesian tree, again in linear time. So the whole suffix array construction would be linear time. Suffix arrays themselves anyway offer better data locality than suffix trees and this DC3 algorithm apparently allows parallelization. I hope I find time to explore all that sometime in code.

He also mentions the use of a “suffix tray” to store the edges in each node, to get  O(P  + lg(Σ))) query time, much the same as the O(P) time you’d get with an array, while reducing the overall space complexity to O(T) instead of O(TΣ ). Here P is the size of the query string, T is the size of the text being searched, and Σ is the size of the alphabet. He also mentioned instead using a combination of hash tables and van Emde Boas trees to get O(P + lglg(Σ)) query time. Those parts of the lecture felt quite rushed to me, so I’d like to investigate that sometime too if I find time.

C++17: constsexpr if

Yesterday I played with constexpr if a little, using clang++ built from the latest LLVM sources. C++17 will not have many of the features I was looking forward to, but this is a really nice convenience.

It lets us clean up some of those little templates-plus-specializations that only exist to execute different code in a parent template depending on the template’s type, particularly when the specialization is on a bool or integer value. It won’t help if the template specializations need different member variables or need to provide different type aliases or typedefs.

For instance, libsigc++-3.0 still had a little with_type template with partial template specializations for true and false. Replacing it with constexpr if makes the intention clearer.

I made many similar simplifications by using constexpr if in my murrayc-tuple-utils, which is used by libsigc++-3.0 via copy/paste/rename.

In my murrayc-dp-algorithms project, some code became much clearer and more efficient with constexpr if, allowing me to remove some template specializations that only existed to satisfy the compiler.

Galaxy Zoo iPhone/iPad app

A few weeks ago I finally released my iOS app for Galaxy Zoo on the Apple App Store. Now that it’s been used a bit and I’ve had the chance to fix a couple of crashes, I’m announcing it properly. I am a little afraid of harsh reviews but please try it out and  be kind.

I’m still not quite as comfortable with it as I am with my Android app for Galaxy Zoo, partly because I just don’t feel at home with Objective-C and because iOS lacks something like SyncProvider. But it seems to work.

I actually finished writing the app about eight months ago, but I had to wait until Apple gave me permission to use the old “Galaxy Zoo” app name, allowing this app to appear as just a new version of the old (withdrawn) Galaxy Zoo app from 2014 or so. Joe Zuntz, who wrote the old app, was very helpful with that, pestering Apple until it happened. Unlike that old app, it should be possible to keep this one working even if the server changes, because it’s open source on github.

Here are some screenshots:

screenshot_iphone_4p7inch_classify
screenshot_iphone_4p7inch_list
screenshot_iphone_4p7inch_help

Big-O Quiz: Graph Theory

I’ve been studying more graph theory recently. To take notes that I might remember, I added a Graphs quiz to my Big-O Quiz site. It covers the graph theory mentioned in The Algorithm Design Manual and in the Stanford/Coursera “Algorithms: Design and Analysis Parts 1 & 2”  courses by Tim Roughgarden. I’m adding to it as I work through Tim Roughgarden’s follow-up “Second course in Algorithms“.

It’s often a slightly silly quiz, offering you algorithm descriptions that make no sense for the question’s problem, along with the correct one, but the aim is to just to reinforce my memory by exercising my memory. So I’ve tried to use concise descriptions of problems and algorithms, without trying to be exhaustively correct.

Boost Graph Library and C++ >17

I’ve recently been looking at the Boost Graph Library (BGL), by reading through the excellent BGL book and playing with the BGL examples, which are mostly from the book.  Although it  was written 15 years ago, in 2001, it doesn’t feel dated because it uses techniques being actively discussed now for C++17 or later. And there’s a clear lineage from then to now.

For instance, I’ve been watching the charming Programming Conversations videos by Alexander Stepanov, who brought generic programming to C++ by designing the STL. At some early point in the videos, he mentioned the idea of concepts, which he expected to be in C++17, and showed how he mimicked the simpler concepts syntax (without the checking) with some #defines. Unfortunately, that proposal, by Andrew Sutton, has recently been rejected for C++17, though it seems likely to succeed after C++17. Andrew Sutton demonstrates the proposed syntax wonderfully in this C++Now 2015 Keynote video. He has implemented C++ concepts in gcc 6, and I’ve played with it very superficially for libsigc++.

I’ve written and maintained lots of C++ template code and I’ve never liked how each template makes only implicit requirements of its types. I would feel more comfortable if I could express those requirements in a concept and have the compiler tell me if my template’s types don’t “model” the concept. Eventually compiler errors will then mention problems at that higher level instead of spewing details about exactly how compilation failed. And eventually, C++ might allow checking of semantics as well as just syntax. I can even imagine tools that would analyze template code and suggest that it should require certain concepts for its types, a little like how the latest compilers can suggest that I use the override keyword on virtual method overloads. This means more checking at compile time, and that makes me happy. However, I understand why it would need multiple compilers to implement it before it would be accepted into C++17.

Anyway, I started reading that BGL book and immediately noticed the foreword by the same Alexandar Stepanov, which mentions generic programming ideas such as concepts. The BGL uses concepts, though with minimal checking, and the book uses these to show the structure of the API. Furthermore, as I tried to get some simple changes into the BGL, I noticed that the same Andrew Sutton had been a maintainer of the BGL.

I began playing with the BGL by converting its example code to modern C++, replacing as many of those verbose traits typedefs and awkward tie() calls, with the auto keyword and range-based for loops. The result looks far clearer to me, letting us see how the API could be improved further. For instance, the BGL’s use of generic free-standing functions can seem a little unconstrained to people used to knowing exactly what method they can call on what object, particularly as the BGL puts everything in the boost namespace instead of boost::graph). But Bjarne Stroustrup’s Unified Call proposal (apparently rejected for C++17 too) would improve that. For instance, num_vertices(graph), could be written as graph.num_vertices() and Concepts would let the compiler know if that should be allowed.

So, though the BGL source code seems to have had very little attention over the last 15 years, and now looks almost abandoned, it’s clearly been an inspiration for the most current trends in C++, such as Concepts and Unified Calling. All the work on C++11 and C++14 has drained the swamp so much that these old ideas are now more obviously necessary.

libsigc++ 2.99.5: Even less code

I managed to rip out another chunk of slightly-incomprehensible code from libsigc++, and those improvements are in libsigc++-3.0’s 2.99.5 version.

Now there’s no common functor_base base class and I removed all those interdependent result_type typedefs that were scattered throughout the code. When I had unravelled the whole result_type chain, I could replace it with just a handful of decltype(auto) uses, realising that that’s what the whole mess was trying to achieve. I love decltype(auto).

BigOQuiz: C++ Standard Library Algorithms

I’ve reworked bigoquiz.com, which I created a few weeks ago. to allow multiple quizzes instead of just the default quiz about the Big-O complexity of algorithms and data structures. For some quizzes it also now offers reversed questions, asking you for the question that matches the answer.

For instance, there is now a C++ Standard Algorithms quiz. You could, for instance, check that you know the descriptions of the Sort Operations functions. Or you could check that you know the names of those Sort Operations functions based on their descriptions (or as multiple-cdon’t always hoice). Maybe this is one way to get more familiar with the many standard C++ algorithms that we often don’t realize are available. I hope this helps me as I strive to use “no raw loops“.

As before, you can track your progress in each section of the quiz, and identify problem questions that you keep getting wrong.

screenshot_bigoquiz_cpp_std_algorithms_sections

I’ve also added a Design Patterns quiz, based on the patterns mentioned in the Head-First Design Patterns book.

The user profile page also now shows overall statistics for all the quizzes that you’ve tried.

screenshot_bigoquiz_userprofile