Tag Archives: C++14

Suffix Tree: Suffix Array and LCP Array, C++

I played a bit more with my C++ Suffix Tree experiment, gradually exploring how Suffix Trees are equivalent to a Suffix Array and its associated LCP Array, allowing us to do anything with Suffix Arrays (and LCP arrays – Longest Common Prefix) that we can do with Suffix Trees, but with less memory (but only by a constant factor) and, I think, more cache-friendly use of memory. This code is still far from suitable for any real use.

Converting between Suffix Trees and Suffix Arrays

I first moved my previous code for Ukkonen’s algorithm from a branch into master, so we can now construct a SuffixTree from a string, in O(n) linear time. Subsequent insert()s still use the naive O(n^2) code. Traditionally, you’d concatenate all your strings with unique separator characters, but I’m trying to avoid that. And, really, I’m just trying to explore the use of the relevant algorithms – it’s a bonus if I can wrap it up into something useful.

I then added a get_suffix_array_and_lcp_array() method to get the Suffix Array and associated LCP Array, in O(n) linear time. This works by doing a lexicographically-ordered depth first search.

Finally, I added a SuffixTree constructor that takes such a Suffix Array and LCP Array to build the suffix tree, again in O(n) linear time. This iterates over the suffixes in the Suffix Array adding a node for each, walking back up the tree to split at the point specified by the LCP array. I kept the path in a stack for that walk up. This constructor let me test that I could use a Suffix Array (and LCP Array) obtained from the SuffixTree to recreate the SuffixTree correctly.

This was based on the ideas I learned from Dan Gusfield’s Fundamentals of Stringology lectures, specifically part 3. Incidentally, his later discussion of the Burrows Wheeler Transform led me to Colt McAnlis’ wonderful Compressor Head series, which far more people should watch.

Towards a usable SuffixArray API.

I’ve already created a SuffixArray class to do much of what the SuffixTree class can do, such as finding matches for a substring. It currently does an O(log(n)) binary search over the suffix array. But this should be possible in O(m) time, where m is the length of the substring, hopefully with far less string comparisons, just as it is possible with a Suffix Array. This can apparently be done with the help of the LCP Array. Hopefully I’ll get to that.

I also want to implement Kärkkäinen & Sanders DC3 algorithm to construct the Suffix Array directly from the string in O(n) linear time. And, if I understand Dan Gusfield’s lecture properly, we can use the Burrows-Wheeler transform to construct the associated LCP Array also in linear time. Hopefully I’ll get to that too.

gtkmm 4 started

We (the gtkmm developers) have started work on an ABI-breaking gtkmm-4.0, as well as an ABI-breaking glibmm, target GTK+ 4, and letting us clean up some cruft that has gathered over the years. These install in parallel with the existing gtkmm-3.0 and glibmm-2.4 APIs/ABIs.

A couple of days ago I released first versions of glibmm (2.51.1) and gtkmm (3.89.1), as well as accompanying pangomm and atkmm releases.

This also lets us use my rewrite of libsigc++ for libsigc++-3.0, bringing us more fully into the world of “modern C++”. We might use this opportunity to make other fundamental changes, so now is the time to make suggestions.

We did a parallel-installing ABI-breaking glibmm even though glib isn’t doing that. That’s because, now that GTK+ is forcing us to do it for gtkmm, this seems like as good a time as any to do it for glibmm too. It’s generally harder to maintain C++ ABIs than C ABIs, largely because C++ is just more complicated and its types are more exactly specified. For instance we’d never have been able to switch to libsigc++-3.0 without breaking ABI.

 

Boost Graph Library and C++ >17

I’ve recently been looking at the Boost Graph Library (BGL), by reading through the excellent BGL book and playing with the BGL examples, which are mostly from the book.  Although it  was written 15 years ago, in 2001, it doesn’t feel dated because it uses techniques being actively discussed now for C++17 or later. And there’s a clear lineage from then to now.

For instance, I’ve been watching the charming Programming Conversations videos by Alexander Stepanov, who brought generic programming to C++ by designing the STL. At some early point in the videos, he mentioned the idea of concepts, which he expected to be in C++17, and showed how he mimicked the simpler concepts syntax (without the checking) with some #defines. Unfortunately, that proposal, by Andrew Sutton, has recently been rejected for C++17, though it seems likely to succeed after C++17. Andrew Sutton demonstrates the proposed syntax wonderfully in this C++Now 2015 Keynote video. He has implemented C++ concepts in gcc 6, and I’ve played with it very superficially for libsigc++.

I’ve written and maintained lots of C++ template code and I’ve never liked how each template makes only implicit requirements of its types. I would feel more comfortable if I could express those requirements in a concept and have the compiler tell me if my template’s types don’t “model” the concept. Eventually compiler errors will then mention problems at that higher level instead of spewing details about exactly how compilation failed. And eventually, C++ might allow checking of semantics as well as just syntax. I can even imagine tools that would analyze template code and suggest that it should require certain concepts for its types, a little like how the latest compilers can suggest that I use the override keyword on virtual method overloads. This means more checking at compile time, and that makes me happy. However, I understand why it would need multiple compilers to implement it before it would be accepted into C++17.

Anyway, I started reading that BGL book and immediately noticed the foreword by the same Alexandar Stepanov, which mentions generic programming ideas such as concepts. The BGL uses concepts, though with minimal checking, and the book uses these to show the structure of the API. Furthermore, as I tried to get some simple changes into the BGL, I noticed that the same Andrew Sutton had been a maintainer of the BGL.

I began playing with the BGL by converting its example code to modern C++, replacing as many of those verbose traits typedefs and awkward tie() calls, with the auto keyword and range-based for loops. The result looks far clearer to me, letting us see how the API could be improved further. For instance, the BGL’s use of generic free-standing functions can seem a little unconstrained to people used to knowing exactly what method they can call on what object, particularly as the BGL puts everything in the boost namespace instead of boost::graph). But Bjarne Stroustrup’s Unified Call proposal (apparently rejected for C++17 too) would improve that. For instance, num_vertices(graph), could be written as graph.num_vertices() and Concepts would let the compiler know if that should be allowed.

So, though the BGL source code seems to have had very little attention over the last 15 years, and now looks almost abandoned, it’s clearly been an inspiration for the most current trends in C++, such as Concepts and Unified Calling. All the work on C++11 and C++14 has drained the swamp so much that these old ideas are now more obviously necessary.

libsigc++: std::function-style syntax.

Yesterday I released version 2.99.2 of libsigc++-3.0. This changes the declaration syntax for sigc::slot and sigc::signal to use the same style as std::function, which was added in C++11. We don’t want to be arbitrarily and unnecessarily different.

We’ve also simplified sigc::mem_fun() to always take a reference, instead of either a reference or a pointer.

So, where you might do this for libsigc++-2.0:

sigc::slot<void, int, A> slot = sigc::mem_fun(this, &Thing::on_something);
signal<void, int, A> m_signal;

You would now do this for libsigc++-3.0.

sigc::slot<void(int, A)> slot = sigc::mem_fun(*this, &Thing::on_something);
signal<void(int, A)> m_signal;

Actually, as of version 2.9.1 of libsigc++-2.0, you can use both syntaxes with libsigc++-2.0, allowing you some time to adapt to the new API before one day moving to libsigc++-3.0.

libsigc++ 3.0: Very variadic

I have just released libsigc++ 2.99.1, the first release of the libsigc++-3.0 API, which installs in parallel with the existing libsigc++-2.0 API. This is libsigc++ using much more modern C++, specifically C++14. The API itself is almost completely unchanged, but the implementation is now clearer and easier to contribute to. I’m rather proud of my work, but I’m quite sure that there’s still room for improvement.

Unfortunately, glibmm and gtkmm can’t use libsigc++-3.0 until they have their own parallel-installable versions. That’s not likely to happen until the mythical GTK+ 4 happens. Even though libsigc++ is mostly all templates in headers, its symbols do appear in the linker symbols for gtkmm method signatures, so changes to libsigc++ can break gtkmm ABI. I do have a sigc3 branch of glibmm in git just to show that it works.

With around 150 commits, I gradually refactored libsigc++ to use variadic templates with C++14 instead of generating code from nasty m4 files. Over the years, we have built up quite a large set of regression tests, run during “make check” and “make distcheck”, including various corner cases. Without these tests, the refactoring would have been much harder. With the tests, I could make small commits, knowing that each commit had not broken anything. When something did seem wrong, I could add a test and go back through the git history to find the problem, sometimes splitting commits into even smaller changes. I did a lot of that, rebasing several times, sometimes stopping and starting again after help from Jonathan Wakeley and Marcin Kolny. Those tests give me much more confidence in the end result than I could have if I had chosen to just reimplement the entire API from scratch.

The API is almost all .h files and according to wc, there are 24,145 lines of code in those files in libsigc++ 2.7.1 (after make), and  6,507 in libsigc++ 2.99.1. So there is now only 27% as much code.

This is possible because:

  • C++ variadic templates allow us to have one class or function where we previously had to generate multiple versions for 1 to 6 function parameters, sometimes with additional versions for const and non-const parameters or const and non-const (and volatile and non-volatile) member function pointers.
  • decltype(auto) lets us avoid lots of templated type traits just to correctly specify the correct type for methods.
  • The standard C++ type traits, such as std::conditional<>, std::result_of<>, std::is_base_of<>, std::remove_volatile<> and std::is_const<> let us write very generic code, sometimes replacing our own type traits. I added some more to libsigc++ to get compile-time type traits for member method pointers.
  • template aliases (like typedefs for templates) avoided the need for multiple functor classes deriving from a common base, even though I ended up not needing most of these aliases either.
  • I replaced our sigc::ref() and sigc::reference_wrapper() with std::ref() and std::reference_wrapper<>. Presumably these share a common history.
  • I removed some configure checks and ifdef-ed workarounds for older MSVC++ and Sun Forte compilers. Hopefully they aren’t necessary now, but we will see.

For some adaptors  I used the tuple utilities that I’ve been working on recently, for instance in sigc::bind(). These are copied into the libsigc++ source code, and I’d particularly welcome improvements to them in the form of patches or github pull requests.

I’m still not completely happy with all the overloads we have for sigc::mem_fun(), to take member functions that are non-const, const, non-volatile and/or volatile, but I have some things still to try. We might also remove several by not allowing both mem_fun(pointer, func) and mem_fun(reference, func).

Please do suggest ways to simplify the code yet further.

Playing with C++ Variadic Templates and Dynamic Programming

Dynamic Programming C++ base classes

Over the last few months I’ve been exploring various dynamic programming (DP) algorithms, trying to get a feel for how to choose suitable sub-problems and the dimensions of those sub-problems, and when a top-down or bottom-up strategy is best. In principle, dynamic programming lets you break a problem down into suitable sub-problems, as long as they obey the “principle of optimality“, meaning you can use some broad state (result) of the sub-problem calculation, without needing the actual specifics of each solution.

I’ve also been playing lots with variadic templates in modern C++.

My little murrayc-dp-algorithms project is the result. It has base classes for both bottom-up DP and top-down DP and several examples that use them. The end result is only slightly useful, and is currently far from optimal for performance. But I still think it’s an interesting exercise that shows how these various DP algorithms really relate to each other. It should also explain how I came up with those examples in my earlier blog post.

Top-Down DP

Let’s start with top-down dynamic-programming, because it’s generally simpler. Calculating Fibonacci numbers is the classic example. This is clearly inefficient because every recursive call recalculates almost the same lesser Fibonacci numbers, twice. This is exponential, with time complexity of roughly O(2^n).

int f(int i) {
  if (n == 0)
    return 0;
  else if (n == 1)
    return 1;

  return f(i - 1) + f(i -2);
}

One simple way to avoid the wasteful recalculation is to cache the results and reuse them. That’s generally called Memoization. For Fibonacci number calculation, this is still not the best way, but it’s a useful example.

DpTopDownBase provides a framework for dynamic programming with memoization. You provide a calc_subproblem() implementation, which should use get_subproblem(). That get_subproblem() will return a previously-calculated result if it exists, or recursively call calc_subproblem().

1-Dimensional Top-Down DP

For instance, the top-down DpFibonacci class is declared like so, deriving from DpTopDownBase:

class DpFibonacci
  : public murraycdp::DpTopDownBase< unsigned long, // subproblem type unsigned int // i > {
public:
  explicit DpFibonacci(unsigned int n)
  : n_(n) {}

This specifies that its calc_subproblem() (and the base class’ get_subproblem())will take an unsigned int and return an unsigned long:

unsigned long
calc_subproblem(
  type_base::type_level level,
  unsigned int i) const override {
  // Base cases:
  if (i == 0) {
    return 0;
  } else if (i == 1) {
    return 1;
  }

  return get_subproblem(level, i - 1)
    + get_subproblem(level, i - 2);
}

The level parameter is just for optional debug output, to show how deep we are in the recursive call stack.

It also implements get_goal_cell() so that the base class knows which value of n to use to get the result f(n). This just returns the n provided to DpFibonacci’s constructor.

We can then instantiate the class and call calc():

DpFibonacci dp(n);
auto result = dp.calc();

The base DpTopDownBase::calc() method calls get_goal_cell() and calls calc_subproblem() for that cell. Because our calc_subproblem() implementation uses get_subproblem(), it will automatically avoid recalculating already-calculated sub-problems.

That DpTopDownBase::calc() method is a fairly simple example of the usefulness of C++ variadic template parameters to call one variadic method with the result of another variadic method, using tuples and std::experimental::apply(). And it’s type-safe at compile time all the way.

2-Dimensional Top-Down DP

But DpTopDownBase isn’t just good for a 1-dimensional cache, based on one parameter. The wonder of C++ variadic templates lets us specify multiple parameters for calc_subproblem and get_goal_cell(). Those overridden methods take just the right parameter types and just the right number of parameters.

And this is where top-down DP can be useful, because it might not be necessary to calculate all possible sub-problems for all possible values of these parameters.

For instance, in the Knapsack Problem, we have a knapsack with a certain maximum weight capacity, and we have several items of various weights and values. We want to find the set of items to put in the knapsack that maximises the value. We give each item an index number.

In the classic Knapsack DP algorithm, we then have 2 dimensions (or parameters) for each sub-problem: A count of items and a maximum weight. So if we have 5 items and total capacity of 100Kg, we can calculate the max value, f(5, 100Kg) by using other sub-problems such as f(4, 80Kg) along the way. It’s not likely that we’ll need to calculate all 5 * 100 possible subproblems, though it is likely that we’ll reuse some sub-problem calculations.

For instance, the top-down DpKnapsack class derives from the base DpTopDownBase like so:

class DpKnapsack
: public murraycdp::DpTopDownBase<
    SubSolution,
    SubSolution::type_vec_items::size_type,
    Item::type_weight
  > {

Again it overrides calc_subproblem() and overrides get_goal_cell(). This time it takes 2 parameters instead of just i:

SubSolution
calc_subproblem(
  type_level level,
  type_size items_count,
  type_weight weight_capacity) const override {

And it can call get_subproblem() with those 2 parameters, letting the base class worry about how to cache the sub-problems and when to recalculate them. (This currently uses a rather inefficient generic hashing function for std::tuple<>s. )

Notice that we return a custom SubSolution class for the sub-problem result instead of just a number. This lets us store both the maximum value for that sub-problem and the items used in that sub-problem solution. That helps us to reconstruct the complete set of items for the sub-solutions along the way. However, that is inefficient because we won’t need all sub-solution item sets . It should instead be possible to write some generic code to  reconstruct the solution by working backwards from the result of the goal. I’ll try that out sometime.

Again we can just instantiate the class and call calc():

//Items with their values and weights:
DpKnapsack::type_vec_items items{
  {13, 15}, {12, 14}, {2, 13}, {10, 12},
  {1, 11}, {7, 10}, {6, 9}, {5, 8},
  {11, 7}, {14, 6}, {9, 5}, {4, 4}, {3, 3},
  {22, 2}, {8, 1}};
DpKnapsack::type_value weight_capacity = 45;

DpKnapsack dp(items, weight_capacity);
const auto result = dp.calc();
std::cout << "solution: value: " << result.value << std::endl
  << "with items: ";
print_vec(result.solution);
std::cout << std::endl;

The top-down DpContextFreeGrammarParser , top-down DpMakeChange , top-down DpEditDistance , top-down DpParenthesization do much the same thing. The top-down DpTsp uses 3 dimensions and overrides calc() to choose the minimum result over all the sub-problems at the top level instead of just letting the default base calc() use the result of one goal sub-problem as the overall result.

I think it’s nice to separate the real heart of each DP algorithm from the code that does the boring caching and checking.

Bottom-Up DP

In bottom-up DP, we calculate every single possible sub-problem result, starting from our base case. Even if we won’t need every single possible result, this is often more efficient because we sometimes know when we can discard whole sets of sub-problems. For instance, looking again at the classic Fibonacci Numbers example, we only really need the last 2 sub-problem results:

int f(int n) {
  if (n == 0)
    return 0;
  else if (n == 1)
    return 1;

  int result = 0;
  int prev1 = 1; // f(1) = 0;
  int prev2 = 0; // f(0) = 0;
  for (int i = 1; i < n; ++i) {
    result = (prev1 + prev2);
    prev2 = prev1;
    prev1 = result;
  }

  return result;
}

So we don’t need to cache all i results along the way. For a multi-dimensional DP algorithm, that can save us a lot of space.

1-Dimensional Bottom-Up DP

The bottom-up DpFibonacci class derives from DpBottomUpBase, specifying:

  • the number of sub-problems to keep along the way: 2 to keep f(n-1) and f(n-2).
  • the type of each sub-problem’s (and the final) calculation result: long for the result of f(5), for instance.
  • the type of the input: int for 5, for instance, in f(5).
class DpFibonacci
: public murraycdp::DpBottomUpBase<
    2, // count of sub-problems to keep
    unsigned long, // sub problem type
    unsigned int // i
  > {

DpFibonacci again overrides calc_subproblem() and get_goal_cell(), with exactly the same implementation as the top-down version.

The base class’ calc() method then just iterates from 0 to n, calling the virtual calc_subproblem() method and then returning the result for the goal cell, based on calling the virtual get_goal_cell() method.

DpBottomUpBase keeps just the right number of previously-calculated sub-problems, while reusing the discarded sub-problem storage, by using my (funky) circular_vector.

DpTripleStep is very similar, though it needs to keep the 4 previous sub-problems instead of 2, and of course it does a different calculation per sub-problem.

2-Dimensional Bottom-Up DP

The bottom-up DpKnapsack class is again much like the top-down DpKnapsack implementation, but it again specifies that it only needs the previous 2 sets of sub-problems.

class DpKnapsack
  : public murraycdp::DpBottomUpBase<
    2, // count of sub-problems to keep.
    SubSolution,
    SubSolution::type_vec_items::size_type,
    Item::type_weight
  > {

Where the bottom-up DpFibonacci passed one count to the base class’ constructor, the bottom-up DpKnapsack passes two, one for each calc_subproblem() parameter:

DpKnapsack(
  const type_vec_items& items,
  type_weight weight_capacity)
  : DpBottomUpBase(
      items.size() + 1,
      weight_capacity + 1),
    items_(items),
    weight_capacity_(weight_capacity) {}

The base DpBottomUpBase class uses these constructor parameters as ends of ranges, assuming that each range starts at 0. It will then iterate over the first range, iterating over the second range each time, like a nested for loop:

for (int items_count = 0; items_count < MAX_ITEMS; ++items_count) {
  for (int weight_capacity = 0; weight_capacity < MAX_WEIGHT_CAPACITY; ++ weight_capacity) {
    ...
}

There is no way in C++ to nest an arbitrary number of for loops. But we can use variadic templates to declare an arbitrarily-nested vector. And thanks to my funky variadic templates for vectors of vector (of vectors of vectors, etc), the base class can generically call calc_subproblem() as if it had nested for loops, regardless of how much nesting there should be. My for_vector_of_vectors() takes the (nested) vector, and a series of start and end parameters for each level and eventually uses std::experimental::apply() to call the functor (a lambda, in this case) for each combination. To get that series of start and end parameters, it uses my funky tuple_interlace() to alternate starts (0) with ends.

The bottom-up DpEditDistance and bottom-up DpSubstringMatching do much the same thing.

Problems with DpBottomUpBase

The bottom-up base class has some serious limitations:

It only lets us discard older sub-problems in the first dimension, such as the Knapsack sub-problems for items less than n-2. But you might understand enough about your algorithm to know other sets of sub-problems that can be discarded at various times, based on less simple criteria. This would make the best use of space while avoiding recalculation. Maybe some recalculation would be OK as long as you could minimize it.

Also, it is limited to a simple iteration, using a range from start to end, using a simple operator++ to move to the next iteration. But sometimes you want to do something a little more complicated than an iterating for loop. For instance, my top-down DpTsp class finds a solution for the Travelling Salesman Problem by manipulating sub-sets in its calc_subproblem(). I could use gosper’s hack to create a suitable n-choose-k subset iterator, but it still needs to iterate over a further set of subsets missing each item from the parent subset, and I suppose that would be doable with another custom iterator. But the start and end iterators at each level would then have to be instantiated based on the values of the parent level, and cannot just be supplied before the whole calc() starts. Maybe I could do something with lambda functions to let these iterators be instantiated along the way.