A C++ developer looks at Go (the programming language), Part 1: Simple Features

I’m reading “The Go Programming Language” by Brian Kernighan and Alan Donovan. It is a perfect programming language introduction, clearly written and perfectly structured, with nicely chosen examples. It contains no hand-waving – it’s aware of other languages and briefly acknowledges the choices made in the language design without lengthy discussion.

As an enthusiastic C++ developer, and a Java developer, I’m not a big fan of the overall language. It seems like an incremental improvement on C, and I’d rather use it than C, but I still yearn for the expressiveness of C++. I also suspect that Go cannot achieve the raw performance of C or C++ due to its safety features, though that maybe depends on compiler optimization. But it’s perfectly valid to knowingly choose safety over performance, particularly if you get more safety and more performance than with Java.

I would choose Go over C++ for a simple proof of concept program using concurrency and networking. Goroutines and channels, which I’ll mention in a later post, are convenient abstractions, and Go has standard API for HTTP requests. Concurrency is hard, and it’s particularly easy to choose safety over performance when writing network code.

Here are some of my superficial observations about the simpler features, which mostly seem like straightforward improvements on C. In part 2 I’ll mention the higher-level features and I’ll hopefully do a part 3 about concurrency. I strongly recommend that you read the book to understand these issues properly.

I welcome friendly corrections and clarifications. There are surely several mistakes here, hopefully none major.

No semicolons at the end of lines

Let’s start with the most superficial thing. Unlike C, C++, or Java, Go doesn’t need semicolons at the end of lines of code. So this is normal:

a = b
c = d

This is nicer for people learning their first programming language. It can take a while for those semicolons to become a natural habit.

No () parentheses with if and for

Here’s another superficial difference. Unlike C or Java, Go doesn’t put its conditions inside parentheses with if and for. That’s another small change that feels arbitrary and makes C coders feel less comfortable.

For instance, in Go we might write this:

for i := 0; i < 100; i++ {
  ...
}

if a == 2 {
  ...
}

Which in C would look like this:

for (int i = 0; i < 100; i++) {
  ...
}

if (a == 2) {
  ...
}

Type inference

Go has type inference, from literal values or from function return values, so you don’t need to restate types that the compiler should know about. This is a bit like C++’s auto keyword (since C++11). For instance:

var a = 1 // An int.
var b = 1.0 // A float64.
var c = getThing()

There’s also a := syntax that avoids the need for var, though I don’t see the need for both in the language:

a := 1 // An int.
b := 1.0 // A float64
d := getThing()

I love type inference via auto in modern C++, and find it really painful to use any language that doesn’t have this. Java feels increasingly verbose in comparison, but maybe Java will get there. I don’t see why C can’t have this. After all, they eventually allowed variables to be declared not just at the start of functions, so change is possible.

Types after names

Go has types after the variable/parameter/function names, which feels rather arbitrary, though I guess there are reasons, and personally I can adapt. So, in C you’d have

Foo foo = 2;

but in Go you’d have

var foo Foo = 2

Keeping a more C-like syntax would have eased C developers into the language. These are often not people who embrace even small changes in the language.

No implicit conversions

Go doesn’t have implicit conversions between types, such as int and uint, or floats and int. This also applies to comparison via == and !=.

So, these won’t compile:

var a int = -2
var b uint = a
var c int = b

var d float64 = 1.345
var e int = c

C compiler warnings can catch some of these, but a) People generally don’t turn on all these warnings, and they don’t turn on warnings as errors, and b) the warnings are not this strict.

Notice that Go has the type after the variable (or parameter, or function) name, not before.

Notice that, unlike Java, Go still has unsigned integers. Unlike C++’s standard library, Go uses signed integers for sizes and lengths. Hopefully C++ will get do that too one day.

No implicit conversions because of underlying types

Go doesn’t even allow implicit conversions between types that, in C, would just be typedefs. So, this won’t compile

type Meters int
type Feet int
var a Meters = 100
var b Feet = a

I think I’d like to see this as a warning in C and C++ compilers when using typedef.

However, you are allowed to implicitly assign a literal (untyped) value, which looks like the underlying type, to a typed variable, but you can’t assign from an actual typed variable of the underlying type:

type int Meters
var a Meters = 100 // No problem.

var i int = 100
var b Meters = i // Will not compile.

No enums

Go has no enums. You should instead use const values with the iota keyword. So, while C++ code might have this:

enum class Continent {
  NORTH_AMERICA,
  SOUTH_AMERICA,
  EUROPE,
  AFRICA,
  ...
};

Continent c = Continent::EUROPE;
Continent d = 2; // Will not compile

in Go, you’d have this:

type continent int

const (
  CONTINENT_NORTH_AMERICA continent = iota
  CONTINENT_SOUTH_AMERICA // Also a continent, with the next value via iota.
  CONTINENT_EUROPE // Also a continent, with the next value via iota.
  CONTINENT_AFRICA // Also a continent, with the next value via iota.
)

var c continent = CONTINENT_EUROPE
var d continent = 2 // But this works too.

Notice how, compared to C++ enums, particularly C++11 scoped enums, each value’s name must have an explicit prefix, and the compiler won’t stop you from assigning a literal number to a variable of the enum type. Also, the Go compiler doesn’t treat these as a group of associated values, so it can’t warn you, for instance, if you forget to mention one in a switch/case block.

Switch/Case: No fallthrough by default

In C and C++, you almost always need a break statement at the end of each case block. Otherwise, the code in the following case block will run too. This can be useful, particularly when you want the same code to run in response to multiple values, but it’s not the common case. In Go, you have to add an explicit fallthrough keyword to get this behaviour, so the code is more concise in the general case.

Switch/Case: Not just basic types

In Go, unlike in C and C++, you can switch on any comparable value, not just values known at compile time, such as ints,  enums, or other constexpr values. So you can switch on strings, for instance:

switch str {
  case "foo":
  doFoo()
case "bar":
  doBar()
}

This is convenient and I guess that it is still compiled to efficient machine code when it uses compile-time values. C++ seems to have resisted this convenience because it couldn’t always be as efficient as a standard switch/case, but I think that unnecessarily ties the switch/case syntax to its original meaning in C when people expected to be more aware of the mapping from C code to machine code.

Pointers, but no ->, and no pointer arithmetic

Go has normal types and pointer types, and uses * and & as in C and C++. For instance:

var a thing = getThing();
var p *thing = &a;
var b thing = *p; // Copy a by value, via the p pointer

As in C++, the new keyword returns a pointer to a new instance:

var a *thing = new(thing)
var a thing = new(thing) // Compilation error

This is like C++, but unlike Java, in which any non-fundamental types (not ints or booleans, for instance) are effectively used via a reference (it just looks like a value), which can confuse people at first by allowing inadvertent sharing.

Unlike C++, you can call a method on a value or a pointer using the same dot operator:

var a *Thing = new(Thing) // You wouldn't normally specify the type.
var b Thing = *a
a.foo();
b.foo();

I like this. After all, the compiler knows whether the type is a pointer or a value, so why should it bother me with complaints about a . where there should be a -> or vice-versa? However, along with type inference, this can slightly obscure whether your code is dealing with a pointer (maybe sharing the value with other code) or a value. I’d like to see this in C++, though it would be awkward with smart pointers.

You cannot do pointer arithmetic in Go. For instance, if you have an array, you can’t step through that array by repeatedly adding 1 to a pointer value and dereferencing it. You have to access the array elements by index, which I think involves bounds checking. This avoids some mistakes that can happen in C and C++ code, leading to security vulnerabilities when your code accesses unexpected parts of your application’s memory.

Go functions can take parameters by value or by pointer. This is like C++, but unlike Java, which always takes non-fundamental types by (non const) reference, though it can look to beginner programmers as if they are being copied by value. I’d rather have both options with the code showing clearly what is happening via the function signature, as in C++ or Go.

Like Java, Go has no notion of const pointers or const references. So if your function takes a parameter as a pointer, for efficiency, your compiler can’t stop you from changing the value that it points to. In Java, this is often done by creating an immutable type, and many Java types, such as String, are immutable, so you can’t change them even if you want to. But I prefer language support for constness as in C++, for pointer/reference parameters and for values initialized at runtime. Which leads us to const in Go.

References, sometimes

Go does seem to have references (roughly, pointers that look like values), but only for the built-in slice, map, and channel types.  (See below about slices and maps.) So, for instance, this function can change its input slide parameter, and that change will be visible to the caller, even though the parameter is not declared as a pointer:

func doThing(someSlice []int) {
  someSlice[2] = 3;
}

In C++, this would be more obviously a reference:

void doThing(Thing& someSlice) {
  someSlice[2] = 3;
}

I’m not sure if this is a fundamental feature of the language or just something about how those types are implemented. It seems confusing for just some types to act differently, and I find the explanation a bit hand-wavy. Convenience is nice, but so is consistency.

const

Go’s const keyword is not like const in C (rarely useful) or C++, where it indicates that a variable’s value should not be changed after initialization. It is more like C++’s constexpr keyword (since C++11), which defines values at compile time. So it’s a bit like a replacement for macros via #define in C, but with type safety. For instance:

const pi = 3.14

Notice that we don’t specify a type for the const value, so the value can be used with various types depending on the syntax of the value, a bit like a C macro #define. But we can restrict it by specifying a type:

const pi float64 = 3.14

Unlike constexpr in C++, there is no concept of constexpr functions or types that can be evaluated at compile time, so you can’t do this:

const pi = calculate_pi()

and you can’t do this

type Point struct {
  X int
  Y int
}

const point = Point{1, 2}

though you can do this with a simple type whose underlying type can be const:

type Yards int
const length Yards = 100

Only for loops

All loops in Go are for loops – there are no while or do-while loops. This simplifies the language in one way compared, for instance, to C, C++, or Java, though there are now multiple forms of for loop.

For instance:

for i := 0; i < 100; i++ {
  ...
}

or, like a while loop in C:

for keepGoing {
  ...
}

And for loops have a range-based syntax for containers such as string, slices or maps, which I’ll mention later:

for i, c := range things {
  ...
}

C++ has range-based for loops too, since C++11, but I like that Go can (optionally) give you the index as well as the value.

A native (Unicode) String type

Go has a built-in string type, and built in comparison operators such as ==, !=, and < (as does Java). Like Java, Strings are immutable, so you can’t change them after you’ve created them, though you can create new Strings by concatenating other Strings with the built in operator +. For instance:

str1 := "foo"
str2 := str1 + "bar"

Go source code is always UTF-8 encoded and string literals may contain non-ASCII utf-8 code points. Go calls Unicode code points “runes”.

Although the built-in len() function returns the number of bytes, and the built in operator [] for strings operates on bytes, there is a utf8 package for dealing with strings as runes (Unicode code points). For instance:

str := "foo"
l := utf8.RuneCountInString(str)

And the range-based for loop deals in runes, not bytes:

str := "foo"
for _, r := range str {
  fmt.Println("rune: %q", r)
}

C++ still has no standard equivalent.

Slices

Go’s slices are a bit like dynamically-allocated arrays in C, though they are really views of an underlying array, and two slices can be views into different parts of the same underlying array. They feel a bit like std::string_view from C++17, or GSL::span, but they can be resized easily, like std::vector in C++17 or ArrayList in Java.

We can declare a span like so, and append to it:

a := []int{5, 4, 3, 2, 1} // A slice
a = append(a, 0)

Arrays (whose size cannot change, unlike slices) have a very similar syntax:

a := [...]int{5, 4, 3, 2, 1} // An array.
b := [5]int{5, 4, 3, 2, 1} // Another array.

You must be careful to pass arrays to functions by pointer, or they will be (deep) copied by value.

Slices are not (deep) comparable, or copyable, unlike std::array or std::vector in C++, which feels rather inconvenient.

Slices don’t grow beyond their capacity (which can be more than their current length) when you append values. To do that you must manually create a new slice and copy the old slice’s elements into it. You can keep a pointer to an element in a slice (really to the element in the underlying array). So, as with maps (below), the lack of resizing is probably to remove any possibility of an pointer becoming invalid.

The built in append() function may allocate a bigger underlying array if it would need more than the existing capacity (which can be more than the current length). So you should always assign the result of append() like so:

a = append(a, 123)

I don’t think you can keep a pointer to an element in a slice. If you could, the garbage collection system would need to keep the previous underlying array around until you had stopped using that pointer.

Unlike C or C++ arrays, and unlike operator [] with std::vector, attempting to access an invalid index of a slice will result in a panic (effectively a crash) rather than just undefined behaviour. I prefer this, though I imagine that the bounds checking has some small performance cost.

Maps

Go has a built-in map type. This is roughly equivalent to C++’s std::map (balanced binary trees), or std::unordered_map (hash tables). Go maps are apparently hash tables but I don’t know if they are separate-chaining hash tables (like std::unordered_map) or open-addressing hash tables (like nothing in standard C++ yet, unfortunately).

Obviously, keys in hash tables have to be hashable and comparable. The book mentions comparability, but so few things are comparable that they would all be easily hashable too. Only basic types (int, float64, string, etc, but not slices) or structs made up only of basic types are comparable, so that’s all you can use as a key. You can get around this by using a basic type (such as an int or string) that is (or can be made into) a hash of your value. I prefer C++’s need for a std::hash<> specialization, though I wish it was easier to write one.

Unlike C++’, you can’t keep a pointer to an element in a map, so changing one part of the value means copying the whole value back into the map, presumably with another lookup. Go apparently does this to completely avoid the problem of invalid pointers when the map has to grow. C++ instead lets you take the risk, specifying when your pointer could become invalid.

Go maps are clearly a big advantage over C, where you otherwise have to use some third-party data structure or write your own, typically with very little type safety.

They look like this:

m := make(map[int]string)
m[3] = "three"
m[4] = "four"

Multiple return values

Functions in Go can have multiple return types, which I find more obvious then output parameters. For instance:

func getThings() (int, Foo) {
  return 2, getFoo()
}

a, b := getThings()

This is a bit like returning tuples in modern C++, particularly with structured bindings in C++17:

std::tuple<int, Foo> get_things() {
  return make_tuple(2, get_foo());
}

auto [i, f] = get_things();

Garbage Collection

Like Java, Go has automatic memory management, so you can trust that instances will not be released until you have finished using them, and you don’t need to explicitly release them. So you can happily do this, without worrying about releasing the instance later:

func getThing() *Thing {
  a := new(Thing)
  ...
  return a
}

b := getThing()
b.foo()

And you can even do this, not caring, and not easily even knowing, whether the instance was created on the stack or the heap:

func getThing() *Thing {
  var a Thing
  ...
  return &a
}

b := getThing()
b.foo()

I don’t know how Go avoids circular references or unwanted “leak” references, as Java or C++ would with weak references.

I wonder how, or if, Go avoids Java’s problem with intermittent slowdowns due to garbage collection. Go seems to be aimed at system-level code, so I guess it must do better somehow.

However, also like Java, and probably like all garbage collection, this is only useful for managing memory, not resources in general. The programmer is usually happy to have memory released some time after the code has finished using it, not necessarily immediately. But other resources, such as file descriptors and database connections, need to be released immediately. Some things, such as mutex locks, often need to be released at the end of an obvious scope. Destructors make this possible. For instance, in C++:

void Something::do_something() {
  do_something_harmless();

  {
    std::lock_guard<std::mutex> lock(our_mutex);
    change_some_shared_state();
  }
  
  do_something_else_harmless();
}

Go can’t do this, so it has defer() instead, letting you specify something to happen whenever a function ends. It’s a annoying that defer is associated with functions, not to scopes in general.

func something() {
  doSomethingHarmless()

  ourMutex.lock()
  defer ourMutex.unlock()
  changeSomeSharedState()

  // The mutex has not been released yet when this remaining code runs,
  // so you'd want to restrict the use of the resource (a mutex here) to
  // another small function, and just call it in this function.
  doSomethingElseHarmless()
}

This feels like an awkward hack, like Java’s try-with-resources.

I would prefer to see a language that somehow gives me all of scoped resource management (with destructors), reference-counting (like std::shared_ptr<>) and garbage collection, in a concise syntax, so I can have predictable, obvious, but reliable, resource releasing when necessary, and garbage collection when I don’t care.

Of course, I’m not pretending that memory management is easy in C++. When it’s difficult it can be very difficult. So I do understand the choice of garbage collection. I just expect a system level language to offer more.

Things I don’t like in Go

As well as the minor syntactic annoyances mentioned above, and the lack of simple generic resource (not just memory) management, I have a couple of other frustrations with the language.

(I’m not loving the support for object orientation either, but I’ll mention that in a later article when I’ve studied it more.)

No generics

Go’s focus on type safety, particularly for numeric types, makes the lack of generics surprising. I can remember how frustrating it was to use Java before generics, and this feels almost that awkward. Without generics I soon find myself having to choose between lack of type safety or repeatedly reimplementing code for each type, feeling like I’m fighting the language.

I understand that generics are difficult to implement, and they’d have to make a choice about how far to take them (probably further than Java, but not as far as C++), and I understand that Go would then be much more than a better C. But I think generics are inevitable once, like Go, you pursue static type safety.

Somehow go’s slice and map containers are generic, probably because they are built-in types.

Lack of standard containers

Go has no queue or stack in its standard library. In C++, I use std::queue and std::stack regularly. I think these would need generics. People can use go’s slice (a dynamically-allocated array) to achieve the same things, and you can wrap that up in your own type, but your type, can only contain specific types, so you’ll be reimplementing this for every type. Or your container can hold interface{} types (apparently a bit like a Java Object or a C++ void*), giving up (static) type safety.

C++: std::string_view not so useful when calling C functions

string_view does not own string data

C++17 adds std::string_view, which is a thin view of a character array, holding just a pointer and a length. This makes it easy to provide just one method that can efficiently take either a const char*, or a std::string, without unnecessary copying of the underlying array. For instance:

void use_string(std::string_view str);

You can then call that function like so:

use_string("abc");

or

std::string str("abc");
use_string(str);

This involves no deep copying of the character array until the function’s implementation needs to do that. Most obviously, it involves no copying when you are just passing a string literal to the function. For instance it doesn’t create a temporary std::string just to call the function, as would be necessary if the function took std::string.

string_view knows nothing of null-termination

However, though the string literal (“abc”) is null-terminated, and the std::string is almost-certainly  null-terminated (but implementation defined), our use_string() function cannot know for sure that the underlying array is null terminated. It could have been called liked so:

const char* str = "abc"; // null-terminated
use_string(std::string_view(str, 2));  // not 3.

or even like so:

const char str[] = {'a', 'b', 'c'}; //not null-terminated
use_string(std::string_view(str, 3));

or as a part of a much larger string that we are parsing.

Unlike std::string, there is no std::string_view::c_str() which will give you a null-terminated character array. There is std::string_view::data() but, like std::string::data(), that doesn’t guarantee the the character array will be null-terminated. (update: since C++11, std::string::data() is guaranteed to be null-terminated, but std::string_view::data() in C++17 is not.)

So if you call a typical C function, such as gtk_label_set_text(), you have to construct a temporary std::string, like so:

void use_string(std::string_view str) {
  gtk_label_set_text(label, std::string(str).c_str());
}

But that creates a copy of the array inside the std::string, even if that wasn’t really necessary. std::string_view has no way to know if the original array is null-terminated, so it can’t copy only when necessary.

This is understandable, and certainly useful for pure C++ code bases, or when using C APIs that deal with lengths instead of just null termination. I do like that it’s in the standard library now. But it’s a little disappointing in my real world of integrating with typical C APIs, for instance when implementing gtkmm.

Implementation affecting the interface

Of course, any C function that is expected to take a large string would have a version that takes a length. For instance, gtk_text_buffer_set_text(), so we can (and gtkmm will) use std::string_view as the parameter for any C++ function that uses that C function. But it’s a shame that we can’t have a uniform API that uses the same type for all string parameters. I don’t like when the implementation details dictate the types used in our API.

There is probably no significant performance issue for small strings, even when using the temporary std::string() technique, but it wouldn’t be nice to make the typical cases worse, even just theoretically.

In gtkmm, we could create our own string_view type, which is aware of null termination, but we are trying to be as standard as possible so our API is as obvious as possible.

C++ Implementations of “Engineering the Compiler” Pseudocode

Implementing is understanding

I’m gradually reading through the Engineering a Compiler book. I’m enjoying it but after a while I started wondering if I really understood the algorithms and I wanted to make sure of that before going further. So I implemented the algorithms in C++ (C++17, because it’s nice), testing them against the example inputs and grammars from the book. For instance, here is my code to construct, and use, the action and goto tables which define a DFA for a bottom-up table-driven LR(1) parser, along with some test code to use it with a simple expression grammar.

So far I’ve done this for Chapter 2 (Scanners) and Chapter 3 (Parsers) and I’ll try to keep going. It is a useful exercise for me, and maybe it’s useful to someone else reading the book.  Please note that the code will probably be meaningless to you if you haven’t read the book or something similar. On the other hand, the code will probably seem childlike if you’ve studied compilers properly.

Trying to get  the code to work often showed me that I had had only the illusion of fully understanding. Much of the pseudocode in the book is not very clear, even when you adjust your mind to its vaguely mathematical syntax, and to know what it really means you have to closely read the text descriptions, sometimes finding clues several pages back. For instance it’s not always clear what is meant by a particular symbol, and I found at least one conditional check that appeared in the description but not in the code. I would much rather see descriptions inline in the code.

Pseudocode is vague

I’m not a fan of pseudocode either, though I understand the difficulty in choosing a real programming language for examples. It means putting some readers off. But there could at least be some real code in an appendix. The book has some detailed walkthroughs that show that the authors must have implemented the algorithms somehow, so it’s a shame that I couldn’t just look at that code. I wonder what real programming language might be the most compact for manipulating sets of states when dealing with finite state automata states and grammar symbols.

For the parsers chapter this wasn’t helped by the, presumably traditional, ambiguity of the “FIRST” sets terminology. FIRST(symbol), FIRST(symbols), FIRST(production), and FIRST+(production) are all things.

My code isn’t meant to be efficient. For instance, my LR(1) parser table-building code has some awfully inefficient use of std::set as a key in a std::map. The code is already lengthier than I’d like, so I would prefer not to make it more efficient at the cost of readability. I’ve also probably overdone it with the half-constexpr and vaguely-concepty-generic Grammar classes. But I am tempted by the idea of making these fully constexpr with some kind of constexpr map, and then one day having the compiler build the generated (DFA) tables at compile time, so I wanted to explore that just a little.

But the book is good

Despite my complaints about the presentation of the algorithms, so far I’d still recommend the book. I feel like it’s getting the ideas into my head , it’s not really that bad, and as far as I know there’s nothing better. Of course, given that I haven’t read the alternatives, my recommendation shouldn’t mean much.

Also, these days I always write up my notes as a bigoquiz.com quiz, so here are my quizzable compiler notes so far.

Boost Graph Library: modernization

Modern C++ (C++11 and later) can greatly simplify generic templated C++ code. I’ve recently been playing around with modernizing the Boost Graph Library code. The experience is similar to how I modernized the libsigc++ code, though I have not gone nearly into that much depth yet. The BGL is currently a big messy jumble of code that isn’t getting much love, and modernizing it could start to let its accumulated wisdom shine through, while also freeing it of other boost dependencies.

Please note that this is just an experiment in my own fork that even I am not pursuing particularly seriously. These changes are not likely to ever be accepted into the BGL because it would mean requiring modern C++ compilers. Personally, I think that’s the only way to move forward. I also think that Boost’s monolithic release process, and lack of a real versioning strategy, holds its libraries back from evolving. At the least, I think only a small set of generally-useful libraries should be released together, and I think that set should drop anything that’s now been moved into the standard library. BGL seems to have been stagnant for the last decade, so there doesn’t seem to be much to lose.

I’ve modernized the example code (and also tried to remove the “using namespace boost” lines), and done some work to modernize the boost graph code itself.

At the least, liberal use of auto can let you delete lots of ugly type declarations that really exist just to make things compile rather than to provide useful clues to the reader. For instance, auto makes the example code less cluttered with magic incantations. Range-based for loops simplify more code – for instance, in the examples.The simpler code is then easier to understand, letting you see the useful work underneath the boiler plate.

I’ve jumped right into C++17 and used structured bindings (and in the examples) because they are particularly helpful with the BGL API, which has many methods that return the begin and end of a range inside a std::pair. This lets us avoid some more type declarations. However, in the examples, I used a make_range_pair() utility function in many places instead, so I could use a simple range-based for. I guess that the range library would provide a type to use instead, maybe as part of the standard library one day.

I also replaced most uses of the boost type traits (some from boost::mpl) with the std:: equivalents. For instance, std::is_same instead of boost::is_same. It should now be possible to remove the boost::mpl dependency completely with a little work.

I’ve also tried putting all of BGL in the boost::graph namespace, instead of just in the boost namespace, but the API currently expects application code to use “using namespace boost”, to make its generic API work, and this makes that even more obvious.

As a next step, I guess that the boost graph implementation code could be simplified much more by use of decltype(auto). As I was modernizing libsigc++, I sometimes found templates that were used only by specializations that were used only by typedefs that were no longer needed, because they could be replaced by decltype(auto) return types. You have to pull at the threads one by one.

gtkmm 4 progress

gtkmm 4 is alive

We (mostly Kjell Ahlstedt and myself) have been quietly working away on gtkmm 4 and an associated ABI-breaking version of glibmm. We’ve been tracking GTK+ 4 from git master, making sure that gtkmm builds against it, and making various API-breaking or ABI-breaking changes that had been left in the code as TODO comments waiting for a chance like this.

This includes simple ABI-breaking (but not API-breaking) stuff such as adding a base classes to widget esor changing the type sof a method parameters. Many other changes are about updating the code to be more modern C++, trying to do things the right way and avoiding duplication with new API in the C++ Standard library.

These changes pleases us as purists but, honestly, as an application developer it isn’t going to give you interesting new behaviour in your applications. If you too are enthusiastic about the C++ renaissance, porting to the new APIs might feel rewarding and correct, and you’ll have to do it someday anyway to get whatever new features arrive later, but I cannot claim there is a more compelling reason to do the work. You might get shiny new features via GTK+ 4, but so far it feels like a similar exercise in internal cleanup and removal of unloved API.

There are still people stuck on GTK+ 2, because most people only port code when they need to. I don’t think the transition to GTK+ 4 or gtkmm 4 will be any faster. Certainly not until the GTK+ 4 porting advice gets a lots better. Porting gtkmm and glom to GTK+ 4 has not been fun, particularly as the trend for API changes without explanation has only increased.

Anyway, here are some of the more significant changes so far in glibmm-2.54 (a horrible ABI name, but there are reasons) and gtkmm-4.0, both still thoroughly unstable and subject to API change:

(Lots of this is currently on in git master but will be in tarball releases soonish, when there are glib and gtk+ releases we can depend on.)

Deprecated API is gone

Anything that was deprecated in gtkmm 3 (or glibmm-2.4) is now completely removed from gtkmm 4 (or glibmm-2.54). You’ll want to build your application with gtkmm 3 with all deprecated API disabled before attempting to build against gtkmm 4.

In some cases this included deprecated base classes that we couldn’t let you optionally disable without breaking ABI, but now they are really really gone.

This is a perfect example of API changes that make us feel better but which are really not of much direct benefit to you as an application developer. It’s for your own good, really.

Glib::RefPtr is now std::shared_ptr

In gtkmm 3, Glib::RefPtr<> is a reference-counting smart pointer, mostly used to hide the manual GObject reference-counting that’s needed in C. It worked well, but C++11 introduced std::shared_ptr so we saw a chance to delete some code and  make our API even more “standard”. So, Glib::RefPtr is now just an alias for std::shared_ptr and we will probably change our APIs to use std::shared_ptr instead of Glib::RefPtr.

Glib::RefPtr is an intrusive smart pointer, meaning that it needs, and uses, a reference count in the underlying object rather than in the smartpointer itself. But std::shared_ptr is non-intrusive. So we now just take one reference when we instantiate the std::shared_ptr and release that one reference when the last std::shared_ptr is destroyed, via its Deleter callback. That means we always need to instantiate the std::shared_ptr via Glib::make_refptr_for_instance(), but we hide that inside glibmm/gtkmm code, and application developers would rarely need to do this anyway.

std::shared_ptr<> is a familiar type, so using it in our API should make it easier for the average person to reason about our API. And using it makes us part of the wider ongoing conversation in the C++ community about how to indicate and enforce ownership. For instance, we might start taking Thing* parameters instead of std::shared_ptr<Thing> parameters when we know that the called method will not need to share ownership. I mentioned this idea in an earlier post. However, we often cannot assume much about what the underlying C function really does.

This is a big change. Hopefully it will work.

Now uses libsigc++-3.0 instead of libsigc++-2.0

I rewrote libsigc++ for modern C++, using variadic templates and some other modern C++ techniques. It feels like libsigc++-2.0, but the code is much simpler. Compiler errors might be slightly less cryptic. This requires C++14.

Enums are inside related classes

Enums that were only used with a particular class are now inside that class. For instance, Gio::ApplicationFlags is now Gio::Application::Flags. This makes the API slightly clearer. This didn’t need C++11, but it did need an API break.

Enums are now C++11 enum classes

Enums are now declared as “enum class” (scoped enumerations) instead of “enum” (unscoped enumerations), so they cannot be implicitly converted to other types such as bool or int. That lets the compiler find some subtle programmer errors.

The enum values must now be prefixed by the enum name rather than having a prefix as part of the value name. For instance, we now use Gio::Application::Flags::HANDLES_OPEN instead of Gio::Application::FLAGS_HANDLES_OPEN (actually Gio::APPLICATION_FLAGS_HANDLES_OPEN before we put enums inside classes).

Gtk::TreeView and Gtk::TextView now have real const_iterators

This lets us make the API more const-correct, requiring less arbitrary const_casts<>s in application code.

Removed the old intermediate ListHandle/SListHandle/ArrayHandle/SArrayHandle types

Long ago, the gtkmm API used these intermediate types, wrapping glib types such as GList, GSList, and C arrays, so you didn’t need to choose between using a std::list, std::vector, or other standard container. Since gtkmm 3 we decided that this was more trouble than it was worth, and decided to just uses std::vector everywhere, but it’s only now that we’ve been able to remove them from the glibmm, pangomm, and atkmm APIs.

A possible change: Not using Glib::ustring

We are still considering whether to replace uses of Glib::ustring in our API with std::string, maybe just keeping Glib::ustring for when people really want to manipulate UTF-8 “characters”. I would much prefer standard C++ to have real UTF-8 support, for instance letting us step through and insert characters in a UTF-8 string, but that doesn’t look like it will happen in the foreseeable future.

Glib::ustring still wraps useful UTF-8 APIs in glib, in a std::string-like API, so we wouldn’t want to remove it.

Also, currently it’s useful to know that, for instance, a gtkmm method that returns a Glib::ustring is giving us a UTF-8 string (such as Gtk::FileChooser::get_current_name()), rather than something of unknown encoding (such as Gtk::FileChooser::get_filename()). We allow implicit conversions, for convenience, so we can’t use the compiler to check for awareness of these encoding differences, but having it in the method signature still feels nicer than having to read a method’s documentation.

Trying googletest

Recently I’ve played a bit with googletest in a couple of small projects: in murrayc-tuple-utils (and some later commits) and in prefixsuffix. It’s pretty straightforward and improved my test code and test result output (test-suite.log with autotools). Here are some notes.

Code Changes

I usually just use the standard autotools and CMake test features, sprinkling assert()s and “return EXIT_FAILURE;” in my C++ test code. However, googletest lets me replace a simple assert():

assert(foo.get_thing() == "hello");

with an EXPECT_EQ() call that will give me a clue when the test fails:

EXPECT_EQ("hello", foo.get_thing());

There are some other simple assertions.

googletest also wants me to declare test functions, previously simply declared like so,

void test_thing_something() {

now like so:

TEST("TestThing", "Something") {

which is all a bit macrotastic, unfortunately, but it works.

I chose to use the standard main() implementation, so I just removed my main() function that called these tests and compiled gtest_main.cc into the gtest library, as suggested in the googletest documentation.

Build Changes

googletest is meant to be built in your project, not built separately and just linked to as a library. For performance testing, I can see the benefit of building the test framework with exactly the same environment and options as the code being tested, but this does seem rather awkward. I guess it’s just what Google do in their monolithic builds so there’s nobody making an effort to support any other build system.

Anyway, it’s fairly easy to add googletest as a git submodule and build the library in your autotools or CMake project. For autotools, I did this with libtool in murrayc-tuple-utils and without libtool in prefixsuffix.

Unfortunately, I had to list the individual googletest source files in EXTRA_DIST to make this work with autotool’s “make distcheck”.

This is easier with CMake, because googletest has a CMakeList.txt file so you can use “add_subdirectory (googletest)”. Of course, CMake doesn’t have an equivalent for “make distcheck”.

Also unfortunately, I had to stop using the wonderful -Wsuggest-override warning with warnings as error, because googletest doesn’t use the override keyword. I think it hasn’t really caught up with C++11 yet, which seems odd as I guess Google is using at least C++11 in all its code.

Conclusion

The end result is not overwhelming so far, and it’s arguable if it’s worth having to deal with the git submodule awkwardness. But in theory, when the tests fail, I now won’t have to add so many printfs to find out what happened. This makes it a bit more like using JUnit with Java projects.

Also in theory, the test output would integrate nicely with a proper continuous integration system, such as Jenkins, or whatever Google use internally. But I’m not using any of those on these projects. Travis-CI is free with github projects, but it seems to be all about builds, with no support for test results.

Suffix Tree: Suffix Array and LCP Array, C++

I played a bit more with my C++ Suffix Tree experiment, gradually exploring how Suffix Trees are equivalent to a Suffix Array and its associated LCP Array, allowing us to do anything with Suffix Arrays (and LCP arrays – Longest Common Prefix) that we can do with Suffix Trees, but with less memory (but only by a constant factor) and, I think, more cache-friendly use of memory. This code is still far from suitable for any real use.

Converting between Suffix Trees and Suffix Arrays

I first moved my previous code for Ukkonen’s algorithm from a branch into master, so we can now construct a SuffixTree from a string, in O(n) linear time. Subsequent insert()s still use the naive O(n^2) code. Traditionally, you’d concatenate all your strings with unique separator characters, but I’m trying to avoid that. And, really, I’m just trying to explore the use of the relevant algorithms – it’s a bonus if I can wrap it up into something useful.

I then added a get_suffix_array_and_lcp_array() method to get the Suffix Array and associated LCP Array, in O(n) linear time. This works by doing a lexicographically-ordered depth first search.

Finally, I added a SuffixTree constructor that takes such a Suffix Array and LCP Array to build the suffix tree, again in O(n) linear time. This iterates over the suffixes in the Suffix Array adding a node for each, walking back up the tree to split at the point specified by the LCP array. I kept the path in a stack for that walk up. This constructor let me test that I could use a Suffix Array (and LCP Array) obtained from the SuffixTree to recreate the SuffixTree correctly.

This was based on the ideas I learned from Dan Gusfield’s Fundamentals of Stringology lectures, specifically part 3. Incidentally, his later discussion of the Burrows Wheeler Transform led me to Colt McAnlis’ wonderful Compressor Head series, which far more people should watch.

Towards a usable SuffixArray API.

I’ve already created a SuffixArray class to do much of what the SuffixTree class can do, such as finding matches for a substring. It currently does an O(log(n)) binary search over the suffix array. But this should be possible in O(m) time, where m is the length of the substring, hopefully with far less string comparisons, just as it is possible with a Suffix Array. This can apparently be done with the help of the LCP Array. Hopefully I’ll get to that.

I also want to implement Kärkkäinen & Sanders DC3 algorithm to construct the Suffix Array directly from the string in O(n) linear time. And, if I understand Dan Gusfield’s lecture properly, we can use the Burrows-Wheeler transform to construct the associated LCP Array also in linear time. Hopefully I’ll get to that too.

gtkmm 4 started

We (the gtkmm developers) have started work on an ABI-breaking gtkmm-4.0, as well as an ABI-breaking glibmm, target GTK+ 4, and letting us clean up some cruft that has gathered over the years. These install in parallel with the existing gtkmm-3.0 and glibmm-2.4 APIs/ABIs.

A couple of days ago I released first versions of glibmm (2.51.1) and gtkmm (3.89.1), as well as accompanying pangomm and atkmm releases.

This also lets us use my rewrite of libsigc++ for libsigc++-3.0, bringing us more fully into the world of “modern C++”. We might use this opportunity to make other fundamental changes, so now is the time to make suggestions.

We did a parallel-installing ABI-breaking glibmm even though glib isn’t doing that. That’s because, now that GTK+ is forcing us to do it for gtkmm, this seems like as good a time as any to do it for glibmm too. It’s generally harder to maintain C++ ABIs than C ABIs, largely because C++ is just more complicated and its types are more exactly specified. For instance we’d never have been able to switch to libsigc++-3.0 without breaking ABI.

 

Suffix Tree, Ukkonen, C++

I’ve been playing with suffix trees recently, in particular trying to get my head around Ukkonen’s algorithm for suffix tree construction. I’ve got the hang of it now after trying out some code and checking it against this excellent video walk through of Ukkonen’s algorithm by Tushar Roy.

Tries, Radix Trees, Suffix Trees

A suffix tree stores every suffix of a string, or of several strings, and makes it easy to find prefixes of those suffixes, in linear O(m) time, where m is the length of the substring. So it lets you find substrings in O(m) time. So it can be used for full text search, and it’s often used for DNA comparison.

A suffix tree is actually a radix tree (also known as a radix trie), with suffixes inserted instead of just the whole string. A radix tree  is itself a compacted trie (also known as a prefix tree).

A trie (pronounced try) is a tree with characters on the edges, letting you discover the whole string as you traverse the edges, staring with the edge from the root for the first character of the string. For instance, this lets you offer text completion if the user gives you the first few correct characters of a word.

A radix tree compacts the Trie by letting each edge have several characters, splitting the edge only when necessary to distinguish different strings. So, while a Trie containing “abc” would have a chain of root->a->b->c, a Radix Tree would have just root->abc. Adding “abd” would change that to root->ab->c, with another for root->ab->d.

A suffix tree, or radix tree, can save space by storing just the start and end indices of its strings instead of the strings themselves. Then each edge has constant size, and the space for the tree is linear O(m), where m is the length of one inserted string. Or maybe you could say its linear O(nk), for n strings of length k, where k is m considered effectively constant.

For the string “banana” a suffix tree would look like this, with both leaf nodes and intermediate end markers marked with *.

root
    banana*
    a*
     na*
       na*
    na*
      na*

We can add a second string in the same suffix tree, associating values with the leaf at the end of each path, letting us know which items have a particular substring. For instance, if I add “banana” (value 0) and “bandana” (value 1) to a suffix tree, I can ask about “ana” and discover that it identifies both items (value 0 and value 1), via the a->n->a path. The tree looks like this:

ban
   ana (0)
   dana (1)
a (0, 1)
 n
  a (0, 1)
   na (0)
  dana (1)
n
 a (0, 1)
  na (0)
 dana (1)
dana (1)

Suffix Tree Construction

The hard part is constructing the suffix tree. When we insert one string of length m, we actually insert m suffixes. If each suffix insertion takes O(m) time, then the whole construction time is at least O(m^2).

Ukkonen’s algorithm reduces this to O(m) by iterating through the string just once from start to end and using the ideas of an implicit “global end”, the “active point” and “suffix links” to jump to wherever the rest of a smaller suffix should be inserted.

Each “phase” deals with one character from the string, in sequence, performing “extensions” within that “phase” to add suffixes that start with that character. This is useful because of the repeating sub-structure of suffix trees – each sub-tree can appear again as part of a smaller suffix.

Global End

Whenever the algorithm adds a node, that’s going to be a leaf node, meaning the edge to the leaf node will have the actual end of the inserted string. So the algorithm assigns a global end to the edge. We increment that end index each time we process the next character, so these edges get longer automatically.

Active point

The active point identifies a point on an edge in the tree by specifying a node, an edge from that node (by specifying a first character), and a length along that edge. Within every “extension”, the algorithm starts looking for the next character at the active point. Sometimes it will find the next character only because an edge got longer automatically when we incremented our global end.

Whenever it finds the next character from that active point, maybe even after stepping across an intermediate node,  it sets that newly-found position as the active point. This is called Rule 3.

Whenever it doesn’t find the next character from that active point, it adds a node, splitting the edge if necessary. This is called Rule 2. Whenever it splits an edge and creates an internal node, it changes the active point to the same character in a smaller edge, and looks again for the same next character. This lets it do the same split in the smaller suffix. It keeps doing this until all the smaller suffixes have been split in the same way.

This example, with the string “banana”, shows the use of the global end and of Rule 3 and Rule 2. It doesn’t show the use of suffix links yet, because it always splits edges from the root. Note that Ukkonen’s algorithm typically appends a special out-of-alphabet “$” character to the string so it can insert leaf nodes.

b: Look from root. Add to root.

b

a: Look from root. Add to root.

b
ba (because we incremented the global end).

n: Look from root. Add to root.

ban (because we incremented the global end).
an (because we incremented the global end).
n

a: Look from root. Find it (“ana”). Set it as the active point.

bana (because we incremented the global end).
ana (because we incremented the global end).
na

n: Look from previously-found “a”, in “_a_na”. Find it. Set it as the active point.

banan (because we incremented the global end).
anan (because we incremented the global end).
nan

a: Look from previously-found “n” in “a_n_an”. Find it. Set it as the active point.

banana (because we incremented the global end).
anana (because we incremented the global end).
nana

$: Look from previously-found “a” in “an_a_na”. Not found, so add an internal node and add the $. Change the active point to same character in a smaller (suffix) edge: “a” in “n_ana”.

banana$ (because we incremented the global end).
ana
   na$ (because we incremented the global end).
   $
nana$

$: Look from “a” in “n_a_na$”. Not found, so add an internal node and the $.
Change the active point to same character in a smaller (suffix) edge: “a” in “_a_na$”.

banana$
ana
   na$
   $
na
  na$
  $

$: Look from “a” in “_a_na$”. Not found, so add an internal node and the $. Change the active point to root.

banana$
a
 na
   na$
   $
 $
na
  na$
  $

$: Look from root.  Add to root.

 banana$
 a
  na
    na$
    $
  $
 na
   na$
   $
 $

Suffix Links

In Rule 2, when the algorithm splits an edge, adding an internal node, if sets a suffix link for that internal node, pointing to root. If it has previously set a suffix link for any other internal node, while looking for the same character (in the same “phase”), it updates the link for that previously-created internal node to point to the new node.

If the active node is not root after adding the internal node, it also changes the active node via the active node’s suffix link, taking us to a previously-created internal node. As before, this lets us jump to the smaller suffix of the same substring, to make the same split there, but suffix links let us do it even when we are not on an edge from root.

We can see this when using “banbadbans” as an example. We reach this point:

a: Look from “n” in “ba->_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to the “n” in “a->_n_badban$”.

ba
  n
   badban$
   $
  dban$
a
 nbadban$
 dban$
nbadban$
dban$

a: Look from “n” in “a->_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to the “n” in “_n_badban$”.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
nbadban$
dban$

a: Look from “n” in “_n_badban$”. Not found, so add an internal node and add the $. Follow the suffix link from the “n” to root.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
n
 badban$
 $
dban$

a: Look from root. Not found. Add to root.

ba
  n
   badban$
   $
  dban$
a
 n
  badban$
  $
 dban$
n
 badban$
 $
dban$
$

Example Code

Here is some example C++ code. It’s also in github, with far more comments.

const auto key_start = key.start_;
const auto key_end = str_end(key);

ActivePoint active;
active.node = &root_;
active.edge_valid = false; // Instead of -1.
active.length = 0;

std::size_t remaining = 0;
auto end_ptr = std::make_shared<KeyIterator>(key_start);
KeyIterator& end = *end_ptr;

// The "phases"
for (auto i = key_start; i != key_end; ++i) {
  ++remaining;
  ++end;

  Node* prev_created_internal_node = nullptr;

  // The "extensions".
  while(remaining) {
    const auto edge_match = (active.edge_valid && active.length) ?
      find_partial_edge(active, i) :
      find_partial_edge(active.node, i);
    const auto edge = edge_match.edge_;
    const auto part_len_used = edge_match.edge_part_used_;

    if (!edge_match.char_found_) {
      KeyInternal prefix(i, end_ptr);

      // Rule 2 extension: There is no match:
      if (part_len_used == 0) {
        active.node->append_node(prefix, value);
      } else {
        auto extra_node = edge->split(part_len_used);
        extra_node->append_node(prefix, value);
        extra_node->suffix_link_ = &root_;

        if (prev_created_internal_node) {
          prev_created_internal_node->suffix_link_ = extra_node;
        }
        prev_created_internal_node = extra_node;

        if (active.node != &root_) {
          active.node = active.node->suffix_link_;
        } else {
          --active.length;
          ++active.edge;
        }
      }

      --remaining;
      continue;
    }

    // Rule 3 extension:
    active.node = edge_match.parent_node_;
    active.edge = edge->part_.start_;
    active.edge_valid = true;
    active.length = part_len_used;

    break;
  }
}

Generic Suffix Tree in C++

However, the basic Ukkonen’s algorithm just stores one string. Before trying Ukkonen’s algorithm, I wrote a templated C++ suffix tree that can contain multiple strings (see the test code), with a simple O(m^2), or possibly O(m^3) construction time.

I created a version that uses Ukkonen’s algorithm, reducing construction to O(m) time, but it doesn’t support multiple inserts. In this implementation, I’ve avoided the need for extra “$” leaf nodes. I’ve also stored the suffix links in a hash table, to avoid wasting space per node even after construction, though that replaces a true constant-time read/write with an amortized constant time hash table insert/lookup.

I’m trying to adapt the Ukkonen-based suffix tree to support multiple inserts. Usually, to create such a “generalized” suffix tree with Ukkonen’s algorithm, you concatenate the strings together, separated by unique out-of-alphabet characters (“$”, “#”, etc), insert the whole thing, then prune the paths you don’t need, but I’d like to avoid that.

I’m not suggesting that anyone use this code for serious work. For now, it’s littered with asserts() and debug output and it’s in need of optimization because it’s generalized for any standard C++ container, such as std::string. For instance, it doesn’t seem wise to store std::string::iterators, instead of pointers, for the start/end ranges on edges. I’m also bothered by the use of std::shared_pointer<>s to share the “global end”, which still take space per edge even after construction.

Alternative Algorithms

Update: I’ve just watched Erik Demaine’s 2012 MIT “Strings” lecture about suffix trees (I needed to watch his previous “Static Strings” lecture first). He doesn’t mention Ukkonen’s algorithm, but does mention a “DC3 (Difference Cover 3)” algorithm by Kärkkäinen-Sanders-Burkhardt (at least the lecture’s student notes call it that) for building suffix arrays in linear time. He shows that a suffix tree can then be constructed from a  suffix array by using the LCP (Longest Common Prefix) of the suffix array to build a cartesian tree, again in linear time. So the whole suffix array construction would be linear time. Suffix arrays themselves anyway offer better data locality than suffix trees and this DC3 algorithm apparently allows parallelization. I hope I find time to explore all that sometime in code.

He also mentions the use of a “suffix tray” to store the edges in each node, to get  O(P  + lg(Σ))) query time, much the same as the O(P) time you’d get with an array, while reducing the overall space complexity to O(T) instead of O(TΣ ). Here P is the size of the query string, T is the size of the text being searched, and Σ is the size of the alphabet. He also mentioned instead using a combination of hash tables and van Emde Boas trees to get O(P + lglg(Σ)) query time. Those parts of the lecture felt quite rushed to me, so I’d like to investigate that sometime too if I find time.

C++17: constexpr if

Yesterday I played with constexpr if a little, using clang++ built from the latest LLVM sources. C++17 will not have many of the features I was looking forward to, but this is a really nice convenience.

It lets us clean up some of those little templates-plus-specializations that only exist to execute different code in a parent template depending on the template’s type, particularly when the specialization is on a bool or integer value. It won’t help if the template specializations need different member variables or need to provide different type aliases or typedefs.

For instance, libsigc++-3.0 still had a little with_type template with partial template specializations for true and false. Replacing it with constexpr if makes the intention clearer.

I made many similar simplifications by using constexpr if in my murrayc-tuple-utils, which is used by libsigc++-3.0 via copy/paste/rename.

In my murrayc-dp-algorithms project, some code became much clearer and more efficient with constexpr if, allowing me to remove some template specializations that only existed to satisfy the compiler.