Tag Archives: C++17

C++: std::string_view not so useful when calling C functions

string_view does not own string data

C++17 adds std::string_view, which is a thin view of a character array, holding just a pointer and a length. This makes it easy to provide just one method that can efficiently take either a const char*, or a std::string, without unnecessary copying of the underlying array. For instance:

void use_string(std::string_view str);

You can then call that function like so:

use_string("abc");

or

std::string str("abc");
use_string(str);

This involves no deep copying of the character array until the function’s implementation needs to do that. Most obviously, it involves no copying when you are just passing a string literal to the function. For instance it doesn’t create a temporary std::string just to call the function, as would be necessary if the function took std::string.

string_view knows nothing of null-termination

However, though the string literal (“abc”) is null-terminated, and the std::string is almost-certainly  null-terminated (but implementation defined), our use_string() function cannot know for sure that the underlying array is null terminated. It could have been called liked so:

const char* str = "abc"; // null-terminated
use_string(std::string_view(str, 2));  // not 3.

or even like so:

const char str[] = {'a', 'b', 'c'}; //not null-terminated
use_string(std::string_view(str, 3));

or as a part of a much larger string that we are parsing.

Unlike std::string, there is no std::string_view::c_str() which will give you a null-terminated character array. There is std::string_view::data() but, like std::string::data(), that doesn’t guarantee the the character array will be null-terminated. (update: since C++11, std::string::data() is guaranteed to be null-terminated, but std::string_view::data() in C++17 is not.)

So if you call a typical C function, such as gtk_label_set_text(), you have to construct a temporary std::string, like so:

void use_string(std::string_view str) {
  gtk_label_set_text(label, std::string(str).c_str());
}

But that creates a copy of the array inside the std::string, even if that wasn’t really necessary. std::string_view has no way to know if the original array is null-terminated, so it can’t copy only when necessary.

This is understandable, and certainly useful for pure C++ code bases, or when using C APIs that deal with lengths instead of just null termination. I do like that it’s in the standard library now. But it’s a little disappointing in my real world of integrating with typical C APIs, for instance when implementing gtkmm.

Implementation affecting the interface

Of course, any C function that is expected to take a large string would have a version that takes a length. For instance, gtk_text_buffer_set_text(), so we can (and gtkmm will) use std::string_view as the parameter for any C++ function that uses that C function. But it’s a shame that we can’t have a uniform API that uses the same type for all string parameters. I don’t like when the implementation details dictate the types used in our API.

There is probably no significant performance issue for small strings, even when using the temporary std::string() technique, but it wouldn’t be nice to make the typical cases worse, even just theoretically.

In gtkmm, we could create our own string_view type, which is aware of null termination, but we are trying to be as standard as possible so our API is as obvious as possible.

C++ Implementations of “Engineering the Compiler” Pseudocode

Implementing is understanding

I’m gradually reading through the Engineering a Compiler book. I’m enjoying it but after a while I started wondering if I really understood the algorithms and I wanted to make sure of that before going further. So I implemented the algorithms in C++ (C++17, because it’s nice), testing them against the example inputs and grammars from the book. For instance, here is my code to construct, and use, the action and goto tables which define a DFA for a bottom-up table-driven LR(1) parser, along with some test code to use it with a simple expression grammar.

So far I’ve done this for Chapter 2 (Scanners) and Chapter 3 (Parsers) and I’ll try to keep going. It is a useful exercise for me, and maybe it’s useful to someone else reading the book.  Please note that the code will probably be meaningless to you if you haven’t read the book or something similar. On the other hand, the code will probably seem childlike if you’ve studied compilers properly.

Trying to get  the code to work often showed me that I had had only the illusion of fully understanding. Much of the pseudocode in the book is not very clear, even when you adjust your mind to its vaguely mathematical syntax, and to know what it really means you have to closely read the text descriptions, sometimes finding clues several pages back. For instance it’s not always clear what is meant by a particular symbol, and I found at least one conditional check that appeared in the description but not in the code. I would much rather see descriptions inline in the code.

Pseudocode is vague

I’m not a fan of pseudocode either, though I understand the difficulty in choosing a real programming language for examples. It means putting some readers off. But there could at least be some real code in an appendix. The book has some detailed walkthroughs that show that the authors must have implemented the algorithms somehow, so it’s a shame that I couldn’t just look at that code. I wonder what real programming language might be the most compact for manipulating sets of states when dealing with finite state automata states and grammar symbols.

For the parsers chapter this wasn’t helped by the, presumably traditional, ambiguity of the “FIRST” sets terminology. FIRST(symbol), FIRST(symbols), FIRST(production), and FIRST+(production) are all things.

My code isn’t meant to be efficient. For instance, my LR(1) parser table-building code has some awfully inefficient use of std::set as a key in a std::map. The code is already lengthier than I’d like, so I would prefer not to make it more efficient at the cost of readability. I’ve also probably overdone it with the half-constexpr and vaguely-concepty-generic Grammar classes. But I am tempted by the idea of making these fully constexpr with some kind of constexpr map, and then one day having the compiler build the generated (DFA) tables at compile time, so I wanted to explore that just a little.

But the book is good

Despite my complaints about the presentation of the algorithms, so far I’d still recommend the book. I feel like it’s getting the ideas into my head , it’s not really that bad, and as far as I know there’s nothing better. Of course, given that I haven’t read the alternatives, my recommendation shouldn’t mean much.

Also, these days I always write up my notes as a bigoquiz.com quiz, so here are my quizzable compiler notes so far.

Boost Graph Library: modernization

Modern C++ (C++11 and later) can greatly simplify generic templated C++ code. I’ve recently been playing around with modernizing the Boost Graph Library code. The experience is similar to how I modernized the libsigc++ code, though I have not gone nearly into that much depth yet. The BGL is currently a big messy jumble of code that isn’t getting much love, and modernizing it could start to let its accumulated wisdom shine through, while also freeing it of other boost dependencies.

Please note that this is just an experiment in my own fork that even I am not pursuing particularly seriously. These changes are not likely to ever be accepted into the BGL because it would mean requiring modern C++ compilers. Personally, I think that’s the only way to move forward. I also think that Boost’s monolithic release process, and lack of a real versioning strategy, holds its libraries back from evolving. At the least, I think only a small set of generally-useful libraries should be released together, and I think that set should drop anything that’s now been moved into the standard library. BGL seems to have been stagnant for the last decade, so there doesn’t seem to be much to lose.

I’ve modernized the example code (and also tried to remove the “using namespace boost” lines), and done some work to modernize the boost graph code itself.

At the least, liberal use of auto can let you delete lots of ugly type declarations that really exist just to make things compile rather than to provide useful clues to the reader. For instance, auto makes the example code less cluttered with magic incantations. Range-based for loops simplify more code – for instance, in the examples.The simpler code is then easier to understand, letting you see the useful work underneath the boiler plate.

I’ve jumped right into C++17 and used structured bindings (and in the examples) because they are particularly helpful with the BGL API, which has many methods that return the begin and end of a range inside a std::pair. This lets us avoid some more type declarations. However, in the examples, I used a make_range_pair() utility function in many places instead, so I could use a simple range-based for. I guess that the range library would provide a type to use instead, maybe as part of the standard library one day.

I also replaced most uses of the boost type traits (some from boost::mpl) with the std:: equivalents. For instance, std::is_same instead of boost::is_same. It should now be possible to remove the boost::mpl dependency completely with a little work.

I’ve also tried putting all of BGL in the boost::graph namespace, instead of just in the boost namespace, but the API currently expects application code to use “using namespace boost”, to make its generic API work, and this makes that even more obvious.

As a next step, I guess that the boost graph implementation code could be simplified much more by use of decltype(auto). As I was modernizing libsigc++, I sometimes found templates that were used only by specializations that were used only by typedefs that were no longer needed, because they could be replaced by decltype(auto) return types. You have to pull at the threads one by one.

C++17: constexpr if

Yesterday I played with constexpr if a little, using clang++ built from the latest LLVM sources. C++17 will not have many of the features I was looking forward to, but this is a really nice convenience.

It lets us clean up some of those little templates-plus-specializations that only exist to execute different code in a parent template depending on the template’s type, particularly when the specialization is on a bool or integer value. It won’t help if the template specializations need different member variables or need to provide different type aliases or typedefs.

For instance, libsigc++-3.0 still had a little with_type template with partial template specializations for true and false. Replacing it with constexpr if makes the intention clearer.

I made many similar simplifications by using constexpr if in my murrayc-tuple-utils, which is used by libsigc++-3.0 via copy/paste/rename.

In my murrayc-dp-algorithms project, some code became much clearer and more efficient with constexpr if, allowing me to remove some template specializations that only existed to satisfy the compiler.