µUBSan: clean-room reimplementation of the Undefined Behavior Sanitizer runtime

alexhutcheson · on Aug 8, 2018

Why was a clean-room approach necessary here? The UIUC License used by UBSan is extremely permissive, so going through extra effort to avoid creating a derivative work doesn't make much sense to me.

ajross · on Aug 8, 2018

It's not. They're just using the term as a synonym for "rewrite". There's no documentation of any actual IP isolation in the linked article. They just want people to know it's new and not based on the existing LLVM or Linux runtimes.

alexhutcheson · on Aug 8, 2018

This seems correct. Apologies for the unintentionally pedantic comment.

jmfisch · on Aug 8, 2018

Speculating:

- The author wanted the personal challenge for the task

- The author wanted the code licensed under a BSD 2-clause and felt the existing license didn't fit her/his ideals.

floatboth · on Aug 8, 2018

> The original Clang/LLVM runtime is written in C++ with features that are not available in libc and in the NetBSD kernel

alexhutcheson · on Aug 8, 2018

Yeah, I understand why they would write a clone, but "clean-room" has a specific meaning[1], and it's not clear why you would want or need that extra effort here.

Of course, I could just be misunderstanding, and they could be using "clean-room" as a synonym for "from scratch", rather than the meaning I linked to.

[1] https://en.wikipedia.org/wiki/Clean_room_design

floatboth · on Aug 8, 2018

Oh, right, I didn't even think about that meaning since we're not in the context of proprietary software. I'm 99.99% sure it's just "from scratch" here.

liuliu · on Aug 8, 2018

My reading is that they didn't look at the Linux implementation (as "clean-room").

tux1968 · on Aug 8, 2018

This is an optional runtime used to improve error reporting with the Clang sanitizer:

https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

brandmeyer · on Aug 8, 2018

Darned Good Stuff!

Is the runtime API shared between both GCC and LLVM/Clang?

berti · on Aug 9, 2018

It's sad to see so much duplicated effort. What does this accomplish that a few patches to the llvm implementation couldn't?

ddevault · on Aug 8, 2018

>I've decided to write the whole µUBSan runtime as a single self-contained .c soure-code file, as it makes it easier for it to be reused by every interested party.

I don't really get why people do this. Linking is one of the easiest and most broadly supported features of C environments on every platform.

cryptonector · on Aug 8, 2018

You're not going to like how PuTTY is written then either. The entire initial portion of the SSHv2 protocol, from version string exchange to the end of user authentication, is coded as one huge (10Kloc) C function that uses a macro-driven Duff's device to implement co-routine behavior (it's all async I/O under the covers).

To me, one huge file or many small ones doesn't matter all that much because I have cscope on my side. What matters is being able to find things quickly. Still, one huge file means I can asterisk on a symbol in vim to quickly find all references to it without having to switch to cscope -- that's not nothing.

Also, https://github.com/NetBSD/src/blob/trunk/common/lib/libc/mis... is just not that big... only 1639 lines, 1310 sloc. Compare to, let's say, OpenSSL, where 29 C source files have more lines than ubsan.c, with several being more than twice the size. Maybe you think OpenSSL is not a fair example because there's lots of tables and what not? Even if you look at files outside crypto/, you'll find lots of big ones. Just for fun I looked at a variety of other open source projects: Heimdal, MIT Kerberos, glibc, PostgreSQL -- you'll be shocked when you look at their file sizes, even when you elide files that are obviously mostly-data.

Source file size is not that interesting. The contents is. When a set of sources is small enough, organizing it into multiple files is not necessarily a win. 1.3kloc doesn't seem like that big a file.

ddevault · on Aug 8, 2018

1639 lines of code actually isn't bad at all. I think that's less of a selling point and more happenstance, though. I certainly wouldn't want to reject changes which split it up in the future as it gets more unweildy on the basis of "but being in one file is a feature!".

cryptonector · on Aug 8, 2018

You could have looked at the file size first. What if it had been 500sloc? Would you still have made the same comment? What is the point at which you'd absolutely insist on splitting it up if you were doing a code review? Surely 1.3ksloc is smaller than that. The commentary on splitting this up strikes me as so much bikeshedding.

abenedic · on Aug 8, 2018

I think, not to put words in his mouth, that he is objecting to the idea of single file libraries as inherently good or better than a multi-file library. I think the objection is more about future design choices the maintainer will make. If you want to keep it single file, it may be necessary to avoid adding some features which are too complex.

Complex features usually necessitate modularization, which is against the idea of the single file library. Modularization in c and c++ is very poor and based on having multiple files, some of which represent the interface and others that represent the implementation. I at least partly think this is his objection.

cryptonector · on Aug 8, 2018

OK, sure, https://news.ycombinator.com/user?id=Sir_Cmpwn's comment was:

> > I've decided to write the whole µUBSan runtime as a single self-contained .c soure-code file, as it makes it easier for it to be reused by every interested party.

> I don't really get why people do this. Linking is one of the easiest and most broadly supported features of C environments on every platform.

Sure, this is true, and users already have to know how to link, unless they are #include'ing this, but still, a single file is much easier to share/distribute and use, and in any case, splitting up such a small file (by the standards of.. a number of open source projects I looked at, it's small) seems unnecessary.

abenedic · on Aug 8, 2018

I totally agree with you, I personally love single file implementations.

I was just trying to explain what I thought his argument was against them, since I don't think he really explicitly stated it outside of generally speaking of maintenance issues. He just said he would prefer to contribute patches to a project that had multiple files and a makefile based build system.

I was assuming his argument was about the future directions the project could go, which I can see as being a valid criticism. As I said the main issue with a single file implementation is that potential users may end up using parts of the implementation instead of just the public facing interface you would like them to use.

cryptonector · on Aug 9, 2018

Fair enough.

blattimwind · on Aug 8, 2018

https://github.com/qt/qtbase/blob/5.11/src/widgets/kernel/qw... is the largest no-nonsense C++ file I've ever seen https://github.com/borgbackup/borg/blob/master/src/borg/arch... ditto for Python

Personally files this large become just very annoying overall. Code review tools tend to become slow on them (e.g. Gerrit is super-slow with files that are quite a bit smaller than qwidget.cpp), navigation is only possible via symbols, the import / include area becomes a huge mess (which also tends to generate constant merge conflicts, because imports are a common hot area for writes) etc.

cryptonector · on Aug 9, 2018

Oh yes, past some size things get slow and obnoxious, no doubt.

pebers · on Aug 8, 2018

It's not the reason the author gives, but one advantage is that you don't need LTO for some optimisations. Sqlite has a single-file distribution for that reason. On the downside, it is quite slow to compile and not readily parallelisable - but that probably is less of a concern here.

com2kid · on Aug 8, 2018

> Linking is one of the easiest and most broadly supported features of C environments on every platform.

And build systems are one of the most varied features of C environments.

I've seen large projects (e.g. OpenSSL) take weeks of effort to integrate into build systems. Suffice to say, no one wants to try and merge patches in afterwords.

Nevermind the number of source trees I've seen with copies of some old version of OpenSSL source, ugh.

abenedic · on Aug 8, 2018

There seems to be several different discussions which are taking place as a response to your comment. It might be good to have a more consolidated reference since different people are attacking different aspects of your argument, so of which are a little unfair.[I am not a native speaker so I may have misunderstood some points or will misrepresent them]

Reasons why people like single file libs:

* Easy for a beginner to use.

* No non-obvious dependencies. (check includes to see them) * Easy to repackage for multiple OS's

* Easy to understand since each reference is in the same file.

* Typically simpler and have fewer features(which can help one focus on the main issues.)

* No need for either static or dynamic linking, although it is basically the same as static linking in some ways.

* Eliminates the need for dependency management.

Reasons against single file libraries:

* No isolation of components. (The interface and implementation are in the same file)

* Adding new feature may make the library too complicated(which can lead to some features being denied)

* A bit harder to maintain. Since the interface and implementation are the same, some users may depend on internal interfaces. Which can inhibit changes needed for performance.

* May encourage bad behaviour, since the user need never learn to link against a third party library. If the culture changes enough there may be many people contributing who are not able to use standard tools in the standard way. (The idea that beginners should learn the culture that is used by the toolmakers.)

Hopefully this is useful for someone, and I hope I did not misrepresent any opinions.[Also huge fan of Drew]

vultour · on Aug 8, 2018

As someone with a very casual knowledge of C, I am thankful when people do this. All you need to do is #include "x" and it just works, having to configure the compiler and linking can be painful. Experienced developers might see it as a disadvatage for some reason, but it is a godsend for beginners.

ddevault · on Aug 8, 2018

I think it just trains beginners on how to do things wrong. The basics of linking two pieces of code together aren't terribly complicated and are essential C knowledge.

MaulingMonkey · on Aug 8, 2018

> The basics of linking two pieces of code together aren't terribly complicated

The basics aren't complicated, but doing it correctly requires more than the basics IME.

Each platform does it slightly differently, and I support multiple platforms. I've repeatedly seen static libs linked into multiple dynamic libs and cause all kinds of trouble because there's multiple copies of global data, of which only one was probably initialized correctly. I have seen multiple different stdlibs successfully linked into the same program, causing ABI mismatches that compiled, linked, and "usually" ran at runtime - but had some nasty crashes to be debugged, because std::vector<int> was an entirely different type at the call site vs the callee's implementation. Version mismatches of other SDKs are also common. I've had all kinds of weird constraints on what compilation and link flags I could use based on what static libs I've linked (e.g. being unable to link with exception handling enabled because one of my dependencies was built with it disabled)

Additionally you don't have libs for all my platforms (or sometimes any of them), so I have to build from source anyways, so I have to write build rules for your lib because the defaults probably don't work, and because build rules are never documented I have to mostly reverse engineer intent to do so - possibly without a viable build in the first place if your only working libs are for platforms I don't use.

Or if I'm lucky enough that reusing your build rules is viable, I have to try and sync various flags between my projects written in one build system and your projects written in yet another build system, and prevent these from getting out of sync for the remainder of the lifetime of the project. I likely have to jump through all kinds of stupid hoops like installing the exact right version of python and installing undocumented but required packages just to run the configuration scripts that drive the build.

...or for single-source libs, I can probably just drop a single file into one of my existing projects and have everything work. Okay, so some of that was just because they actually cared about minimizing how painful it is to integrate their library at all, but having a single .c file helped too.

I'd only write a single .c library for the simplest of libraries - or as an automatically generated file - but it can help.

drb91 · on Aug 8, 2018

Linking is only one step in package management.

Besides, in this scenario, it may not be viewed as a library a user should manage. The only benefit I can think of from linking is getting security patches without recompiling. That might easily not be a priority. Why is that a problem?

ddevault · on Aug 8, 2018

I'm not talking about package management.

microcolonel · on Aug 8, 2018

> I think it just trains beginners on how to do things wrong.

What gives you the impression that this is the "wrong" way to do this? How many beginners are going to be looking to μUBSan for guidance on how to structure their generic library? And it's not like linking isn't involved, you still end up linking the one object that is created from that source file (unless you #include it, which would also probably work).

ddevault · on Aug 8, 2018

>What gives you the impression that this is the "wrong" way to do this?

It's far less maintainable and goes against the grain of how people expect libraries to behave.

>How many beginners are going to be looking to μUBSan for guidance on how to structure their generic library?

Any beginner who uses it. You make a good point, though, this isn't exactly a beginner-tier tool so why is it being distributed like one?

>unless you #include it, which would also probably work

I think that's what they're expecting you to do. And if you're linking anyway, what does it matter if it's one object or several? Or more realistically, a single archive (built from many objects which are themselves built from many source files)?

microcolonel · on Aug 8, 2018

> It's far less maintainable...

How so? What form of maintenance does it prevent or hurt?

> ...and goes against the grain of how people expect libraries to behave.

You could dynamically link it if you wanted, it's just that there is little reason to. This approach is inclusive: it enables all three major ways people use libraries.

Honestly, I don't see what the problem is. You're entitled to prefer dynamic linking or static linking, but how does it hurt you that it can also be used another way?

ddevault · on Aug 8, 2018

I'm not arguing about dynamic versus static linking. In fact, I think dynamic linking is pretty bad. I'm just arguing against distributing libraries all in one big C file. You can statically link against an archive built from several sources.

microcolonel · on Aug 8, 2018

But why does it bother you so much? Clearly this is a feature for somebody else, and it's not an issue for you.

ddevault · on Aug 8, 2018

As I said elsewhere:

>I don't consider open source projects a black box, I evaluate every project under the lens of someone who expects to someday have to work with the code myself and send patches upstream.

And I think I'm squarely in the target audience for this tool anyway. What makes you think it's a feature for someone else?

microcolonel · on Aug 8, 2018

> What makes you think it's a feature for someone else.

The fact that they mention it specifically and proudly on the front page indicates to me that the author, and the NetBSD maintainers who accepted it, consider it good that it is a single, largely self-contained source file. It is only ~1300 source lines, so I don't see why it should necessarily be split out. Something tells me the original authors and the ongoing maintainers of this file have a better idea what value this choice represents than you do (since you haven't said anything particularly compelling in favour of splitting this into several files and objects).

pebers · on Aug 8, 2018

You need to do more than just #include it - it is implemented in one .c file but that is not a header, so you'd still need to compile that separately and link it in somehow.

mnarayan01 · on Aug 8, 2018

If you include it (possibly indirectly) in a file which you compile, then no, you do not need to compile it separately.

TeMPOraL · on Aug 8, 2018

Not really. All you need to do is:

  #ifndef __onefilelib__
  #define __onefilelib__
  #include "onefilelib.c"
  #endif

(Also potentially defining "main" as something else, if it happens that the "onefilelib.c" has an entry point for some reason.)

abenedic · on Aug 8, 2018

You should really be careful about c preprocessor macros when you do things like this with .c files. You should always undefine any macros defined in a file which are not properly namespaced(as much as C allows for pseudo namespacing).

slrz · on Aug 8, 2018

This only does what you want if you mush everything together into a single translation unit. Otherwise, you will get duplicate definitions for everything in onefilelib.c.

sigjuice · on Aug 8, 2018

Is there a prominent C code base that does this sort of thing? I’m sorry this is all very questionable advice IMHO. What would be one word in a makefile (onefilelib.c) is a 4-line jumble of preprocessor macros, names starting with __ are reserved, and the suggestion to redefine main and entry points smells like more #ifdef spaghetti.

mnarayan01 · on Aug 8, 2018

The #ifndef rigamarole is https://en.wikipedia.org/wiki/Include_guard and at least used to be fairly common. I also used to see the __FOO_BAR_H__ naming convention for these defines all over the place. I'm not sure if __ identifiers being reserved is a (not very) new thing or if it's always been around and people are just now more generally knowledgeable about the fact that they shouldn't be used.

sigjuice · on Aug 8, 2018

Yes, the include guard is a very widespread technique for header files. My objection is against #include’ing a .c file to support the questionable trend of ‘single file libs’.

TeMPOraL · on Aug 8, 2018

I wasn't suggesting this as a serious method of organizing your project - merely pointing out that this quick hack can be done and is rather straightforward. And I did actually see it once or twice in the wild.

abenedic · on Aug 8, 2018

I am not sure of a prominent code base that does this exactly, but some people do similar things with "amalagmated builds". Like https://www.sqlite.org/amalgamation.html

abenedic · on Aug 8, 2018

It makes it easier to relink against a different standard library like musl. Or for cross compilation. The fewer files to deal with the fewer places I have to look for issues in porting.

ddevault · on Aug 8, 2018

I mean, you're already going to have to recompile stuff for either of those cases. Once you're compiling things, compiling several files isn't much harder.

abenedic · on Aug 8, 2018

Right, it is almost always possible to do anything. In practice I would rather have to change the includes on one file to get it to compile with musl rather than change the includes on 10+ files each with a different change. It is a matter of degree not kind of difficulty. But in general I always will prefer the issues that arise from a single file implementation rather than the issues from many files.[Sorry if this is phrased poorly, I am not a native speaker.]

drb91 · on Aug 8, 2018

And what dependency management or package system would you use to install that library? Distributing the source, especially for self contained libraries, is easily managed with the rest of your source code.

ddevault · on Aug 8, 2018

Sure, I'm all for distributing the source. Doesn't mean it needs to be a single file. It should probably have a makefile which spits out an archive file, which you link to in your application.

abenedic · on Aug 8, 2018

I would like if when people did this they automatically had it compile the library with -ffunction-sections and -fdata-sections. This would make my life way easier, and if anyone is reading this, please add these flags. Combined with stripping the executable afterwards you can make things way smaller for embedded systems.

namibj · on Aug 9, 2018

Please allow that mess of using ar to work in LTO mode, i.e., allow bypassing the ar or replacing it with the compiler in linking mode, or something. Basically don't use ar in a way that breaks LTO.

LTO is awesome, as some code does benefit from optimizing for a minute to get 5% more speed for the next week (and sometimes even more aggressive parameters for the optimizer, if you run the code for long enough to matter).

drb91 · on Aug 8, 2018

Ok, I’m in agreement there.

microcolonel · on Aug 8, 2018

This way there are no barriers to entry, such as a long list of objects to link. You just include one object and you're golden. If you don't mind linking longer lists of objects, great, you can have a list of length 1 instead.

What harm is it to you that it's a single source file and a single object?

ddevault · on Aug 8, 2018

>What harm is it to you that it's a single source file and a single object?

I don't consider open source projects a black box, I evaluate every project under the lens of someone who expects to someday have to work with the code myself and send patches upstream.

zzzcpan · on Aug 8, 2018

> Linking is one of the easiest and most broadly supported features of C environments on every platform.

No, it's absolutely not easier. And the more you deviate from amd64 linux and gcc, the harder it gets.

ddevault · on Aug 8, 2018

I've worked pretty damn far away from amd64 linux and gcc, and I can assure you that it's really not that hard.

zzzcpan · on Aug 8, 2018

I've seen plenty of times how libraries compile just fine on less popular platforms and environments but then cause crashes at start or not link at all and how much time it wastes dealing with this. I thought these problems were widely known, that's why anyone who cares about porting tries to make self-contained single-sourced libraries.

ddevault · on Aug 8, 2018

I'm not talking about dynamic linking or distributing shared objects or even archives. I'm talking about static linking or simply slinging together .o files.