Open Source at Google: Introducing CityHash

nbpoole · on April 11, 2011

I'm not seeing any benchmarks: since they tout speed as a major selling point, that surprises me.

Edit: For instance, the blog post mentions being inspirsed by MurmurHash, which touts its performance on its site along with benchmarks (http://sites.google.com/site/murmurhash/):

    OneAtATime - 354.163715 mb/sec
    FNV - 443.668038 mb/sec
    SuperFastHash - 985.335173 mb/sec
    lookup3 - 988.080652 mb/sec
    MurmurHash 1.0 - 1363.293480 mb/sec
    MurmurHash 2.0 - 2056.885653 mb/sec

In fact, it looks like the author of MurmurHash also developed a test suite for hash functions which includes performance testing: http://code.google.com/p/smhasher/wiki/SMHasher

elq · on April 11, 2011

looking at the code, they've optimized for speed. And they also said -

> We decided to optimize for speed rather than simplicity and even included special cases for short inputs.

nbpoole · on April 11, 2011

Right, I understand that. I'm just surprised that they aren't showing off benchmarks or statistics to quantify the performance gains they're seeing.

andrewcooke · on April 11, 2011

and i'm not seeing any definition of what the hash is, or why it should be any good, apart from the code...

turbodog · on April 11, 2011

From http://code.google.com/p/cityhash/source/browse/trunk/src/ci...

// WARNING: This code has not been tested on big-endian platforms!

// It is known to work well on little-endian platforms that have a small penalty

// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs.

//

// By the way, for some hash functions, given strings a and b, the hash

// of a+b is easily derived from the hashes of a and b. This property

// doesn't hold for any hash functions in this file.

jws · on April 11, 2011

City64 at least fails to produce the same values on an x86 and big endian PPC (Debian squeeze on both):

  x86
  ========================
  641f1cd0505d1ff9 : I
  e394c6831e9d6e71 : do
  651c1a4b0b3f2d88 : not
  9fc671af2d051786 : know
  1fa385757f0016ec : about
  f2407bd9ce678f7f : making
  73bd225c6b6b8163 : awkward
  ea78562e35777eb1 : rhopalic
  3c854a925e6e4a9e : sentences
  84a04e9aa8dae9f5 : .
  
  PPC
  ========================
  641f1cd0505d1ff9 : I
  e394c6831e9d6e71 : do
  651c1a4b0b3f2d88 : not
  147d90cc620fc6b2 : know
  23af6044b2394703 : about
  4f80227202845190 : making
  4b8b4b3cd98a1de9 : awkward
  61b38def2ee9465f : rhopalic
  29016ee1c7c61e93 : sentences
  84a04e9aa8dae9f5 : .

For anyone that cares, the test program was…

  #include <stdio.h>
  #include <string.h>
  #include "city.h"

  const char *word[] = { "I", "do", "not", "know", "about",
                         "making", "awkward", "rhopalic", 
                         "sentences", ".", 0 };

  int main(int argc, char **argv)
  {
      for ( const char **p = word ; *p != 0; p++) {
        uint64 v = CityHash64( *p, strlen(*p));
        printf("%Lx : %s\n", v, *p);
      }
  }

DarkShikari · on April 11, 2011

    #define UNALIGNED_LOAD64(p) (*(const uint64*)(p))

    #define UNALIGNED_LOAD32(p) (*(const uint32*)(p))

Any function that uses both of these violates strict aliasing rules and might miscompile on recent gcc versions. Only one of them isn't enough, because char* is allowed to alias anything in C/C++. But if you use both, then you have a uint32_t* and a uint64_t* pointing to the same memory, violating the language spec.

ffmpeg has a header of macros to avoid this problem. They have names like AV_RN32A (aligned, native-endian 32-bit read), AV_RL32 (unaligned, little-endian 32-bit read) and so forth.

haberman · on April 12, 2011

> Any function that uses both of these violates strict aliasing rules

More specifically, any function that uses both of these to access the same data violates strict aliasing rules. But this would imply that the data is being loaded redundantly, which seems unlikely in an implementation where speed is a top priority.

For example, I do not believe the following function violates strict aliasing rules:

    void foo(char *p) {
      bar(UNALIAGNED_LOAD64(p), UNALIGNED_LOAD32(p+8));
    }

cpeterso · on April 12, 2011

I am surprised at how half-baked this code sounds. Why would Google release a new hash library that does not support big endian platforms, uses unaligned memory access on little endian platforms, is strict-aliasing unsafe, and is implemented in C++?

sparky · on April 12, 2011

For what it's worth, Snappy (the compression library Google released a couple weeks ago) has many of these same limitations. I wanted to port both of these to the RISCy, big-endian architecture I work on as part of my research, so finding out how unportable they were was kind of a bummer for me, but honestly it's a pretty reasonable tradeoff for Google; x86 is the name of the game in cheap (sorry Power 7/SPARC) commodity servers. If it were me, I might have put a bit of thought/work into portability before open-sourcing it, but I'd rather the code be out there than not, despite its limitations.

andrewcooke · on April 11, 2011

what does this do: #define LIKELY(x) (__builtin_expect(!!(x), 1))

oh, to answer myself, is to help predict the branch direction - it indicates which is likely the common case in the test. http://kerneltrap.org/node/4705

edge17 · on April 11, 2011

haven't seen this stuff in a while... given the complexity and large history-tracking tables of modern branch predictors, is this stuff still useful?

or is the code written in this manner just to make it more platform independent (i.e. cycles still matter for embedded applications, etc)?

pmr_ · on April 11, 2011

There is really no connection between the likely macro and branch prediction here.

The gcc built-in simply will organize the code in a manner that makes the likely statement faster to execute (e.g. branching the unlikely block).

Branch predictors are a runtime mechanism so the processor guesses which outcome of the "if" is more likely.

One is a static optimization, the other is dynamic; if you like to think about it that way.

haberman · on April 12, 2011

A lot of instruction sets (including x86) let you encode branch prediction hints into the actual instructions to help out the branch predictors. But for x86 at least these hints didn't turn out to be too useful, so they're ignored (except on P4, I believe).

ecaron · on April 13, 2011

I just finished created a PHP PECL extension for CityHash. Email me if you'd like to beta-test it before I submit it to the PECL dev list.

mikelward · on April 11, 2011

Middle click not working as expected? http://code.google.com/p/chromium/issues/detail?id=67844