Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From http://code.google.com/p/cityhash/source/browse/trunk/src/ci...

// WARNING: This code has not been tested on big-endian platforms!

// It is known to work well on little-endian platforms that have a small penalty

// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs.

//

// By the way, for some hash functions, given strings a and b, the hash

// of a+b is easily derived from the hashes of a and b. This property

// doesn't hold for any hash functions in this file.



City64 at least fails to produce the same values on an x86 and big endian PPC (Debian squeeze on both):

  x86
  ========================
  641f1cd0505d1ff9 : I
  e394c6831e9d6e71 : do
  651c1a4b0b3f2d88 : not
  9fc671af2d051786 : know
  1fa385757f0016ec : about
  f2407bd9ce678f7f : making
  73bd225c6b6b8163 : awkward
  ea78562e35777eb1 : rhopalic
  3c854a925e6e4a9e : sentences
  84a04e9aa8dae9f5 : .
  
  PPC
  ========================
  641f1cd0505d1ff9 : I
  e394c6831e9d6e71 : do
  651c1a4b0b3f2d88 : not
  147d90cc620fc6b2 : know
  23af6044b2394703 : about
  4f80227202845190 : making
  4b8b4b3cd98a1de9 : awkward
  61b38def2ee9465f : rhopalic
  29016ee1c7c61e93 : sentences
  84a04e9aa8dae9f5 : .
For anyone that cares, the test program was…

  #include <stdio.h>
  #include <string.h>
  #include "city.h"

  const char *word[] = { "I", "do", "not", "know", "about",
                         "making", "awkward", "rhopalic", 
                         "sentences", ".", 0 };

  int main(int argc, char **argv)
  {
      for ( const char **p = word ; *p != 0; p++) {
        uint64 v = CityHash64( *p, strlen(*p));
        printf("%Lx : %s\n", v, *p);
      }
  }


    #define UNALIGNED_LOAD64(p) (*(const uint64*)(p))

    #define UNALIGNED_LOAD32(p) (*(const uint32*)(p))
Any function that uses both of these violates strict aliasing rules and might miscompile on recent gcc versions. Only one of them isn't enough, because char* is allowed to alias anything in C/C++. But if you use both, then you have a uint32_t* and a uint64_t* pointing to the same memory, violating the language spec.

ffmpeg has a header of macros to avoid this problem. They have names like AV_RN32A (aligned, native-endian 32-bit read), AV_RL32 (unaligned, little-endian 32-bit read) and so forth.


> Any function that uses both of these violates strict aliasing rules

More specifically, any function that uses both of these to access the same data violates strict aliasing rules. But this would imply that the data is being loaded redundantly, which seems unlikely in an implementation where speed is a top priority.

For example, I do not believe the following function violates strict aliasing rules:

    void foo(char *p) {
      bar(UNALIAGNED_LOAD64(p), UNALIGNED_LOAD32(p+8));
    }


I am surprised at how half-baked this code sounds. Why would Google release a new hash library that does not support big endian platforms, uses unaligned memory access on little endian platforms, is strict-aliasing unsafe, and is implemented in C++?


For what it's worth, Snappy (the compression library Google released a couple weeks ago) has many of these same limitations. I wanted to port both of these to the RISCy, big-endian architecture I work on as part of my research, so finding out how unportable they were was kind of a bummer for me, but honestly it's a pretty reasonable tradeoff for Google; x86 is the name of the game in cheap (sorry Power 7/SPARC) commodity servers. If it were me, I might have put a bit of thought/work into portability before open-sourcing it, but I'd rather the code be out there than not, despite its limitations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: