I'm confused about the reason a block can't exist in both S3 and Glacier at the same time, if the deduplication code decides that block is needed in a new archive.
Why couldn't you simply have a rule that each file is either S3 or Glacier, and S3 lists-of-blocks can only reference other S3 blocks, while Glacier lists of blocks can only reference other Glacier blocks?
In the worst case, where every block was in both archives, this would only increase costs by 10% if Glacier costs a tenth what S3 costs.
From that I assume that if a block's hash matches something that's already in the archives then you retrieve the archive block(s) with the same hash ID in order to verify that it is exactly the same (byte-for-byte)? And this wouldn't be possible with Glacier as you can't just retrieve the block from storage to check there and then.
Do you have any stats on the number of collisions you've seen?
Why couldn't you simply have a rule that each file is either S3 or Glacier, and S3 lists-of-blocks can only reference other S3 blocks, while Glacier lists of blocks can only reference other Glacier blocks?
In the worst case, where every block was in both archives, this would only increase costs by 10% if Glacier costs a tenth what S3 costs.