Skip to content

hash collisions with tuples #5257

Closed
Closed
@thestinger

Description

@thestinger
Contributor

At the moment tuples have an IterBytes implementation that combines the IterBytes implementations of the contained elements. This means that ("aaa", "", "").to_bytes() is equal to ("a", "a", "a").to_bytes(), so their hash collides.

You can take advantage of this to easily find any number of hash collisions quickly:

value: ("aaa", "bbb", "ccc"), hash: 10071128569180783593
value: ("aaab", "bb", "ccc"), hash: 10071128569180783593
value: ("aaabb", "b", "ccc"), hash: 10071128569180783593
value: ("aaabbb", "", "ccc"), hash: 10071128569180783593
value: ("aaabbbc", "", "cc"), hash: 10071128569180783593
value: ("aaabbbcc", "c", ""), hash: 10071128569180783593

I don't really know how to do this properly (implementing Hash on containers).

Activity

brson

brson commented on Mar 7, 2013

@brson
Contributor

That's pretty funny. Probably there are other combinations of types with this problem as well. For strings and vectors we could hash the length along with the bytes.

enums could have similar problems if they don't hash the discriminant.

brson

brson commented on Mar 7, 2013

@brson
Contributor

@graydon what do you think?

graydon

graydon commented on Mar 7, 2013

@graydon
Contributor

Yeah, maybe redo the trait as just hash-specific and include lengths

nikomatsakis

nikomatsakis commented on Mar 7, 2013

@nikomatsakis
Contributor

I think there should be a rule: When you implement the trait for a type T, you must emit enough information to discover where your byte stream ends without any outside help except knowledge of your static type. This means that for example strings and vectors must include their length. Tuples do not need to, because the length is implicit in the type. Enums must include their variant id. If everyone follows this rule, everything is fine. Deriving iter bytes will help here, of course, because it will follow this rule implicitly.

bstrie

bstrie commented on Jun 5, 2013

@bstrie
Contributor

Nominating for Maturity 5, Production Ready.

bstrie

bstrie commented on Jun 5, 2013

@bstrie
Contributor

Er, this is already milestoned. My mistake. :P

pnkfelix

pnkfelix commented on Aug 14, 2013

@pnkfelix
Member

Visiting for bug triage, email 2013-08-05.

It seems like we could put in the change suggested by @brson and @graydon (perhaps just for strings and vectors, or perhaps include enum's discriminants as well), just for Hash. That would be the, mm, most direct way to address this ticket, I think.

But @nikomatsakis has posted a more general principle, I think it was meant for IterBytes as a whole. It sounded rather like what you would need to implement a type-based value serialization mechanism. (Don't we already have one or more serialization traits?) The IterBytes trait is solely documented as being in place to support Hash, which to me means that there is a lot of freedom in how one implements it. Maybe I am mistaken in drawing a connection between serialization and IterBytes.

Anyway, does anyone have feedback on niko's suggestion?

nikomatsakis

nikomatsakis commented on Aug 14, 2013

@nikomatsakis
Contributor

@pnkfelix I tend to agree that iterbytes and serialization are deeply connected. I originally wanted to remove iterbytes, but in the discussion on #8038, I think we sort of settled on the idea that iter-bytes is basically a specialized serialization for the purposes of hashing, which is usually the same thing but not always. @erickt pointed out that the serialization API includes some higher-level methods for things like maps and so forth that are not particularly well-suited to hashing -- basically that in general-purpose serialization, we might allow more license than we would want for hashing. shrug I guess there is no reason to shoehorn everything into one trait, so long as have deriving modes.

nikomatsakis

nikomatsakis commented on Aug 14, 2013

@nikomatsakis
Contributor

That said I think iterbytes should nonetheless always ensure that the bytes iterated over are sufficient to reconstruct the value up to the point of Eq comparisons (that is, if two distinct values would nonetheless be considered Eq, then of course they can hash together). (This is, incidentally, a reason to distinguish serialization and hashing: one might want to define a newtyped tuple that is symmetric with respect to equality or whatever)

bluss

bluss commented on Aug 15, 2013

@bluss
Member

IterBytes for &[A] needs to hash in the length, anything else? I can't find any explicit IterBytes impl that would not be fixed by that (str uses vec).

For short strings, it might have an impact. That's 8 bytes more to hash. One way to do it might be to use a terminator instead, like f(self.as_bytes()) && f([0xFF]). There are a number of bytes that can not appear in UTF-8 and not in str.

bluss

bluss commented on Aug 15, 2013

@bluss
Member

here's a patch that hashes the length for vectors, and uses a terminating byte for str.

https://gist.github.com/anonymous/34dedfa9f4b8d32134fd

nikomatsakis

nikomatsakis commented on Aug 15, 2013

@nikomatsakis
Contributor

That looks about right to me.

added a commit that references this issue on Aug 18, 2013

auto merge of #8545 : blake2-ppc/rust/iterbytes, r=alexcrichton

alexcrichton

alexcrichton commented on Aug 19, 2013

@alexcrichton
Member

This was closed by #8545

18 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @graydon@alexcrichton@brson@nikomatsakis@pnkfelix

        Issue actions

          hash collisions with tuples · Issue #5257 · rust-lang/rust