-
-
Notifications
You must be signed in to change notification settings - Fork 772
WIP: Robin Hood hashing #1429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Robin Hood hashing #1429
Changes from 2 commits
a0f3a51
141c729
60f7501
1387b2d
e9d9f1a
0ba49eb
689f93a
82d8ba0
e1a5c12
21f881b
92bd12f
71a4669
163c3a2
c07fcf1
2d7925a
8b01b32
c11dda6
075f8f8
a9390b1
cc001c7
261f758
a197a13
7909b82
6e3ece1
d7b4cd8
498a82f
63e7dbb
7e0e8e5
e032ac4
13a8faa
4de8578
94781af
151de70
64cedf6
8031a55
e6e6b32
a8b528a
10314aa
b66be6a
4d8ab86
340a624
3f3ab2d
e5092bc
91bec5e
4ab6309
a968a62
a0c000d
4d1303a
36bdbad
78be711
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,7 +68,7 @@ static int hash_sizes[] = { | |
}; | ||
|
||
#define HASH_MIN_LOAD .25 | ||
#define HASH_MAX_LOAD .75 /* don't go higher than 0.75, otherwise performance severely suffers! */ | ||
#define HASH_MAX_LOAD 0.95 /* don't go higher than 0.75, otherwise performance severely suffers! */ | ||
|
||
#define MAX(x, y) ((x) > (y) ? (x): (y)) | ||
#define NELEMS(x) (sizeof(x) / sizeof((x)[0])) | ||
|
@@ -111,7 +111,7 @@ hashindex_index(HashIndex *index, const void *key) | |
static int | ||
hashindex_lookup(HashIndex *index, const void *key) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function could also be optimized. Currently for the worst case scenario (not finding a key in the index) we scan all buckets until we find an empty one. At high fill ratios this might get close to O(N). However if we track the maximum offset in the entire hashmap we could bail after at most max_offset iterations. As the PR is currently, we could just load an old hashmap and start operating on it with the new collision handling code, and it would just work, and also hashmaps created by this would still be usable by olded borg versions. Changing One potential idea would be to use the MAGIC string in the header to also encode a version. For example if we turn BORG_IDX to BORG_I and 2 bytes for versioning, we could determine if this version of the index is fully robin-hood compliant and if not we could convert it on load from disk. @ThomasWaldmann I'd like to hear your thoughts on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Actually if the offset of the key is smaller than the number of buckets we've looked at we can bail. There's no way the next bucket will contain out desired key, since it would have been swapped on insert. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. scanning a big part of hashindex sounds evil. guess at 95% fill rate, we would run into having to always scan about 20% of all buckets. maybe that was the real reason for the perf breakdown i was sometimes seeing? maybe first keep the code compatible until it is accepted / merged, but we'll keep the idea for later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, one way to keep it compatible and still speedup |
||
{ | ||
int didx = -1; | ||
int didx = -1; // deleted index | ||
int start = hashindex_index(index, key); | ||
int idx = start; | ||
for(;;) { | ||
|
@@ -126,6 +126,7 @@ hashindex_lookup(HashIndex *index, const void *key) | |
} | ||
else if(BUCKET_MATCHES_KEY(index, idx, key)) { | ||
if (didx != -1) { | ||
/* we found a toombstone earlier, so we can move this key on top of it */ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks a lot for adding all these comments, very helpful! typo: tombstone |
||
memcpy(BUCKET_ADDR(index, didx), BUCKET_ADDR(index, idx), index->bucket_size); | ||
BUCKET_MARK_DELETED(index, idx); | ||
idx = didx; | ||
|
@@ -376,31 +377,69 @@ hashindex_get(HashIndex *index, const void *key) | |
return BUCKET_ADDR(index, idx) + index->key_size; | ||
} | ||
|
||
inline | ||
int | ||
distance(HashIndex *index, int current_idx, int ideal_idx) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: only 1 blank before *index. |
||
{ | ||
/* If the current index is smaller than the ideal index we've wrapped | ||
around the end of the bucket array and need to compensate for that. */ | ||
return current_idx - ideal_idx + ( (current_idx < ideal_idx) ? index->num_buckets : 0 ); | ||
} | ||
|
||
static int | ||
hashindex_set(HashIndex *index, const void *key, const void *value) | ||
{ | ||
int idx = hashindex_lookup(index, key); | ||
uint8_t *ptr; | ||
uint8_t *bucket_ptr; | ||
int offset = 0; | ||
int other_offset; | ||
void *bucket = malloc(index->key_size + index->value_size); | ||
void *buffer = malloc(index->key_size + index->value_size); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would be cool if we could somehow not allocate/free bucket and buffer once per hashindex_set operation. maybe the index data structure could have 2 pointers to such buffers that are just allocated once? |
||
if(idx < 0) | ||
{ | ||
/* we don't have the key in the index | ||
we need to find an appropriate address */ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure whether the C code follows some specific coding style, but the max line length for Python code in borg is 120, thus such a comment would be just a single line. |
||
if(index->num_entries > index->upper_limit) { | ||
/* we need to grow the hashindex */ | ||
if(!hashindex_resize(index, grow_size(index->num_buckets))) { | ||
return 0; | ||
} | ||
} | ||
idx = hashindex_index(index, key); | ||
memcpy(bucket, key, index->key_size); | ||
memcpy(bucket + index->key_size, value, index->value_size); | ||
bucket_ptr = BUCKET_ADDR(index, idx); | ||
while(!BUCKET_IS_EMPTY(index, idx) && !BUCKET_IS_DELETED(index, idx)) { | ||
/* we have a collision */ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This loop skips over some number of entries, then continues by displacing other entries. Our inserted entry will take the bucket of the first displaced entry. Instead, you can forward-shift all buckets from the first displaced entry until the end of the chunk, using Tell me if something in my explanation is unclear. I'm using different names. My word for "distance" is "displacement". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So basically take the block (the entire contiguous section of buckets up to the first empty bucket) and
Not sure what you mean by absolute value. The offsets are the relative distance from the ideal bucket a key would be in vs which bucket it is now, so I think it's the same as what you call "displacement".
Indeed, displacement is a better name for it. Thanks for the feedback! |
||
other_offset = distance( | ||
index, idx, hashindex_index(index, bucket_ptr)); | ||
if ( other_offset < offset) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: no blank after ( |
||
/* Swap the bucket at idx with the current key/value pair. | ||
This is the gist of hobin-hood hashing, we rob from | ||
the key with the lower distance to it's optimal address | ||
by swaping places with it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: ... its ... swapping ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also, hobbin-hood... seems I should actually use my spell checker :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, just noticed: is it robin' hood because he is robbing? :D now a native speaker must explain! |
||
*/ | ||
memcpy(buffer, bucket_ptr, (index->key_size + index->value_size)); | ||
memcpy(bucket_ptr, bucket, (index->key_size + index->value_size)); | ||
memcpy(bucket , buffer, (index->key_size + index->value_size)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: no blank after bucket There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. move this into a (inline) memxchg(a, b, tmp, size) function? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i went for memswap, since it felt like a better name |
||
offset = other_offset; | ||
} else { | ||
offset++; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, you increment offset in this case, as you also increment idx and bucket_ptr below. but shouldn't the offset derived from other_offset (425) also get incremented for same reason? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good catch! |
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of the last 4 lines, you could do:
|
||
idx = (idx + 1) % index->num_buckets; | ||
bucket_ptr = BUCKET_ADDR(index, idx); | ||
} | ||
ptr = BUCKET_ADDR(index, idx); | ||
memcpy(ptr, key, index->key_size); | ||
memcpy(ptr + index->key_size, value, index->value_size); | ||
memcpy(bucket_ptr, bucket, (index->key_size + index->value_size)); | ||
index->num_entries += 1; | ||
} | ||
else | ||
{ | ||
/* we already have the key in the index | ||
we just need to update it's value */ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: its |
||
memcpy(BUCKET_ADDR(index, idx) + index->key_size, value, index->value_size); | ||
} | ||
free(buffer); | ||
free(bucket); | ||
return 1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i read through the rest of this function and it looks correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned previously Deleting would be the tricky bit. If we are deleting a key with maximum The only requirement for doing this, as I said in a previous comment, would UPDATED: "The search can also be stopped if during the linear probing, a bucket is encountered for which the distance to the initial bucket in the linear probing is smaller than the DIB of the entry it contains. Indeed, if the entry being searched were in the hash table, then it would have been inserted at that location, since the DIB of the entry in that location is smaller." On Sep 11, 2016 18:02, "TW" [email protected] wrote:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The articles' author seems unimpressed by Robin Hood in his first article, but he was able to improve it: http://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/ has the same trick as in the other article. |
||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,7 @@ | |
ChunkerTestCase, | ||
] | ||
|
||
SELFTEST_COUNT = 29 | ||
SELFTEST_COUNT = 27 | ||
|
||
|
||
class SelfTestResult(TestResult): | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed? Also, make sure to update the comment.
Minor nitpicking, don't put a leading zero if the definition above doesn't have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of the Robin Hood hashing is to minimize the worst case for collisions by spreading the pain across all addresses. This should allow high loads in the hash table without performance degrading much. Also I should add the idea for this change isn't mine, @ThomasWaldmann suggested it as something interesting to do at the EuroPython sprints.
I intentionally didn't update the comments until I run some benchmarks to find an appropriate value.
Will do. BTW, the code style for C in this project isn't 100% clear to me, so If there are any other style no-no's in my PR, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! :)
If you change HASH_MAX_LOAD, do a full text search for it, there is another place depending on its value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had, a look. All I can see is the comment next to the value and
docs/internals.rst
.I would update both once I identify a good value for this constant. Let me know if there's any places I've missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search for 1.35 in the source.