Bounds on probe lengths and number of rehashes with universal hashing

Say I have a family of hash functions that are weakly universal, i.e. the probability of two non-identical keys $xneq y$ are mapped to the same hash-value is bounded by $k/m$ when I have $m$ bins and $k$ is some constant (I am mainly interested in $k=1$ and $k=2$):

$$Pr_hleft[h(x)=h(y)right] leq frac{k}{m} $$

I have $n$ keys mapped to $m$ bin and I keep a constant bound on the load factor $n/mleq alpha$.

With this setup, I am interested in putting a bound on how many collisions I allow in any bin, $K$, and rehashing whenever I have to insert a key in a bin that already holds $K$ elements. For a given bound on the load factor, $alpha$, and a given bound on the maximum elements in any bin, $K$, I am interested in knowing the probability of picking a hash function that puts less than $K$ elements in each bin.

This is a setup I am interested in for teaching purposes, and I am aware that there are smarter solutions to hash tables, but I would like to handle this setup before moving on to them. In my lecture notes, I have already described rehashing and just want to take this to a probabilistic analysis place, so I want to know the expected number of rehashes needed in this setup.

I would actually also love to know if I could put a bound on the probe length in open addressing hashing and rehash when I reach that, but I guess this is a much harder question.

I guess that if I can fix the probability of hitting any given bin more than $K$ times at a constant $p$, I have a strategy for sampling my way out of the problem, where I expect to rehash $1/p$ times if the number of keys, $n$, and the number of bins, $m$, are fixed. But if I add keys one at a time, resize when $n/m>alpha$, and rehash when I exceed the bound $K$ in any bin, do I have any guarantee that I will not have to rehash for every new element added after number $K+1$? I mean, my rehash could put $K$ elements in the largest bin and the next key could then trigger a rehash. An advisory, of course, cannot pick the keys so this happens, if I sample random functions, but do I have any probabilistic bounds that guarantee that I can rehash to keep probe-lengths bounded by a constant and still not rehash so often that I break the amortised constant running time on the table operations?

hashing form fields to prevent tampering

I have a front end form for entry creation. I have a few options I don’t want tampered with. The docs for |hash say “Prefixes the given string with a keyed-hash message authentication code (HMAC), for securely passing data in forms that should not be tampered with.

I’ve tried a dropdown field:


When |hash is added the page just reloads instead of going to the correct redirect URL. So I tried a plain text field with |hash and a hashed string is submitted instead of the string before the filter. I’m guessing this is why the page reloads for the dropdown field.

What am I doing wrong? Am I misunderstanding how |hash works?

Also, it says “Prefixes the given string“, what if you need to protect a integer?

Intercept password prior to wp_update_user committing and hashing?

I’m using a plugin that uses wp_update_user instead of password_reset and I need to use the password prior to hashing to sync the change to a secondary authentication system.

Is there any way to do this? I suspect not since there is no, wp_before_update_user, but I’d really like to avoid needing to write a bunch of custom code when 4 lines would do if I could preempt wp_update_user

Does hashing an encrypted message suffer from any drawbacks?

I am working on this problem where the set up is like this:

There are three hosts: A, B and C. A wants to send a message to C which passes through B. At A the message is hashed, the hash is padded to the message and then the message + hash is encrypted and sent to B. B wants to check for message integrity for which it decrypts the message, extracts the hash and then verifies.

I have proposed an alternative. My point is that we should encrypt the message at A and then compute the hash of this encrypted message. Then we should send the encrypted message +hash. The advantage is that at B the message need not be decrypted. B can verify the hash on the encrypted message itself.

Is there some problem with my proposed strategy?

Does hashing an encrypted message suffers from any drawbacks?

I am working on this problem where the set up is like this:

There three hosts A, B and C. A wants to send a message to C which passes through B. At A the message is hashed, the hash is padded to the message and then the message + hash is encrypted and sent to B. B wants to check for message integrity for which it decrypts the message then extracts the hash and then verify.

I have proposed an alternative. My point is that we should encrypt the message at A and then compute the hash of this encrypted message. Then we should send the encrypted message +hash. The advantage is that at B the message need not be decrypted. B can verify the hash on the encrypted message itself.

Is there some problem with my proposed strategy?

Markov’s Inequality and Hashing

Let H be a universal family of hash functions from U = {0, 1, . . . , u − 1} into a table of size m. Let S ⊆ U
be a set of m elements we wish to hash. Prove that with probability at least 3/4, no list in the table has more than
1 + 2√
m elements.

[Hint: To solve this question, you should use “Markov’s inequality”. Markov’s inequality is a fancy name
for a pretty obvious fact: if you have a non-negative random variable X with expectation E[X], then for any
k > 0, P r(X > kE[X]) ≤
1/k. For instance, the chance that X is more that 100 times its expectation is at most
1/100.]

Hook into all password resets in WordPress and get password before hashing?

I’m syncing my WordPress authentication system with an secondary/external authentication system and my site has at least two ways of resetting the password, including:

  1. Password reset email
  2. User account screen password reset

There may be some third way I’m not recalling, as I’ve disallowed password resets thus far due to my inability to sync the systems.

The crux of my question: How can I hook into the password reset prior to hashing so that I can simultaneously set the new password on the secondary/external authentication system?


Non-essential additional info:

  • User resets password when using password reset email, not auto-generated
  • User account screen is not the WordPress dashboard, but a plugin implementation

The two systems using different hashing mechanisms, so a post-hashing hook won’t help. I know, it’s inconvenient, but it’s what I’ve got.

I can see ways to perhaps do it on a per form/method basis, but this would be really miserable to maintain as the platform grows.

how to determine hashing bit length for multiple categorical features?

Say I have $N$ categorical features $f_i$ $iin(1,N)$ each of which of different alphabet size $n_i$. How can I efficiently optimize the hashing trick on that feature vector? Should I enumerate hash bit length for each feature independently (greedy)? Should i assume no hashing is necessary for features with small alphabet ($n_i$ is small)? What is the best practice strategy?

Arbitrary Length Hashing

Consider you have a hash function $mathcal{H}$ which takes strings of length $2n$ and returns strings of length $n$ and has the nice property that it is collision resistant, i.e. it is hard to find two different strings $s neq s’$ with the same hash $mathcal{H}(s) = mathcal{H}(s’)$.

You would now like to build a new hash function $mathcal{H’}$ which takes strings of arbitrary length and maps them to strings of length $n$, while still being collision resistant.

Lucky for you, already in 1979 a method now known as the Merkle–Damgård construction was published which achieves exactly this.

The task of this challenge will be to implement this algorithm, so we’ll first have a look at a formal description of the Merkle–DamgÃ¥rd construction, before going through a step-by-step example which should show that the approach is simpler than it might appear at first.

Given some integer $n > 0$, a hash function $mathcal{H}$ as
described above and an input string $s$ of arbitrary
length, the new hash function $mathcal{H’}$ does the following:

  • Set $ l = |s|$, the length of $s$, and split $s$ in chunks of length $n$, filling up the last chunk with trailing zeros if
    necessary. This yields $m = lceil frac{l}{n} rceil $ many chunks
    which are labeled $c_1, c_2, dots, c_m $.
  • Add a leading and a trailing chunk $c_0$ and $c_{m+1}$, where $c_0$ is a string consisting of $n$ zeros and $c_{m+1}$ is $n$ in binary, padded with leading zeros to length $n$.
  • Now iteratively apply $mathcal{H}$ to the current chunk $c_i$ appended to the previous result $r_{i-1}$: $ r_i =
    mathcal{H}(r_{i-1}c_i)$
    , where $r_0 = c_0$. (This step might be
    more clear after looking at the example below.)
  • The output of $mathcal{H’}$ is the final result $r_{m+1}$.

The Task

Write a program or function which takes as input a positive integer $n$, a hash function $mathcal{H}$ as black box and a non-empty string $s$ and returns the same result as $mathcal{H’}$ on the same inputs.

This is code-golf, so the shortest answer in each language wins.

Example

Let’s say $n = 5$, so our given hash function $mathcal{H}$ takes strings of length 10 and returns strings of length 5.

  • Given an input of $s = texttt{“Programming Puzzles”} $, we get the following chunks: $s_1 = texttt{“Progr”} $, $s_2 = texttt{“ammin”} $, $s_3 = texttt{“g Puz”} $ and $s_4 = texttt{“zles0”} $. Note that $s_4$ needed to be padded to length 5 with one trailing zero.
  • $ c_0 = texttt{“00000”}$ is just a string of five zeros and $ c_5 = texttt{“00101”}$ is five in binary ($texttt{101}$), padded with two leading zeros.
  • Now the chunks are combined with $mathcal{H}$:
    $r_0 = c_0 = texttt{“00000”} $
    $ r_1 = mathcal{H}(r_0c_1) = mathcal{H}(texttt{“00000Progr”})$
    $ r_2 = mathcal{H}(r_1c_2) = mathcal{H}(mathcal{H}(texttt{“00000Progr”})texttt{“ammin”})$
    $ r_3 = mathcal{H}(r_2c_3) = mathcal{H}(mathcal{H}(mathcal{H}(texttt{“00000Progr”})texttt{“ammin”})texttt{“g Puz”})$
    $ r_4 = mathcal{H}(r_3c_4) = mathcal{H}(mathcal{H}(mathcal{H}(mathcal{H}(texttt{“00000Progr”})texttt{“ammin”})texttt{“g Puz”})texttt{“zles0”})$
    $ r_5 = mathcal{H}(r_4c_5) = mathcal{H}(mathcal{H}(mathcal{H}(mathcal{H}(mathcal{H}(texttt{“00000Progr”})texttt{“ammin”})texttt{“g Puz”})texttt{“zles0”})texttt{“00101”})$
  • $r_5$ is our output.

Let’s have a look how this output would look depending on some choices1 for $mathcal{H}$:

  • If $mathcal{H}(texttt{“0123456789”}) = texttt{“13579”}$, i.e. $mathcal{H}$ just returns every second character, we get:
    $r_1 = mathcal{H}(texttt{“00000Progr”}) = texttt{“00Por”}$
    $r_2 = mathcal{H}(texttt{“00Porammin”}) = texttt{“0oamn”}$
    $r_3 = mathcal{H}(texttt{“0oamng Puz”}) = texttt{“omgPz”}$
    $r_4 = mathcal{H}(texttt{“omgPzzles0”}) = texttt{“mPze0”}$
    $r_5 = mathcal{H}(texttt{“mPze000101”}) = texttt{“Pe011”}$
    So $texttt{“Pe011”}$ needs to be the output if such a $mathcal{H}$ is given as black box function.
  • If $mathcal{H}$ simply returns the first 5 chars of its input, the output of $mathcal{H’}$ is $texttt{“00000”}$. Similarly if $mathcal{H}$ returns the last 5 chars, the output is $texttt{“00101”}$.
  • If $mathcal{H}$ multiplies the character codes of its input and returns the first five digits of this number, e.g. $mathcal{H}(texttt{“PPCG123456”}) = texttt{“56613”}$, then $mathcal{H}'(texttt{“Programming Puzzles”}) = texttt{“91579”}$.

1 For simplicity, those $mathcal{H}$ are actually not collision resistant, though this does not matter for testing your submission.

Computing expected run time of a hashing algorithm

This is a homework question, I don’t want an actual answer, but rather guidance on how to obtain the correct answer. The question is as follows:

In class we saw universal hashing as the solution to processing N elements in the range {0,1,…u-1}, arriving
one at a time. Using universal hashing we can get O(1) processing time for each request, but in expectation. Hence
sometimes we might get unlucky and processing time might be longer. Consider another scenario where we are given
the N requests in advance. Can we process them together and store in a hash table so that any future lookup request for
any of these items can be processed in worst case O(1) time (so no uncertainty or expectation here)? In this problem
you will design a solution for this, again using universal hashing.

You are given a set S of N items in the range {0,1,…u-1) in advance. Consider the following algorithm
that stores these items into a table of size N*N.

SquareHash(S, u)
Input: A set S of N items, an integer u such that
each item is in {0,1, ... , u-1}
Output: A hash table of size N*N and a corresponding hash function
1. Initialize a table T of size N*N
2. While true
        3. Pick a function h uniformly at random from a universal family H
        that maps {0,1, ... , u-1} to {0,1, ... , N*N-1}
        // Assume that this can be done in O(1) time
        4. Insert all the elements of S into T using h as the hash function
        5. If T has no collisions, return (T,h). Otherwise empty T

Clearly if the above algorithm terminates then it will produce a table with no collisions and hence any future
lookup will be in O(1) time guaranteed. However in the worst case the algorithm might run forever. Show that
this is unlikely. In particular, compute the expected running time of the algorithm above.
[Hint: Each time a function h is picked, compute the probability (or a lower bound on it) that the resulting table
will have no collisions. Then use the fact that if a coin of bias p is flipped until it turns up Heads, the expected
number of flips is 1/p]

I understand intuitively this is unlikely to run forever. If a hash function causes a collision, it will pick another one until it finds one that does not cause any collisions. However, I am unsure of how exactly to prove this formally and how to compute the expected running time. I know that the probability of two elements colliding is 1/m, where m is the expected range of outputs.