marco cognetta theoretically good with computers

Finding a Random Seed That Solves a LeetCode Problem

A side hobby of mine is solving LeetCode questions in unintended ways – most often via convoluted one-liners. Such a style of self-imposed constraints make the questions more fun and forces me to think outside of the box for solutions, invoking the feeling of "necessity is the mother of invention".

A recent LeetCode daily challenge was as follows (slightly simplified for clarity):

Given a list of kk unique bitstrings, each of length kk, generate a new length-kk bitstring that is not in the list.

For example, given a list ["010", "110", "111"], a possible solution could be "001". The LeetCode problem has a large test suite (of 183 test cases), with 1k161 \le k \le 16, and the exact problem statement can be found here.

I solved this by finding a random seed such that randomly generated bitstrings solved all the test cases. Here's the code:

class Solution:
    def findDifferentBinaryString(self, nums: List[str]) -> str:
        random.seed((69299878 + sum(ord(c)*(i*j+111) for (i, n) in enumerate(nums) for (j, c) in enumerate(n))) % 999999999)
        return ''.join(random.choice('01') for _ in nums)

Feel free to try this solution yourself (it should work unless LeetCode has updated their test suite, please let me know if that is the case).

Here's how I did it.

Background

There are a number of ways to solve this problem, with the most elegant solution being derived from Cantor's diagonalization argument. Roughly, for each bitstring, choose a unique index and use the opposite of that value for that index of the output string. In code, it would be implemented as:

def find_new_string_cantor(bitstrings):
    return ''.join('0' if n[i] == '1' else '1' for (i, n) in enumerate(bitstrings))

As for random generation, the first thing to note is how did I know this would even be remotely possible? Surely the odds of randomly getting each correct are so low that I would never be able to find a solution before getting banned by LeetCode for too many repeated submissions. Solving it this way is akin to flipping 183 (biased) coins and getting 183 heads (the chances of that being 121838.16×1056\frac{1}{2}^{183} \approx 8.16\times 10^{-56} with a regular coin). Fortunately for us, the word biased is doing a lot of work in this sentence, and the actual odds for a realistic test set are much better.

Suppose we have chosen some kk. There are 2k2^k possible bitstrings of length kk, while our list of bitstrings only contains kk elements. That means, picking a length kk bitstring at random has only a k2k\frac{k}{2^k} chance of already being in the list. This number goes to zero very quickly, as evidenced in the following table:

k2^kk / 2^kk2^kk / 2^k
120.595120.01758
240.51010240.00977
380.3751120480.00537
4160.251240960.00293
5320.156251381920.00159
6640.0937514163840.00085
71280.0546915327680.00045
82560.0312516655360.00024

And, for each kk, there are (2kk){2^k \choose k} possible ways to select kk bitstrings. For large kk, this number grows very quickly, but for smaller kk it is a little more manageable. For example, for k=2k = 2 there are just (42)=6{4 \choose 2} = 6 possible inputs. Since the probability of randomly generating a valid answer for a k=2k=2 input is 12\frac{1}{2}, randomly selecting a bitstring for each of the length 22 inputs would work 126=\frac{1}{2^6} = 1 out of 64 times.

As kk gets larger, even including a large number of test cases doesn't significantly reduce the probability of randomly succeeding in all trials. Even given 100 examples of k=10k = 10 test cases, you would expect to solve all of them via random chance more than 30%30\% of the time, since (110210)1000.375(1 - \frac{10}{2^{10}})^{100} \approx 0.375.

This understanding of the likelihood of randomly solving large test sets is what gave me confidence that an appropriate seed could be found in a reasonable amount of time.

Implementation

One important aspect of the implementation is that the random seed must depend on the input somehow. The reason is that, for example, if all 66 length k=2k=2 inputs were present in the test suite, then there is no static seed that could possibly solve them all simultaneously, as the static seed would produce the same output for any input, and this output must be present in some of the input lists.

My first attempt to construct a seed from the input was to use Python's builtin hash to map the input to a scalar, like sum(hash(b) for b in bitstrings). However, in Python3+, hashing is not deterministic across restarting the interpreter so I had to come up with a different hash function that I knew would be deterministic. Fortunately, the quality of the hash doesn't really matter much in this case, so I chose a simple one that forms a hash by computing a value from each bitstring's characters and their position in the array: sum(ord(c)*j*i for (i, b) in enumerate(bitstrings) for (j, c) in enumerate(b)).

The documentation on Python's random seed details was a little bit vague, and since I didn't know the exact details of the seed but was certain that it could take at least a 32-bit seed, I picked a modulus 999999999999999999, which is under 2322^{32}, but still large enough that I expected to find a valid seed.

The reason I was confident that this modulus was large enough that I was likely to find a valid seed stems from the following heuristic. I was fairly confident that all possible k={1,2}k = \{1, 2\} cases and a large number of k={3,4}k = \{3, 4\} cases would be present in the test suite, and then I figured that they would just randomly populate it with larger-value-of-kk test cases. Since I know the probabilities of randomly solving any given test case, I just calculated the probability of randomly solving a random size-183 test suite made up of test cases as described above. This came out to have p2.9×108p \approx 2.9\times 10^{-8}, which means we should expect there to be about 30 suitable seeds in the range [0,999999999)[0, 999999999).

Lastly, I made a small tweak to the general hash function. I added an additive term of + 111 to ensure that it never returned 0 for any part of the input. In theory this was to prevent collisions (otherwise, the entire first bitstring, and the first bit of the other bitstrings would be ignored), but I discovered later that it really didn't do a great job and there were some easy cases that could have been avoided.

Then, I added an additive variable term to the hash, so that I could easily vary the hash function for brute forcing.

At the end, I had:

def find_new_string(bitstrings, seed_value):
    random.seed((seed_value + sum(ord(c)*(i*j+111) for (i, n) in enumerate(nums) for (j, c) in enumerate(n))) % 999999999)
    return ''.join(random.choice('01') for _ in nums)

I then simply collected a series of test cases and ran the following script:

good_seeds = []
for i in range(100_000_000):
    if all(find_new_string(bitstrings, i) not in bitstrings for bitstrings in test_suite): good_seeds.append(i)

When I had a collection of good seeds, I tested them on LeetCode and, if one failed, I added the failing test to my test suite and resumed my search. I eventually found the additive factor that allowed me to pass all the tests. In total, it took 42 tries for me to get the correct seed once I had nailed down the hash function structure.

I think this search could have been sped up in several ways. The main way is that I am pretty sure that there were some duplicates in the test suite, but with their order permuted (e.g., ["00", "11"] vs ["11", "00"]). These inputs would hash to different values for the seed, meaning they made the problem a bit harder (especially if they were small kk). Sorting them before applying the hash would have resolved this issue. I also think a better hash function could have been used. For example, Python's seed can apparently take in a string, so it would have been possible for me to simply seed the hash function with the concatenation of all of the bitstrings. I think this is the biggest area that could have prevented me from solving the problem, since we expected only ~30 valid seeds in the interval that I used in the hash function, any hash collisions could have easily made it so that we would not find any valid seeds unless we increased the modulus.