Find a substring in a stream without storing the substring as plain text

Question

Find a substring in a stream without storing the substring as plain text

Suppose I have a large data stream (for example, incoming packets from the network) and I want to determine if this data contains a certain substring. There are several string search algorithms , but they require the algorithm to know the plain text string they are looking for.

Let's say the string you're looking for is a password and you don't want to store it as plain text in this search app. However, it will appear as plain text in the stream. For example, you can store the hash and length of the password. Then, for each byte in the stream, check if the next length byte data from stream hash to password hash is, you have a likely match.

This way, you can determine if the password was in the stream without knowing the password. However, hashing once per byte is not fast / efficient.

Maybe there is a clever algorithm that could find a plain text password in a stream without knowing the plain text password directly (and instead a non-reversible equivalent). Alternatively, could a low quality version of the password be used, with the risk of false positives? For example, if the search application knew only half of the password (in plain text), it might, with some error, discover the full password without knowing it.

thank

PS This question comes from a hypothetical discussion I had with some friends about warning you if your password was found in plain text on the net.

+3

string algorithm

bramp June 24. 15 at 22:02

source to share

2 answers

SAT is an NP-hard problem. Suppose your password is n characters long. If you could find a way to make a large enough SAT instance that

used a contiguous sequence of m> = n bytes from the data stream as its 8th input bits, and
output pin 1 if and only if the bits present on its inputs contain your password starting at an offset that is a multiple of 8 bits

and then, "working" with that SAT instance as a schema, you have a password detector that is (at least potentially) very difficult to "invert".

In a sense, what you want is the opposite of boolean logic minimization . You want the largest, hairy path (ideal for some theoretically grounded notions of size and hairiness :)) that computes the truth table. It's easy enough to think of ways to keep the original CNF boolean formula, for example, if you have two clauses A and B, you can always safely add a new clause consisting of all literals in either A or B, but it's probably much more difficult to come up with ways to express the formula in such a way as to confuse the modern SAT solver, as much research has begun to make these programs super efficient at finding and using all kinds of structure in a problem.

One possible way to inject "complications" is to make circuit computation functions that are difficult to compute, such as divisions or square roots, and then check the results for equality in addition to the original inputs. For example, instead of having the schema just check what X[1 .. 8n] = YOUR_PASSWORD

, check what X[1 .. 8n] = YOUR_PASSWORD AND sqrt(X[1 .. 8n]) = sqrt(YOUR_PASSWORD)

. If the SAT solver is smart enough to "see" that the first test implies the second, then it can immediately discard all sentences matching the second - but since everything is presented at a very low level with sentences, this ratio (Hopefully, as I said , modern SAT solvers are pretty amazing) are well hidden. I guess it is better to choose functions like sqrt () that are not one-to-one for integers: this could potentially trigger a SAT solution to waste time exploring seemingly promising (but ultimately wrong) solutions.

0

j_random_hacker June 25. '15 at 11:00

source to share

David Eisenstat · Accepted Answer · 2015-06-24T23:11:25+0000

You can use a low-entropy rolling hash to preview each byte, so for a lg cost of k bits of entropy, you reduce the number of calls to the cryptographic hash by a factor of k.

Find a substring in a stream without storing the substring as plain text

More articles: