How to find "equivalent" texts?

Question

How to find "equivalent" texts?

I want to find (not generate) 2 text lines so that after removing all letters and ucasing one line can be translated to another with a simple replacement.

The motivation for this comes from a project that I know that these are test methods for attacking ciphers via probability distributions. I would like to find a large, plain plaintext that, once encrypted with a simple cypher substitution, can be decrypted for something else that is also coherent.

It ends as 2 parts, find the longest such lines in a corpus and get this corpus.

The first part seems kind to me for some sort of B-tree attack, concatenated with a string after substitution, which makes the sequence of first occurrences sequential.

HELLOWORLDTHISISIT
1233454637819a9b98

A small optimization based on knowing the max value and string length based on each tree depth and the rest is just coding.

The other part will be a little more complex; how to create large search text? some kind of web spider would seem to be the perfect approach as it would have access to the largest amount of text, but how to split it into text only?

This begs the question: Any ideas on how to make this better?

Edit: The cipher in use is an insanely basic 26 signature cipher.

ps this is more of a thought experiment than a likely real project for me.

+1

data-structures web-crawler data-mining

BCS 06 dec. '08 at 20:53

source to share

2 answers

I think you are asking for a bit about generating a lookup that is also "consistent". This is an AI problem for an encryption algorithm to figure out which text is coherent. Also, the longer your text is more complex, the better it will be to create a "consistent" result ... quickly approaching the point where you need the "key" while the text you are encrypting. Thus, defeating the purpose of encrypting it.

0

SoapBox 07 dec. '08 at 21:52

source to share

Darius Bacon · Accepted Answer · 2008-12-08T01:16:15+0000

There are 26! various replacement ciphers. This works for just over 88 bits of selection:

>>> math.log(factorial(26), 2)
88.381953327016262

The entropy of English text is at least 2 bits per character. So it seems to me that you cannot expect to find passages of more than 45-50 characters that are accidentally equivalent to substitution.

For a large corpus, there is a Gutenberg project and Wikipedia, to begin with. You can download all English files from Wikipedia from the site.

How to find "equivalent" texts?

More articles: