Finding patterns in these numbers

I am currently working on a project. In this project, I have a dataset that follows a specific algorithm. I have to find a pattern.

 1    355138022809833    RUPQ730562P    247001    20578330    70175500    
 2    355138022809841    RUPQ730563D    247001    72754950    71957850    
 3    355138023475287    RVSQ831978E    247001    39374170    25101090    
 4    355138023475295    RVSQ831979F    247001    06260280    87190670    
 5    355138023475303    RVSQ831980L    247001    05025410    26440510    
 6    355138023475352    RVSQ831985Y    247001    96637700    48209200    
 7    355138023475360    RVSQ831986A    247001    27362620    70790740    
 8    355138023475378    RVSQ831987P    247001    16576600    30002180    
 9    355138023475386    RVSQ831988D    247001    74778020    98010580      
10    355138023475402    RVSQ831990M    247001    25716170    97946520    

      

The first column is the serial number. The next 3 columns are the input to be given. The next 2 are the output of the algorithm.

So basically

I have 3 variables x, y, z (2nd, 3rd, 4th columns above data)

AND

y1 = f1(x, y, z)

y2 = f2(x, y, z)

      

y1 - the fifth column above the data

y2 - 6th column above data

I have above data. Now I need to find the functions f1 and f2.

What procedure should I follow? What steps need to be taken?

EDIT 1 Krishna Kant Sharma

I posted this question so as not to ask for an answer algorithm. I just asked for the necessary steps to be taken to solve this problem when we also have alphabets in variables. For the first time, in my experience, a small part of the stackoverflow community acted like introverted people. What's the point of stackoverflow? We are here to help each other understand and solve problems. Help with your hand when some of us need it. So why don't we stop beating up some kind of technical purity (for example, alphabets are not alphabetic characters), but solve the main problem.

Additional information

11   355138023475436  RVSQ831993L   247001   07481830   49057990 
12   355138023475444  RVSQ831994T   247001   65090950   87729430 
13   355138023475451  RVSQ831995B   247001   06689330   60021180 
14   355138023475469  RVSQ831996K   247001   05784310   69836640 
15   355138023475477  RVSQ831997Z   247001   13157740   35850670 
16   355138023475485  RVSQ831998Y   247001   68658020   77311320 
17   355138023475501  RVSQ832000N   247001   01567780   26994970 
18   355138023475519  RVSQ832001E   247001   43775370   58120770 
19   355138023475527  RVSQ832002F   247001   42463550   55145190 
20   355138023475535  RVSQ832003R   247001   85766840   15491950    

      

+2


source to share


9 replies


Sorry, but I think what you are asking is not possible (from a computational point of view).

The system from which this data comes can do, say,

SELECT Y1, Y2 FROM my_secret_data WHERE Col1 = x AND Col2 = Y and Col3 = Z;

      

Where my_secret_data

contains values ​​that are not derived from calculations.

So, if you don't have a base table, you can never find an algorithm that solves it (unless you had every combination of inputs and outputs), which would mean rebuilding the entire table)



External computation, all I think you can do is look for patterns and try to figure out what the I / O values ​​represent and see what gets you going.

Edit:

All is not lost in certain situations; things will be different if the inputs, outputs and any functions used by the algorithm were continuous (given that the inputs are alphanumeric, this doesn't look like it here)

If they were, you could (probably) find the algorithm using interpolation (perhaps a neural network), but under these circumstances given the value, I think you will need a lot more sample data.

+4


source


The first step you should take is familiarizing yourself with the context of this input. Then you will have the opportunity to make assumptions about what the result columns might be and what algorithms / functions are commonly used in this context.

The next point is to analyze the input data itself, looking for patterns and comparisons with real things (like zip codes, serial numbers, dates, etc.). Therefore, you have to look at different parts of the input, but also at similar input blocks.

If you are not successful in the previous paragraphs, you will have no choice but trial and error. You can sort some functions or algorithms by looking at the input (for example, letters will display typical math functions useless, so maybe some hash functions).



To shed some light on your input:

  • the last character x (and also y) looks like a char checksum, so if you are looking for patterns check the number / text without the last char
  • the letters in y can be some common abbreviation for currencies, business processes, or something else.
  • the last 0 in the result columns could also be a char checksum or depending on the z column (not enough data to indicate)

I would try some (general) hash functions on some combinations of inputs that give 8 digit results and look for results.

+5


source


Look at lines 5 and 6. All 3 inputs are the same and yet the result is different. I don't think this can be solved with only the data you gave us.

+3


source


Where do you get them and how are they used?

If the entity that uses real f1 and f2 uses them for authentication, and they are generally intelligent, f1 and f2 will be cryptographically secure functions and you have little or no chance of breaking them. If they are just checksums, you can try generic checksum algorithms. I would start with Wikipedia.

How much data can you get? Can you get the f1 and f2 values ​​for any input, or are you limited by what you can observe? If you can observe the values ​​for inputs that differ by one character, you can see how many changes it makes. If the results are mostly the same, it is not a cryptographic hash and you have a better chance.

How important is it for you and your company? I would say that you have very little chance of success, if only there is more there than I now see. It is very likely that any solution will require a lot of brute force searches.

By the way, don't use all the data in your decision process. You can come up with some kind of function to match any data points. Save some data for tests to see if your derived function works on external data.

Finally, is it even ethical? You haven't told us where these numbers come from, and it seems plausible that they are something else, designed for safety, and their intentions may be good or bad. This is something to think about, if only because a company that behaves unethically with others may well behave unethically with its employees.

+2


source


You can always define a piecewise function:

f1 (355138022809833, RUPQ730562P, 247001) = 20578330 f1 (355138022809841, RUPQ730563D, 247001) = 72754950

etc .. since you don't need continuity.

0


source


Given that the z-value does not change in your sample data, it can be removed. There is really no general approach to solving this question if you only have data. On the other hand, if you have the ability to test a function with arbitrary inputs of your own design, you can use techniques similar to differential cryptanalysis.

0


source


This looks like a problem that would be best solved by a neural network. Hopefully you can get a broader set of workout data though!

0


source


If you are sure there is a sample, you can try machine learning . But your dataset for setting up and training the "machine" is pretty small (only 10 pieces each). Moreover, you need to predict, so multiple algorithms like clustering, classification, smart integration won't work for you. Neural networks would be such a technique. This is an option you can try. Unfortunately, I am not an expert in machine learning and data mining and cannot tell you how. For Java, see WEKA .

0


source


Look at the data from different angles:

  • What values ​​occur, what numbers.

  • Look at the differences. This means that x and y are sequential and that there is a close relationship when you remove the last digit / letter.

  • Take a look at the templates. y1 starts at 06, then 05 on lines 4, 5 and 13, 14. The difference between the serial numbers minus the check digit is 16. This might be a match or it might not.

  • Running statistical tests (not much data here).

  • Look at the data in different number systems (hex, binary).

  • Have a look at simple number factorization.

  • Look at the effect of small differences in data.

  • You may want to exclude the first two lines at first because their serial numbers are far from the others, which might obscure a possible pattern.

Try to learn as much background as possible about computing.

Some knowledge of cryptanalysis won't be bad either.

Then, create some working hypothesis on how the y1 and y2 values ​​are calculated and tested. For example, the first thing I would check is to mix it up a bit with shift and xor (possibly CRC), or some linear function of serial number mod 10,000,000 disregarding trailing zeros.

Rinse and repeat. If you have enough patience and it's not that hard, you can find it.

0


source







All Articles