Finding patterns in these numbers
I am currently working on a project. In this project, I have a dataset that follows a specific algorithm. I have to find a pattern.
1 355138022809833 RUPQ730562P 247001 20578330 70175500
2 355138022809841 RUPQ730563D 247001 72754950 71957850
3 355138023475287 RVSQ831978E 247001 39374170 25101090
4 355138023475295 RVSQ831979F 247001 06260280 87190670
5 355138023475303 RVSQ831980L 247001 05025410 26440510
6 355138023475352 RVSQ831985Y 247001 96637700 48209200
7 355138023475360 RVSQ831986A 247001 27362620 70790740
8 355138023475378 RVSQ831987P 247001 16576600 30002180
9 355138023475386 RVSQ831988D 247001 74778020 98010580
10 355138023475402 RVSQ831990M 247001 25716170 97946520
The first column is the serial number. The next 3 columns are the input to be given. The next 2 are the output of the algorithm.
So basically
I have 3 variables x, y, z (2nd, 3rd, 4th columns above data)
AND
y1 = f1(x, y, z)
y2 = f2(x, y, z)
y1 - the fifth column above the data
y2 - 6th column above data
I have above data. Now I need to find the functions f1 and f2.
What procedure should I follow? What steps need to be taken?
EDIT 1 Krishna Kant Sharma
I posted this question so as not to ask for an answer algorithm. I just asked for the necessary steps to be taken to solve this problem when we also have alphabets in variables. For the first time, in my experience, a small part of the stackoverflow community acted like introverted people. What's the point of stackoverflow? We are here to help each other understand and solve problems. Help with your hand when some of us need it. So why don't we stop beating up some kind of technical purity (for example, alphabets are not alphabetic characters), but solve the main problem.
Additional information
11 355138023475436 RVSQ831993L 247001 07481830 49057990
12 355138023475444 RVSQ831994T 247001 65090950 87729430
13 355138023475451 RVSQ831995B 247001 06689330 60021180
14 355138023475469 RVSQ831996K 247001 05784310 69836640
15 355138023475477 RVSQ831997Z 247001 13157740 35850670
16 355138023475485 RVSQ831998Y 247001 68658020 77311320
17 355138023475501 RVSQ832000N 247001 01567780 26994970
18 355138023475519 RVSQ832001E 247001 43775370 58120770
19 355138023475527 RVSQ832002F 247001 42463550 55145190
20 355138023475535 RVSQ832003R 247001 85766840 15491950
source to share
Sorry, but I think what you are asking is not possible (from a computational point of view).
The system from which this data comes can do, say,
SELECT Y1, Y2 FROM my_secret_data WHERE Col1 = x AND Col2 = Y and Col3 = Z;
Where my_secret_data
contains values that are not derived from calculations.
So, if you don't have a base table, you can never find an algorithm that solves it (unless you had every combination of inputs and outputs), which would mean rebuilding the entire table)
External computation, all I think you can do is look for patterns and try to figure out what the I / O values represent and see what gets you going.
Edit:
All is not lost in certain situations; things will be different if the inputs, outputs and any functions used by the algorithm were continuous (given that the inputs are alphanumeric, this doesn't look like it here)
If they were, you could (probably) find the algorithm using interpolation (perhaps a neural network), but under these circumstances given the value, I think you will need a lot more sample data.
source to share
The first step you should take is familiarizing yourself with the context of this input. Then you will have the opportunity to make assumptions about what the result columns might be and what algorithms / functions are commonly used in this context.
The next point is to analyze the input data itself, looking for patterns and comparisons with real things (like zip codes, serial numbers, dates, etc.). Therefore, you have to look at different parts of the input, but also at similar input blocks.
If you are not successful in the previous paragraphs, you will have no choice but trial and error. You can sort some functions or algorithms by looking at the input (for example, letters will display typical math functions useless, so maybe some hash functions).
To shed some light on your input:
- the last character x (and also y) looks like a char checksum, so if you are looking for patterns check the number / text without the last char
- the letters in y can be some common abbreviation for currencies, business processes, or something else.
- the last 0 in the result columns could also be a char checksum or depending on the z column (not enough data to indicate)
I would try some (general) hash functions on some combinations of inputs that give 8 digit results and look for results.
source to share
Where do you get them and how are they used?
If the entity that uses real f1 and f2 uses them for authentication, and they are generally intelligent, f1 and f2 will be cryptographically secure functions and you have little or no chance of breaking them. If they are just checksums, you can try generic checksum algorithms. I would start with Wikipedia.
How much data can you get? Can you get the f1 and f2 values for any input, or are you limited by what you can observe? If you can observe the values for inputs that differ by one character, you can see how many changes it makes. If the results are mostly the same, it is not a cryptographic hash and you have a better chance.
How important is it for you and your company? I would say that you have very little chance of success, if only there is more there than I now see. It is very likely that any solution will require a lot of brute force searches.
By the way, don't use all the data in your decision process. You can come up with some kind of function to match any data points. Save some data for tests to see if your derived function works on external data.
Finally, is it even ethical? You haven't told us where these numbers come from, and it seems plausible that they are something else, designed for safety, and their intentions may be good or bad. This is something to think about, if only because a company that behaves unethically with others may well behave unethically with its employees.
source to share
Given that the z-value does not change in your sample data, it can be removed. There is really no general approach to solving this question if you only have data. On the other hand, if you have the ability to test a function with arbitrary inputs of your own design, you can use techniques similar to differential cryptanalysis.
source to share
If you are sure there is a sample, you can try machine learning . But your dataset for setting up and training the "machine" is pretty small (only 10 pieces each). Moreover, you need to predict, so multiple algorithms like clustering, classification, smart integration won't work for you. Neural networks would be such a technique. This is an option you can try. Unfortunately, I am not an expert in machine learning and data mining and cannot tell you how. For Java, see WEKA .
source to share
Look at the data from different angles:
-
What values occur, what numbers.
-
Look at the differences. This means that x and y are sequential and that there is a close relationship when you remove the last digit / letter.
-
Take a look at the templates. y1 starts at 06, then 05 on lines 4, 5 and 13, 14. The difference between the serial numbers minus the check digit is 16. This might be a match or it might not.
-
Running statistical tests (not much data here).
-
Look at the data in different number systems (hex, binary).
-
Have a look at simple number factorization.
-
Look at the effect of small differences in data.
-
You may want to exclude the first two lines at first because their serial numbers are far from the others, which might obscure a possible pattern.
Try to learn as much background as possible about computing.
Some knowledge of cryptanalysis won't be bad either.
Then, create some working hypothesis on how the y1 and y2 values are calculated and tested. For example, the first thing I would check is to mix it up a bit with shift and xor (possibly CRC), or some linear function of serial number mod 10,000,000 disregarding trailing zeros.
Rinse and repeat. If you have enough patience and it's not that hard, you can find it.
source to share