Generate N "random" string of length K using the probability table

How do I create N

"random" length strings K

using a probability table? K

there will be some even number.

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}

      

Let's say, there K = 6

would be a higher probability 'acacab'

than 'aaaaaa'

.

This is a subproblem of a larger problem that Im using to generate synthetic sequences based on a probability table. I'm not sure how to use the probability table to generate "random" rows?

What I have so far:

def seq_prob(fprob_table,K= 6, N= 10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    #possibly using itertools or random to generate the semi-"random" strings based on the probabilities 
    return seq_list

      

+3


source to share


4 answers


There are some good approaches to choosing weighted random options at the end of the documentation for an inline modulerandom

:

A common goal is to create random.choice () with weighted probabilities.

If the weights are small integer ratios, a simple technique is to construct a sample population with repetitions:

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

      

A more general approach is to arrange the weights in a cumulative distribution with itertools.accumulate (), and then find the random value with bisect.bisect ():



>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

      

To adapt this last approach to your specific problem, I would do:

import random
import itertools
import bisect

def seq_prob(fprob_table, K=6, N=10):
    choices, weights = fprob_table.items()
    cumdist = list(itertools.accumulate(weights))

    results = []
    for _ in range(N):
        s = ""
        while len(s) < K:
            x = random.random() * cumdist[-1]
            s += choices[bisect.bisect(cumdist, x)]
        results.append(s)

    return results

      

This assumes that the key rows in your probability table are the same length. If they are of several different lengths, this code sometimes (probably most of the time!) Gives answers that are longer than K

characters. I suppose it also assumes that it K

is an exact multiple of the key length, although it will indeed work if it is not (it will just give result strings that are longer than length K

) since there is no way to get K

exactly).

+3


source


You can use random.random

:

from random import random
def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    s = ""
    while len(seq_list) < N:
        for k, v in fprob_table.items():
            if len(s) == K:
                seq_list.append(s)
                s = ""
                break
            rn = random()
            if rn <=  v:
                s += k
    return seq_list

      



This can no doubt be improved, but random.random

is useful when dealing with probability.

+2


source


I'm sure there is a cleaner / better way, but here's one easy way to do it.

Here we are filling with pick_list

100 distinct values ​​of a pair of characters, the number of values ​​determined by the probability. In this case, there are 20 'aa'

, 30 'ab'

and 50 'ac'

entries inside pick_list

. Then it random.choice(pick_list)

evenly pulls a random entry from the list.

import random

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}


def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input

    # fill list with number of items based on the probabilities
    pick_list = []
    for key, prob in fprob_table.items():
        pick_list.extend([key] * int((prob * 100)))    

    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    for i in range(N):
        sub_seq = "".join(random.choice(pick_list) for _ in range(int(K/2)))
        seq_list.append(sub_seq)
    return seq_list

      

With the results:

 seq_prob(prob_table)
['ababac',
 'aaacab',
 'aaaaac',
 'acacac',
 'abacac',
 'acaaac',
 'abaaab',
 'abaaab',
 'aaabaa',
 'aaabaa']

      

+1


source


If your tables or sequences are large, using numpy can be helpful as it is likely to be significantly faster. Also, numpy is created for this problem and this approach is easy to understand and only 3 or 4 lines.

The idea would be to convert probabilities to cumulative probabilities, i.e. when displayed (.2, .5, .3)

in (.2, .7, 1.)

, and then random numbers generated by a flat distribution from 0

to 1

, fall into the cells the total sum with a frequency corresponding to the weights. Numpy searchsorted

can be used to quickly find which bin the random values ​​are in. I.e

import numpy as np

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
N = 10
k = 3   # number of strings (not number of characters)

rvals = np.random.random((N, k))         # generate a bunch of random values
string_indices = np.searchsorted(np.cumsum(prob_table.values()), rvals)   # weighted indices
x = np.array(prob_table.keys())[string_indices]     # get the strings associated with the indices
y = ["".join(x[i,:]) for i in range(x.shape[0])]    # convert this to a list of strings

# y = ['acabab', 'acacab', 'acabac', 'aaacaa', 'acabac', 'acacab', 'acabaa', 'aaabab', 'abacac', 'aaabab']

      

Here I have used k

as the number of lines you will need, not k

as the number of characters, since the problem statement is ambiguous about strings / characters.

0


source







All Articles