Create (not so) random string with specific occurrences of string

I have a requirement where I have the alphabet "ACGT" and I need to create a string of about 20,000 characters. This string must contain 100+ occurrences of the "CCGT" pattern. Most of the time, the generated string contains about 20-30 instances.

    int N = 20000;
    std::string alphabet("ACGT");
    std::string str;
    str.reserve(N);
    for (int index = 0; index < N; index++)
    {
        str += alphabet[rand() % (alphabet.length())];
    }

      

How can I customize my code so that the template appears more often?

Change - is there a way to change the alphabet i.e. "A", "C", "G", "T", "CCGT" as alphabet characters?

Thank.

+3


source to share


6 answers


Create an array of integers containing 100 x 0 and 490 1s, 2s, 3s and 4s [000000 .... 111111 .... 2222, etc.] Makes up almost 20,000 records.

Then random move (std :: random_shuffle)

Then write a line where every 0 translates to 'CCGT', every 1 translates to 'A', every 2 .... etc.



I think this gives you what you want and by changing the original ints array you can also change the number of "A" characters in the output.

Edit: If it's not random, do 100 0 at the beginning, then random 1-4 for the rest.

+2


source


The only solution I can think of would meet the "100+" criteria:

create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
   pick random start position
   write CCGT
}

      



Of course, you will need to make sure that the overwritten symbols were not already part of the "CCGT".

+1


source


My first thought was to generate a list of 100 indices that you will make sure to enter a special string into. Then, when you generate a random string, insert a custom string at each of these indices as you reach them.

I missed checking that the intervals are positioned appropriately (cannot be within 4 of another interval) and sorting them in ascending order - both of them will be necessary for this.

int N = 20000;
std::string alphabet("ACGT");
int intervals[100];
for (int index = 0; index < 100; index)
{
    intervals[index] = rand() % 2000;
    // Do some sort of check to make sure each element of intervals is not
    // within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
    if (index == intervals[current_interval_index])
    {
        str += alphabet;
        current_interval_index++;
        index += 3;
    }
    else
    {
        str += alphabet[rand() % (alphabet.length())];
    }
}

      

+1


source


The solution I came up with is used std::vector

to contain all random sets of 4 characters including 100 special sequences. Then I shuffle this vector to randomly distribute 100 special sequences across the entire line.

To make the distribution of letters, I create an alternate string alphabet

called weighted

that contains the relative abundance of characters alphabet

according to what has already been included out of the 100 special sequence.

int main()
{
    std::srand(std::time(0));

    using std::size_t;

    const size_t N = 20000;

    std::string alphabet("ACGT");

    // stuff the ballot
    std::vector<std::string> v(100, "CCGT");

    // build a properly weighted alphabet string
    // to give each letter equal chance of appearing
    // in the final string

    std::string weighted;

    // This could be scaled down to make the weighted string much smaller

    for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
        weighted += "C";

    for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
        weighted += "GT";

    for(size_t i = 0; i < N / 4; ++i)
        weighted += "A";

    size_t remaining = N - (v.size() * 4);

    // add the remaining characters to the weighted string
    std::string s;
    for(size_t i = 0; i < remaining; ++i)
        s += weighted[std::rand() % weighted.size()];

    // add the random "4 char" sequences to the vector so
    // we can randomly distribute the pre-loaded special "4 char"
    // sequence among them.
    for(std::size_t i = 0; i < s.size(); i += 4)
        v.push_back(s.substr(i, 4));

    // distribute the "4 char" sequences randomly
    std::random_shuffle(v.begin(), v.end());

    // rebuild string s from vector
    s.clear();
    for(auto&& seq: v)
        s += seq;

    std::cout << s.size() << '\n'; // should be N
}

      

+1


source


I love @ Andy Newman's answer and think this is probably the best way - the code below is a compiled example of what they suggested.

#include <string>
#include <algorithm>
#include <iostream>

int main()
{
    srand(time(0));
    int N = 20000;
    std::string alphabet("ACGT");
    std::string str;
    str.reserve(N);
    int int_array[19700];
    // Populate int array
    for (int index = 0; index < 19700; index++)
        {
        if (index < 100)
        {
            int_array[index] = 0;
        }
        else
        {
            int_array[index] = (rand() % 4) + 1;
        }
    }
    // Want the array in a random order
    std::random_shuffle(&int_array[0], &int_array[19700]);
    // Now populate string from the int array
    for (int index = 0; index < 19700; index++)
    {
        switch(int_array[index])
        {
            case 0:
                str += alphabet;
                break;
            case 1:
                str += 'A';
                break;
            case 2:
                str += 'C';
                break;
            case 3:
                str += 'G';
                break;
            case 4:
                str += 'T';
                break;
            default:
                break;
        }
    }
    // Print out to check what it looks like
    std::cout << str << std::endl;
}

      

+1


source


You have to do N

more.

I take this freedom because you say, "Create a string around 20,000 characters"; but there's more to it than that.

If you only find 20-30 instances in a 20,000 character string, then something is wrong. The ball's score is to say there are 20,000 character positions to check, and each will have a four-letter string from a four-letter alphabet, allowing 1/256 to be a specific string. The average should be (roughly, because I'm simplistic) 20,000/256 or 78 hits.

Perhaps your string was not randomized properly (probably due to the use of a modular idiom), or perhaps you are only testing every fourth character position - as if the string were a list of non-overlapping four-letter words.

If you can get your average hit rate back to 78, then you can reach just over 100 requirements simply by increasing N

proportionately.

+1


source







All Articles