Create (not so) random string with specific occurrences of string
I have a requirement where I have the alphabet "ACGT" and I need to create a string of about 20,000 characters. This string must contain 100+ occurrences of the "CCGT" pattern. Most of the time, the generated string contains about 20-30 instances.
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
str += alphabet[rand() % (alphabet.length())];
}
How can I customize my code so that the template appears more often?
Change - is there a way to change the alphabet i.e. "A", "C", "G", "T", "CCGT" as alphabet characters?
Thank.
source to share
Create an array of integers containing 100 x 0 and 490 1s, 2s, 3s and 4s [000000 .... 111111 .... 2222, etc.] Makes up almost 20,000 records.
Then random move (std :: random_shuffle)
Then write a line where every 0 translates to 'CCGT', every 1 translates to 'A', every 2 .... etc.
I think this gives you what you want and by changing the original ints array you can also change the number of "A" characters in the output.
Edit: If it's not random, do 100 0 at the beginning, then random 1-4 for the rest.
source to share
The only solution I can think of would meet the "100+" criteria:
create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
pick random start position
write CCGT
}
Of course, you will need to make sure that the overwritten symbols were not already part of the "CCGT".
source to share
My first thought was to generate a list of 100 indices that you will make sure to enter a special string into. Then, when you generate a random string, insert a custom string at each of these indices as you reach them.
I missed checking that the intervals are positioned appropriately (cannot be within 4 of another interval) and sorting them in ascending order - both of them will be necessary for this.
int N = 20000;
std::string alphabet("ACGT");
int intervals[100];
for (int index = 0; index < 100; index)
{
intervals[index] = rand() % 2000;
// Do some sort of check to make sure each element of intervals is not
// within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
if (index == intervals[current_interval_index])
{
str += alphabet;
current_interval_index++;
index += 3;
}
else
{
str += alphabet[rand() % (alphabet.length())];
}
}
source to share
The solution I came up with is used std::vector
to contain all random sets of 4 characters including 100 special sequences. Then I shuffle this vector to randomly distribute 100 special sequences across the entire line.
To make the distribution of letters, I create an alternate string alphabet
called weighted
that contains the relative abundance of characters alphabet
according to what has already been included out of the 100 special sequence.
int main()
{
std::srand(std::time(0));
using std::size_t;
const size_t N = 20000;
std::string alphabet("ACGT");
// stuff the ballot
std::vector<std::string> v(100, "CCGT");
// build a properly weighted alphabet string
// to give each letter equal chance of appearing
// in the final string
std::string weighted;
// This could be scaled down to make the weighted string much smaller
for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
weighted += "C";
for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
weighted += "GT";
for(size_t i = 0; i < N / 4; ++i)
weighted += "A";
size_t remaining = N - (v.size() * 4);
// add the remaining characters to the weighted string
std::string s;
for(size_t i = 0; i < remaining; ++i)
s += weighted[std::rand() % weighted.size()];
// add the random "4 char" sequences to the vector so
// we can randomly distribute the pre-loaded special "4 char"
// sequence among them.
for(std::size_t i = 0; i < s.size(); i += 4)
v.push_back(s.substr(i, 4));
// distribute the "4 char" sequences randomly
std::random_shuffle(v.begin(), v.end());
// rebuild string s from vector
s.clear();
for(auto&& seq: v)
s += seq;
std::cout << s.size() << '\n'; // should be N
}
source to share
I love @ Andy Newman's answer and think this is probably the best way - the code below is a compiled example of what they suggested.
#include <string>
#include <algorithm>
#include <iostream>
int main()
{
srand(time(0));
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
int int_array[19700];
// Populate int array
for (int index = 0; index < 19700; index++)
{
if (index < 100)
{
int_array[index] = 0;
}
else
{
int_array[index] = (rand() % 4) + 1;
}
}
// Want the array in a random order
std::random_shuffle(&int_array[0], &int_array[19700]);
// Now populate string from the int array
for (int index = 0; index < 19700; index++)
{
switch(int_array[index])
{
case 0:
str += alphabet;
break;
case 1:
str += 'A';
break;
case 2:
str += 'C';
break;
case 3:
str += 'G';
break;
case 4:
str += 'T';
break;
default:
break;
}
}
// Print out to check what it looks like
std::cout << str << std::endl;
}
source to share
You have to do N
more.
I take this freedom because you say, "Create a string around 20,000 characters"; but there's more to it than that.
If you only find 20-30 instances in a 20,000 character string, then something is wrong. The ball's score is to say there are 20,000 character positions to check, and each will have a four-letter string from a four-letter alphabet, allowing 1/256 to be a specific string. The average should be (roughly, because I'm simplistic) 20,000/256 or 78 hits.
Perhaps your string was not randomized properly (probably due to the use of a modular idiom), or perhaps you are only testing every fourth character position - as if the string were a list of non-overlapping four-letter words.
If you can get your average hit rate back to 78, then you can reach just over 100 requirements simply by increasing N
proportionately.
source to share