# Create (not so) random string with specific occurrences of string

I have a requirement where I have the alphabet "ACGT" and I need to create a string of about 20,000 characters. This string must contain 100+ occurrences of the "CCGT" pattern. Most of the time, the generated string contains about 20-30 instances.

``````    int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
str += alphabet[rand() % (alphabet.length())];
}
```

```

How can I customize my code so that the template appears more often?

Change - is there a way to change the alphabet i.e. "A", "C", "G", "T", "CCGT" as alphabet characters?

Thank.

+3

source to share

Create an array of integers containing 100 x 0 and 490 1s, 2s, 3s and 4s [000000 .... 111111 .... 2222, etc.] Makes up almost 20,000 records.

Then random move (std :: random_shuffle)

Then write a line where every 0 translates to 'CCGT', every 1 translates to 'A', every 2 .... etc.

I think this gives you what you want and by changing the original ints array you can also change the number of "A" characters in the output.

Edit: If it's not random, do 100 0 at the beginning, then random 1-4 for the rest.

+2

source

The only solution I can think of would meet the "100+" criteria:

``````create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
pick random start position
write CCGT
}
```

```

Of course, you will need to make sure that the overwritten symbols were not already part of the "CCGT".

+1

source

My first thought was to generate a list of 100 indices that you will make sure to enter a special string into. Then, when you generate a random string, insert a custom string at each of these indices as you reach them.

I missed checking that the intervals are positioned appropriately (cannot be within 4 of another interval) and sorting them in ascending order - both of them will be necessary for this.

``````int N = 20000;
std::string alphabet("ACGT");
int intervals;
for (int index = 0; index < 100; index)
{
intervals[index] = rand() % 2000;
// Do some sort of check to make sure each element of intervals is not
// within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
if (index == intervals[current_interval_index])
{
str += alphabet;
current_interval_index++;
index += 3;
}
else
{
str += alphabet[rand() % (alphabet.length())];
}
}
```

```
+1

source

The solution I came up with is used `std::vector`

to contain all random sets of 4 characters including 100 special sequences. Then I shuffle this vector to randomly distribute 100 special sequences across the entire line.

To make the distribution of letters, I create an alternate string `alphabet`

called `weighted`

that contains the relative abundance of characters `alphabet`

according to what has already been included out of the 100 special sequence.

``````int main()
{
std::srand(std::time(0));

using std::size_t;

const size_t N = 20000;

std::string alphabet("ACGT");

// stuff the ballot
std::vector<std::string> v(100, "CCGT");

// build a properly weighted alphabet string
// to give each letter equal chance of appearing
// in the final string

std::string weighted;

// This could be scaled down to make the weighted string much smaller

for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
weighted += "C";

for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
weighted += "GT";

for(size_t i = 0; i < N / 4; ++i)
weighted += "A";

size_t remaining = N - (v.size() * 4);

// add the remaining characters to the weighted string
std::string s;
for(size_t i = 0; i < remaining; ++i)
s += weighted[std::rand() % weighted.size()];

// add the random "4 char" sequences to the vector so
// we can randomly distribute the pre-loaded special "4 char"
// sequence among them.
for(std::size_t i = 0; i < s.size(); i += 4)
v.push_back(s.substr(i, 4));

// distribute the "4 char" sequences randomly
std::random_shuffle(v.begin(), v.end());

// rebuild string s from vector
s.clear();
for(auto&& seq: v)
s += seq;

std::cout << s.size() << '\n'; // should be N
}
```

```
+1

source

I love @ Andy Newman's answer and think this is probably the best way - the code below is a compiled example of what they suggested.

``````#include <string>
#include <algorithm>
#include <iostream>

int main()
{
srand(time(0));
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
int int_array;
// Populate int array
for (int index = 0; index < 19700; index++)
{
if (index < 100)
{
int_array[index] = 0;
}
else
{
int_array[index] = (rand() % 4) + 1;
}
}
// Want the array in a random order
std::random_shuffle(&int_array, &int_array);
// Now populate string from the int array
for (int index = 0; index < 19700; index++)
{
switch(int_array[index])
{
case 0:
str += alphabet;
break;
case 1:
str += 'A';
break;
case 2:
str += 'C';
break;
case 3:
str += 'G';
break;
case 4:
str += 'T';
break;
default:
break;
}
}
// Print out to check what it looks like
std::cout << str << std::endl;
}
```

```
+1

source

You have to do `N`

more.

I take this freedom because you say, "Create a string around 20,000 characters"; but there's more to it than that.

If you only find 20-30 instances in a 20,000 character string, then something is wrong. The ball's score is to say there are 20,000 character positions to check, and each will have a four-letter string from a four-letter alphabet, allowing 1/256 to be a specific string. The average should be (roughly, because I'm simplistic) 20,000/256 or 78 hits.

Perhaps your string was not randomized properly (probably due to the use of a modular idiom), or perhaps you are only testing every fourth character position - as if the string were a list of non-overlapping four-letter words.

If you can get your average hit rate back to 78, then you can reach just over 100 requirements simply by increasing `N`

proportionately.

+1

source

All Articles