Number of longest common subsequences

I need to find the number of distinct longest common subsequences between two strings A and B. I am currently using a normal dynamic programming approach and then generate all the different substrings using a backtracking array and then doing the first depth search from the starting index.

However, since there are so many possible answers, my code is too slow. Is there a way to count the number of such individual longest common subsequences without generating them at all?

My code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Stack;

class Node
{
String res = "";
int i;
int j;

public Node( int _i, int _j, String s )
{
    i = _i;
    j = _j;
    res = s;
}
}

public class LCSRevisited
{
static String a;
static String b;
static int m,n;
static int[][] memo;
static int[][] bt; // 1 means [i+1][j], 2 means [i][j+1], 3 means [i+1][j+1]
// 4  - means both

static HashSet <String> filter;

static void printAllStrings( )
{
    Iterator i = filter.iterator();

    while( i.hasNext())
    {
        System.out.println( i.next() );
    }
} 

 static void printSol()
 {
   System.out.print( memo[ 0 ][ 0 ]);

   // check how many UNIQUE such strings exist

   filter = new HashSet();
   Stack<Node> s = new Stack();
   Node start = new Node( 0, 0, "" );
   s.push( start );
   Node curr;
   String res;

   // use backtrack array to do a DFS

   while( !s.isEmpty() )
   {
        curr = s.pop();
        res = curr.res;

        if( ( curr.i>=m) || ( curr.j >=n ) )
        {
            filter.add( curr.res);
            continue;
       }

        // check backtrack value
        int i = curr.i;
        int j = curr.j;
        int back = bt[ i ][ j];

        if( back == 1 )
        {
            s.push( new Node( i+1, j, res ));
        }
        if( back == 2 )
        {
            s.push( new Node( i, j+1, res ));
        }
        if( back == 3 )
        {
            s.push( new Node( i+1, j+1, res+a.charAt(i) ));
        }
        if( back == 4 )
        {
            s.push( new Node( i, j+1, res ));
            s.push( new Node( i+1, j, res ));
        }
   }
   //printAllStrings();
   System.out.println(" " + filter.size() );
}

static void solve()
{
   // fill base cases
   m = a.length();
   n = b.length();
   memo = new int[ m+1 ][ n+1 ];
   Arrays.fill( memo[m], 0 );

   bt = new int[ m+1 ][ n+1 ];

   for( int i=0; i<m; i++ )
   {
       memo[ i ][ n ] = 0;    
   }

   // Now compute memo values
   for( int i=m-1; i>=0; i-- )
   {
       for( int j=n-1; j>=0; j-- )
       {
           if( a.charAt(i) == b.charAt(j))
           {
               memo[ i ][ j ] = 1 + memo[ i+1 ][ j+1 ];
               bt[ i ][ j ] = 3;
           }
           else
           {
               int r1 = memo[ i+1 ][ j ];
               int r2 = memo[ i ][ j+1 ];

               if( r1==r2 )
               {
                    memo[ i ][ j ] = r1;
                    bt[ i ][ j ] = 4;
               }
               else if( r1 > r2 )
               {
                   memo[ i ][ j ] = r1;
                   bt[ i ][ j ] = 1;
               }
               else
               {
                   memo[ i ][ j ] = r2;
                   bt[ i ][ j ] = 2;
               }
           }
       }
   }

   printSol();
 }

public static void main( String[] args ) throws IOException
{
 BufferedReader br = new BufferedReader( new InputStreamReader( System.in ));

int T= Integer.parseInt( br.readLine() );

while( T--> 0 )
{
    a = br.readLine();
    b = br.readLine();

    if( T>=1 )
    br.readLine();

    solve();
    // printArr( bt );
}
}
}

      

+3


source to share


2 answers


I think you can use a hash roll function like Rabin Karp's. This way, you can calculate new hash values ​​of longer common subsequences without having to re-create the entire string and hash.

In fact, I think you can use pure DP to find your answer. Let's memo[][]

say you have already calculated the DP table values ​​for LCS ( in your code, I think). Then you can calculate the number of different LCS instances like



for j ← 0 to n do
    for i ← 0 to m do
        if i = 0 or j = 0 then
            D[i, j] ← 1
        else
            D[i, j] ← 0
            if ai  = bj  then
                D[i, j] ← D[i − 1, j − 1]
            else if L[i − 1, j] = L[i, j] then
                D[i, j] ← D[i, j] + D[i − 1, j]
            endif
            if L[i, j − 1] = L[i, j] then
                D[i, j] ← D[i, j] + D[i, j − 1]
            endif
            if L[i − 1, j − 1] = L[i, j] then
                D[i, j] ← D[i, j] − D[i − 1, j − 1]
            endif
        end if
    endfor
endfor

      

Your answer is D [n, m]. Hope my ideas helped!

+1


source


Perhaps using a kind of trie will help you generate the actual sequences when calculating the length and after calculating the total with a traversal in trie (linear time).

Memo [i] [j] now represents the length of the common subsequence of A [i ... m] and B [j..n]. I suggest that you also have lp [i] [j] representing a list of pointers to a node in a trie, so that the path from that node to the root of the trie gives you one of the longest common subseqs for A [i ... m] and B [j..n]. To build this lp, for cases 1 and 2, you just copy the lists from lp [i + 1] [j] or lp [i] [j + 1] as for case 4, and for 3 you add a new node in the tree with the value A [i] = B [j], and all the nodes pointed to by lp [i + 1] [j + 1] are specified in the tree as sons of the new node. These operations would be linear (or maybe even faster with some advanced data structures for working with sets). Note that what I described is not actually a trie / tree (a node can have multiple parents).



Finally, for counting, I think a workaround would be good, with some extra handling - spreading counts: count [node] [level] = sum (count [sons (node)] [level- 1]) or if node is count leaves [node] [1] = 1, count [node] [l! = 1] = 0. And your answer will sum the counts for those "root" nodes (to the power of 0) that satisfy the "longest" constraint (\ sum_x {count [x] [l_max]}).

I'm not 100% sure of my solution, but it might be a good start to improve the answer to your problem.

0


source







All Articles