Pattern matching for URL classification

As part of a project, I and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple: we just loop through the URL and find the relevant keywords in it and classify the page accordingly.

For example: if the url is http: // cnnworld / sports / abcd , we would classify it under the category "sports"

For this we have a database with format mappings: Keyword -> Category

Now, what we are doing now, for each URL, we continue to read all the data items in the database and use the String.find () method to find out if the keyword appears in the URL. Once this is found, we will stop.

But this approach has several problems, the main ones being:

(i) Our database is very large and these re-queries are extremely slow

(ii) A page may belong to more than one category and our approach does not handle such cases. Of course, one simple way to ensure this would be to continue querying the database even after a category match is found, but that will only make things even slower.

I thought about alternatives and wondered if the opposite could be done: parse the url, look for words in it, and then query the database for just those words.

The naive algorithm for this should be done in O (n ^ 2) - query the database for all substrings that occur in the URL.

I was wondering if there is any better approach for this. Any ideas?? Thank you in advance:)

+3


source to share


4 answers


In our commercial classifier, we have a database with 4 keywords :) and we also scan the HTML body, there are several ways to solve this problem:



  • Use Aho-Corasick, we used a modified algorithm specifically for working with web content, for example handler: tab, space, \ r, \ n as space, as only one, so two spaces will be treated as one space and also ignore bottom / uppercase.
  • Another option is to put all your keywords inside a tree (like std :: map), so the search becomes very fast, the downside is that it takes up memory and a lot, but if it's on the server you wouldn't feel it.
+2


source


If you have (many) fewer categories than keywords, you can create a regex for each category where it matches any of the keywords for that category. Then you run your url for each category regex. It will also address the issue of matching multiple categories.



0


source


I think your suggestion to split the url to find useful bits and then only query those elements seems like a decent way.

I tossed together some Java that might help illustrate the code point of what I think it will entail. The most valuable parts are probably the regexes, but I hope the general algorithm for this helps as well:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;

public class CategoryParser 
{
    /** The db field that keywords should be checked against */
    private static final String DB_KEYWORD_FIELD_NAME = "keyword";

    /** The db field that categories should be pulled from */
    private static final String DB_CATEGORY_FIELD_NAME = "category";

    /** The name of the table to query */
    private static final String DB_TABLE_NAME = "KeywordCategoryMap";

    /**
     * This method takes a URL and from that text alone determines what categories that URL belongs in.
     * @param url - String URL to categorize
     * @return categories - A List<String&rt; of categories the URL seemingly belongs in
     */
    public static List<String> getCategoriesFromUrl(String url) {

        // Clean the URL to remove useless bits and encoding artifacts
        String normalizedUrl = normalizeURL(url);

        // Break the url apart and get the good stuff
        String[] keywords = tokenizeURL(normalizedUrl);

        // Construct the query we can query the database with
        String query = constructKeywordCategoryQuery(keywords);

        System.out.println("Generated Query: " + query);

        // At this point, you'd need to fire this query off to your database,
        // and the results you'd get back should each be a valid category
        // for your URL. This code is not provided because it very implementation specific,
        // and you already know how to deal with databases.


        // Returning null to make this compile, even though you'd obviously want to return the
        // actual List of Strings
        return null;
    }

    /**
     * Removes the protocol, if it exists, from the front and
     * removes any random encoding characters
     * Extend this to do other url cleaning/pre-processing
     * @param url - The String URL to normalize
     * @return normalizedUrl - The String URL that has no junk or surprises
     */
    private static String normalizeURL(String url)
    {
        // Decode URL to remove any %20 type stuff
        String normalizedUrl = url;
        try {
            // I've used a URLDecoder that part of Java here,
            // but this functionality exists in most modern languages
            // and is universally called url decoding
            normalizedUrl = URLDecoder.decode(url, "UTF-8");
        }
        catch(UnsupportedEncodingException uee)
        {
            System.err.println("Unable to Decode URL. Decoding skipped.");
            uee.printStackTrace();
        }

        // Remove the protocol, http:// ftp:// or similar from the front
        if (normalizedUrl.contains("://"))
        {
            normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
        }

        // Room here to do more pre-processing

        return normalizedUrl;
    }

    /**
     * Takes apart the url into the pieces that make at least some sense
     * This doesn't guarantee that each token is a potentially valid keyword, however
     * because that would require actually iterating over them again, which might be
     * seen as a waste.
     * @param url - Url to be tokenized
     * @return tokens - A String array of all the tokens
     */
    private static String[] tokenizeURL(String url)
    {
        // I assume that we're going to use the whole URL to find tokens in
        // If you want to just look in the GET parameters, or you want to ignore the domain
        // or you want to use the domain as a token itself, that would have to be
        // processed above the next line, and only the remaining parts split
        String[] tokens = url.split("\\b|_");

        // One could alternatively use a more complex regex to remove more invalid matches
        // but this is subject to your (?:in)?ability to actually write the regex you want

        // These next two get rid of tokens that are too short, also.

        // Destroys anything that not alphanumeric and things that are
        // alphanumeric but only 1 character long
        //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");

        // Destroys anything that not alphanumeric and things that are
        // alphanumeric but only 1 or 2 characters long
        //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");

        return tokens;
    }

    private static String constructKeywordCategoryQuery(String[] keywords)
    {
        // This will hold our WHERE body, keyword OR keyword2 OR keyword3
        StringBuilder whereItems = new StringBuilder();

        // Potential query, if we find anything valid
        String query = null;

        // Iterate over every found token
        for (String keyword : keywords)
        {
            // Reject invalid keywords
            if (isKeywordValid(keyword))
            {
                // If we need an OR
                if (whereItems.length() > 0)
                {
                    whereItems.append(" OR ");
                }

                // Simply append this item to the query
                // Yields something like "keyword='thisKeyword'"
                whereItems.append(DB_KEYWORD_FIELD_NAME);
                whereItems.append("='");
                whereItems.append(keyword);
                whereItems.append("'");
            }
        }

        // If a valid keyword actually made it into the query
        if (whereItems.length() > 0)
        {
            query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
                    + " WHERE " + whereItems.toString() + ";";
        }

        return query;
    }

    private static boolean isKeywordValid(String keyword)
    {
        // Keywords better be at least 2 characters long
        return keyword.length() > 1
                // And they better be only composed of letters and numbers
                && keyword.matches("\\w+")
                // And they better not be *just* numbers
                // && !keyword.matches("\\d+") // If you want this
                ;
    }

    // How this would be used
    public static void main(String[] args)
    {
        List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
        List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
    }
}

      

The generated request for the SO link would look like this:

SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'

      

Lots of room for optimization, but I think it's much faster than checking the string for every possible keyword.

0


source


The Aho-corasick algorithm is best for finding an intermediate string with one traversal. You can form a tree (aho-corasick tree) of your keyword. The last node contains the number displayed with a specific keyword.

Now you just need to traverse the URL string in the tree. When you got a certain number (work like a flag in our script), it means that we got a certain category. Continue with this number on the hash map and find the appropriate category for future reference.

I think this will help you. Follow this link: nice aho-corasick animation by ivan

0


source







All Articles