Number of characters of all characters in HTML string, but only 20 visible words

I am working on a WordPress site where you will find excerpts about corporate clients on one of the pages.

Let's say I have a web page where the visible text looks like this:

"SuperAmazing.com, a subsidiary of Amazing, the leading provider of
integrated messaging and collaboration services, today announced the
availability of an enhanced version of its Enterprise Messaging
Service (CMS) 2.0, a lower cost webmail alternative to other business
email solutions such as Microsoft Exchange, GroupWise and LotusNotes
offerings."

      

But let's say there might be an HTML link or an image in that text, so the raw HTML might look like this:

<img src="/images/corporate/logos/super_amazing.jpg" alt="Company
logo for SuperAmazing.com" /> SuperAmazing.com, a subsidiary of
<a href="http://www.amazing.com/">Amazing</a>, the leading
provider of integrated messaging and collaboration services, today
announced the availability of an enhanced version of its Enterprise
Messaging Service (CMS) 2.0, a lower cost webmail alternative to other
business email solutions such as Microsoft Exchange, GroupWise and
LotusNotes offerings."

      

Here's what I need to do: find out if there is a connection within the first 20 visible words.

These are the first 20 visible words:

"SuperAmazing.com, a subsidiary of Amazing, the leading provider of
integrated messaging and collaboration services, today announced the
availability of an"

      

I need to get the number of characters, including HTML, for 20 visible words, which in this case would be "an", although of course it will be different for each excerpt on the page.

(I'm willing to count "SuperAmazing.com" as 2 words if that makes it easy.)

I tried a number of regex for word counting, but they all count HTML, not visible words.

So what would be the correct regex to find the full number of characters, including HTML, for the first 20 visible words?

+2


source to share


4 answers


Here's a good enough regex to match the first twenty visible words:

'~^(?:\s*+(?:(?:[^<>\s]++|</?\w[^<>]*+>)++)){1,20}~'

      

This matches one to twenty space-separated tokens, where a token is defined as one or more words or tags not separated by spaces (where "word" is defined as one or more characters other than whitespace or angle brackets). For example, this will be one token:

<a href="http://www.amazing.com/">Amazing</a>

      



... but these are two tokens:

<a href="http://www.superduper.com/">Super Duper</a>

      

This will treat an individual tag (like the tag <img>

in your example, or any tag that is surrounded by spaces) as a separate token that resets the counter - it only matches the word "from" in your example. It also mishandles tags <br>

or block level tags like <p>

and <table>

if they don't have spaces around them. Only you can know which of the problems will be.

EDIT: If the highlighted tag <img>

is what you see a lot, you can preprocess the text to remove the spaces following it. This would effectively combine it with the first subsequent "real" token, resulting in a more accurate character. I know that in this case it changes the count by one or two characters, but if the twentieth word happened with "supercalifragilisticexpialidocious", you will probably notice the difference. :)

+2


source


I am not sure about using PHP regex for word count.

Assuming you can isolate the visible words from the variable, my original approach would be to explode / split it into spaces (or whatever gives what you think of as words) and put the results in an array.

After splitting, divide the array into 20 elements.



Then apply a regular expression to each of the array elements and decide if the references match.

To get the number of characters, concatenate / explode a twenty word array (no spaces) and find the length of the string.

+2


source


The "getTextFromNode" and "getTextFromDocument" functions provide HTML text content. The getFirstWords function returns the first number of words from the text.

function getTextFromNode($Node, $Text = "") {
    if ($Node->tagName == null)
        return $Text.$Node->textContent;

    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getTextFromNode($Node, $Text);

    while($Node->nextSibling != null) {
        $Text = getTextFromNode($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

function getTextFromDocument($DOMDoc) {
    return getTextFromNode($DOMDoc->documentElement);
}

function getFirstWords($Text, $Count = 1) {
    if (!($Count > 0))
        $Count = 1;

    $Text = trim($Text);

    $TextParts = split('[ ]+', $Text, 21);
    if (count($TextParts) == $Count)
        $TextParts[$Count - 1] = "";

    $NewText = join(" ", $TextParts);
    return $NewText;
}

      

And you can use it:

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");

$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";

$NewText = getFirstWords($Text, 21);
echo "First 20 words from HTML: ".$NewText."\n";

      

Hope it helps.

+2


source


Regex and HTML don't mix. Counting using a regular expression is unusual. Regex is the wrong solution to your problem. Use HTML parsing library to extract text. Then use some form of tokenizer to extract words. You will end up saving a lot of headaches.

What are the headaches? Let's say you've managed to create a monstrous regular expression that does what you want. Now, suppose there is an edge case two years later that you left out, and you need to change this monster. At this point, you wish you had a coded solution that you could easily change.

+1


source







All Articles