Generating clean text using php

I am using a service where I am getting the generated string. The lines are usually like this:

Hello   Mr   John Doe, you are now registered \t.
Hello &nbsb; Mr   John Doe, your phone number is &nbsb; 555-555-555 &nbs; \n

      

I need to remove all html objects and all \ t and \ n, etc.

I can use html_entity_decode

to remove the broken spaces and use str_replace

to remove \t

or \n

, but is there a more general way? that you are sure that the string contains nothing but alphabet characters (some string that does not contain codes).

+3


source to share


1 answer


If I understand your case correctly, you basically want to convert from HTML to plain text.

Depending on the complexity of your input and the required reliability and accuracy, you have several options:

  • Use strip_tags () , to remove HTML tags, mb_convert_encoding () with HTML-ENTITIES

    as the original encoding and decoding facilities strtr () or preg_replace () , to make any further substitution

    $html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
        Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
        Test: &euro;/&eacute;</p>";
    
    $plain_text = $html;
    $plain_text = strip_tags($plain_text);
    $plain_text = mb_convert_encoding($plain_text, 'UTF-8', 'HTML-ENTITIES');
    $plain_text = strtr($plain_text, [
        "\t" => ' ',
        "\r" => ' ',
        "\n" => ' ',
    ]);
    $plain_text = preg_replace('/\s+/u', ' ', $plain_text);
    
    var_dump($html, $plain_text);
    
          

  • Use a suitable DOM parser plus perhaps preg_replace()

    for further customization:

    $html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
        Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
        Test: &euro;/&eacute;</p>";
    
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTML($html);
    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($dom);
    
    $plain_text = '';
    foreach ($xpath->query('//text()') as $textNode) {
        $plain_text .= $textNode->nodeValue;
    }
    $plain_text = preg_replace('/\s+/u', ' ', $plain_text);
    
    var_dump($html, $plain_text);
    
          



Both solutions should print something like this:

string(169) "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
    Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
    Test: &euro;/&eacute;</p>"
string(107) "Hello Mr John Doe, you are now registered. Hello Mr John Doe, your phone number is 555-555-555 Test: €/é"

      

+2


source







All Articles