PHP mb_split (), capturing separators

preg_split

has an optional flag PREG_SPLIT_DELIM_CAPTURE

that also returns all delimiters in the returned array. mb_split

does not work.

Is there a way to split a multibyte string (not just UTF-8, but all kinds) and grab the delimiters?

I am trying to create a multipath line separator while keeping the line breaks, but would prefer a more general purpose solution to use.

Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github ( https://github.com/vanderlee/PHP-multibyte-functions/blob/master/functions/mb_explode.php ) which allows all preg_split flags to be used:

/**
 * A cross between mb_split and preg_split, adding the preg_split flags
 * to mb_split.
 * @param string $pattern
 * @param string $string
 * @param int $limit
 * @param int $flags
 * @return array
 */
function mb_explode($pattern, $string, $limit = -1, $flags = 0) {       
    $strlen = strlen($string);      // bytes!   
    mb_ereg_search_init($string);

    $lengths = array();
    $position = 0;
    while (($array = mb_ereg_search_pos($pattern)) !== false) {
        // capture split
        $lengths[] = array($array[0] - $position, false, null);

        // move position
        $position = $array[0] + $array[1];

        // capture delimiter
        $regs = mb_ereg_search_getregs();           
        $lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]);

        // Continue on?
        if ($position >= $strlen) {
            break;
        }           
    }

    // Add last bit, if not ending with split
    $lengths[] = array($strlen - $position, false, null);

    // Substrings
    $parts = array();
    $position = 0;      
    $count = 1;
    foreach ($lengths as $length) {
        $is_delimiter   = $length[1];
        $is_captured    = $length[2];

        if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY) && ++$count > $limit) {
            if ($length[0] > 0 || ~$flags & PREG_SPLIT_NO_EMPTY) {          
                $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                            ? array(mb_strcut($string, $position), $position)
                            : mb_strcut($string, $position);                
            }
            break;
        } elseif ((!$is_delimiter || ($flags & PREG_SPLIT_DELIM_CAPTURE && $is_captured))
               && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY)) {
            $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                        ? array(mb_strcut($string, $position, $length[0]), $position)
                        : mb_strcut($string, $position, $length[0]);
        }

        $position += $length[0];
    }

    return $parts;
}

      

+3


source to share


1 answer


Capturing separators is only possible with preg_split

and not available in other functions.

So there are three possibilities:

1) convert your string to UTF8, use preg_split

with PREG_SPLIT_DELIM_CAPTURE

and use array_map

to convert each item to original encoding.

This way is easier. This is not the case in the second case. (Note that in general it is easier to always work in UTF8 instead of dealing with exotic encodings)

2) instead of a split-like function, you need to use, for example mb_ereg_search_regs

, to get the parts matched and build the template like this:



delimiter|all_that_is_not_the_delimiter

      

(Note that the two branches must be mutually exclusive and must be written in such a way as to make impossible gaps between results. The first part must be at the beginning of the line, and the last part must be at the end. Each part must be adjacent to the previous, etc. etc.)

3) use mb_split

with images . By definition, inverse are zero-width assertions and do not match any characters, only positions in the string. So you can use a pattern like this that matches the positions after or before the separator:

(?=delimiter)|(<=delimiter)

      

(Limitation of this method is that the subpattern in lookbehind can not have a variable length (in other words, you can not use a quantifier inside), but it may be a fixed-length alternation Subpattern: (?<=subpat1|subpat2|subpat3)

)

+2


source







All Articles