Why do mbstring functions incorrectly identify ISO-8859 strings?

Question

Why do mbstring functions incorrectly identify ISO-8859 strings?

Despite listing each character ISO-8859

, specified as an individual encoding , the mbstring functions handle each character set ISO-8859

interchangeably. To do your homework:

$strings = [ 
  'English'   => 'Ea vim decore sapientem repudiandae. Sea cu delenit gamu mutn, tic.',
  'Cyrillic'  => '    ,     ,     .',
  'Greek'     => 'Λορεμ ιπσθμ δολορ σιτ αμετ, ηασ γραεcο νθσqθαμ cθ, εστ θτ εσσε διcαμ qθαλισqθε cθ.',
  'Armenian'  => 'լոռեմ իպսում դոլոռ սիթ ամեթ, եամ նո թաթիոն ծոմպռեհենսամ, իուս ադ նիսլ ոմնիս մինիմ եսթ',
  'Georgian'  => 'ლორემ იფსუმ დოლორ სით ამეთ, ეხ ყუანდო ცოფიოსაე უსუ, იუს ეუ ჰინც ვერო დომინგ ჰის',
  'Hindi'     => 'वर्ष एसेएवं व्याख्यान संदेश होने लक्षण एसेएवं पहोचाना विचरविमर्श? वर्णन करती आशाआपस अन्तरराष्ट्रीयकरन. रहारुप कार्यसिधान्त',
  'Korean'    => '모든 국민은 보건에 관하여 국가의 보호를 받는다, 전직대통령의 신분과 예우에 관하여는 법',
  'Arabic'    => 'مع لهذه الهجوم عدم, فكان اتفاق الصفحات من أسر. وجزر عُقر أما بـ, عل دار بقسوة المتّبعة بالولايات. وإقامة والفرنسي كل لكل. أي',
  'Hebrew'    => 'עמוד מדינות, חפש ואלקטרוניקה אנתרופולוגיה דת, מה קהילה הקהילה טכנו'
];

$encodings = ['ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15' ];

foreach( $strings as $lang => $text ) {
    echo $lang . " is encoded as " . mb_detect_encoding( $text, $encodings ) . "\n";

    foreach( $encodings as $encoding ) {
        echo " - is " . (mb_check_encoding( $text, $encoding ) ? "" : "not ") . $encoding . "\n";
    }
}

This leads to the output of the effect

Hindi is encoded as ISO-8859-1
  - is ISO-8859-1
  - is ISO-8859-2
  - is ISO-8859-3
  - is ISO-8859-4
  - is ISO-8859-5
  - is ISO-8859-6
  - is ISO-8859-7
  - is ISO-8859-8
  - is ISO-8859-9
  - is ISO-8859-10
  - is ISO-8859-13
  - is ISO-8859-14
  - is ISO-8859-15

with identical results for each listed language , which is clearly not true.

Why does mbstring list each encoding ISO-8859

separately, but handle them interchangeably? Is there a way to reliably determine the correct specification?

Or am I just using these functions incorrectly?

+3

php character-encoding iso-8859-1 mbstring

bosco March 25 17 at 10:54

source to share

1 answer

Paul crovella · Accepted Answer · 2017-03-25T12:08:27+0000

mb_detect_encoding

makes a guess about what the encoding is, it's impossible for such a thing to be accurate (and this function doesn't make many attempts.)

mb_check_encoding

tells you whether a string is a sequence of bytes that is valid for a given encoding, and given that every possible byte is valid in every ISO-8859- *, it's pointless to check them (they will always return true

.)

For relevant reading, I highly recommend: The Absolute Minimum Every Software Developer Must Know Absolutely, Positively About Unicode and Character Sets

Why do mbstring functions incorrectly identify ISO-8859 strings?

More articles: