What is the character set if default_charset is empty
In PHP 5.6 onwards, the string is default_charset
set to "UTF-8"
as described eg. in the documentationphp.ini
. He says the line is empty for earlier versions.
Since I am creating a Java library to communicate with PHP, I need to know what values ββI should expect when the string is treated as internal bytes. What happens if the string is default_charset
empty and the string (literal) contains characters outside the ASCII range? Should I expect the platform's default encoding, or the character encoding used for the source file?
source to share
Short answer
For literal strings, always the source file encoding. default_charset
The value does nothing here.
Longer answer
PHP strings are "binary", meaning they have no internal string encoding. Basically a string in PHP is just byte buffers.
For literal strings, eg. $s = "Γ"
this means the string will contain any bytes stored in the file between the quotes. If the file was saved in UTF-8 it will be equivalent $s = "\xc3\x84"
, if the file was saved in ISO-8859-1 (latin1) it will be equivalent $s = "\xc4"
.
The setting value has default_charset
no effect on the bytes stored in the strings.
What does it do default_charset
??
Some functions, which must deal with strings as text and know the encoding, take $encoding
as an argument (usually optional). This is talking about a function that encodes text in a string.
Before the default PHP 5.6 parameter value for these optional arguments $encoding
was either in the function definition (for example htmlspecialchars()
), or configured in different php.ini settings for each extension separately (for example mbstring.internal_encoding
, iconv.input_encoding
).
PHP 5.6 introduced a new php.ini setting default_charset
. The old settings were deprecated, and all functions that take an optional argument $encoding
should now default to a value default_charset
when no encoding is explicitly specified.
However, it is the developer's responsibility to ensure that the text in the string is actually encoded in the encoding that was specified.
Links:
-
String Type Details Learn more about the nature of PHP strings (not mentioneddefault_charset
at the time of writing). - New Features in PHP 5.6: Default Character Encoding
Brief introduction of the new optiondefault_charset
in the PHP 5.6 release notes. - Deprecated functions in PHP 5.6: iconv and mbstring encoding settings
List of deprecated php.ini options in favor of the optiondefault_chaset
.
source to share
It seems like you should n't rely on the internal encoding. The encoding of the internal symbol can be seen / set using mb_internal_encoding .
phpinfo () example
- PHP version 5.5.9-1ubuntu4.5
- default_charset no value
file1.php
<?php
$string = "e";
echo mb_internal_encoding(); //ISO-8859-1
file2.php
<?php
$string = "Γ";
echo mb_internal_encoding(); //ISO-8859-1
both files will output ISO-8859-1 unless you manually change the internal encoding.
<?php
echo bin2hex("ΓΆ"); //c3b6 (utf-8)
Receiving the hex character of that character returns UTF-8 encoding. If you save the file using UTF-8, the string in this example will be 2 bytes, even if the internal encoding is not set to UTF-8. Therefore, you must rely on the character encoding used for the source file.
source to share