mb_detect_encoding() - mb函数（多字节字符串转化库）

是丫丫呀1年前 (2023-11-21)阅读数 21#技术干货

文章标签字符

mb_detect_encoding()

(PHP 4 >= 4.0.6, PHP 5, PHP 7)

检测字符的编码

说明

mb_detect_encoding (string $str[, mixed $encoding_list= mb_detect_order() [,bool $strict= false ]] ) : string

检测字符串$str的编码。

参数

$str

待检查的字符串。

$encoding_list

$encoding_list是一个字符编码列表。编码顺序可以由数组或者逗号分隔的列表字符串指定。

mb_detect_encoding() - mb函数（多字节字符串转化库）

如果省略了$encoding_list将会使用 detect_order。

$strict

$strict指定了是否严格地检测编码。默认是 FALSE。

返回值

检测到的字符编码，或者无法检测指定字符串的编码时返回 FALSE。

范例

mb_detect_encoding() 例子

参见

mb_detect_order()设置/获取字符编码的检测顺序

If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');
if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.
I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.

Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.
The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset. 
I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:
if(detectUTF8($str)){
 $str=str_replace("\xE2\x82\xAC","€",$str); 
 $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
 $str=str_replace("€","\x80",$str); 
}
If html-output is needed the last line is not necessary (and even unwanted).

A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
   where "mb_detect_order()" is not set appropriately.
To enable "mb_detect_encoding()" to work in such a case,
   simply put "mb_detect_order('...')"
   before "mb_detect_encoding()" in your script file.
Both 
   "ini_set('mbstring.language', '...');"
   and
   "ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.

Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

example / usage of: mb_detect_encoding() 

$txtstr = str_to_utf8($txtstr);

Much simpler UTF-8-ness checker using a regular expression created by the W3C:

For detect UTF-8, you can use:
if (preg_match('!!u', $str)) { echo 'utf-8'; }
- Norihiori

a) if the FUNCTION mb_detect_encoding is not available: 
### mb_detect_encoding ... iconv ###

b) if the FUNCTION mb_convert_encoding is not available: 
### mb_convert_encoding ... iconv ###

beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)
mb_detect_encoding('accentue' , 'UTF-8, ISO-8859-1')
returns ISO-8859-1, while 
mb_detect_encoding('accentu' , 'UTF-8, ISO-8859-1')
returns UTF-8
bottom line : an ending '' (and probably other accentuated chars) mislead mb_detect_encoding

// ----------------------------------------------------------- 
if(!function_exists('mb_detect_encoding')) {
function mb_detect_encoding($string, $enc=null, $ret=true) {
  $out=$enc; 
  static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251');
    foreach ($list as $item) {
      $sample = iconv($item, $item, $string);
      if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; } 
    } 
  return $out;
}
}
// -----------------------------------------------------------

seems to work

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:
//
//  utf8 encoding validation developed based on Wikipedia entry at:
//  http://en.wikipedia.org/wiki/UTF-8
//
//  Implemented as a recursive descent parser based on a simple state machine
//  copyright 2005 Maarten Meijer
//
//  This cries out for a C-implementation to be included in PHP core
//
  function valid_1byte($char) {
    if(!is_int($char)) return false;
    return ($char & 0x80) == 0x00;
  }
  
  function valid_2byte($char) {
    if(!is_int($char)) return false;
    return ($char & 0xE0) == 0xC0;
  }
  function valid_3byte($char) {
    if(!is_int($char)) return false;
    return ($char & 0xF0) == 0xE0;
  }
  function valid_4byte($char) {
    if(!is_int($char)) return false;
    return ($char & 0xF8) == 0xF0;
  }
  
  function valid_nextbyte($char) {
    if(!is_int($char)) return false;
    return ($char & 0xC0) == 0x80;
  }
  
  function valid_utf8($string) {
    $len = strlen($string);
    $i = 0;  
    while( $i About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!
I seriously underestimated the importance of setlocale...

Produces the following: 
Array
(
  [0] => mais coisas a pensar sobre diario ou dois!
  [1] => plus de choses a penser a journalier ou a deux !
  [2] => ?mas cosas a pensar en diario o dos!
  [3] => piu cose da pensare circa giornaliere o due!
  [4] => flere ting aa tenke paa hver dag eller to!
  [5] => Dalsi veci, premyslet o kazdy den nebo dva!
  [6] => mehr ueber Spass spaet schoenen
  [7] => me vone gjate fun bukur
  [8] => toebb mint szorakozas keso csodalatos kenyer
)
whereas 

produces:
Array
(
  [0] => mais coisas a pensar sobre di?rio ou dois!
  [1] => plus de choses ? penser ? journalier ou ? deux !
  [2] => ?m?s cosas a pensar en diario o dos!
  [3] => pi? cose da pensare circa giornaliere o due!
  [4] => flere ting ? tenke p? hver dag eller to!
  [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva!
  [6] => mehr ?ber Spass sp?t sch?nen
  [7] => m? von? gjat? fun bukur
  [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r
)
This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.
Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.
Replace
     } // goto next char
with
     } else {
      return false; // 10xxxxxx occuring alone
     } // goto next char
from PHPDIG
  function isUTF8($str) {
    if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
      return true;
    } else {
      return false;
    }
  }
Another light way to detect character encoding:

鹏仔微信 15129739599 鹏仔QQ344225443 鹏仔前端 pjxi.com 共享博客 sharedbk.com

免责声明：我们致力于保护作者版权，注重分享，当前被刊用文章因无法核实真实出处，未能及时与作者取得联系，或有版权异议的，请联系管理员，我们会立即处理! 部分文章是来自自研大数据AI进行生成,内容摘自(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供学习参考,不准确地方联系删除处理!邮箱：344225443@qq.com)

图片声明：本站部分配图来自网络。本站只作为美观性配图使用,无任何非法侵犯第三方意图,一切解释权归图片著作权方,本站不承担任何责任。如有恶意碰瓷者,必当奉陪到底严惩不贷!

内容声明：本文中引用的各种信息及资料（包括但不限于文字、数据、图表及超链接等）均来源于该信息及资料的相关主体（包括但不限于公司、媒体、协会等机构）的官方网站或公开发表的信息。部分内容参考包括:(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供参考使用,不准确地方联系删除处理！本站为非盈利性质站点,本着为中国教育事业出一份力,发布内容不收取任何费用也不接任何广告!)