mb_detect_encoding() - mb函数(多字节字符串转化库)
mb_detect_encoding()
(PHP 4 >= 4.0.6, PHP 5, PHP 7)
检测字符的编码
说明
mb_detect_encoding (string $str[, mixed $encoding_list= mb_detect_order() [,bool $strict= false ]] ) : string检测字符串$str的编码。
参数
$str
待检查的字符串。
$encoding_list$encoding_list是一个字符编码列表。 编码顺序可以由数组或者逗号分隔的列表字符串指定。
如果省略了$encoding_list将会使用 detect_order。
$strict$strict指定了是否严格地检测编码。 默认是 FALSE
。
返回值
检测到的字符编码,或者无法检测指定字符串的编码时返回 FALSE
。
范例
mb_detect_encoding() 例子
参见
mb_detect_order()
设置/获取 字符编码的检测顺序
If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list: mb_detect_encoding($string, 'UTF-8, ISO-8859-1'); if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8. I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.
Beware of bug to detect Russian encodings http://bugs.php.net/bug.php?id=38138
I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion. The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset. I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's: if(detectUTF8($str)){ $str=str_replace("\xE2\x82\xAC","€",$str); $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str); $str=str_replace("€","\x80",$str); } If html-output is needed the last line is not necessary (and even unwanted).
A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)
In my environment (PHP 7.1.12), "mb_detect_encoding()" doesn't work where "mb_detect_order()" is not set appropriately. To enable "mb_detect_encoding()" to work in such a case, simply put "mb_detect_order('...')" before "mb_detect_encoding()" in your script file. Both "ini_set('mbstring.language', '...');" and "ini_set('mbstring.detect_order', '...');" DON'T work in script files for this purpose whereas setting them in PHP.INI file may work.
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity: example / usage of: mb_detect_encoding() $txtstr = str_to_utf8($txtstr);
Much simpler UTF-8-ness checker using a regular expression created by the W3C:
For detect UTF-8, you can use: if (preg_match('!!u', $str)) { echo 'utf-8'; } - Norihiori
a) if the FUNCTION mb_detect_encoding is not available: ### mb_detect_encoding ... iconv ### b) if the FUNCTION mb_convert_encoding is not available: ### mb_convert_encoding ... iconv ###
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests) mb_detect_encoding('accentue' , 'UTF-8, ISO-8859-1') returns ISO-8859-1, while mb_detect_encoding('accentu' , 'UTF-8, ISO-8859-1') returns UTF-8 bottom line : an ending '' (and probably other accentuated chars) mislead mb_detect_encoding
// ----------------------------------------------------------- if(!function_exists('mb_detect_encoding')) { function mb_detect_encoding($string, $enc=null, $ret=true) { $out=$enc; static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251'); foreach ($list as $item) { $sample = iconv($item, $item, $string); if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; } } return $out; } } // -----------------------------------------------------------
seems to work
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8. To verify utf 8 use the following: // // utf8 encoding validation developed based on Wikipedia entry at: // http://en.wikipedia.org/wiki/UTF-8 // // Implemented as a recursive descent parser based on a simple state machine // copyright 2005 Maarten Meijer // // This cries out for a C-implementation to be included in PHP core // function valid_1byte($char) { if(!is_int($char)) return false; return ($char & 0x80) == 0x00; } function valid_2byte($char) { if(!is_int($char)) return false; return ($char & 0xE0) == 0xC0; } function valid_3byte($char) { if(!is_int($char)) return false; return ($char & 0xF0) == 0xE0; } function valid_4byte($char) { if(!is_int($char)) return false; return ($char & 0xF8) == 0xF0; } function valid_nextbyte($char) { if(!is_int($char)) return false; return ($char & 0xC0) == 0x80; } function valid_utf8($string) { $len = strlen($string); $i = 0; while( $iAbout function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this: mb_detect_encoding('áéóú', 'UTF-8', true); // false but now the result is not false, can you give me reason, thanks!I seriously underestimated the importance of setlocale... Produces the following: Array ( [0] => mais coisas a pensar sobre diario ou dois! [1] => plus de choses a penser a journalier ou a deux ! [2] => ?mas cosas a pensar en diario o dos! [3] => piu cose da pensare circa giornaliere o due! [4] => flere ting aa tenke paa hver dag eller to! [5] => Dalsi veci, premyslet o kazdy den nebo dva! [6] => mehr ueber Spass spaet schoenen [7] => me vone gjate fun bukur [8] => toebb mint szorakozas keso csodalatos kenyer ) whereas produces: Array ( [0] => mais coisas a pensar sobre di?rio ou dois! [1] => plus de choses ? penser ? journalier ou ? deux ! [2] => ?m?s cosas a pensar en diario o dos! [3] => pi? cose da pensare circa giornaliere o due! [4] => flere ting ? tenke p? hver dag eller to! [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva! [6] => mehr ?ber Spass sp?t sch?nen [7] => m? von? gjat? fun bukur [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r ) This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it. Replace } // goto next char with } else { return false; // 10xxxxxx occuring alone } // goto next charfrom PHPDIG function isUTF8($str) { if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) { return true; } else { return false; } }Another light way to detect character encoding:
鹏仔微信 15129739599 鹏仔QQ344225443 鹏仔前端 pjxi.com 共享博客 sharedbk.com
免责声明:我们致力于保护作者版权,注重分享,当前被刊用文章因无法核实真实出处,未能及时与作者取得联系,或有版权异议的,请联系管理员,我们会立即处理! 部分文章是来自自研大数据AI进行生成,内容摘自(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供学习参考,不准确地方联系删除处理!邮箱:344225443@qq.com)
图片声明:本站部分配图来自网络。本站只作为美观性配图使用,无任何非法侵犯第三方意图,一切解释权归图片著作权方,本站不承担任何责任。如有恶意碰瓷者,必当奉陪到底严惩不贷!
内容声明:本文中引用的各种信息及资料(包括但不限于文字、数据、图表及超链接等)均来源于该信息及资料的相关主体(包括但不限于公司、媒体、协会等机构)的官方网站或公开发表的信息。部分内容参考包括:(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供参考使用,不准确地方联系删除处理!本站为非盈利性质站点,本着为中国教育事业出一份力,发布内容不收取任何费用也不接任何广告!)