Using mb_convert_encoding() to convert UTF8 to UTF8
Moderator: General Moderators
- Christopher
- Site Administrator
- Posts: 13592
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Using mb_convert_encoding() to convert UTF8 to UTF8
Should you use mb_convert_encoding() like you show for any character encoding you are using -- or only UTF8?
(#10850)
Re: Security Resources
It's the 21st century! One shouldn't be using any encoding *besides* unicode (utf-8 or utf-16). It amuses me to no end how poorly supported unicode is in PHP, and if I recall correctly, they pushed full-blown utf8 support to PHP6.
Here's a good primer:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Btw, maybe you should split this discussion to a different topic so that more people would chime in.
Here's a good primer:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Btw, maybe you should split this discussion to a different topic so that more people would chime in.
- Christopher
- Site Administrator
- Posts: 13592
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Re: Using mb_convert_encoding() to convert UTF8 to UTF8
My question was about whether this technique will work with converting any encoding to itself? Or if you recommend converting to UTF8 always? It seems like the point of this is to deal with multi-byte exploits.
(#10850)
Re: Using mb_convert_encoding() to convert UTF8 to UTF8
If you just want to check encoding you can use, like, mb_detect_encoding as
Or there's the regex method which is meh but works.
I'm trying to think why you might want to actually convert one encoding back into itself. It would never do anything. Maybe you're having it detect the encoding and are worried if the original encoding is UTF-8 then it might do something weird. But then that means you don't know what the original encoding is?
Code: Select all
mb_detect_encoding($string, "UTF-8", true)
I'm trying to think why you might want to actually convert one encoding back into itself. It would never do anything. Maybe you're having it detect the encoding and are worried if the original encoding is UTF-8 then it might do something weird. But then that means you don't know what the original encoding is?
Re: Using mb_convert_encoding() to convert UTF8 to UTF8
mb_detect_encoding, assuming it works (which I don't entirely trust), will only tell you if the input is valid. The conversion will force it to be valid. You can have the optional check if you want to (can't see how it would be useful, but hey - it's your script, do whatever), but the forced conversion is necessary to avoid malformed UTF-8.
Quoting a bit more context from the original thread this was split from:
Quoting a bit more context from the original thread this was split from:
-----Mordred wrote:Oh yes, for <5.4.0 surely. Did you know about this?Christopher wrote:For example, do you recommended using mb_convert_encoding() to convert everything to UTF8?
You should use a wrapper function anyway, who wants to type so much code every time? Inside, something like:
Code: Select all
function HtmlEscape($s) { mb_substitute_character("none"); $s = mb_convert_encoding($s, 'UTF-8', 'UTF-8'); return htmlspecialchars($s, ENT_QUOTES, 'UTF=8'); }
-----Christopher wrote:Can you explain what is going on in these two lines?Code: Select all
mb_substitute_character("none"); $s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');
Mordred wrote:"Convert the string from utf-8 to utf-8 making sure you remove any character sequences that are not valid for utf-8"
I must add that this must be accompanied by strict enforcement of utf-8 encoding to the client to avoid legitimate clients sending you their weird Elbonian encoding and getting their data mangled. This is not related to security, just to the proper functioning of the site. An attacker will not send you well-formed utf-8 because he's a nice guy, that's why you don't trust him to, and that's why you force clean his input.