Using mb_convert_encoding() to convert UTF8 to UTF8

Discussions of secure PHP coding. Security in software is important, so don't be afraid to ask. And when answering: be anal. Nitpick. No security vulnerability is too small.

Moderator: General Moderators

Post Reply
User avatar
Christopher
Site Administrator
Posts: 13592
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Using mb_convert_encoding() to convert UTF8 to UTF8

Post by Christopher »

Should you use mb_convert_encoding() like you show for any character encoding you are using -- or only UTF8?
(#10850)
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Security Resources

Post by Mordred »

It's the 21st century! One shouldn't be using any encoding *besides* unicode (utf-8 or utf-16). It amuses me to no end how poorly supported unicode is in PHP, and if I recall correctly, they pushed full-blown utf8 support to PHP6.

Here's a good primer:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Btw, maybe you should split this discussion to a different topic so that more people would chime in.
User avatar
Christopher
Site Administrator
Posts: 13592
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Using mb_convert_encoding() to convert UTF8 to UTF8

Post by Christopher »

My question was about whether this technique will work with converting any encoding to itself? Or if you recommend converting to UTF8 always? It seems like the point of this is to deal with multi-byte exploits.
(#10850)
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: Using mb_convert_encoding() to convert UTF8 to UTF8

Post by requinix »

If you just want to check encoding you can use, like, mb_detect_encoding as

Code: Select all

mb_detect_encoding($string, "UTF-8", true)
Or there's the regex method which is meh but works.

I'm trying to think why you might want to actually convert one encoding back into itself. It would never do anything. Maybe you're having it detect the encoding and are worried if the original encoding is UTF-8 then it might do something weird. But then that means you don't know what the original encoding is?
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Using mb_convert_encoding() to convert UTF8 to UTF8

Post by Mordred »

mb_detect_encoding, assuming it works (which I don't entirely trust), will only tell you if the input is valid. The conversion will force it to be valid. You can have the optional check if you want to (can't see how it would be useful, but hey - it's your script, do whatever), but the forced conversion is necessary to avoid malformed UTF-8.

Quoting a bit more context from the original thread this was split from:
Mordred wrote:
Christopher wrote:For example, do you recommended using mb_convert_encoding() to convert everything to UTF8?
Oh yes, for <5.4.0 surely. Did you know about this?

You should use a wrapper function anyway, who wants to type so much code every time? Inside, something like:

Code: Select all

function HtmlEscape($s) {
mb_substitute_character("none");
$s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');
return htmlspecialchars($s, ENT_QUOTES, 'UTF=8');
}
-----
Christopher wrote:Can you explain what is going on in these two lines?

Code: Select all

mb_substitute_character("none");
$s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');
-----
Mordred wrote:"Convert the string from utf-8 to utf-8 making sure you remove any character sequences that are not valid for utf-8"
I must add that this must be accompanied by strict enforcement of utf-8 encoding to the client to avoid legitimate clients sending you their weird Elbonian encoding and getting their data mangled. This is not related to security, just to the proper functioning of the site. An attacker will not send you well-formed utf-8 because he's a nice guy, that's why you don't trust him to, and that's why you force clean his input.
Post Reply