Monday, January 04, 2010
UTF-8 and Oracle Access Manager
OAM supports UTF-8 in incoming data, and can generate HTML pages encoded with UTF-8, but what about internally? Is UTF-8 data available in plugins? In HTTP header variables? We tested 10.1.4.3 on Windows and were surprised that our UTF-8 data was being interpretted incorrectly in our managed plugins (though exec ppp plugins worked as expected).
The character Û ( U with a circumflex) has a code point value of 219 (all numbers are decimal). In UTF-8 this is encoded as the bytes 195 & 155. However, when this text reaches our plugin it appears as Û (A with tilde & single right-pointing angle quotation mark). In .NET Strings are in unicode, so we know something is happening with the identity server to re-interpret the bytes 195 & 155 as some other encoding and then to provide us that String as unicode. That encoding turns out to be Windows-1252 - the default code page on our Windows system. 195 is Ã, while 155 is ›. Luckily there is a simple workaround – we get the Windows-1252 byte value of the string and then interpet those bytes at UTF-8.
Encoding encoding_1252 = Encoding.GetEncoding("Windows-1252");
string utf8String = Encoding.UTF8.GetString(encoding_1252.GetBytes(windows1252String))
Using Reflector I can see a few calls to StringToHGlobalAnsi in the managed library, and I would guess a similar call like PtrToStringAnsi is used for converting between unmanaged and managed memory, and this may be a cause of the issue.
This issue also exists in the Access Server. If you want to send a UTF-8 attribute value in a header, OAM is smart enough to base 64 encode it (according to RFC 2047 ). So our value should be encoded useing this format "=?UTF-8?B?" base64-encoded-text "?=". Unfortunately, the text to be encoded is incorrect – the access server is B64 encoding the Windows-1252 interpretation of the UTF-8 bytes. You'll need to B64 decode the header text and then use the re-encoding code shown earlier to get the real value.
One thing to note is that if your default code page is something other then Windows-1252, you'll proably have to interpret the string using that code page.
The character Û ( U with a circumflex) has a code point value of 219 (all numbers are decimal). In UTF-8 this is encoded as the bytes 195 & 155. However, when this text reaches our plugin it appears as Û (A with tilde & single right-pointing angle quotation mark). In .NET Strings are in unicode, so we know something is happening with the identity server to re-interpret the bytes 195 & 155 as some other encoding and then to provide us that String as unicode. That encoding turns out to be Windows-1252 - the default code page on our Windows system. 195 is Ã, while 155 is ›. Luckily there is a simple workaround – we get the Windows-1252 byte value of the string and then interpet those bytes at UTF-8.
Encoding encoding_1252 = Encoding.GetEncoding("Windows-
string utf8String = Encoding.UTF8.GetString(
Using Reflector I can see a few calls to StringToHGlobalAnsi in the managed library, and I would guess a similar call like PtrToStringAnsi is used for converting between unmanaged and managed memory, and this may be a cause of the issue.
This issue also exists in the Access Server. If you want to send a UTF-8 attribute value in a header, OAM is smart enough to base 64 encode it (according to RFC 2047 ). So our value should be encoded useing this format "=?UTF-8?B?" base64-encoded-text "?=". Unfortunately, the text to be encoded is incorrect – the access server is B64 encoding the Windows-1252 interpretation of the UTF-8 bytes. You'll need to B64 decode the header text and then use the re-encoding code shown earlier to get the real value.
One thing to note is that if your default code page is something other then Windows-1252, you'll proably have to interpret the string using that code page.