Merni wrote:If it helps, at least the decimal 133 (ellipsis … ) and 145-148 (‘ ’ “ ” smart quotes) appear to be from
Windows-1252. Weirdly, there are also 39 (' straight apostrophe) and " (" straight quotes).
This looks to be an issue with the contents of that particular resolution as submitted, not with the API in general. The API is correctly reporting the data in the server, that data just happens to be wrong, which is nominally the fault of the player who submitted the resolution rather than of NationStates (although really, it's more that neither of them properly sanitized text made by shoddy Microsoft products).
Imperium Anglorum wrote:Sadly, I can't find anything in Java which will decode HTML entities to Windows-1252 instead of Unicode.
HTML entities
are supposed to be in Unicode, and usually are (even on NationStates). However, because using Windows-1252 characters as though they were Unicode characters is a common mistake, and because the actual Unicode codepoints that they would replace are used by basically no-one ever, most modern browsers recognize them anyway.
There's really only a few Windows-1252 characters in common use you need to worry about:
… / … -> … (...)
‘ / ‘ -> ‘ (')
’ / ’ -> ’ (')
“ / “ -> “ (")
” / ” -> ” (")
– / – -> – (-)
— / — -> — (-)
So you can just implement a search-and-replace for these codepoints (it's best to perform this AFTER the parsing of XML/HTML entities into raw Unicode data). There
are more than these, but they aren't seen as often and you can find them on Wikipedia if you need to.
The basic reason behind this confusion is that the Windows-1252 encoding is a Microsoft-made extension of the ISO-8859-1 (AKA Latin-1) official international standard, which has the exact same meanings for codepoints 0xA0-0xFF, while only differing in the meanings of codepoints 0x80-0x9F, the ISO-8859-1 readings of which are obsolete junk which nobody ever used anyway. This means a lot of people mistakenly treat Windows-1252 basically as if it was actually the official ISO-8859-1 (blame Microsoft). The thing is, the Unicode standard is defined as sharing its codepoints up to 0xFF with ISO-8859-1, meaning that ISO-8859-1 data can be trivially converted to proper Unicode. But that's ISO-8859-1,
not Windows-1252, and Unicode has its own different codepoints for those characters that Windows-1252 assigns to the 0x80-0x9F range.
I understand if that previous paragraph goes over your head. Most people don't know or care about such details, which is why browser makers eventually threw up their hands and just started recognizing the "wrong" formatting.