encoding : Java Glossary

go to home page E words local find full screen, hide local find menu Google search web for more information on this topic jump to foot of page translate this page with Babelfish 2008-02-23 by Roedy Green ©1996-2008 Canadian Mind Products
index page for letter ⇒ punctuation 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)
CurrCon neededThe CurrCon Java Applet displays prices on this web page converted with today’s exchange rates into your local international currency, e.g. Euros, US dollars, Canadian dollars, British Pounds, Indian Rupees… CurrCon requires Java 1.1 or later, preferably 1.6.0_10. If you can’t see the prices, or if you just want to learn more about CurrCon, click here for help.
encoding
This page contains two signed Applets and one unsigned Applet. You must grant permission for the two signed Applets to run to view the page.
Normally Readers translate from various 8-bit byte streams to standard 16-bit Unicode to read. You can specify the sort of translation to use when you create the Reader. Similarly, normally Writers translate from internal 16 bit Unicode into various 8-bit byte streams.

However encodings are more versatile than that. They also let you read and write big or little endian 16-bit Unicode character streams. In theory encodings could support complex encoding structures, translation, compression, or quoting. One letter may become many or vice versa. Letters may be suppressed.

Encodings are usually trap door. When you translate to 8-bit you lose information. When you translate it back to Unicode some characters will not come back the same way they were originally. Some may even be missing.

Encodings are not used in AWT or Swing. You use pure 16-bit unicode chars and Strings. How it displays depends on how clever the Font is at displaying Unicode. Normally it will display only some small subset of the characters properly. See FontShower to learn a bit about what your Font supports.

Possible Encodings Reversibility
Encodings Supported in your Browser Tracking
Official Encoding Name Given Alias Encoding Identification
Table Of Possible Encodings Choosing An Encoding
Why So Many Encodings? HEX
ISO Java Source Code Encoding
Roll Your own Default File and Console Encoding
Converting Learning More
native2ascii Links

Possible Supported Encodings

The complete set of which encodings supported anywhere/everywhere is not documented. However, starting with JDK 1.4 there is a way to find out just which encodings are supported in your particular JVM using java.nio.charset. Charset.availableCharsets().

There are five sources of information:

  1. Sun’s JDK Technote Guide on nio encoding : available:
    That lists them, but does not tell you much about them. Note that java.nio uses different canonical names from java.io and java.lang.
  2. A place to look for supported character sets :
  3. Another place to look for supported character Sets :
  4. The following Applet that lists the encodings supported on your particular browser/java.
  5. The following Table that lists the encodings supported on one some Java, somewhere. I manually collect this lore.

Supported Encodings in this Browser

List of encodings supported in this browser and this Java. Source available.

The key to this Applet is java.nio.charset.Charset. availableCharsets().

If, encodings, the above Encodings signed Java Applet does not work…

  1. This signed Java Applet needs Java 1.4 or later, best version 1.4.2_18 or later, version 1.6.0_10 recommended and a recent browser.
  2. You should see the Applet above looking much like the screenshot. If you don’t, the following should help you get it working:
  3. If you are using Microsoft Internet Explorer, try another browser. Seriously. Microsoft has taken great pains, over and over, to screw up Java and every other multi-platform standardisation.
  4. If you are using Internet Explorer 7 or 8, you must allow blocked content permission for Active X to run. This also gives permission to Java to run. Click the Information bar, and then click Allow blocked content. Unfortunately, this also allows dangerous ActiveX code to run. However, you must do this in order to get access to perfectly-safe Java Applets running in a sandbox. This is part of Microsoft’s war on Java. Don’t put up with it! Use a different browser.
  5. For this Applet to work, you must click grant/accept to give it permission to discover the default encoding via the file.encoding restricted system property.
  6. Optionally, you may permanently install the Canadian Mind Products code-signing certificate so you don’t have to grant each time.
  7. If the above Applet appears to freeze-up, click Alt-Esc repeatedly to check for any buried permission dialog box.
  8. If you have certificate troubles, check the installed certificates and remove or update any obsolete or suspected defective certificates. The only certificate used by this program is mindprodcert2008dsa.
  9. Especially if this Applet has worked before, try clearing the browser cache and rebooting.
  10. To ensure your Java is up to date, check with Wassup. First, download it and run it as an application independent of your browser, then run it online as an Applet to add the complication of your browser.
  11. If the above Applet does not work, check the Java console for error messages.
  12. If the above Applet does not work, you might have better luck with the downloadable version.
  13. If you still can’t get the program working click HELP for more detail.
  14. If you can’t get the above Applet working after trying the advice above and from the HELP button below, have bugs to report or ideas to improve the program or its documentation, please send me an email atemail Roedy Green.
Java powered   Get New Java  Get New Browser   Help
You can find out the default encoding with:

Finding Official Encoding Name Given an Alias



If, officialencoding, the above Official Encoding Java Applet does not work…

  1. This Java Applet needs Java 1.5 or later, best version 1.5.0_16 or later, version 1.6.0_10 recommended and a recent browser.
  2. You should see the Applet above looking much like the screenshot. If you don’t, the following should help you get it working:
  3. If you are using Microsoft Internet Explorer, try another browser. Seriously. Microsoft has taken great pains, over and over, to screw up Java and every other multi-platform standardisation.
  4. If you are using Internet Explorer 7 or 8, you must allow blocked content permission for Active X to run. This also gives permission to Java to run. Click the Information bar, and then click Allow blocked content. Unfortunately, this also allows dangerous ActiveX code to run. However, you must do this in order to get access to perfectly-safe Java Applets running in a sandbox. This is part of Microsoft’s war on Java. Don’t put up with it! Use a different browser.
  5. Especially if this Applet has worked before, try clearing the browser cache and rebooting.
  6. To ensure your Java is up to date, check with Wassup. First, download it and run it as an application independent of your browser, then run it online as an Applet to add the complication of your browser.
  7. If the above Applet does not work, check the Java console for error messages.
  8. If the above Applet does not work, you might have better luck with the downloadable version.
  9. If you still can’t get the program working click HELP for more detail.
  10. If you can’t get the above Applet working after trying the advice above and from the HELP button below, have bugs to report or ideas to improve the program or its documentation, please send me an email atemail Roedy Green.
Java powered   Get New Java  Get New Browser   Help

Table Of Possible Supported Encodings

Here are some encodings typically supported. You aften see names with dash, underscore, and space variations. e.g. ISO 8859-1, ISO8859_1 and ISO-8859-1. The encodings you will encounter most often are: ISO-8859-1 (Latin-1), UTF-8 and windows-1250. These are the latest fashion in naming.
Java Encodings
Encoding name Supp-
orted?
Official Name Description
8859_1 ISO-8859-1 Latin-1 ASCII (the USA default). This just takes the low order 8 bits and tacks on a high order 0 byte. Same as ISO-8859-1. Microsoft’s variant of Latin-1 is called Cp1252.
ASCII US-ASCII 7 bit ASCII, plus forms like \uxxxx for the exotic characters.
base64 base64 source code is available.
base64u base64u source code is available. A variant of Base64 also URL-encoded.
base85
Big5 Big5 Big5, Traditional Chinese
Big5-HKSCS Big5-HKSCS Big5 with Hong Kong extensions, Traditional Chinese
Big5-Solaris Not supported in Windows. Big5 with seven additional Hanzi ideograph character mappings for the Solaris zh_TW.BIG5 locale
Cp037 IBM037 USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, EBCDIC, aka Cp1140
Cp038 International EBCDIC, aka IBM038
Cp273 IBM273 IBM Austria, Germany, aka Cp1141
Cp277 IBM277 IBM Denmark, Norway, EBCDIC, aka Cp1142
Cp278 IBM278 IBM Finland, Sweden, EBCDIC, aka Cp1143
Cp280 IBM280 IBM Italy, EBCDIC, aka Cp1144
Cp284 IBM284 IBM Catalan/Spain, Spanish Latin America, EBCDIC, aka Cp1145
Cp285 IBM285 IBM United Kingdom, Ireland, EBCDIC, aka Cp1146
Cp297 IBM297 IBM France, EBCDIC, aka Cp1147
Cp420 IBM420 IBM Arabic, EDCDIC ak IBM240
Cp424 IBM424 IBM Hebrew, EBCDIC
Cp437 IBM437 Original IBM PC OEM DOS character set (with line drawing characters and some Greek and math), MS-DOS United States, Australia, New Zealand, South Africa. The rest of the world uses Cp850 for the DOS box.
Cp500 IBM500 IBM Belgium and Switzerland, EBCDIC, 500V1, aka Cp1148
Cp737 x-IBM737 PC Greek
Cp775 IBM775 PC Baltic
Cp838 IBM-Thai IBM Thailand extended SBCS, aka IBM838
Cp850 IBM850 Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see 8859-1. See Cp437.
Cp852 IBM852 Microsoft DOS Multilingual Latin-2 Slavic
Cp855 IBM855 IBM Cyrillic
Cp857 IBM857 IBM Turkish
Cp858 IBM00858 variant of Cp850 with the Euro. Microsoft DOS Multilingual Latin-1 (with line drawing characters). For true Latin-1 see 8859-1.
Cp860 IBM860 MS-DOS Portuguese
Cp861 IBM861 MS-DOS Icelandic
Cp862 IBM862 PC Hebrew
Cp863 IBM863 MS-DOS Canadian French
Cp864 IBM864 PC Arabic
Cp865 IBM865 MS-DOS Nordic
Cp866 IBM866 MS-DOS Russian
Cp868 IBM868 MS-DOS Pakistan
Cp869 IBM869 IBM Modern Greek
Cp870 IBM870 IBM Multilingual Latin-2, ECBDIC
Cp871 IBM871 IBM Iceland, EBCDIC, aka Cp1149
Cp874 x-IBM874 IBM Thai
Cp875 x-IBM875 IBM Greek
Cp918 IBM918 IBM Pakistan(Urdu), EBCDIC
Cp921 x-IBM921 IBM Latvia, Lithuania (AIX, DOS).
Cp922 x-IBM922 IBM Estonia (AIX, DOS).
Cp930 x-IBM930 Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
Cp933 x-IBM933 Korean Mixed with 1880 UDC, superset of 5029
Cp935 x-IBM935 Simplified Chinese Host mixed with 1880 UDC, superset of 5031
Cp937 x-IBM937 Traditional Chinese Host miexed with 6204 UDC, superset of 5033
Cp939 x-IBM939 Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
Cp942 x-IBM942 Japanese (OS/2) superset of 932
Cp942C x-IBM942C variant of Cp942. Japanese (OS/2) superset of Cp932
Cp943 x-IBM943 Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp943C x-IBM943C Variant of Cp943. Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp948 x-IBM948 OS/2 Chinese (Taiwan) superset of 938
Cp949 x-IBM949 PC Korean
Cp949C x-IBM949C variant of Cp949, PC Korean
Cp950 x-IBM950 PC Chinese (Hong Kong, Taiwan)
Cp964 x-IBM964 AIX Chinese (Taiwan)
Cp970 x-IBM970 AIX Korean
Cp1006 x-IBM1006 IBM AIX Pakistan (Urdu).
Cp1025 x-IBM1025 IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYRa0.
Cp1026 IBM1026 IBM Latin-5, Turkey
Cp1046 x-IBM1046 IBM Open Edition US EBCDIC
Cp1047 IBM1047 IBM System 390 EBCDIC, Java 1.2+ only.
Cp1048 IBM EBCDIC. aka IBM1048.
Cp1097 x-IBM1097 IBM Iran(Farsi)/Persian
Cp1098 x-IBM1098 IBM Iran(Farsi)/Persian (PC)
Cp1112 x-IBM1112 IBM Latvia, Lithuania
Cp1122 x-IBM1122 IBM Estonia
Cp1123 x-IBM1123 IBM Ukraine
Cp1124 x-IBM1124 IBM AIX Ukraine
Cp1140 IBM01140 USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia, aka Cp037.
Cp1141 IBM01141 IBM Austria, Germany, aka Cp273.
Cp1142 IBM01142 IBM Denmark, Norway, aka Cp277.
Cp1143 IBM01143 IBM Finland, Sweden, aka Cp278.
Cp1144 IBM01144 IBM Italy, aka Cp2803
Cp1145 IBM01145 IBM Catalan/Spain, Spanish Latin America, aka Cp284.
Cp1146 IBM01146 IBM United Kingdom, Ireland, aka Cp285.
Cp1147 IBM01147 IBM France, aka Cp297.
Cp1148 IBM01148 EBCDIC 500V1.
Cp1149 IBM01149 IBM Iceland.
Cp1250 windows-1250 Windows Eastern European
Cp1251 windows-1251 Windows Cyrillic (Russian)
Cp1252 windows-1252 Microsoft Windows variant of Latin-1, NT default. Beware. Some unexpected translations occur when you read with this default encoding, e.g. codes 128..159 are translated to 16 bit chars with bits in the high order byte on. It does not just truncate the high byte on write and pad with 0 on read. For true Latin-1 see 8859-1.
Cp1253 windows-1253 Windows Greek
Cp1254 windows-1254 Windows Turkish
Cp1255 windows-1255 Windows Hebrew
Cp1256 windows-1256 Windows Arabic
Cp1257 windows-1257 Windows Baltic
Cp1258 windows-1258 Windows Vietnamese
Cp1381 x-IBM1381 IBM OS/2, DOS People’s Republic of China (PRC)
Cp1383 x-IBM1383 IBM AIX People’s Republic of China (PRC)
Cp33722 x-IBM33722 IBM-eucJP - Japanese (superset of 5050)
Default US-ASCII 7-bit ASCII (not the actual default!). Strips off the high order bit 7 and tacks on a high order 0 byte. The actual default is controlled in Windows 95/98/ME/NT/W2K/XP/W2K3 by the Control Panel national settings.
EBCDIC Not directly supported. EBCDIC comes in dozens of variants, most of which do not have Java support. Check out Cp037, Cp038, Cp278, Cp280, Cp284, Cp285, Cp297, Cp424, Cp500, Cp871, Cp918, Cp1046, Cp1047, Cp1048, Cp1148.
Filode n/a Used to encode filenames with fancy characters in them to make them usable on systems with ASCII-only filenames.
GB18030 GB18030 Simplified Chinese, PRC standard
GB2312 GB2312 Chinese. Popular in email.
GBK GBK GBK, Simplified Chinese
gzip gzip compressed, often used in HTML sent from a website.
IBMOEM
ISO-2022-CN ISO-2022-CN ISO 2022 CN, Chinese
ISO-2022-CN-CNS x-ISO-2022-CN-CNS CNS 11643 in ISO-2022-CN form, T. Chinese
ISO-2022-CN-GB x-ISO-2022-CN-GB GB 2312 in ISO-2022-CN form, S. Chinese
ISO-2022-JP ISO-2022-JP JIS0201, 0208, 0212, ISO-2022 Encoding, Japanese
ISO-2022-KR ISO-2022-KR ISO 2022 KR, Korean
ISO-8859-1 ISO-8859-1 ISO 8859-1, same as 8859_1, USA, Europe, Latin America, Caribbean, Canada, Africa, Latin-1, (Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish). Beware, for NT, the default is Cp1252 a variant of Latin-1, controlled by the control panel regional settings.
ISO-8859-2 ISO-8859-2 ISO 8859-2, Eastern Europe, Latin-2, (Albanian, Czech, English, German, Hungarian, Polish, Rumanian, (Serbo-)Croatian, Slovak, Slovene and Swedish)
ISO-8859-3 ISO-8859-3 ISO 8859-3, SE Europe/miscellaneous, Latin-3 (Afrikaans, Catalan, English, verdastelo Esperanto, French, Galician, German, Italian, Maltese and Turkish)
ISO-8859-4 ISO-8859-4 ISO 8859-4, Scandinavia/Baltic, Latin-4, (Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish)
ISO-8859-5 ISO-8859-5 ISO 8859-5, Cyrillic, (Bulgarian, Bielorussian, English, Macedonian, Russian, Serb(o-Croat)ian and Ukrainian)
ISO-8859-6 ISO-8859-6 ISO 8859-6, Arabic ASMO 449
ISO-8859-7 ISO-8859-7 ISO 8859-7, Greek ELOT-928
ISO-8859-8 ISO-8859-8 ISO 8859-8, Hebrew
ISO-8859-9 ISO-8859-9 ISO 8859-9, Turkish Latin-5, (English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish and Turkish)
ISO-8859-10 ISO 8859-10, Lappish/Nordic/Eskimo languages, Latin-6. (Danish, English, Estonian, Faeroese, Finnish, German, Greenlandic, Icelandic, Lappish, Latvian, Lithuanian, Norwegian and Swedish)
ISO-8859-11 x-iso-8859-11 ISO 8859-11, Thai.
ISO-8859-12 ISO 8859-12, Devanagari.
ISO-8859-13 ISO 8859-13, Baltic Rim, Latin-7.
ISO-8859-14 ISO 8859-14, Celtic, Latin-8.
ISO-8859-15 ISO-8859-15 ISO 8859-15, Euro, including Euro currency sign, aka Latin9, not Latin-15 as you would expect. Like Latin-1 with 8 replacements.
JIS ISO-2022-JP Japanese
JIS0201 JIS_X0201 JIS 0201, Japanese
JIS0212 JIS_X0212-1990 JIS 0212, Japanese
JISAutoDetect x-JISAutoDetect Detects and converts from Shift-JIS, EUC-JP, ISO- 2022 JP (conversion to Unicode only)
JIS_X0201 JIS_X0201 Japanese
JIS_X0212-1990 JIS_X0212-1990 Japanese
KOI8-R KOI8-R KOI8-R, Russian
ks_c_5601-1987 EUC-KR Korean standard often used in emails. See KSC5601.
KSC5601 EUC-KR Korean
Latin-1 see 8859-1 and Cp1252.
Latin-2   see 8859-2.
Latin-3   see 8859-3.
Latin-4   see 8869-4.
Latin Extended-A   MSWord
Latin Extended-B   MSWord
LocaleDefault   Mad as it sounds, the only way to get this is to look up the Locale default such as
yourself and pass it explicitly or use a variant method that does not specify the encoding. default won’t do it! In my opinion, all methods that use a LocaleDefault without an encoding parameter should be deprecated.

You can also find out the encoding used on an InputStreamReader with InputStreamReader.getEncoding(). It will pick up the default, or the explicit encoding specified.

MacArabic x-MacArabic Macintosh Arabic
MacCentralEurope x-MacCentralEurope Macintosh Latin-2
MacCroatian x-MacCroatian Macintosh Croatian
MacCyrillic x-MacCyrillic Macintosh Cyrillic (Russian)
MacDingbat x-MacDingbat Macintosh Dingbat
MacGreek x-MacGreek Macintosh Greek
MacHebrew x-MacHebrew Macintosh Hebrew
MacIceland x-MacIceland Macintosh Iceland
MacRoman x-MacRoman Macintosh Roman
MacRomania x-MacRomania Macintosh Romania
MacSymbol x-MacSymbol Macintosh Symbol
MacThai x-MacThai Macintosh Thai
MacTurkish x-MacTurkish Macintosh Turkish
MacUkraine x-MacUkraine Macintosh Ukraine
MS874 x-windows-874 Windows Thai
MS932 windows-31j Windows Japanese. Microsoft JIS.
SingleByte This does not expand low order eight-bits with high order zero as its name implies. It looks to be a complex encoding for some Asian language.
Shift_JIS Shift_JIS Shift JIS. Japanese. A Microsoft code that extends csHalfWidthKatakana to include kanji by adding a second byte when the value of the first byte is in the ranges 81-9F or E0-EF.
TIS-620 TIS-620 TIS620, Thai
Transporter Transporter source code is available. A variant of Base64u also URL-encoded. It also optionally handles serialisation/reconstituting, compression/decompression, signing/verifying and heavy duty encryption/decryption.
truncation chop high byte, or 0-pad high byte.
UCS-2 Use UTF-16.
Unicode UTF-16 use UTF-16BE instead. Big endian, must be marked.
Unicode-8 see UTF-8.
Unicode-16 see UTF-16.
UnicodeBig UTF-16 use UTF-16BE instead. 16-bit UCS-2 Transformation Format, big endian byte order identified by an optional byte-order mark; FE FF. On read, defaults to big-endian. On write puts out a big-endian marker. Same as Unicode.
UnicodeBigUnmarked UTF-16BE 16-bit UCS-2 Transformation Format, big endian byte order, definitely without Byte Order Mark. Not writtten on write, ignored on read. Same as UTF-16BE.
UnicodeLittle x-UTF-16LE-BOM Use UTF-16LE instead. 16-bit UCS-2 Transformation Format, little endian byte order identified by an optional byte-order mark; FF FE. On read, defaults to little-endian. On write puts out a little-endian marker.
UnicodeLittleUnmarked UTF-16LE 16-bit UCS-2 Transformation Format, little endian byte order, definitely without Byte Order Mark. Not writtten on write, ignored on read.
URL For x-www-form-urlencoded use java.net.URLEncoder.encode and java.net.URLDecoder.decode instead. Used to encode GCI command lines. It encodes space as + and special characters as %xx hex. Don’t confuse it with BASE64 or BASE64u.
US-ASCII US-ASCII 7-bit American Standard Code for Information Interchange.
Uuencode Similar to base64.
UTF-7 7-bit encoded Unicode.
UTF-8 UTF-8 8-bit encoded Unicode. née UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can’t read the file without this marker. Now the question is, how do you get that marker in there? You can’t just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.RFC 3629 officially describes the UTF-8 format.

DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI, the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16 bit characters in a special range), rather than directly encoding the character.

UTF-16 UTF-16 Same as Unicode. Default big endian, optionally marked. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. (Copies of ISO standards are quite expensive.) It is also described in the Unicode Consortium’s Unicode Standard, as well as in the IETF’s RFC 2781. To put the byte order mark in at the head of the file use prw.write( '\ufeff' ); This will be encoded as FE FF.
UTF-16BE UTF-16BE 16-bit UCS-2 Transformation Format, big endian byte order identified by an optional byte-order mark; FE FF. On read, defaults to big-endian. On write puts out a big-endian marker. If you definitely have a BOM, use x-UTF-16BE-BOM.
UTF-16LE UTF-16LE 16-bit UCS-2 Transformation Format, little endian byte order identified by an optional byte-order mark; FF FE. On read, defaults to little-endian. On write puts out a little-endian marker. If you definitely have a BOM, use x-UTF-16LE-BOM.
UTF-32 UTF-32 32-bit UCS-4 Transformation Format, byte order identified by an optional byte-order mark: 00 00 FF FE for little endian, FE FF 00 00 for big endian.
UTF-32BE UTF-32BE 32-bit UCS-4 Transformation Format, big-endian byte order. If you definitely have a BOM, use X-UTF-32BE-BOM.
UTF-32LE UTF-32LE 32-bit UCS-4 Transformation Format, little-endian byte order. If you definitely have a BOM, use X-UTF-32LE-BOM.
windows-1250 windows-1250 Windows Eastern European
windows-1251 windows-1251 Windows Cyrillic (Russian)
windows-1252 windows-1252 Microsoft Windows variant of Latin-1, NT default. Beware. Some unexpected translations occur when you read with this default encoding, e.g. codes 128..159 are translated to 16 bit chars with bits in the high order byte on. It does not just truncate the high byte on write and pad with 0 on read. For true Latin-1 see 8859-1.
windows-1253 windows-1253 Windows Greek
windows-1254 windows-1254 Windows Turkish
windows-1255 windows-1255 Windows Hebrew
windows-1256 windows-1256 Windows Arabic
windows-1257 windows-1257 Windows Baltic
windows-1258 windows-1258 Windows Vietnamese
windows-31j windows-31j Windows 31j
x-EUC-CN GB2312 GB2312, EUC encoding, Simplified Chinese
x-EUC-JP EUC-JP JIS0201, 0208, 0212, EUC Encoding, Japanese
x-EUC-JP-LINUX x-euc-jp-linux JISX0201, 0208, EUC Encoding, Japanese for LinuxYFF
x-EUC-KR KS C 5601, EUC Encoding, Korean
x-EUC-TW x-EUC-TW CNS11643 (Plane 1-3), T. Chinese, EUC encoding
x-ISCII91 x-ISCII91 ISCII91 encoding of Indic scripts
x-JIS0208 x-JIS0208 JIS 0208, Japanese
x-Johab x-Johab Johab, Korean
x-MS950-HKSCS x-MS950-HKSCS Windows Traditional Chinese with Hong Kong extensions
x-mswin-936 x-mswin-936 Windows Simpified Chinese PRC
x-UTF-16BE-BOM Unicode 16 bit big ended, with a BOM definitely present.
x-UTF-16LE-BOM x-UTF-16LE-BOM Unicode 16 bit little ended, with a BOM definitely present.
X-UTF-32BE-BOM X-UTF-32BE-BOM Unicode 32 bit big ended, with a BOM definitely present.
X-UTF-32LE-BOM X-UTF-32LE-BOM Unicode 32 bit little ended, with a BOM definitely present.
x-windows-949 x-windows-949 Windows Korean
x-windows-950 x-windows-950 Windows Traditional Chinese
Where two fonts are shows separated by a /, the second one is the new version including the euro symbol. Adam Dingle did the research on how these encodings work.

Many new encodings were added in Java 1.4.1 and some were dropped. This list contains even the dropped items. Beware, this list is not complete. Mainly it is missing all the new Windows and IBM proprietary encodings. Before you use an encoding, make sure it is supported by your version of Java.

Note that what Java and the HTML 4.0 specification call a "character encoding" is actually called a "character set" at IANA and in the HTTP proposed standard.

I would like to do some experiments to find out for sure what happens with BOMs in various encodings. I discovered that native2ascii would not work with an BOM until I used x-UTF-16LE-BOM encoding.

Why So Many Encodings?

You may wonder why there are so many encodings, or why there are any at all. The reason is historical chaos. In the beginning, computers used a 4-channel, 16 possible characters, paper tape, allowing for only the hexadecimal characters 0..9 and A..F. To allow for alphabetic messages, not just numbers, two more channels of holes on the paper tape were added. This 6-bit, 64-character encoding allows digits, upper case A..Z and some punctuation. Every university or major computer installation invented its own code, allowing for the punctuations and symbols of most local importance.

Universities started exchanging programs and data on magnetic type. The 7-bit, 128-character ASCII code was invented to allow for a common character set and encoding. It allowed for both upper and lower case and a reasonably rich set of punctuation.

About that time, computers started to standardise on the 8-bit byte. Every national group then expanded the code to 8-bits, giving them 256 possible characters. They filled the extra slots with various accented letters, nationally important symbols and letters from non-Roman alphabets. The Chinese had a difficult problem. They needed thousands of symbols, not just 256 offered by 8-bit codes. So they invented various multi-byte encodings and 16-bit encodings. IBM invented EBCDIC, its own proprietary sets of codes to help lock customers into its equipment. There was very little document sharing, so the fact every country had its own way of encoding data, and sometimes dozens of ways, caused little trouble.

To allow for exchange, especially on the Internet, 16-bit, 4096-character Unicode was invented. Surely this would be sufficient to handle all of earth’s languages! It had one big drawback. At least for English, its files were twice as fat as the old 8-bit encodings. The world was not prepared to abandon their hundreds of encodings, even for new files. Not only where they firmly entrenched in email, they were burned into hardware, such as printers and modems. Java needed tables called encodings to translate scores of these 8-bit encoding into Unicode. Java’s Readers and Writers automatically handle the translations.

Then UTF-8 was invented to give the benefit of compact 8-bit encoding, with the full 16-bit Unicode character set.

Then scholars complained that Unicode did not handle various dead languages and obscure musical notation. So Unicode was extended to 32 bits to shut them up. Java half-heartedly supports this with code points.

When you write Java programs, there at at least three encodings you will be forced to deal with:

  1. UTF-16 : how Java stores characters and codepoints (32-bit characters) internally in Strings and char[]..
  2. UTF-8 : the usual compact way to store Unicode data on disk or text files.
  3. your local default : For me, this is windows-1252. It is the default encoding of *.bat files and notepad.

ISO

You can buy documentation on the ISO code sets from the ironically named Organisation for International Standards. They cost approximately 50.00 CHF

Roll Your own

You can find out what is already supported with java.nio.charset. Charset. availableCharsets().

If you don’t see the character set encoding you need, you can write your own translate/encoding tables and insert them as part of the official set. See the java.nio.charset.spi.CharsetProvider, Charset, CharsetEncoder and CharsetDecoder classes.

To create a new character set, you extend CharsetProvider to provide one or more custom CharSets with look-up by name. To create the custom Charset, you extend the CharSet class mainly to flesh it out with methods for newEncoder and newDecoder which provide your own custom CharsetEncoder and CharsetDecoder respectively.

To write your custom CharsetEncoder you extend CharsetEncoder and write a custom encodeLoop method. To write your custom CharsetDecoder you extend CharsetDecoder and write a custom decodeLoop method. You can of course borrow these methods from some other Charset, and just code some exceptions to the rule. You can borrow either by extending or by delegation.

After all this is all ready, to include your Charsets as part of the official ones, you register your new CharsetProvider with a configuration file named java.nio.charset.spi.CharsetProvider in the resource directory META-INF/services. This file contains a list of your fully-qualified CharsetProvider class names, one per line. The file must be encoded in UTF-8.

Converting

The key thing in converting to keep uppermost in your mind is that all encoded files are conceptually composed of 8-bit byte[], even UTF-16 encoded files. Java internally works with Unicode 16-bit chars. Don’t try to go from String to String or byte[] to byte[]. You are always encoding String to byte[] or decoding byte[] to String. There are three basic ways to do the conversions:
  1. With Reader and Writer file I/O. See the File I/O Amanuensis for details. Your files are byte-encoded and you read and write translating into Strings internally. Use a Reader to decode bytes to Strings and a Writer to encode Strings to bytes.
  2. Use the String constructor to decode bytes to Strings.
    // use String constructor to decode bytes to String
    byte[] someBytes = ...;
    String encodingName = "Shift_JIS";
    String s = new String ( someBytes, encodingName );
    Use String.getbytes to encode Strings to bytes.
    // Using String.getBytes to encode String to bytes
    String s = ...;
    byte [] b = s.getBytes( "8859_1" /* encoding */ );
  3. If you have more than one conversion to do, use java.nio.charset. Charset. This saves the overhead of looking up the encoding class by name for each conversion. Decoding bytes to String.
    // decoding bytes to a String
    import java.nio.ByteBuffer;
    import java.nio.CharBuffer;
    import java.nio.charset.Charset;
    
    byte[] b = ...;
    Charset def = Charset.defaultCharset() /* default encoding */;
    Charset cs = Charset.forName( "Shift_JIS" /* encoding */ );
    ByteBuffer bb = ByteBuffer.wrap( b );
    CharBuffer cb = cs.decode( bb );
    String s = cb.toString();
    Encoding String to bytes.
  4. If you want very fast conversions, you must avoid the hidden copies that are inherent in the above methods. You would with CharBuffer and ByteBuffer.

native2ascii

Sun has included a utility misnamed native2ascii.exe which is included with the JDK in
native2ascii.exe :
It converts files from any encoding to 8-bit printable form, and back. 8-bit printable using ASCII characters plus forms like \u95e8 for the exotic characters.

Reversibility

You won’t necessarily get exactly back to where you started if you encode then decode. If you chose a traditional single-byte, 8-bit encoding, say Cp437 as your target, there are only 256 encodings to go round for all 64K Unicode characters. Obviously, some Unicode characters are going to have to collapse onto the same 8-bit character, and so won’t decode back to where they started. Further, some of these 8-bit encodings have a few strange characters that don’t exist in Unicode. UTF-8 does not suffer from this problem.

Further, the encode/decode routines are permitted to combine pairs such as 0x0055 (LATIN CAPITAL LETTER U) followed by 0x0308 (COMBINING DIAERESIS) to a single character 0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS), or vice versa.

Tracking which characters get Translated Where

Be careful when translating between character sets using the encoding feature of Readers. Everything goes through the intermediate 16-bit Unicode which may not have all the characters of the target and destination character sets. Some characters may be translated to codes with some high byte bits on. For more accurate translation, do it yourself with a one-step table. You can use the following program to discover what translations are being done with any particular encoding, and use that information to generate the source for your own translate table, using the automatic encodings, so that you can see any inaccuracies and fix them.

Encoding Identification

Files are not marked with a signature to denote the encoding used. Further, the encoding it is not recorded externally in some sort of resource fork. You are just supposed to know what sort of encoding was used or track it by some ad hoc means. There are three exceptions. You can make a guess by reading the text. The language gives a clue to the likely encoding used. The way common words are encoded gives a clue. Try looking at the document in various encodings and see which makes the most sense.

The Unicode little-endian or big-endian BOM (Byte Order Mark) is a strong clue you have 16-bit Unicode.

To automate the guessing, you could look for common foreign words to see how they are encoded. You could compute letter frequencies and compare them against documents with known encodings.

You might want to tackle this student project to solve the problem.

The following Applet helps you determine the encoding of a file by displaying the beginning of it in hex and decoded characters in any of the supported Java encodings. If the file is made only of printable ASCII characters, then almost any encoding can be used to read it. If the display shows blanks between each character then chances are you have some variant of UTF-16 encoding.

You can fine tune your guesses by entering them in the Official Encoding Applet above to see which sample character set looks most plausible for documents such as yours.

If, encodingrecogniser, the above Encoding Recogniser signed Java Applet does not work…

  1. This signed Java Applet needs Java 1.5 or later, best version 1.5.0_16 or later, version 1.6.0_10 recommended and a recent browser.
  2. You should see the Applet above looking much like the screenshot. If you don’t, the following should help you get it working:
  3. If you are using Microsoft Internet Explorer, try another browser. Seriously. Microsoft has taken great pains, over and over, to screw up Java and every other multi-platform standardisation.
  4. If you are using Internet Explorer 7 or 8, you must allow blocked content permission for Active X to run. This also gives permission to Java to run. Click the Information bar, and then click Allow blocked content. Unfortunately, this also allows dangerous ActiveX code to run. However, you must do this in order to get access to perfectly-safe Java Applets running in a sandbox. This is part of Microsoft’s war on Java. Don’t put up with it! Use a different browser.
  5. For this Applet to work, you must click grant/accept to give it permission to read a file whose encoding you want to determine..
  6. Optionally, you may permanently install the Canadian Mind Products code-signing certificate so you don’t have to grant each time.
  7. If the above Applet appears to freeze-up, click Alt-Esc repeatedly to check for any buried permission dialog box.
  8. If you have certificate troubles, check the installed certificates and remove or update any obsolete or suspected defective certificates. The only certificate used by this program is mindprodcert2008dsa.
  9. Especially if this Applet has worked before, try clearing the browser cache and rebooting.
  10. To ensure your Java is up to date, check with Wassup. First, download it and run it as an application independent of your browser, then run it online as an Applet to add the complication of your browser.
  11. If the above Applet does not work, check the Java console for error messages.
  12. If the above Applet does not work, you might have better luck with the downloadable version.
  13. If you still can’t get the program working click HELP for more detail.
  14. If you can’t get the above Applet working after trying the advice above and from the HELP button below, have bugs to report or ideas to improve the program or its documentation, please send me an email atemail Roedy Green.
Java powered   Get New Java  Get New Browser   Help
PackageVersionReleasedLicenceLanguageNotes 
encodingrecogniser
Encoding Recognizer
1.1 2008-04-01 free Java
summary / PAD description / screenshot for the current version of Encoding Recognizer. Helps determine a file’s encoding by displaying it presuming all the different supported encodings.
download Encoding Recognizer source and compiled class files to run on your own machine as an application or Applet. First install the most recent Java. To install, extract the zip download with WinZip, (or similar unzip utility) into any directory you please, often J:\ — ticking off the “user folder names” option. To run as an application,type:
java -jar J:\com\mindprod\encodingrecogniser\encodingrecogniser.jar
adjusting as necessary to account for where the jar file is.
download ASP PAD XML program description for the current version of Encoding Recognizer.
Encoding Recognizer is free. Full source included. You may even include the source code, modified or unmodified in commercial programs that you write and distribute. Non-military use only.
   
 

Choosing an Encoding

The times you would send people special encodings are when:
  1. They are using old text-based software that supports only one particular encoding.
  2. You are sending large volumes of data, and want the efficiency of a national encoding perhaps combined with compression.

Java Source Code Encoding

If your Java source code contains awkward characters encoded with \uxxxx, then there in nothing special you need to do. However, if they are encoded as naked UTF-8 characters then you need to code:
Rem invoking the compiler when your source code contains naked UTF-8 characters.
javac.exe -encoding UTF-8  MyClass.java

Default File and Console Encoding

If you want to change the default encoding for files you read and write, including the console, you need to set the file.encoding system property. You can do this programmatically with:
// setting the default encoding programmatically
System.setProperty( "file.encoding", "UTF-8" );
You can also do it on the java.exe command line like this:
Rem setting the default encoding on the command line
java.exe "-Dfile.encoding=UTF-8" -jar myprog.jar

Learning More

Sun’s JDK Technote Guide on Locales and Encoding : available:
Sun’s JDK Technote Guide on nio encodings : available:
Sun’s Javadoc on the Charset class : available:
Sun’s Javadoc on the CharsetEncoder class : available:
Sun’s Javadoc on the CharsetDecoder class : available:
Sun’s Javadoc on the CharsetProvider class : available:
Sun’s Javadoc on the ByteBuffer class : available:
Sun’s Javadoc on the CharBuffer class : available:
Sun’s Javadoc on System.getProperty : available:
Sun’s JDK Tool Guide to native2ascii : available:
will encode/decode base64, quoted-printable, 7bit, 8bit, binary and uuencode.


CMP homejump to top
CMP logo
feedback Please email your feedback for publication, errors, omissions, broken/redirected link reports
and suggestions to improve this page to Roedy Green : feedback email
made with CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[65.110.21.43]
Your face IP:[38.103.63.62] The information on this page is for non-military use only.
You are visitor number 173,342. Military use includes use by defence contractors.
You can get a fresh copy of this page from: or possibly from your local J: drive (Java virtual drive/mindprod.com website mirror)
http://mindprod.com/jgloss/encoding.html J:\mindprod\jgloss\encoding.html