A way of armouring, i.e. sending awkward characters. Browsers use url-encoding on HTTP GET and PUT requests to the
server. They embed data in the URLs. Url-encoding is also used by the url-encoded and x-www-form-urlencoded
mime types.
You see url-encoding every time you do a Google search e.g.
http://www.google.com/search?client=opera&rls=en&q=%22rabbits%22%2BEaster+eggs &sourceid=opera&ie=utf-8&oe=utf-8
The request url-encodes my query:
"rabbits"+Easter eggs
There are two flavours of urlencoding, one used in URLs, and one used in forms.
URL Encoding
Ironically, despite the name, you are not supposed to java.net. URLEncoder.
encode/decode to handle encoding URLs or GET parameters. It will work most of the time
however. Unfortunately, the URL class provides no escaping features. You must use the URI
class and convert the URL with toURL(). The encoding
algorithm is described in RFC 3986.
To decode a String, you just feed it to the single-argument URI
constructor, then extract the various fields with methods like URI.getPath().
Properly speaking, you should not see bare & in URLs; they should be pre-encoded as &.
I wrote a utility called Amper that processes *.html
files to make this correction.
Form Encoding
Form url-encoding/decoding is handled by java.net.URLEncoder. encode/decode.
This is only intended for String data with a few awkward characters in it, not heavy-duty
binary. Encodings you will likely use in conjunction with URLEncoder include ISO-8859-1
(Latin-1), UTF-8 and windows-1250.
When you use URLEncode.encode you must specify an 8-bit
encoding such as UTF-8 or ISO-8859-1. The algorithm first
converts to 8-bit characters then encodes. Thus the encoded string depends on the encoding you choose. The encoding is
not embedded in the output. You just have to know what it is when an incoming encoding url-encoded string arrives.
java.net.URLEncoder uses the following set of characters to convert 8-bit data into
printable characters :a to z, A to Z,
0 to 9, -, .,
*, and _. It works like this:
- The alphanumeric characters a to z, A
to Z, 0 to 9 remain the same.
- The special characters ., -, *, and _
remain the same.
- The space character is converted into a plus sign +.
- All other 16-bit characters are unsafe and are first converted into one or more bytes using some encoding
scheme. Then each byte is represented by the 3-character string %FF, where FF is the two-digit
hexadecimal representation of the byte. e.g. $ → %24, % → %25, & → %26, / → %2F, : → %3A, = ⇒
$3D, ? ⇒ %3F. You must URLEncode only once. If you URLEncode something already URLEncoded you will get gibberish.
In the best case, your message is the same size as the original. In a pathological case, your message can balloon up to
three times the original size.
Learning More
Sun’s Javadoc on the
URLEncoder class : available:
Sun’s Javadoc on the
URLDecoder class : available:
Sun’s Javadoc on the
URL class : available:
Sun’s Javadoc on the
URI class : available:
Sun’s Javadoc on
URI.
toString : available:
Sun’s Javadoc on
URI.
toASCIIString : available: