Java String Encoding Internals (evanjones.ca)

[ 2010-May-31 16:15 ]

I have written about my experiments with Java string encoding performance before. However, someone asked me some questions about why my code can do better than the JDK. I did not have a good answer, so I dug into the JDK source code to find out. If you want to know what happens when you call String.getBytes("UTF-8"), read on. The short version for UTF-8 is:

A temporary byte[] array is allocated with s.length() * 4 bytes. This is the maximum length for a UTF-8 string (although this maximum length is actually larger than needed, see the bug I filed on this issue).
The string is encoded using a CharsetEncoder. The JDK does this by accessing the raw char[] array in the String object.
The UTF-8 bytes are copied from the temporary byte[] array into the final byte[] array with the exact right length.

Conclusion: This allocates s.length() * 4 bytes of garbage, and has an "extra" copy. This is what permits custom code to be slightly faster than the JDK: custom code produces less garbage, particularly for ASCII or mostly ASCII text. Significant wins are possible when the destination does not need to be a byte[] array with the exact length. For example, writing directly to the output buffer or a ByteBuffer with "unused" bytes at the end can be faster. See my StringEncoder class in the source code used for these benchmarks if you want to try and take advantage of this in your own code.

The details, with links to the source code:

String.getBytes calls StringCoding.encode(charsetName, value, offset, count);
This method gets a cached StringCoding.StringEncoding object stored in a static ThreadLocal<SoftReference<StringEncoder>>. This is a good trick for thread-specific encoders, since it permits the JVM to garbage collect them if it is under memory pressure. This object wraps a Charset and a CharsetEncoder. It also sets an isTrusted boolean to true if the Charset is provided by the JDK (charset.getClass().getClassLoader0() == null).
StringCoding.encode then checks that the charset string matches the one passed in. If not, create a new StringEncoder after looking up the Charset by name.
Finally, it calls StringEncoder.encode(chars, offset, length);
StringEncoder.encode allocates a byte[] array of size length * encoder.maxBytesPerChar(). Note that for UTF-8, the JDK reports that maxBytesPerChar() == 4.
Checks if the character set isTrusted: If it is not, it makes a defensive copy of the input string. This is to prevent a user supplied CharsetEncoder from being able to mutate the internals of a String. This does not happen for UTF-8.
If the CharsetEncoder is an instance of sun.nio.cs.ArrayEncoder, call .encode(char[], int offset, int length, byte[] byteArray). Note that the UTF-8 encoder does not implement the ArrayEncoder interface.
Set up the encoder: set REPLACE as default mode; call reset()
Wrap the input and output in ByteBuffer and CharBuffers.
Call .encode(charBuffer, byteBuffer, true) once, throwing an exception if it underflows or overflows.
If the output filled the array and the encoder isTrusted, return the array as is. Otherwise, call Arrays.copyOf to copy the bytes into a newly allocated array. This copy will happen every time for UTF-8, since the output will never fill the array.