Java String Encoding Performance (evanjones.ca)

[ 2009-November-23 09:10 ]

When writing Java String objects to the external world, either to a file or over the network, they must be converted to bytes in some specific encoding. Personally, I recommend UTF-8, but that is another issue. No matter what encoding is used, some API must be called to do the conversion. The simplest is String.getBytes(), which returns a new byte[] array with the data in a given character set. Unfortunately, when buffering writes for high performance, this typically results in an extra copy and an array that promptly needs to be garbage collected, which is wasteful. CharsetEncoder provides a mechanism to serialize into a ByteBuffer, which can be reused. This is more complicated, but should avoid extra copies and memory allocations. Interestingly, it doesn't improve performance when used in a straightforward fashion. It turns out that if you are only encoding a few strings, use String.getBytes(). If you need to encode many strings into a separate buffer, copy the string into a char[] array and use that as the input to a CharsetEncoder. This article describes the details of these approaches, and shows some performance numbers. [Update: For more detail, read how the JDK encodes strings] [Update 2013-01-20: Nitsan Wakart re-ran my experiments and produced an improved implementation. He found that with JDK7, there is very little difference between String.getBytes() and my "faster" version. However, he created a version that uses the Unsafe class to get access to String internals which is still faster].

String.getBytes()

The standard approach is to call String.getBytes() to get a temporary byte[] array, then copy it into the destination buffer with System.arraycopy(). This works pretty well, particularly if there are only a few strings to encode. The implementation uses some private APIs to make this fast. Unfortunately there is an extra memory allocation and copy for the temporary byte[] array.

CharsetEncoder

CharsetEncoder takes a CharBuffer and encodes it into a ByteBuffer. A CharBuffer backed by the string can be allocated by calling CharBuffer.wrap(), and a ByteBuffer backed by the destination byte[] array can be allocated with ByteBuffer.wrap(). This seems perfect: it converts the String into an existing buffer. Unfortunately, it turns out that this is quite slow. It seems that accessing the characters of the String via the CharBuffer, which in turn uses the CharSequence interface, is slow. According to my benchmarks, this approach is always slower than using String.getBytes(), so you should never use it.

CharsetEncoder With a char[] Buffer

To avoid the slow access to the individual characters of the String, we can copy them in bulk using String.getChars(), into a char[] array which is wrapped by a CharBuffer. Then we can use the CharsetEncoder to encode the characters. Amazingly, despite the copy from the String into the char[] array, this is faster than String.getBytes(), as long as the encoder and temporary arrays are reused. If the encoder is only used once, then the overhead of allocating the temporary objects outweighs the advantages.

Performance Results

All tests were performed on a 2.53 GHz Intel Xeon E5540 (Core i7/Nehalem architecture) Linux system, with Sun's Java 1.6.0_17. I also tried a JDK7 beta (1.7.0-ea-b76) and the results were basically the same. The test converted 1399 short UTF-8 strings in a variety of languages (40 300 bytes), taken from Gnome translations collected in the pango-profile benchmark. The results below are the average of 10 runs, discarding the first to avoid JIT optimization overhead. In the test description column, "once" means the encoder was used once then discarded, while "reuse" means one encoder was used for the entire test. "array" means the output was a separate, new byte[] array, while "buffer" means the output was written to a single large byte[] array that was reused.

Test description	Time (milliseconds)
bytebuffer once array	176 ms
bytebuffer once buffer	174 ms
bytebuffer reuse array	155 ms
bytebuffer reuse buffer	142 ms
string once array	129 ms
string once buffer	143 ms
string reuse array	126 ms
string reuse buffer	146 ms
chars once array	417 ms
chars once buffer	435 ms
chars reuse array	89.8 ms
chars reuse buffer	85.9 ms

Code

The code I used to test this is part of: javanetperf.tar.bz2