You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Piyush Narang (JIRA)" <ji...@apache.org> on 2016/06/24 22:08:16 UTC
[jira] [Commented] (PARQUET-642) Improve performance of ByteBuffer
based read / write paths
[ https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348769#comment-15348769 ]
Piyush Narang commented on PARQUET-642:
---------------------------------------
Basic micro bench mark of just String encoding / decoding:
{code}
./parquet-benchmarks/run_main.sh -wi 5 -i 10 -f 5 -bm thrpt
Benchmark Mode Samples Score Error Units
o.a.p.b.StringEncodingBenchmarks.readBytesAsString_charset thrpt 50 4745459.365 ± 154135.792 ops/s
o.a.p.b.StringEncodingBenchmarks.readBytesAsString_newString thrpt 50 7447380.382 ± 97667.509 ops/s
o.a.p.b.StringEncodingBenchmarks.readStrAsByteBuffer_charset thrpt 50 1699074.087 ± 34233.840 ops/s
o.a.p.b.StringEncodingBenchmarks.readStrAsByteBuffer_getBytes thrpt 50 5411374.351 ± 346074.359 ops/s
{code}
{code}
public class StringEncodingBenchmarks {
private static final String testString = "helloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworld";
private static final String testString2 = "helloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworldhelloworld";
private static final ByteBuffer testBytes = getTestBytes();
private static final ByteBuffer testBytes2 = getTestBytes();
private static final ThreadLocal<CharsetEncoder> ENCODER =
new ThreadLocal<CharsetEncoder>() {
@Override
protected CharsetEncoder initialValue() {
return StandardCharsets.UTF_8.newEncoder();
}
};
private static ByteBuffer getTestBytes() {
try {
return ByteBuffer.wrap(testString.getBytes("UTF-8"));
} catch(Exception e) {
Throwables.propagate(e);
}
return ByteBuffer.allocate(1);
}
@Benchmark
public ByteBuffer readStrAsByteBuffer_charset(Blackhole blackhole) throws Exception {
return ENCODER.get().encode(CharBuffer.wrap(testString));
}
@Benchmark
public ByteBuffer readStrAsByteBuffer_getBytes(Blackhole blackhole) throws Exception {
return ByteBuffer.wrap(testString2.getBytes("UTF-8"));
}
@Benchmark
public String readBytesAsString_charset(Blackhole blackhole) throws Exception {
testBytes.position(0);
return StandardCharsets.UTF_8.decode(testBytes).toString();
}
@Benchmark
public String readBytesAsString_newString(Blackhole blackhole) throws Exception {
testBytes2.position(0);
return new String(testBytes2.array(), testBytes2.arrayOffset(), testBytes2.remaining(), "UTF-8");
}
}
{code}
> Improve performance of ByteBuffer based read / write paths
> ----------------------------------------------------------
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
> Issue Type: Bug
> Reporter: Piyush Narang
>
> While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 and https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8 (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis).
> Seems like this seems to be due to the encoding / decoding of Strings in the Binary class (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java) - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
> try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
> } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
> }
> }
> }
> ...
> @Override
> public String toStringUsingUTF8() {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would have to slice it
> // which creates another ByteBuffer object or do what is done here to adjust the
> // limit/offset and set them back after
> String ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
> return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
> String ret;
> if (value.hasArray()) {
> try {
> ret = new String(value.array(), value.arrayOffset() + offset, length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
> throw new ParquetDecodingException("UTF-8 not supported");
> }
> } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would have to slice it
> // which creates another ByteBuffer object or do what is done here to adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
> }
> return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
> try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
> } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)