You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Nishanth S <ni...@gmail.com> on 2018/01/26 17:04:54 UTC
GenericData.deepCopy() HotSpot
Hello Every One,
We have a process that reads data from a local file share ,serailizes and
writes to HDFS in avro format. .I am just wondering if I am building the
avro objects correctly. For every record that is read from the binary file
we create an equivalent avro object in the below format.
Parent p = new Parent();
LOGHDR hdr = LOGHDR.newBuilder().build()
MSGHDR msg = MSGHDR.newBuilder().build()
p.setHdr(hdr);
p.setMsg(msg);
p..
p..set
datumFileWriter.write(p);
This avro schema has around 1800 fileds including 26 nested types within
it .I did some load testing and figured that if I serialize the same object
to disk performance is 6 x times faster than a constructing a new object
(p.build). When a new avro object is constructed everytime using
RecordBuilder.build() much of the time is spend in
GenericData.deepCopy().Has any one run into a similar problem ? We are
using Avro 1.8.2.
Thanks,
Nishanth
Re: GenericData.deepCopy() HotSpot
Posted by Nishanth S <ni...@gmail.com>.
Thanks for taking a look Doug. This was a cooked up schema I used for
testing . It only has 4 fields in it . Did a simple test writing 1 M
records with close to 1 Gb of data written to disk .TPS has been
consistent at 44K. I did not see much of a difference in this case if I
re use the object or do a build() for every iteration. How ever in case
of actual schema which has close to 2K fields I can achieve only 5K tps
with no reuse and 9K TPS with re use(data size written to disk is 1 G). I
just added a bytes field to both my test schema and actual schema to
increase the data volume for test,rest all are default values . Is there
any other way to improve performance ? . We do not use the avro sorting
capabilities, so I also tried setting order=ignore for a major chunk of
fields but that did not have an impact . Appreciate you taking a look.
Thanks,
Nishanth
On Mon, Feb 5, 2018 at 9:34 AM, Doug Cutting <cu...@gmail.com> wrote:
> Your code builds a new builder and instance each time through the loop:
>
> for (int i=0;i<1000000;i++) {
> user = User.newBuilder().build();
> ...
>
> How does it perform if you move that second line outside the loop?
>
> Thanks,
>
> Doug
>
>
> On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <ni...@gmail.com>
> wrote:
>
>> Thanks Doug . Here is a comparison .
>>
>> Load Avro Record Size : Roughly15 Kb
>>
>> I have used the same payload with a schema that has around 2k fields
>> and also with another schema that has 5 fileds . I re used the
>> avro object in both cases using a builder once . Test was run for 1 M
>> records writing the same amount of data (1 Gb ) to a local drive . Ran
>> this few times as single threaded . Average TPS in case of smaller schema
>> is 40 K where as with a bigger schema it drops down to 10 K even though
>> both are writing the same amount of data. Since I am only creating the
>> avro object once in both cases it looks like there is an overhead in
>> the datafilewriter too in case of bigger schemas .
>>
>>
>>
>> public static void main(String[] args){
>> try{
>> new LoadGenerator().load();
>> }catch(IOException e){
>> e.printStackTrace();
>> }
>> }
>>
>> DataFileWriter<User> dataFileWriter;
>> DatumWriter<User> datumWriter;
>> FileSystem hdfsFileSystem;
>> Configuration conf;
>> Path path;
>> OutputStream outStream;
>> User user;
>> com.google.common.base.Stopwatch stopwatch = new
>> com.google.common.base.Stopwatch().start();
>> public void load() throws IOException{
>> conf = new Configuration();
>> hdfsFileSystem = FileSystem.get(conf);
>> datumWriter = new SpecificDatumWriter<User>(User.class);
>> dataFileWriter = new DataFileWriter<User>(datumWriter);
>> dataFileWriter.setCodec(CodecFactory.snappyCodec());
>> path = new Path("/projects/tmp/load.avro");
>> outStream=hdfsFileSystem.create(path, true);
>> dataFileWriter.create(User.getClassSchema(), outStream);
>> dataFileWriter.setFlushOnEveryBlock(false);
>> // Create and Load User
>> int numRecords =1000000;
>> for (int i=0;i<1000000;i++){
>> user = User.newBuilder().build();
>> user.setFirstName("testName"+new Random().nextLong());
>> user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
>> user.setFavoriteColor("blue" +new Random().nextFloat());
>> user.setData(ByteBuffer.wrap(new byte[15000]));
>> dataFileWriter.append(user);
>> }
>> dataFileWriter.close();
>> stopwatch.stop();
>> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
>> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>>
>> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cu...@gmail.com> wrote:
>>
>>> Builders have some inherent overheads. Things could be optimized to
>>> better minimize this, but it will likely always be faster to reuse a single
>>> instance when writing.
>>>
>>> The deepCopy's are probably of the default values of each field you're
>>> not setting. If you're only setting a few fields then you might use a
>>> builder to create a single instance so its defaults are set, then reuse
>>> that instance as you write, setting only those few fields you need to
>>> differ from the default. (This only works if you're setting the same
>>> fields every time. Otherwise you'd need to restore the default value.)
>>>
>>> An optimization for Avro here might be to inline default values for
>>> immutable types when generating the build() method.
>>>
>>> Doug
>>>
>>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com>
>>> wrote:
>>>
>>>> Hello Every One,
>>>>
>>>> We have a process that reads data from a local file share ,serailizes
>>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>>> the avro objects correctly. For every record that is read from the binary
>>>> file we create an equivalent avro object in the below format.
>>>>
>>>> Parent p = new Parent();
>>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>>> p.setHdr(hdr);
>>>> p.setMsg(msg);
>>>> p..
>>>> p..set
>>>> datumFileWriter.write(p);
>>>>
>>>> This avro schema has around 1800 fileds including 26 nested types
>>>> within it .I did some load testing and figured that if I serialize the same
>>>> object to disk performance is 6 x times faster than a constructing a new
>>>> object (p.build). When a new avro object is constructed everytime using
>>>> RecordBuilder.build() much of the time is spend in
>>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>>> using Avro 1.8.2.
>>>>
>>>> Thanks,
>>>> Nishanth
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
Re: GenericData.deepCopy() HotSpot
Posted by Doug Cutting <cu...@gmail.com>.
Your code builds a new builder and instance each time through the loop:
for (int i=0;i<1000000;i++) {
user = User.newBuilder().build();
...
How does it perform if you move that second line outside the loop?
Thanks,
Doug
On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <ni...@gmail.com> wrote:
> Thanks Doug . Here is a comparison .
>
> Load Avro Record Size : Roughly15 Kb
>
> I have used the same payload with a schema that has around 2k fields
> and also with another schema that has 5 fileds . I re used the
> avro object in both cases using a builder once . Test was run for 1 M
> records writing the same amount of data (1 Gb ) to a local drive . Ran
> this few times as single threaded . Average TPS in case of smaller schema
> is 40 K where as with a bigger schema it drops down to 10 K even though
> both are writing the same amount of data. Since I am only creating the
> avro object once in both cases it looks like there is an overhead in
> the datafilewriter too in case of bigger schemas .
>
>
>
> public static void main(String[] args){
> try{
> new LoadGenerator().load();
> }catch(IOException e){
> e.printStackTrace();
> }
> }
>
> DataFileWriter<User> dataFileWriter;
> DatumWriter<User> datumWriter;
> FileSystem hdfsFileSystem;
> Configuration conf;
> Path path;
> OutputStream outStream;
> User user;
> com.google.common.base.Stopwatch stopwatch = new
> com.google.common.base.Stopwatch().start();
> public void load() throws IOException{
> conf = new Configuration();
> hdfsFileSystem = FileSystem.get(conf);
> datumWriter = new SpecificDatumWriter<User>(User.class);
> dataFileWriter = new DataFileWriter<User>(datumWriter);
> dataFileWriter.setCodec(CodecFactory.snappyCodec());
> path = new Path("/projects/tmp/load.avro");
> outStream=hdfsFileSystem.create(path, true);
> dataFileWriter.create(User.getClassSchema(), outStream);
> dataFileWriter.setFlushOnEveryBlock(false);
> // Create and Load User
> int numRecords =1000000;
> for (int i=0;i<1000000;i++){
> user = User.newBuilder().build();
> user.setFirstName("testName"+new Random().nextLong());
> user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
> user.setFavoriteColor("blue" +new Random().nextFloat());
> user.setData(ByteBuffer.wrap(new byte[15000]));
> dataFileWriter.append(user);
> }
> dataFileWriter.close();
> stopwatch.stop();
> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>
> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cu...@gmail.com> wrote:
>
>> Builders have some inherent overheads. Things could be optimized to
>> better minimize this, but it will likely always be faster to reuse a single
>> instance when writing.
>>
>> The deepCopy's are probably of the default values of each field you're
>> not setting. If you're only setting a few fields then you might use a
>> builder to create a single instance so its defaults are set, then reuse
>> that instance as you write, setting only those few fields you need to
>> differ from the default. (This only works if you're setting the same
>> fields every time. Otherwise you'd need to restore the default value.)
>>
>> An optimization for Avro here might be to inline default values for
>> immutable types when generating the build() method.
>>
>> Doug
>>
>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com>
>> wrote:
>>
>>> Hello Every One,
>>>
>>> We have a process that reads data from a local file share ,serailizes
>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>> the avro objects correctly. For every record that is read from the binary
>>> file we create an equivalent avro object in the below format.
>>>
>>> Parent p = new Parent();
>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>> p.setHdr(hdr);
>>> p.setMsg(msg);
>>> p..
>>> p..set
>>> datumFileWriter.write(p);
>>>
>>> This avro schema has around 1800 fileds including 26 nested types
>>> within it .I did some load testing and figured that if I serialize the same
>>> object to disk performance is 6 x times faster than a constructing a new
>>> object (p.build). When a new avro object is constructed everytime using
>>> RecordBuilder.build() much of the time is spend in
>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>> using Avro 1.8.2.
>>>
>>> Thanks,
>>> Nishanth
>>>
>>>
>>>
>>>
>>>
>>
>
Re: GenericData.deepCopy() HotSpot
Posted by Nishanth S <ni...@gmail.com>.
Thanks Doug . Here is a comparison .
Load Avro Record Size : Roughly15 Kb
I have used the same payload with a schema that has around 2k fields
and also with another schema that has 5 fileds . I re used the
avro object in both cases using a builder once . Test was run for 1 M
records writing the same amount of data (1 Gb ) to a local drive . Ran
this few times as single threaded . Average TPS in case of smaller schema
is 40 K where as with a bigger schema it drops down to 10 K even though
both are writing the same amount of data. Since I am only creating the
avro object once in both cases it looks like there is an overhead in
the datafilewriter too in case of bigger schemas .
public static void main(String[] args){
try{
new LoadGenerator().load();
}catch(IOException e){
e.printStackTrace();
}
}
DataFileWriter<User> dataFileWriter;
DatumWriter<User> datumWriter;
FileSystem hdfsFileSystem;
Configuration conf;
Path path;
OutputStream outStream;
User user;
com.google.common.base.Stopwatch stopwatch = new
com.google.common.base.Stopwatch().start();
public void load() throws IOException{
conf = new Configuration();
hdfsFileSystem = FileSystem.get(conf);
datumWriter = new SpecificDatumWriter<User>(User.class);
dataFileWriter = new DataFileWriter<User>(datumWriter);
dataFileWriter.setCodec(CodecFactory.snappyCodec());
path = new Path("/projects/tmp/load.avro");
outStream=hdfsFileSystem.create(path, true);
dataFileWriter.create(User.getClassSchema(), outStream);
dataFileWriter.setFlushOnEveryBlock(false);
// Create and Load User
int numRecords =1000000;
for (int i=0;i<1000000;i++){
user = User.newBuilder().build();
user.setFirstName("testName"+new Random().nextLong());
user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
user.setFavoriteColor("blue" +new Random().nextFloat());
user.setData(ByteBuffer.wrap(new byte[15000]));
dataFileWriter.append(user);
}
dataFileWriter.close();
stopwatch.stop();
long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
System.out.println("Time elapsed for myCall() is "+ elapsedTime);
On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cu...@gmail.com> wrote:
> Builders have some inherent overheads. Things could be optimized to
> better minimize this, but it will likely always be faster to reuse a single
> instance when writing.
>
> The deepCopy's are probably of the default values of each field you're not
> setting. If you're only setting a few fields then you might use a builder
> to create a single instance so its defaults are set, then reuse that
> instance as you write, setting only those few fields you need to differ
> from the default. (This only works if you're setting the same fields every
> time. Otherwise you'd need to restore the default value.)
>
> An optimization for Avro here might be to inline default values for
> immutable types when generating the build() method.
>
> Doug
>
> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com>
> wrote:
>
>> Hello Every One,
>>
>> We have a process that reads data from a local file share ,serailizes
>> and writes to HDFS in avro format. .I am just wondering if I am building
>> the avro objects correctly. For every record that is read from the binary
>> file we create an equivalent avro object in the below format.
>>
>> Parent p = new Parent();
>> LOGHDR hdr = LOGHDR.newBuilder().build()
>> MSGHDR msg = MSGHDR.newBuilder().build()
>> p.setHdr(hdr);
>> p.setMsg(msg);
>> p..
>> p..set
>> datumFileWriter.write(p);
>>
>> This avro schema has around 1800 fileds including 26 nested types within
>> it .I did some load testing and figured that if I serialize the same object
>> to disk performance is 6 x times faster than a constructing a new object
>> (p.build). When a new avro object is constructed everytime using
>> RecordBuilder.build() much of the time is spend in
>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>> using Avro 1.8.2.
>>
>> Thanks,
>> Nishanth
>>
>>
>>
>>
>>
>
Re: GenericData.deepCopy() HotSpot
Posted by Doug Cutting <cu...@gmail.com>.
Builders have some inherent overheads. Things could be optimized to better
minimize this, but it will likely always be faster to reuse a single
instance when writing.
The deepCopy's are probably of the default values of each field you're not
setting. If you're only setting a few fields then you might use a builder
to create a single instance so its defaults are set, then reuse that
instance as you write, setting only those few fields you need to differ
from the default. (This only works if you're setting the same fields every
time. Otherwise you'd need to restore the default value.)
An optimization for Avro here might be to inline default values for
immutable types when generating the build() method.
Doug
On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com> wrote:
> Hello Every One,
>
> We have a process that reads data from a local file share ,serailizes
> and writes to HDFS in avro format. .I am just wondering if I am building
> the avro objects correctly. For every record that is read from the binary
> file we create an equivalent avro object in the below format.
>
> Parent p = new Parent();
> LOGHDR hdr = LOGHDR.newBuilder().build()
> MSGHDR msg = MSGHDR.newBuilder().build()
> p.setHdr(hdr);
> p.setMsg(msg);
> p..
> p..set
> datumFileWriter.write(p);
>
> This avro schema has around 1800 fileds including 26 nested types within
> it .I did some load testing and figured that if I serialize the same object
> to disk performance is 6 x times faster than a constructing a new object
> (p.build). When a new avro object is constructed everytime using
> RecordBuilder.build() much of the time is spend in
> GenericData.deepCopy().Has any one run into a similar problem ? We are
> using Avro 1.8.2.
>
> Thanks,
> Nishanth
>
>
>
>
>