You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Nishanth S <ni...@gmail.com> on 2018/01/26 17:04:54 UTC

GenericData.deepCopy() HotSpot

Hello Every One,

We have a process that reads data from a  local file share  ,serailizes and
writes to HDFS in avro format. .I am just wondering if I am building the
avro objects correctly. For every record that  is read from the binary file
we create an equivalent avro object in the below format.

Parent p = new Parent();
LOGHDR hdr = LOGHDR.newBuilder().build()
MSGHDR msg = MSGHDR.newBuilder().build()
p.setHdr(hdr);
p.setMsg(msg);
p..
p..set
datumFileWriter.write(p);

This avro schema has  around 1800 fileds including 26 nested types within
it .I did some load testing and figured that if I serialize the same object
to disk  performance is  6 x times faster  than a constructing a new object
(p.build). When a new  avro object is constructed everytime using
RecordBuilder.build()  much of the time is spend in
GenericData.deepCopy().Has any one run into a similar problem ? We are
using Avro 1.8.2.

Thanks,
Nishanth

Re: GenericData.deepCopy() HotSpot

Posted by Nishanth S <ni...@gmail.com>.

Thanks for taking a look Doug. This was a cooked up schema I used for
testing . It only has 4 fields in it . Did a simple test writing 1 M
records with close to 1 Gb of  data written to disk .TPS   has been
consistent at 44K. I  did not see much of a difference  in this case if I
re use the object or  do a  build() for every iteration. How ever in case
of actual schema  which has close to 2K fields  I can achieve only 5K tps
with no reuse and 9K TPS with re use(data size written to disk is 1 G). I
just added a  bytes field to  both my test schema and actual schema to
increase the  data volume for test,rest all are  default values . Is there
any other way to improve  performance ? . We do not use the avro sorting
capabilities, so I also tried setting order=ignore for a major chunk  of
fields but  that did not have an impact . Appreciate you taking a look.

Thanks,
Nishanth

On Mon, Feb 5, 2018 at 9:34 AM, Doug Cutting <cu...@gmail.com> wrote:

> Your code builds a new builder and instance each time through the loop:
>
>   for (int i=0;i<1000000;i++) {
>   user = User.newBuilder().build();
>   ...
>
> How does it perform if you move that second line outside the loop?
>
> Thanks,
>
> Doug
>
>
> On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <ni...@gmail.com>
> wrote:
>
>> Thanks Doug .  Here  is a  comparison .
>>
>> Load Avro  Record Size : Roughly15 Kb
>>
>> I have used the same payload  with a schema  that has  around 2k fields
>> and  also  with    another schema   that has  5 fileds . I re used the
>> avro object in both cases   using a builder once . Test was run for 1 M
>> records writing the  same amount of data  (1 Gb ) to  a  local drive . Ran
>> this few times as  single threaded . Average TPS in case of smaller schema
>> is  40 K where  as with a bigger schema it drops down to 10 K  even though
>> both are  writing the same amount of data. Since I am   only creating the
>> avro object once  in both  cases   it looks   like  there is an overhead in
>> the  datafilewriter too in case of bigger schemas .
>>
>>
>>
>> public static void main(String[] args){
>> try{
>> new LoadGenerator().load();
>> }catch(IOException e){
>>     e.printStackTrace();
>> }
>>     }
>>
>>     DataFileWriter<User> dataFileWriter;
>>     DatumWriter<User> datumWriter;
>>     FileSystem hdfsFileSystem;
>>     Configuration conf;
>>     Path path;
>>     OutputStream outStream;
>>     User user;
>>     com.google.common.base.Stopwatch stopwatch = new
>> com.google.common.base.Stopwatch().start();
>>     public  void load() throws IOException{
>> conf = new Configuration();
>> hdfsFileSystem = FileSystem.get(conf);
>> datumWriter = new SpecificDatumWriter<User>(User.class);
>> dataFileWriter = new DataFileWriter<User>(datumWriter);
>> dataFileWriter.setCodec(CodecFactory.snappyCodec());
>>         path = new Path("/projects/tmp/load.avro");
>>         outStream=hdfsFileSystem.create(path, true);
>> dataFileWriter.create(User.getClassSchema(), outStream);
>>         dataFileWriter.setFlushOnEveryBlock(false);
>> // Create and Load User
>> int numRecords =1000000;
>> for (int i=0;i<1000000;i++){
>>     user = User.newBuilder().build();
>>     user.setFirstName("testName"+new Random().nextLong());
>>     user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
>>     user.setFavoriteColor("blue" +new Random().nextFloat());
>>     user.setData(ByteBuffer.wrap(new byte[15000]));
>>     dataFileWriter.append(user);
>> }
>> dataFileWriter.close();
>> stopwatch.stop();
>> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
>> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>>
>> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cu...@gmail.com> wrote:
>>
>>> Builders have some inherent overheads.  Things could be optimized to
>>> better minimize this, but it will likely always be faster to reuse a single
>>> instance when writing.
>>>
>>> The deepCopy's are probably of the default values of each field you're
>>> not setting.  If you're only setting a few fields then you might use a
>>> builder to create a single instance so its defaults are set, then reuse
>>> that instance as you write, setting only those few fields you need to
>>> differ from the default.  (This only works if you're setting the same
>>> fields every time.  Otherwise you'd need to restore the default value.)
>>>
>>> An optimization for Avro here might be to inline default values for
>>> immutable types when generating the build() method.
>>>
>>> Doug
>>>
>>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com>
>>> wrote:
>>>
>>>> Hello Every One,
>>>>
>>>> We have a process that reads data from a  local file share  ,serailizes
>>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>>> the avro objects correctly. For every record that  is read from the binary
>>>> file we create an equivalent avro object in the below format.
>>>>
>>>> Parent p = new Parent();
>>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>>> p.setHdr(hdr);
>>>> p.setMsg(msg);
>>>> p..
>>>> p..set
>>>> datumFileWriter.write(p);
>>>>
>>>> This avro schema has  around 1800 fileds including 26 nested types
>>>> within it .I did some load testing and figured that if I serialize the same
>>>> object to disk  performance is  6 x times faster  than a constructing a new
>>>> object (p.build). When a new  avro object is constructed everytime using
>>>> RecordBuilder.build()  much of the time is spend in
>>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>>> using Avro 1.8.2.
>>>>
>>>> Thanks,
>>>> Nishanth
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: GenericData.deepCopy() HotSpot

Posted by Doug Cutting <cu...@gmail.com>.

Your code builds a new builder and instance each time through the loop:

  for (int i=0;i<1000000;i++) {
  user = User.newBuilder().build();
  ...

How does it perform if you move that second line outside the loop?

Thanks,

Doug


On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <ni...@gmail.com> wrote:

> Thanks Doug .  Here  is a  comparison .
>
> Load Avro  Record Size : Roughly15 Kb
>
> I have used the same payload  with a schema  that has  around 2k fields
> and  also  with    another schema   that has  5 fileds . I re used the
> avro object in both cases   using a builder once . Test was run for 1 M
> records writing the  same amount of data  (1 Gb ) to  a  local drive . Ran
> this few times as  single threaded . Average TPS in case of smaller schema
> is  40 K where  as with a bigger schema it drops down to 10 K  even though
> both are  writing the same amount of data. Since I am   only creating the
> avro object once  in both  cases   it looks   like  there is an overhead in
> the  datafilewriter too in case of bigger schemas .
>
>
>
> public static void main(String[] args){
> try{
> new LoadGenerator().load();
> }catch(IOException e){
>     e.printStackTrace();
> }
>     }
>
>     DataFileWriter<User> dataFileWriter;
>     DatumWriter<User> datumWriter;
>     FileSystem hdfsFileSystem;
>     Configuration conf;
>     Path path;
>     OutputStream outStream;
>     User user;
>     com.google.common.base.Stopwatch stopwatch = new
> com.google.common.base.Stopwatch().start();
>     public  void load() throws IOException{
> conf = new Configuration();
> hdfsFileSystem = FileSystem.get(conf);
> datumWriter = new SpecificDatumWriter<User>(User.class);
> dataFileWriter = new DataFileWriter<User>(datumWriter);
> dataFileWriter.setCodec(CodecFactory.snappyCodec());
>         path = new Path("/projects/tmp/load.avro");
>         outStream=hdfsFileSystem.create(path, true);
> dataFileWriter.create(User.getClassSchema(), outStream);
>         dataFileWriter.setFlushOnEveryBlock(false);
> // Create and Load User
> int numRecords =1000000;
> for (int i=0;i<1000000;i++){
>     user = User.newBuilder().build();
>     user.setFirstName("testName"+new Random().nextLong());
>     user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
>     user.setFavoriteColor("blue" +new Random().nextFloat());
>     user.setData(ByteBuffer.wrap(new byte[15000]));
>     dataFileWriter.append(user);
> }
> dataFileWriter.close();
> stopwatch.stop();
> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>
> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cu...@gmail.com> wrote:
>
>> Builders have some inherent overheads.  Things could be optimized to
>> better minimize this, but it will likely always be faster to reuse a single
>> instance when writing.
>>
>> The deepCopy's are probably of the default values of each field you're
>> not setting.  If you're only setting a few fields then you might use a
>> builder to create a single instance so its defaults are set, then reuse
>> that instance as you write, setting only those few fields you need to
>> differ from the default.  (This only works if you're setting the same
>> fields every time.  Otherwise you'd need to restore the default value.)
>>
>> An optimization for Avro here might be to inline default values for
>> immutable types when generating the build() method.
>>
>> Doug
>>
>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com>
>> wrote:
>>
>>> Hello Every One,
>>>
>>> We have a process that reads data from a  local file share  ,serailizes
>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>> the avro objects correctly. For every record that  is read from the binary
>>> file we create an equivalent avro object in the below format.
>>>
>>> Parent p = new Parent();
>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>> p.setHdr(hdr);
>>> p.setMsg(msg);
>>> p..
>>> p..set
>>> datumFileWriter.write(p);
>>>
>>> This avro schema has  around 1800 fileds including 26 nested types
>>> within it .I did some load testing and figured that if I serialize the same
>>> object to disk  performance is  6 x times faster  than a constructing a new
>>> object (p.build). When a new  avro object is constructed everytime using
>>> RecordBuilder.build()  much of the time is spend in
>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>> using Avro 1.8.2.
>>>
>>> Thanks,
>>> Nishanth
>>>
>>>
>>>
>>>
>>>
>>
>

Re: GenericData.deepCopy() HotSpot

Posted by Nishanth S <ni...@gmail.com>.

Thanks Doug .  Here  is a  comparison .

Load Avro  Record Size : Roughly15 Kb

I have used the same payload  with a schema  that has  around 2k fields
and  also  with    another schema   that has  5 fileds . I re used the
avro object in both cases   using a builder once . Test was run for 1 M
records writing the  same amount of data  (1 Gb ) to  a  local drive . Ran
this few times as  single threaded . Average TPS in case of smaller schema
is  40 K where  as with a bigger schema it drops down to 10 K  even though
both are  writing the same amount of data. Since I am   only creating the
avro object once  in both  cases   it looks   like  there is an overhead in
the  datafilewriter too in case of bigger schemas .



public static void main(String[] args){
try{
new LoadGenerator().load();
}catch(IOException e){
    e.printStackTrace();
}
    }

    DataFileWriter<User> dataFileWriter;
    DatumWriter<User> datumWriter;
    FileSystem hdfsFileSystem;
    Configuration conf;
    Path path;
    OutputStream outStream;
    User user;
    com.google.common.base.Stopwatch stopwatch = new
com.google.common.base.Stopwatch().start();
    public  void load() throws IOException{
conf = new Configuration();
hdfsFileSystem = FileSystem.get(conf);
datumWriter = new SpecificDatumWriter<User>(User.class);
dataFileWriter = new DataFileWriter<User>(datumWriter);
dataFileWriter.setCodec(CodecFactory.snappyCodec());
        path = new Path("/projects/tmp/load.avro");
        outStream=hdfsFileSystem.create(path, true);
dataFileWriter.create(User.getClassSchema(), outStream);
        dataFileWriter.setFlushOnEveryBlock(false);
// Create and Load User
int numRecords =1000000;
for (int i=0;i<1000000;i++){
    user = User.newBuilder().build();
    user.setFirstName("testName"+new Random().nextLong());
    user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
    user.setFavoriteColor("blue" +new Random().nextFloat());
    user.setData(ByteBuffer.wrap(new byte[15000]));
    dataFileWriter.append(user);
}
dataFileWriter.close();
stopwatch.stop();
long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
System.out.println("Time elapsed for myCall() is "+ elapsedTime);

On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cu...@gmail.com> wrote:

> Builders have some inherent overheads.  Things could be optimized to
> better minimize this, but it will likely always be faster to reuse a single
> instance when writing.
>
> The deepCopy's are probably of the default values of each field you're not
> setting.  If you're only setting a few fields then you might use a builder
> to create a single instance so its defaults are set, then reuse that
> instance as you write, setting only those few fields you need to differ
> from the default.  (This only works if you're setting the same fields every
> time.  Otherwise you'd need to restore the default value.)
>
> An optimization for Avro here might be to inline default values for
> immutable types when generating the build() method.
>
> Doug
>
> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com>
> wrote:
>
>> Hello Every One,
>>
>> We have a process that reads data from a  local file share  ,serailizes
>> and writes to HDFS in avro format. .I am just wondering if I am building
>> the avro objects correctly. For every record that  is read from the binary
>> file we create an equivalent avro object in the below format.
>>
>> Parent p = new Parent();
>> LOGHDR hdr = LOGHDR.newBuilder().build()
>> MSGHDR msg = MSGHDR.newBuilder().build()
>> p.setHdr(hdr);
>> p.setMsg(msg);
>> p..
>> p..set
>> datumFileWriter.write(p);
>>
>> This avro schema has  around 1800 fileds including 26 nested types within
>> it .I did some load testing and figured that if I serialize the same object
>> to disk  performance is  6 x times faster  than a constructing a new object
>> (p.build). When a new  avro object is constructed everytime using
>> RecordBuilder.build()  much of the time is spend in
>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>> using Avro 1.8.2.
>>
>> Thanks,
>> Nishanth
>>
>>
>>
>>
>>
>

Re: GenericData.deepCopy() HotSpot

Posted by Doug Cutting <cu...@gmail.com>.

Builders have some inherent overheads.  Things could be optimized to better
minimize this, but it will likely always be faster to reuse a single
instance when writing.

The deepCopy's are probably of the default values of each field you're not
setting.  If you're only setting a few fields then you might use a builder
to create a single instance so its defaults are set, then reuse that
instance as you write, setting only those few fields you need to differ
from the default.  (This only works if you're setting the same fields every
time.  Otherwise you'd need to restore the default value.)

An optimization for Avro here might be to inline default values for
immutable types when generating the build() method.

Doug

On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <ni...@gmail.com> wrote:

> Hello Every One,
>
> We have a process that reads data from a  local file share  ,serailizes
> and writes to HDFS in avro format. .I am just wondering if I am building
> the avro objects correctly. For every record that  is read from the binary
> file we create an equivalent avro object in the below format.
>
> Parent p = new Parent();
> LOGHDR hdr = LOGHDR.newBuilder().build()
> MSGHDR msg = MSGHDR.newBuilder().build()
> p.setHdr(hdr);
> p.setMsg(msg);
> p..
> p..set
> datumFileWriter.write(p);
>
> This avro schema has  around 1800 fileds including 26 nested types within
> it .I did some load testing and figured that if I serialize the same object
> to disk  performance is  6 x times faster  than a constructing a new object
> (p.build). When a new  avro object is constructed everytime using
> RecordBuilder.build()  much of the time is spend in
> GenericData.deepCopy().Has any one run into a similar problem ? We are
> using Avro 1.8.2.
>
> Thanks,
> Nishanth
>
>
>
>
>