You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Mohammad Islam <mi...@yahoo.com> on 2013/08/07 04:13:07 UTC

Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/
-----------------------------------------------------------

(Updated Aug. 7, 2013, 2:13 a.m.)


Review request for hive, Ashutosh Chauhan and Jakob Homan.


Changes
-------

Add logic to avoid excessive logging for each record.


Bugs: HIVE-4732
    https://issues.apache.org/jira/browse/HIVE-4732


Repository: hive-git


Description
-------

>From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible.

At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time.
 
   


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 

Diff: https://reviews.apache.org/r/12480/diff/


Testing
-------


Thanks,

Mohammad Islam


Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Posted by Jakob Homan <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review26066
-----------------------------------------------------------

Ship it!


Ship It!

- Jakob Homan


On Aug. 30, 2013, 11:49 a.m., Mohammad Islam wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/12480/
> -----------------------------------------------------------
> 
> (Updated Aug. 30, 2013, 11:49 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Jakob Homan.
> 
> 
> Bugs: HIVE-4732
>     https://issues.apache.org/jira/browse/HIVE-4732
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible.
> 
> At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time.
>  
>    
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
> 
> Diff: https://reviews.apache.org/r/12480/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mohammad Islam
> 
>


Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Posted by Mohammad Islam <mi...@yahoo.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/
-----------------------------------------------------------

(Updated Aug. 30, 2013, 6:49 p.m.)


Review request for hive, Ashutosh Chauhan and Jakob Homan.


Changes
-------

Updated with Jakob's comments


Bugs: HIVE-4732
    https://issues.apache.org/jira/browse/HIVE-4732


Repository: hive-git


Description
-------

>From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible.

At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time.
 
   


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b 
  serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 

Diff: https://reviews.apache.org/r/12480/diff/


Testing
-------


Thanks,

Mohammad Islam


Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Posted by Mohammad Islam <mi...@yahoo.com>.

> On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote:
> > serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java, line 529
> > <https://reviews.apache.org/r/12480/diff/3/?file=338097#file338097line529>
> >
> >     Weird spacing... 2x below as well.

Done


> On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote:
> > serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java, line 49
> > <https://reviews.apache.org/r/12480/diff/3/?file=338099#file338099line49>
> >
> >     And this would indicate a bug.

Done.


> On Aug. 26, 2013, 5:35 a.m., Jakob Homan wrote:
> > serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java, line 38
> > <https://reviews.apache.org/r/12480/diff/3/?file=338099#file338099line38>
> >
> >     These should never be null, not even in testing.  It's better to change the tests to correctly populate the data structure.

AvroGenericRecordWriteable needs the record reader ID. Since we are instantiating one here. We will need to set it w/o any checking.
Remove unnecessary null checkings.


- Mohammad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review25537
-----------------------------------------------------------


On Aug. 30, 2013, 6:49 p.m., Mohammad Islam wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/12480/
> -----------------------------------------------------------
> 
> (Updated Aug. 30, 2013, 6:49 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Jakob Homan.
> 
> 
> Bugs: HIVE-4732
>     https://issues.apache.org/jira/browse/HIVE-4732
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible.
> 
> At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time.
>  
>    
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
> 
> Diff: https://reviews.apache.org/r/12480/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mohammad Islam
> 
>


Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Posted by Jakob Homan <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review25537
-----------------------------------------------------------


One issue in the testing and a few formatting issues.  Otherwise looks good.


serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java
<https://reviews.apache.org/r/12480/#comment49986>

    Weird spacing... 2x below as well.



serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java
<https://reviews.apache.org/r/12480/#comment49984>

    These should never be null, not even in testing.  It's better to change the tests to correctly populate the data structure.



serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java
<https://reviews.apache.org/r/12480/#comment49985>

    And this would indicate a bug.


- Jakob Homan


On Aug. 6, 2013, 7:13 p.m., Mohammad Islam wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/12480/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2013, 7:13 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Jakob Homan.
> 
> 
> Bugs: HIVE-4732
>     https://issues.apache.org/jira/browse/HIVE-4732
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> From our performance analysis, we found AvroSerde's schema.equals() call consumed a substantial amount ( nearly 40%) of time. This patch intends to minimize the number schema.equals() calls by pushing the check as late/fewer as possible.
> 
> At first, we added a unique id for each record reader which is then included in every AvroGenericRecordWritable. Then, we introduce two new data structures (one hashset and one hashmap) to store intermediate data to avoid duplicates checkings. Hashset contains all the record readers' IDs that don't need any re-encoding. On the other hand, HashMap contains the already used re-encoders. It works as cache and allows re-encoders reuse. With this change, our test shows nearly 40% reduction in Avro record reading time.
>  
>    
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java ed2a9af 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java e994411 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java 66f0348 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 3828940 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 9af751b 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
> 
> Diff: https://reviews.apache.org/r/12480/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mohammad Islam
> 
>