You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2017/04/21 16:35:06 UTC

Re: Review Request 57353: Intern Properties objects referenced from PartitionDesc to reduce memory pressure.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------



Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.


common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java
Lines 314 (patched)
<https://reviews.apache.org/r/57353/#comment245727>

    If Interners.newWeakInterner() returns a thread-safe interner, why do we have to lock the INTERNER only when updating it?


- Sergio Pena


On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
> 
> (Updated March 7, 2017, 1:22 a.m.)
> 
> 
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-16079
>     https://issues.apache.org/jira/browse/HIVE-16079
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
> 
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72 
> 
> 
> Diff: https://reviews.apache.org/r/57353/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>


Re: Review Request 57353: Intern Properties objects referenced from PartitionDesc to reduce memory pressure.

Posted by Misha Dmitriev <mi...@cloudera.com>.

> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.

It's not instantiated directly. Rather, see the serialization/deserialization code in SerializationUtilities.java, where this class is indirectly instantiated. My understanding is that this is how Partitions and their child data structures are created, by transferring data from HMS.


> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java
> > Lines 314 (patched)
> > <https://reviews.apache.org/r/57353/diff/1/?file=1656856#file1656856line314>
> >
> >     If Interners.newWeakInterner() returns a thread-safe interner, why do we have to lock the INTERNER only when updating it?

I don't think newWeakInterner() returns a thread-safe interner, at least from inspecting its code?


- Misha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------


On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
> 
> (Updated March 7, 2017, 1:22 a.m.)
> 
> 
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-16079
>     https://issues.apache.org/jira/browse/HIVE-16079
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
> 
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72 
> 
> 
> Diff: https://reviews.apache.org/r/57353/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>


Re: Review Request 57353: Intern Properties objects referenced from PartitionDesc to reduce memory pressure.

Posted by Misha Dmitriev <mi...@cloudera.com>.

> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.
> 
> Misha Dmitriev wrote:
>     It's not instantiated directly. Rather, see the serialization/deserialization code in SerializationUtilities.java, where this class is indirectly instantiated. My understanding is that this is how Partitions and their child data structures are created, by transferring data from HMS.
> 
> Sergio Pena wrote:
>     I still not found how this happens. Could you describe how you understand this happens? Maybe I can follow you better than the code.

Right, now I understand what you mean. I made a mistake when making some final edits of this code. A new CopyOnFirstWriteProperties instance should be created in the setProperties() method of PartitionDesc. I'll make a fix and post a new patch.


- Misha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------


On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
> 
> (Updated March 7, 2017, 1:22 a.m.)
> 
> 
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-16079
>     https://issues.apache.org/jira/browse/HIVE-16079
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
> 
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72 
> 
> 
> Diff: https://reviews.apache.org/r/57353/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>


Re: Review Request 57353: Intern Properties objects referenced from PartitionDesc to reduce memory pressure.

Posted by Sergio Pena <se...@cloudera.com>.

> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.
> 
> Misha Dmitriev wrote:
>     It's not instantiated directly. Rather, see the serialization/deserialization code in SerializationUtilities.java, where this class is indirectly instantiated. My understanding is that this is how Partitions and their child data structures are created, by transferring data from HMS.

I still not found how this happens. Could you describe how you understand this happens? Maybe I can follow you better than the code.


> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java
> > Lines 314 (patched)
> > <https://reviews.apache.org/r/57353/diff/1/?file=1656856#file1656856line314>
> >
> >     If Interners.newWeakInterner() returns a thread-safe interner, why do we have to lock the INTERNER only when updating it?
> 
> Misha Dmitriev wrote:
>     I don't think newWeakInterner() returns a thread-safe interner, at least from inspecting its code?

I read the javadoc that states that:

newWeakInterner()
Returns a new thread-safe interner which retains a weak reference to each instance it has interned, and so does not prevent these instances from being garbage-collected.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------


On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
> 
> (Updated March 7, 2017, 1:22 a.m.)
> 
> 
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-16079
>     https://issues.apache.org/jira/browse/HIVE-16079
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
> 
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72 
> 
> 
> Diff: https://reviews.apache.org/r/57353/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>