You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2017/04/21 16:35:06 UTC
Re: Review Request 57353: Intern Properties objects referenced from
PartitionDesc to reduce memory pressure.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------
Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.
common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java
Lines 314 (patched)
<https://reviews.apache.org/r/57353/#comment245727>
If Interners.newWeakInterner() returns a thread-safe interner, why do we have to lock the INTERNER only when updating it?
- Sergio Pena
On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
>
> (Updated March 7, 2017, 1:22 a.m.)
>
>
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
>
>
> Bugs: HIVE-16079
> https://issues.apache.org/jira/browse/HIVE-16079
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
>
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0
> ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72
>
>
> Diff: https://reviews.apache.org/r/57353/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Misha Dmitriev
>
>
Re: Review Request 57353: Intern Properties objects referenced from
PartitionDesc to reduce memory pressure.
Posted by Misha Dmitriev <mi...@cloudera.com>.
> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.
It's not instantiated directly. Rather, see the serialization/deserialization code in SerializationUtilities.java, where this class is indirectly instantiated. My understanding is that this is how Partitions and their child data structures are created, by transferring data from HMS.
> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java
> > Lines 314 (patched)
> > <https://reviews.apache.org/r/57353/diff/1/?file=1656856#file1656856line314>
> >
> > If Interners.newWeakInterner() returns a thread-safe interner, why do we have to lock the INTERNER only when updating it?
I don't think newWeakInterner() returns a thread-safe interner, at least from inspecting its code?
- Misha
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------
On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
>
> (Updated March 7, 2017, 1:22 a.m.)
>
>
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
>
>
> Bugs: HIVE-16079
> https://issues.apache.org/jira/browse/HIVE-16079
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
>
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0
> ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72
>
>
> Diff: https://reviews.apache.org/r/57353/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Misha Dmitriev
>
>
Re: Review Request 57353: Intern Properties objects referenced from
PartitionDesc to reduce memory pressure.
Posted by Misha Dmitriev <mi...@cloudera.com>.
> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.
>
> Misha Dmitriev wrote:
> It's not instantiated directly. Rather, see the serialization/deserialization code in SerializationUtilities.java, where this class is indirectly instantiated. My understanding is that this is how Partitions and their child data structures are created, by transferring data from HMS.
>
> Sergio Pena wrote:
> I still not found how this happens. Could you describe how you understand this happens? Maybe I can follow you better than the code.
Right, now I understand what you mean. I made a mistake when making some final edits of this code. A new CopyOnFirstWriteProperties instance should be created in the setProperties() method of PartitionDesc. I'll make a fix and post a new patch.
- Misha
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------
On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
>
> (Updated March 7, 2017, 1:22 a.m.)
>
>
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
>
>
> Bugs: HIVE-16079
> https://issues.apache.org/jira/browse/HIVE-16079
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
>
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0
> ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72
>
>
> Diff: https://reviews.apache.org/r/57353/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Misha Dmitriev
>
>
Re: Review Request 57353: Intern Properties objects referenced from
PartitionDesc to reduce memory pressure.
Posted by Sergio Pena <se...@cloudera.com>.
> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > Misha, whers is CopyOnFirstWriteProperties used? The patch looks pretty good, but I don't see where CopyOnFirstWriteProperties is instatiated.
>
> Misha Dmitriev wrote:
> It's not instantiated directly. Rather, see the serialization/deserialization code in SerializationUtilities.java, where this class is indirectly instantiated. My understanding is that this is how Partitions and their child data structures are created, by transferring data from HMS.
I still not found how this happens. Could you describe how you understand this happens? Maybe I can follow you better than the code.
> On April 21, 2017, 4:35 p.m., Sergio Pena wrote:
> > common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java
> > Lines 314 (patched)
> > <https://reviews.apache.org/r/57353/diff/1/?file=1656856#file1656856line314>
> >
> > If Interners.newWeakInterner() returns a thread-safe interner, why do we have to lock the INTERNER only when updating it?
>
> Misha Dmitriev wrote:
> I don't think newWeakInterner() returns a thread-safe interner, at least from inspecting its code?
I read the javadoc that states that:
newWeakInterner()
Returns a new thread-safe interner which retains a weak reference to each instance it has interned, and so does not prevent these instances from being garbage-collected.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57353/#review172668
-----------------------------------------------------------
On March 7, 2017, 1:22 a.m., Misha Dmitriev wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57353/
> -----------------------------------------------------------
>
> (Updated March 7, 2017, 1:22 a.m.)
>
>
> Review request for hive, Chaozhong Yang, Alan Gates, Rui Li, Prasanth_J, Sergio Pena, Sahil Takiar, Vihang Karajgaonkar, and Xuefu Zhang.
>
>
> Bugs: HIVE-16079
> https://issues.apache.org/jira/browse/HIVE-16079
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> When multiple concurrent Hive queries run, a separate copy of
> org.apache.hadoop.hive.ql.metadata.Partition and
> ql.plan.PartitionDesc is created for each table partition
> per each query instance. So when in my benchmark explained in
> HIVE-16079 we have 2000 partitions and 50 concurrent queries running
> over them, we end up, in the worst case, with 2000*50=100,000 instances
> of Partition and PartitionDesc in memory. These objects themselves
> collectively take just ~2% of memory. However, other data structures
> that each of them reference, take a lot more. In particular, Properties
> objects take more than 20% of memory. When we have 50 concurrent
> read-only queries, there are 50 identical copies of Properties per
> each partition. That's a huge waste of memory.
>
> This change introduces a new class that extends Properties, called
> CopyOnFirstWriteProperties. It utilizes a unique interned copy of
> Properties whenever possible. However, when one of the methods that
> modify properties is called, a copy is created. When this class is
> used, memory consumption by Properties falls from 20% to 5..6%.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/CopyOnFirstWriteProperties.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/SerializationUtilities.java 247d5890ea8131404b9543d22876ca4c052578e0
> ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java d05c1c68fdb7296c0346d73967071da1ebe7bb72
>
>
> Diff: https://reviews.apache.org/r/57353/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Misha Dmitriev
>
>