You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Antonio Murgia <an...@eng.it> on 2016/10/13 16:16:22 UTC

sc.phoenixTableAsRDD number of initial partitions

Hello everyone,

I'm trying to read data from a Phoenix Table using apache Spark. I 
actually use the suggested method: sc.phoenixTableAsRDD without issuing 
any query (e.g. reading the whole table) and I noticed that the number 
of partitions that spark creates is equal to the number of 
regionServers. Is there a way to use a custom number of regions?

The problem we actually face is that if a region is bigger than the 
available memory of the spark executor, it goes in OOM. Being able to 
tune the number of regions, we might use a higher number of partitions 
reducing the memory footprint of the processing (and also slowing it 
down, i know :( ).

Thank you in advance

#A.M.

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by Ciureanu Constantin <ci...@gmail.com>.

Then please post a small part of your code (that one reading from Phoenix &
processing the RDD contents)

2016-10-14 11:12 GMT+02:00 Antonio Murgia <an...@eng.it>:

> For the record, autocommit was set to true.
>
> On 10/14/2016 10:08 AM, James Taylor wrote:
>
>
>
> On Fri, Oct 14, 2016 at 12:37 AM, Antonio Murgia <an...@eng.it>
> wrote:
>
>> We tried with an Upsert from select, but we ran into some memory issue
>> from the phoenix side.
>>
>> Do you have any suggestion to perform something like that?
>>
> You can try setting auto commit to true on the connection before you
> perform the upsert select. That'll prevent memory problems on the client
> due to buffering.
>
> Thanks,
> James
>
>
>

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by Antonio Murgia <an...@eng.it>.

For the record, autocommit was set to true.


On 10/14/2016 10:08 AM, James Taylor wrote:
>
>
> On Fri, Oct 14, 2016 at 12:37 AM, Antonio Murgia 
> <antonio.murgia@eng.it <ma...@eng.it>> wrote:
>
>     We tried with an Upsert from select, but we ran into some memory
>     issue from the phoenix side.
>
>     Do you have any suggestion to perform something like that?
>
> You can try setting auto commit to true on the connection before you 
> perform the upsert select. That'll prevent memory problems on the 
> client due to buffering.
>
> Thanks,
> James

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by James Taylor <ja...@apache.org>.

On Fri, Oct 14, 2016 at 12:37 AM, Antonio Murgia <an...@eng.it>
wrote:

> We tried with an Upsert from select, but we ran into some memory issue
> from the phoenix side.
>
> Do you have any suggestion to perform something like that?
>
You can try setting auto commit to true on the connection before you
perform the upsert select. That'll prevent memory problems on the client
due to buffering.

Thanks,
James

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by Antonio Murgia <an...@eng.it>.

I know spark doc is really comprehensive, I read it a lot of times in 
the last 2 years, I know how to check how Spark uses its memory and how 
to tweak it (e.g. using more memory for caching or not). I'll try asking 
to not use any memory to cache the rdd, since I'm not caching at all. 
Please don't reply with general spark knowledge, because I kinda know 
how spark works.

Thank you in advance.

#A.M.


On 10/14/2016 09:54 AM, Mich Talebzadeh wrote:
>
> "I do know how Spark in general works, and how it stores data in 
> memory etc. It's been almost 2 years that I work on it. So I'm 
> definetely not collecting the whole rdd in memory ;)"
>
>
> Spark doc is a good start.
>
> To see how spark memory is utilised look at Spark UI on <HOST>:4040 by 
> default under storage tab. It will tell you what is stored.
>
> Spark uses execution memory for result set on operation (RDD + DF) and 
> storage memory for anything cached with cache() or persist(). You can 
> verify all this in Spark UI.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn 
> /https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk.Any and all responsibility for 
> any loss, damage or destruction of data or any other property which 
> may arise from relying on this email's technical content is explicitly 
> disclaimed. The author will in no case be liable for any monetary 
> damages arising from such loss, damage or destruction.
>
>
> On 14 October 2016 at 08:37, Antonio Murgia <antonio.murgia@eng.it 
> <ma...@eng.it>> wrote:
>
>     Hi Constantin,
>
>     thank you for your reply. I do know how Spark in general works,
>     and how it stores data in memory etc. It's been almost 2 years
>     that I work on it. So I'm definetely not collecting the whole rdd
>     in memory ;)
>
>     Our "mantainance use case" is the following:
>
>     Copying the whole content of a table to another table applying a
>     simple transformation (e.g. aggregating some columns). We tried
>     with an Upsert from select, but we ran into some memory issue from
>     the phoenix side.
>
>     Do you have any suggestion to perform something like that?
>
>     Thank you in advance
>
>     #A.M.
>
>
>     On 10/14/2016 08:10 AM, Ciureanu Constantin wrote:
>>
>>     Hi Antonio,
>>     Reading the whole table is not a good use-case for Phoenix /
>>     HBase or any DB.
>>     You should never ever store the whole content read from DB / disk
>>     into memory, that's definitely wrong.
>>     Spark doesn't do that by itself, no matter what "they" told you
>>     that it's going to do in order to be faster bla bla. Review your
>>     algorithm and see what's to improve, After all, I hope you just
>>     use collect() so the OOM is on the driver (that's easier to fix,
>>     :p by not using it).
>>     Back to the OOM: After reading an RDD you can shuffle yourself /
>>     repartition in any number of partitions easily (but that sends
>>     data through network so it's expensive):
>>     repartition(numPartitions)
>>     http://spark.apache.org/docs/latest/programming-guide.html
>>     <http://spark.apache.org/docs/latest/programming-guide.html>
>>     I recommend to read this plus a few articles on Spark best practices.
>>
>>     Kind regards,
>>     Constantin
>>
>>
>>     �n Joi, 13 oct. 2016, 18:16 Antonio Murgia,
>>     <antonio.murgia@eng.it <ma...@eng.it>> a scris:
>>
>>         Hello everyone,
>>
>>         I'm trying to read data from a Phoenix Table using apache
>>         Spark. I
>>         actually use the suggested method: sc.phoenixTableAsRDD
>>         without issuing
>>         any query (e.g. reading the whole table) and I noticed that
>>         the number
>>         of partitions that spark creates is equal to the number of
>>         regionServers. Is there a way to use a custom number of regions?
>>
>>         The problem we actually face is that if a region is bigger
>>         than the
>>         available memory of the spark executor, it goes in OOM. Being
>>         able to
>>         tune the number of regions, we might use a higher number of
>>         partitions
>>         reducing the memory footprint of the processing (and also
>>         slowing it
>>         down, i know :( ).
>>
>>         Thank you in advance
>>
>>         #A.M.
>>
>
>

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by Mich Talebzadeh <mi...@gmail.com>.

"I do know how Spark in general works, and how it stores data in memory
etc. It's been almost 2 years that I work on it. So I'm definetely not
collecting the whole rdd in memory ;)"

Spark doc is a good start.

To see how spark memory is utilised look at Spark UI on <HOST>:4040 by
default under storage tab. It will tell you what is stored.

Spark uses execution memory for result set on operation (RDD + DF) and
storage memory for anything cached with cache() or persist(). You can
verify all this in Spark UI.

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 October 2016 at 08:37, Antonio Murgia <an...@eng.it> wrote:

> Hi Constantin,
>
> thank you for your reply. I do know how Spark in general works, and how it
> stores data in memory etc. It's been almost 2 years that I work on it. So
> I'm definetely not collecting the whole rdd in memory ;)
>
> Our "mantainance use case" is the following:
>
> Copying the whole content of a table to another table applying a simple
> transformation (e.g. aggregating some columns). We tried with an Upsert
> from select, but we ran into some memory issue from the phoenix side.
>
> Do you have any suggestion to perform something like that?
>
> Thank you in advance
>
> #A.M.
>
> On 10/14/2016 08:10 AM, Ciureanu Constantin wrote:
>
> Hi Antonio,
> Reading the whole table is not a good use-case for Phoenix / HBase or any
> DB.
> You should never ever store the whole content read from DB / disk into
> memory, that's definitely wrong.
> Spark doesn't do that by itself, no matter what "they" told you that it's
> going to do in order to be faster bla bla. Review your algorithm and see
> what's to improve, After all, I hope you just use collect() so the OOM is
> on the driver (that's easier to fix, :p by not using it).
> Back to the OOM: After reading an RDD you can shuffle yourself /
> repartition in any number of partitions easily (but that sends data through
> network so it's expensive):
> repartition(numPartitions)
> http://spark.apache.org/docs/latest/programming-guide.html
> I recommend to read this plus a few articles on Spark best practices.
>
> Kind regards,
> Constantin
>
> În Joi, 13 oct. 2016, 18:16 Antonio Murgia, <an...@eng.it> a
> scris:
>
>> Hello everyone,
>>
>> I'm trying to read data from a Phoenix Table using apache Spark. I
>> actually use the suggested method: sc.phoenixTableAsRDD without issuing
>> any query (e.g. reading the whole table) and I noticed that the number
>> of partitions that spark creates is equal to the number of
>> regionServers. Is there a way to use a custom number of regions?
>>
>> The problem we actually face is that if a region is bigger than the
>> available memory of the spark executor, it goes in OOM. Being able to
>> tune the number of regions, we might use a higher number of partitions
>> reducing the memory footprint of the processing (and also slowing it
>> down, i know :( ).
>>
>> Thank you in advance
>>
>> #A.M.
>>
>>
>

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by Antonio Murgia <an...@eng.it>.

Hi Constantin,

thank you for your reply. I do know how Spark in general works, and how 
it stores data in memory etc. It's been almost 2 years that I work on 
it. So I'm definetely not collecting the whole rdd in memory ;)

Our "mantainance use case" is the following:

Copying the whole content of a table to another table applying a simple 
transformation (e.g. aggregating some columns). We tried with an Upsert 
from select, but we ran into some memory issue from the phoenix side.

Do you have any suggestion to perform something like that?

Thank you in advance

#A.M.


On 10/14/2016 08:10 AM, Ciureanu Constantin wrote:
>
> Hi Antonio,
> Reading the whole table is not a good use-case for Phoenix / HBase or 
> any DB.
> You should never ever store the whole content read from DB / disk into 
> memory, that's definitely wrong.
> Spark doesn't do that by itself, no matter what "they" told you that 
> it's going to do in order to be faster bla bla. Review your algorithm 
> and see what's to improve, After all, I hope you just use collect() so 
> the OOM is on the driver (that's easier to fix, :p by not using it).
> Back to the OOM: After reading an RDD you can shuffle yourself / 
> repartition in any number of partitions easily (but that sends data 
> through network so it's expensive):
> repartition(numPartitions)
> http://spark.apache.org/docs/latest/programming-guide.html
> I recommend to read this plus a few articles on Spark best practices.
>
> Kind regards,
> Constantin
>
>
> �n Joi, 13 oct. 2016, 18:16 Antonio Murgia, <antonio.murgia@eng.it 
> <ma...@eng.it>> a scris:
>
>     Hello everyone,
>
>     I'm trying to read data from a Phoenix Table using apache Spark. I
>     actually use the suggested method: sc.phoenixTableAsRDD without
>     issuing
>     any query (e.g. reading the whole table) and I noticed that the number
>     of partitions that spark creates is equal to the number of
>     regionServers. Is there a way to use a custom number of regions?
>
>     The problem we actually face is that if a region is bigger than the
>     available memory of the spark executor, it goes in OOM. Being able to
>     tune the number of regions, we might use a higher number of partitions
>     reducing the memory footprint of the processing (and also slowing it
>     down, i know :( ).
>
>     Thank you in advance
>
>     #A.M.
>

Re: sc.phoenixTableAsRDD number of initial partitions

Posted by Ciureanu Constantin <ci...@gmail.com>.

Hi Antonio,
Reading the whole table is not a good use-case for Phoenix / HBase or any
DB.
You should never ever store the whole content read from DB / disk into
memory, that's definitely wrong.
Spark doesn't do that by itself, no matter what "they" told you that it's
going to do in order to be faster bla bla. Review your algorithm and see
what's to improve, After all, I hope you just use collect() so the OOM is
on the driver (that's easier to fix, :p by not using it).
Back to the OOM: After reading an RDD you can shuffle yourself /
repartition in any number of partitions easily (but that sends data through
network so it's expensive):
repartition(numPartitions)
http://spark.apache.org/docs/latest/programming-guide.html
I recommend to read this plus a few articles on Spark best practices.

Kind regards,
Constantin

În Joi, 13 oct. 2016, 18:16 Antonio Murgia, <an...@eng.it> a scris:

> Hello everyone,
>
> I'm trying to read data from a Phoenix Table using apache Spark. I
> actually use the suggested method: sc.phoenixTableAsRDD without issuing
> any query (e.g. reading the whole table) and I noticed that the number
> of partitions that spark creates is equal to the number of
> regionServers. Is there a way to use a custom number of regions?
>
> The problem we actually face is that if a region is bigger than the
> available memory of the spark executor, it goes in OOM. Being able to
> tune the number of regions, we might use a higher number of partitions
> reducing the memory footprint of the processing (and also slowing it
> down, i know :( ).
>
> Thank you in advance
>
> #A.M.
>
>