You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by "Fustes, Diego" <Di...@ndt-global.com> on 2016/04/18 10:37:13 UTC

Phoenix-Spark: Number of partitions in PhoenixRDD

Hi all,

I'm working with the Phoenix spark plugin to process a HUGE table. The table is salted in 100 buckets and is split in 400 regions. When I read it with phoenixTableAsRDD, I get a RDD with 150 parititions. These partitions are too big, such
that I am getting OutOfMemory problems. Therefore, I would like to get smaller partitions. To do this, I could just call repartition, but it would shuffle the whole dataset... So, my question is, is there a way to modify PhoenixInputFormat
to get more partitions in the resulting RDD?

Thanks and regards,

Diego

[Description: Description: cid:image001.png@01CF4378.72EDFE50]
NDT GDAC Spain S.L.
Diego Fustes, Big Data and Machine Learning Expert
Gran Vía de les Corts Catalanes 130, 11th floor
08038 Barcelona, Spain
Phone: +34 93 43 255 27
diego.fustes@ndt-global.com<ma...@ndt-global.com>
www.ndt-global.com<http://www.ndt-global.com/>

--
This email is intended only for the recipient(s) designated above. Any dissemination, distribution, copying, or use of the information contained herein by anyone other than the recipient(s) designated by the sender is unauthorized and strictly prohibited and subject to legal privilege. If you have received this e-mail in error, please notify the sender immediately and delete and destroy this email.

Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht der Adressat sind, informieren Sie bitte den Absender unverzüglich, verwenden Sie den Inhalt nicht und löschen Sie die E-Mail sofort.

NDT Global GmbH and Co. KG, Friedrich-List-Str. 1, D-76297 Stutensee, Germany
Registry Court Mannheim
HRA 704288

Personally liable partner:
NDT Verwaltungs GmbH
Friedrich-List-Straße 1, D-76297 Stutensee, Germany
Registry Court Mannheim
HRB 714639
CEO: Gunther Blitz

RE: Phoenix-Spark: Number of partitions in PhoenixRDD

Posted by "Fustes, Diego" <Di...@ndt-global.com>.

Hi Josh,

Thanks for the information. For the moment, I was able to control the partitions by switching to JdbcRDD, such that I create 6000 partitions based on a numeric column. I think that such approach is slower, but at least scales well.

Raising the number of HBase regions that much would be counterproductive, as it affects HBase performance... It would be nice if you can provide an alternative to the PhoenixInputFormat that creates more splits... In any case, thanks for the help.

Diego

From: Josh Mahonin [mailto:jmahonin@gmail.com]
Sent: 18 April 2016 20:25
To: user@phoenix.apache.org
Subject: Re: Phoenix-Spark: Number of partitions in PhoenixRDD

Hi Diego,

The phoenix-spark RDD partition count is equal to the number of splits that the query planner returns. Adjusting the HBase region splits, table salting [1], as well as the guidepost width [2] should help with the parallelization here.

Using 'EXPLAIN' for the generated query in sqlline might be helpful for debugging here as well.

It would be great if you could update this thread with any lessons learned as well.

Good luck,

Josh

[1] https://phoenix.apache.org/salted.html
[2] https://phoenix.apache.org/update_statistics.html

On Mon, Apr 18, 2016 at 4:37 AM, Fustes, Diego <Di...@ndt-global.com>> wrote:
Hi all,

I'm working with the Phoenix spark plugin to process a HUGE table. The table is salted in 100 buckets and is split in 400 regions. When I read it with phoenixTableAsRDD, I get a RDD with 150 parititions. These partitions are too big, such
that I am getting OutOfMemory problems. Therefore, I would like to get smaller partitions. To do this, I could just call repartition, but it would shuffle the whole dataset... So, my question is, is there a way to modify PhoenixInputFormat
to get more partitions in the resulting RDD?

Thanks and regards,

Diego





[Description: Description: cid:image001.png@01CF4378.72EDFE50]
NDT GDAC Spain S.L.
Diego Fustes, Big Data and Machine Learning Expert
Gran Vía de les Corts Catalanes 130, 11th floor
08038 Barcelona, Spain
Phone: +34 93 43 255 27
diego.fustes@ndt-global.com<ma...@ndt-global.com>
www.ndt-global.com<http://www.ndt-global.com/>


--

This email is intended only for the recipient(s) designated above.  Any dissemination, distribution, copying, or use of the information contained herein by anyone other than the recipient(s) designated by the sender is unauthorized and strictly prohibited and subject to legal privilege.  If you have received this e-mail in error, please notify the sender immediately and delete and destroy this email.



Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht der Adressat sind, informieren Sie bitte den Absender unverzüglich, verwenden Sie den Inhalt nicht und löschen Sie die E-Mail sofort.



NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germany

Registry Court Mannheim

HRA 704288



Personally liable partner:

NDT Verwaltungs GmbH

Friedrich-List-Straße 1, D-76297 Stutensee, Germany

Registry Court Mannheim

HRB 714639

CEO: Gunther Blitz

Re: Phoenix-Spark: Number of partitions in PhoenixRDD

Posted by Josh Mahonin <jm...@gmail.com>.

Hi Diego,

The phoenix-spark RDD partition count is equal to the number of splits that
the query planner returns. Adjusting the HBase region splits, table salting
[1], as well as the guidepost width [2] should help with the
parallelization here.

Using 'EXPLAIN' for the generated query in sqlline might be helpful for
debugging here as well.

It would be great if you could update this thread with any lessons learned
as well.

Good luck,

Josh

[1] https://phoenix.apache.org/salted.html
[2] https://phoenix.apache.org/update_statistics.html

On Mon, Apr 18, 2016 at 4:37 AM, Fustes, Diego <Di...@ndt-global.com>
wrote:

> Hi all,
>
>
>
> I'm working with the Phoenix spark plugin to process a HUGE table. The
> table is salted in 100 buckets and is split in 400 regions. When I read it
> with phoenixTableAsRDD, I get a RDD with 150 parititions. These partitions
> are too big, such
>
> that I am getting OutOfMemory problems. Therefore, I would like to get
> smaller partitions. To do this, I could just call repartition, but it would
> shuffle the whole dataset... So, my question is, is there a way to modify
> PhoenixInputFormat
>
> to get more partitions in the resulting RDD?
>
>
>
> Thanks and regards,
>
>
>
> Diego
>
>
>
>
>
>
>
>
>
>
>
> [image: Description: Description: cid:image001.png@01CF4378.72EDFE50]
>
> *NDT GDAC Spain S.L.*
>
> Diego Fustes, Big Data and Machine Learning Expert
>
> Gran Vía de les Corts Catalanes 130, 11th floor
>
> 08038 Barcelona, Spain
>
> Phone: +34 93 43 255 27
>
> diego.fustes@ndt-global.com
>
> *www.ndt-global.com <http://www.ndt-global.com/>*
>
>
>
> --
> This email is intended only for the recipient(s) designated above.  Any dissemination, distribution, copying, or use of the information contained herein by anyone other than the recipient(s) designated by the sender is unauthorized and strictly prohibited and subject to legal privilege.  If you have received this e-mail in error, please notify the sender immediately and delete and destroy this email.
>
> Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht der Adressat sind, informieren Sie bitte den Absender unverzüglich, verwenden Sie den Inhalt nicht und löschen Sie die E-Mail sofort.
>
> NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRA 704288
>
> Personally liable partner:
> NDT Verwaltungs GmbH
> Friedrich-List-Straße 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRB 714639
> CEO: Gunther Blitz
>
>
>
>
>
>