You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Feynman Liang <fl...@databricks.com> on 2015/09/19 01:06:01 UTC

Re: One element per node

rdd.mapPartitions(x => new Iterator(x.head))

On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <alexander.ulanov@hpe.com
> wrote:

> Dear Spark developers,
>
>
>
> Is it possible (and how to do it if possible) to pick one element per
> physical node from an RDD? Let’s say the first element of any partition on
> that node. The result would be an RDD[element], the count of elements is
> equal to the N of nodes that has partitions of the initial RDD.
>
>
>
> Best regards, Alexander
>

Re: One element per node

Posted by Reynold Xin <rx...@databricks.com>.

The reason it is nondeterministic is because tasks are not always scheduled
to the same nodes -- so I don't think you can make this deterministic.

If you assume no failures and tasks take a while to run (so it runs slower
than the scheduler can schedule them), then I think you can make it
deterministic by setting spark.locality.wait to a really high number, and
coalescing everything into just N partitions, where N = the number of
machines.




On Fri, Sep 18, 2015 at 5:53 PM, Ulanov, Alexander <alexander.ulanov@hpe.com
> wrote:

> Sounds interesting! Is it possible to make it deterministic by using
> global long value and get the element on partition only if
> someFunction(partitionId, globalLong)==true? Or by using some specific
> partitioner that creates such partitionIds that can be decomposed into
> nodeId and number of partitions per node?
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Friday, September 18, 2015 4:37 PM
> *To:* Ulanov, Alexander
> *Cc:* Feynman Liang; dev@spark.apache.org
> *Subject:* Re: One element per node
>
>
>
> Use a global atomic boolean and return nothing from that partition if the
> boolean is true.
>
>
>
> Note that your result won't be deterministic.
>
>
> On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander <al...@hpe.com>
> wrote:
>
> Thank you! How can I guarantee that I have only one element per executor
> (per worker, or per physical node)?
>
>
>
> *From:* Feynman Liang [mailto:fliang@databricks.com
> <fl...@databricks.com>]
> *Sent:* Friday, September 18, 2015 4:06 PM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: One element per node
>
>
>
> rdd.mapPartitions(x => new Iterator(x.head))
>
>
>
> On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <
> alexander.ulanov@hpe.com> wrote:
>
> Dear Spark developers,
>
>
>
> Is it possible (and how to do it if possible) to pick one element per
> physical node from an RDD? Let’s say the first element of any partition on
> that node. The result would be an RDD[element], the count of elements is
> equal to the N of nodes that has partitions of the initial RDD.
>
>
>
> Best regards, Alexander
>
>
>
>

RE: One element per node

Posted by "Ulanov, Alexander" <al...@hpe.com>.

Sounds interesting! Is it possible to make it deterministic by using global long value and get the element on partition only if someFunction(partitionId, globalLong)==true? Or by using some specific partitioner that creates such partitionIds that can be decomposed into nodeId and number of partitions per node?

From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Friday, September 18, 2015 4:37 PM
To: Ulanov, Alexander
Cc: Feynman Liang; dev@spark.apache.org
Subject: Re: One element per node

Use a global atomic boolean and return nothing from that partition if the boolean is true.

Note that your result won't be deterministic.

On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander <al...@hpe.com>> wrote:
Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical node)?

From: Feynman Liang [mailto:fliang@databricks.com]
Sent: Friday, September 18, 2015 4:06 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: One element per node

rdd.mapPartitions(x => new Iterator(x.head))

On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <al...@hpe.com>> wrote:
Dear Spark developers,

Is it possible (and how to do it if possible) to pick one element per physical node from an RDD? Let’s say the first element of any partition on that node. The result would be an RDD[element], the count of elements is equal to the N of nodes that has partitions of the initial RDD.

Best regards, Alexander

Re: One element per node

Posted by Reynold Xin <rx...@databricks.com>.

Use a global atomic boolean and return nothing from that partition if the
boolean is true.

Note that your result won't be deterministic.

On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander <al...@hpe.com>
wrote:

Thank you! How can I guarantee that I have only one element per executor
(per worker, or per physical node)?



*From:* Feynman Liang [mailto:fliang@databricks.com <fl...@databricks.com>]

*Sent:* Friday, September 18, 2015 4:06 PM
*To:* Ulanov, Alexander
*Cc:* dev@spark.apache.org
*Subject:* Re: One element per node



rdd.mapPartitions(x => new Iterator(x.head))



On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <al...@hpe.com>
wrote:

Dear Spark developers,



Is it possible (and how to do it if possible) to pick one element per
physical node from an RDD? Let’s say the first element of any partition on
that node. The result would be an RDD[element], the count of elements is
equal to the N of nodes that has partitions of the initial RDD.



Best regards, Alexander

Re: One element per node

Posted by Feynman Liang <fl...@databricks.com>.

AFAIK the physical distribution is not exposed in the public API; the
closest I can think of is
`rdd.coalesce(numPhysicalNodes).mapPartitions(...` but this assumes that
one partition exists per node

On Fri, Sep 18, 2015 at 4:09 PM, Ulanov, Alexander <alexander.ulanov@hpe.com
> wrote:

> Thank you! How can I guarantee that I have only one element per executor
> (per worker, or per physical node)?
>
>
>
> *From:* Feynman Liang [mailto:fliang@databricks.com]
> *Sent:* Friday, September 18, 2015 4:06 PM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: One element per node
>
>
>
> rdd.mapPartitions(x => new Iterator(x.head))
>
>
>
> On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <
> alexander.ulanov@hpe.com> wrote:
>
> Dear Spark developers,
>
>
>
> Is it possible (and how to do it if possible) to pick one element per
> physical node from an RDD? Let’s say the first element of any partition on
> that node. The result would be an RDD[element], the count of elements is
> equal to the N of nodes that has partitions of the initial RDD.
>
>
>
> Best regards, Alexander
>
>
>

RE: One element per node

Posted by "Ulanov, Alexander" <al...@hpe.com>.

Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical node)?

From: Feynman Liang [mailto:fliang@databricks.com]
Sent: Friday, September 18, 2015 4:06 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: One element per node

rdd.mapPartitions(x => new Iterator(x.head))

On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <al...@hpe.com>> wrote:
Dear Spark developers,

Is it possible (and how to do it if possible) to pick one element per physical node from an RDD? Let’s say the first element of any partition on that node. The result would be an RDD[element], the count of elements is equal to the N of nodes that has partitions of the initial RDD.

Best regards, Alexander