You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Fei Hu <hu...@gmail.com> on 2017/01/14 23:58:54 UTC

Equally split a RDD partition into two partition at the same node

Dear all,

I want to equally divide a RDD partition into two partitions. That means,
the first half of elements in the partition will create a new partition,
and the second half of elements in the partition will generate another new
partition. But the two new partitions are required to be at the same node
with their parent partition, which can help get high data locality.

Is there anyone who knows how to implement it or any hints for it?

Thanks in advance,
Fei

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Anastasios,

Thanks for your information. I will look into the CoalescedRDD code.

Thanks,
Fei

On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias <zo...@gmail.com>
wrote:

> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of which, CoalescedRDD is private[spark]. If this was not the
> case, you could set balanceSlack to 1, and get what you requested, see
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75
>
> Maybe you could try to use the CoalescedRDD code to implement your
> requirement.
>
> Good luck!
> Cheers,
> Anastasios
>
>
> On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hu...@gmail.com> wrote:
>
>> Hi Anastasios,
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>> Thanks,
>> Fei
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
>> wrote:
>>
>>> Hi Fei,
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>> Best,
>>> Anastasios
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to equally divide a RDD partition into two partitions. That
>>>> means, the first half of elements in the partition will create a new
>>>> partition, and the second half of elements in the partition will generate
>>>> another new partition. But the two new partitions are required to be at the
>>>> same node with their parent partition, which can help get high data
>>>> locality.
>>>>
>>>> Is there anyone who knows how to implement it or any hints for it?
>>>>
>>>> Thanks in advance,
>>>> Fei
>>>>
>>>>
>>>
>>>
>>> --
>>> -- Anastasios Zouzias
>>> <az...@zurich.ibm.com>
>>>
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <az...@zurich.ibm.com>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Anastasios,

Thanks for your information. I will look into the CoalescedRDD code.

Thanks,
Fei

On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias <zo...@gmail.com>
wrote:

> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of which, CoalescedRDD is private[spark]. If this was not the
> case, you could set balanceSlack to 1, and get what you requested, see
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75
>
> Maybe you could try to use the CoalescedRDD code to implement your
> requirement.
>
> Good luck!
> Cheers,
> Anastasios
>
>
> On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hu...@gmail.com> wrote:
>
>> Hi Anastasios,
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>> Thanks,
>> Fei
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
>> wrote:
>>
>>> Hi Fei,
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>> Best,
>>> Anastasios
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to equally divide a RDD partition into two partitions. That
>>>> means, the first half of elements in the partition will create a new
>>>> partition, and the second half of elements in the partition will generate
>>>> another new partition. But the two new partitions are required to be at the
>>>> same node with their parent partition, which can help get high data
>>>> locality.
>>>>
>>>> Is there anyone who knows how to implement it or any hints for it?
>>>>
>>>> Thanks in advance,
>>>> Fei
>>>>
>>>>
>>>
>>>
>>> --
>>> -- Anastasios Zouzias
>>> <az...@zurich.ibm.com>
>>>
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <az...@zurich.ibm.com>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Fei,

I looked at the code of CoalescedRDD and probably what I suggested will not
work.

Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75

Maybe you could try to use the CoalescedRDD code to implement your
requirement.

Good luck!
Cheers,
Anastasios


On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hu...@gmail.com> wrote:

> Hi Anastasios,
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
> Thanks,
> Fei
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
> wrote:
>
>> Hi Fei,
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>> Best,
>> Anastasios
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>> Thanks in advance,
>>> Fei
>>>
>>>
>>
>>
>> --
>> -- Anastasios Zouzias
>> <az...@zurich.ibm.com>
>>
>
>


-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Pradeep,

That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?

Thanks,
Fei

On Mon, Jan 16, 2017 at 2:07 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Usually this kind of thing can be done at a lower level in the InputFormat
> usually by specifying the max split size. Have you looked into that
> possibility with your InputFormat?
>
> On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hu...@gmail.com> wrote:
>
>> Hi Jasbir,
>>
>> Yes, you are right. Do you have any idea about my question?
>>
>> Thanks,
>> Fei
>>
>> On Mon, Jan 16, 2017 at 12:37 AM, <ja...@accenture.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> Coalesce is used to decrease the number of partitions. If you give the
>>> value of numPartitions greater than the current partition, I don’t think
>>> RDD number of partitions will be increased.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jasbir
>>>
>>>
>>>
>>> *From:* Fei Hu [mailto:hufei68@gmail.com]
>>> *Sent:* Sunday, January 15, 2017 10:10 PM
>>> *To:* zouzias@cs.toronto.edu
>>> *Cc:* user @spark <us...@spark.apache.org>; dev@spark.apache.org
>>> *Subject:* Re: Equally split a RDD partition into two partition at the
>>> same node
>>>
>>>
>>>
>>> Hi Anastasios,
>>>
>>>
>>>
>>> Thanks for your reply. If I just increase the numPartitions to be twice
>>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false)
>>> keeps the data locality? Do I need to define my own Partitioner?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Fei
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
>>> wrote:
>>>
>>> Hi Fei,
>>>
>>>
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>>>
>>>
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>>
>>>
>>> Best,
>>>
>>> Anastasios
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>>
>>> Dear all,
>>>
>>>
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>>
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Fei
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -- Anastasios Zouzias
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>> ____________________________________________________________
>>> __________________________
>>>
>>> www.accenture.com
>>>
>>
>>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Pradeep,

That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?

Thanks,
Fei

On Mon, Jan 16, 2017 at 2:07 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Usually this kind of thing can be done at a lower level in the InputFormat
> usually by specifying the max split size. Have you looked into that
> possibility with your InputFormat?
>
> On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hu...@gmail.com> wrote:
>
>> Hi Jasbir,
>>
>> Yes, you are right. Do you have any idea about my question?
>>
>> Thanks,
>> Fei
>>
>> On Mon, Jan 16, 2017 at 12:37 AM, <ja...@accenture.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> Coalesce is used to decrease the number of partitions. If you give the
>>> value of numPartitions greater than the current partition, I don’t think
>>> RDD number of partitions will be increased.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jasbir
>>>
>>>
>>>
>>> *From:* Fei Hu [mailto:hufei68@gmail.com]
>>> *Sent:* Sunday, January 15, 2017 10:10 PM
>>> *To:* zouzias@cs.toronto.edu
>>> *Cc:* user @spark <us...@spark.apache.org>; dev@spark.apache.org
>>> *Subject:* Re: Equally split a RDD partition into two partition at the
>>> same node
>>>
>>>
>>>
>>> Hi Anastasios,
>>>
>>>
>>>
>>> Thanks for your reply. If I just increase the numPartitions to be twice
>>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false)
>>> keeps the data locality? Do I need to define my own Partitioner?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Fei
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
>>> wrote:
>>>
>>> Hi Fei,
>>>
>>>
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>>>
>>>
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>>
>>>
>>> Best,
>>>
>>> Anastasios
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>>
>>> Dear all,
>>>
>>>
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>>
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Fei
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -- Anastasios Zouzias
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>> ____________________________________________________________
>>> __________________________
>>>
>>> www.accenture.com
>>>
>>
>>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Pradeep Gollakota <pr...@gmail.com>.

Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?

On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hu...@gmail.com> wrote:

> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my question?
>
> Thanks,
> Fei
>
> On Mon, Jan 16, 2017 at 12:37 AM, <ja...@accenture.com> wrote:
>
>> Hi,
>>
>>
>>
>> Coalesce is used to decrease the number of partitions. If you give the
>> value of numPartitions greater than the current partition, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufei68@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouzias@cs.toronto.edu
>> *Cc:* user @spark <us...@spark.apache.org>; dev@spark.apache.org
>> *Subject:* Re: Equally split a RDD partition into two partition at the
>> same node
>>
>>
>>
>> Hi Anastasios,
>>
>>
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>>
>>
>> Thanks,
>>
>> Fei
>>
>>
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
>> wrote:
>>
>> Hi Fei,
>>
>>
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>>
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>>
>>
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>>
>>
>> Best,
>>
>> Anastasios
>>
>>
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>
>> Dear all,
>>
>>
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>>
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>>
>>
>> Thanks in advance,
>>
>> Fei
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> -- Anastasios Zouzias
>>
>>
>>
>> ------------------------------
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> ____________________________________________________________
>> __________________________
>>
>> www.accenture.com
>>
>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Pradeep Gollakota <pr...@gmail.com>.

Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?

On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hu...@gmail.com> wrote:

> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my question?
>
> Thanks,
> Fei
>
> On Mon, Jan 16, 2017 at 12:37 AM, <ja...@accenture.com> wrote:
>
>> Hi,
>>
>>
>>
>> Coalesce is used to decrease the number of partitions. If you give the
>> value of numPartitions greater than the current partition, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufei68@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouzias@cs.toronto.edu
>> *Cc:* user @spark <us...@spark.apache.org>; dev@spark.apache.org
>> *Subject:* Re: Equally split a RDD partition into two partition at the
>> same node
>>
>>
>>
>> Hi Anastasios,
>>
>>
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>>
>>
>> Thanks,
>>
>> Fei
>>
>>
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
>> wrote:
>>
>> Hi Fei,
>>
>>
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>>
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>>
>>
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>>
>>
>> Best,
>>
>> Anastasios
>>
>>
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>
>> Dear all,
>>
>>
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>>
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>>
>>
>> Thanks in advance,
>>
>> Fei
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> -- Anastasios Zouzias
>>
>>
>>
>> ------------------------------
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> ____________________________________________________________
>> __________________________
>>
>> www.accenture.com
>>
>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Jasbir,

Yes, you are right. Do you have any idea about my question?

Thanks,
Fei

On Mon, Jan 16, 2017 at 12:37 AM, <ja...@accenture.com> wrote:

> Hi,
>
>
>
> Coalesce is used to decrease the number of partitions. If you give the
> value of numPartitions greater than the current partition, I don’t think
> RDD number of partitions will be increased.
>
>
>
> Thanks,
>
> Jasbir
>
>
>
> *From:* Fei Hu [mailto:hufei68@gmail.com]
> *Sent:* Sunday, January 15, 2017 10:10 PM
> *To:* zouzias@cs.toronto.edu
> *Cc:* user @spark <us...@spark.apache.org>; dev@spark.apache.org
> *Subject:* Re: Equally split a RDD partition into two partition at the
> same node
>
>
>
> Hi Anastasios,
>
>
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
>
>
> Thanks,
>
> Fei
>
>
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
> wrote:
>
> Hi Fei,
>
>
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>
>
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
>
>
> Best,
>
> Anastasios
>
>
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>
> Dear all,
>
>
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
>
>
> Is there anyone who knows how to implement it or any hints for it?
>
>
>
> Thanks in advance,
>
> Fei
>
>
>
>
>
>
>
> --
>
> -- Anastasios Zouzias
>
>
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> ____________________________________________________________
> __________________________
>
> www.accenture.com
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Jasbir,

Yes, you are right. Do you have any idea about my question?

Thanks,
Fei

On Mon, Jan 16, 2017 at 12:37 AM, <ja...@accenture.com> wrote:

> Hi,
>
>
>
> Coalesce is used to decrease the number of partitions. If you give the
> value of numPartitions greater than the current partition, I don’t think
> RDD number of partitions will be increased.
>
>
>
> Thanks,
>
> Jasbir
>
>
>
> *From:* Fei Hu [mailto:hufei68@gmail.com]
> *Sent:* Sunday, January 15, 2017 10:10 PM
> *To:* zouzias@cs.toronto.edu
> *Cc:* user @spark <us...@spark.apache.org>; dev@spark.apache.org
> *Subject:* Re: Equally split a RDD partition into two partition at the
> same node
>
>
>
> Hi Anastasios,
>
>
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
>
>
> Thanks,
>
> Fei
>
>
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
> wrote:
>
> Hi Fei,
>
>
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>
>
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
>
>
> Best,
>
> Anastasios
>
>
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>
> Dear all,
>
>
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
>
>
> Is there anyone who knows how to implement it or any hints for it?
>
>
>
> Thanks in advance,
>
> Fei
>
>
>
>
>
>
>
> --
>
> -- Anastasios Zouzias
>
>
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> ____________________________________________________________
> __________________________
>
> www.accenture.com
>

RE: Equally split a RDD partition into two partition at the same node

Posted by ja...@accenture.com.

Hi,

Coalesce is used to decrease the number of partitions. If you give the value of numPartitions greater than the current partition, I don’t think RDD number of partitions will be increased.

Thanks,
Jasbir

From: Fei Hu [mailto:hufei68@gmail.com]
Sent: Sunday, January 15, 2017 10:10 PM
To: zouzias@cs.toronto.edu
Cc: user @spark <us...@spark.apache.org>; dev@spark.apache.org
Subject: Re: Equally split a RDD partition into two partition at the same node

Hi Anastasios,

Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner?

Thanks,
Fei

On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>> wrote:
Hi Fei,

How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>

coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a narrow dependency (satisfying your requirements) if you increase the # of partitions.

Best,
Anastasios

On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com>> wrote:
Dear all,

I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the same node with their parent partition, which can help get high data locality.

Is there anyone who knows how to implement it or any hints for it?

Thanks in advance,
Fei

--
-- Anastasios Zouzias

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com

Re: Equally split a RDD partition into two partition at the same node

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Fei,

I looked at the code of CoalescedRDD and probably what I suggested will not
work.

Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75

Maybe you could try to use the CoalescedRDD code to implement your
requirement.

Good luck!
Cheers,
Anastasios


On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hu...@gmail.com> wrote:

> Hi Anastasios,
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
> Thanks,
> Fei
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
> wrote:
>
>> Hi Fei,
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>> Best,
>> Anastasios
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>> Thanks in advance,
>>> Fei
>>>
>>>
>>
>>
>> --
>> -- Anastasios Zouzias
>> <az...@zurich.ibm.com>
>>
>
>


-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

RE: Equally split a RDD partition into two partition at the same node

Posted by ja...@accenture.com.

Hi,

Coalesce is used to decrease the number of partitions. If you give the value of numPartitions greater than the current partition, I don’t think RDD number of partitions will be increased.

Thanks,
Jasbir

From: Fei Hu [mailto:hufei68@gmail.com]
Sent: Sunday, January 15, 2017 10:10 PM
To: zouzias@cs.toronto.edu
Cc: user @spark <us...@spark.apache.org>; dev@spark.apache.org
Subject: Re: Equally split a RDD partition into two partition at the same node

Hi Anastasios,

Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner?

Thanks,
Fei

On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>> wrote:
Hi Fei,

How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>

coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a narrow dependency (satisfying your requirements) if you increase the # of partitions.

Best,
Anastasios

On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com>> wrote:
Dear all,

I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the same node with their parent partition, which can help get high data locality.

Is there anyone who knows how to implement it or any hints for it?

Thanks in advance,
Fei

--
-- Anastasios Zouzias

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Anastasios,

Thanks for your reply. If I just increase the numPartitions to be twice
larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
the data locality? Do I need to define my own Partitioner?

Thanks,
Fei

On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
wrote:

> Hi Fei,
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
> Best,
> Anastasios
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <az...@zurich.ibm.com>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Anastasios,

Thanks for your reply. If I just increase the numPartitions to be twice
larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
the data locality? Do I need to define my own Partitioner?

Thanks,
Fei

On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zo...@gmail.com>
wrote:

> Hi Fei,
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
> Best,
> Anastasios
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <az...@zurich.ibm.com>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Liang-Chi,

Yes, the logic split is needed in compute(). The preferred locations can be
derived from the customized Partition class.

Thanks for your help!

Cheers,
Fei


On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh <vi...@gmail.com> wrote:

>
> Hi Fei,
>
> I think it should work. But you may need to add few logic in compute() to
> decide which half of the parent partition is needed to output. And you need
> to get the correct preferred locations for the partitions sharing the same
> parent partition.
>
>
> Fei Hu wrote
> > Hi Liang-Chi,
> >
> > Yes, you are right. I implement the following solution for this problem,
> > and it works. But I am not sure if it is efficient:
> >
> > I double the partitions of the parent RDD, and then use the new
> partitions
> > and parent RDD to construct the target RDD. In the compute() function of
> > the target RDD, I use the input partition to get the corresponding parent
> > partition, and get the half elements in the parent partitions as the
> > output
> > of the computing function.
> >
> > Thanks,
> > Fei
> >
> > On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh &lt;
>
> > viirya@
>
> > &gt; wrote:
> >
> >>
> >> Hi,
> >>
> >> When calling `coalesce` with `shuffle = false`, it is going to produce
> at
> >> most min(numPartitions, previous RDD's number of partitions). So I think
> >> it
> >> can't be used to double the number of partitions.
> >>
> >>
> >> Anastasios Zouzias wrote
> >> > Hi Fei,
> >> >
> >> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> >> >
> >> > https://github.com/apache/spark/blob/branch-1.6/core/
> >> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> >> >
> >> > coalesce is mostly used for reducing the number of partitions before
> >> > writing to HDFS, but it might still be a narrow dependency (satisfying
> >> > your
> >> > requirements) if you increase the # of partitions.
> >> >
> >> > Best,
> >> > Anastasios
> >> >
> >> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;
> >>
> >> > hufei68@
> >>
> >> > &gt; wrote:
> >> >
> >> >> Dear all,
> >> >>
> >> >> I want to equally divide a RDD partition into two partitions. That
> >> means,
> >> >> the first half of elements in the partition will create a new
> >> partition,
> >> >> and the second half of elements in the partition will generate
> another
> >> >> new
> >> >> partition. But the two new partitions are required to be at the same
> >> node
> >> >> with their parent partition, which can help get high data locality.
> >> >>
> >> >> Is there anyone who knows how to implement it or any hints for it?
> >> >>
> >> >> Thanks in advance,
> >> >> Fei
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > -- Anastasios Zouzias
> >> > &lt;
> >>
> >> > azo@.ibm
> >>
> >> > &gt;
> >>
> >>
> >>
> >>
> >>
> >> -----
> >> Liang-Chi Hsieh | @viirya
> >> Spark Technology Center
> >> http://www.spark.tc/
> >> --
> >> View this message in context: http://apache-spark-
> >> developers-list.1001551.n3.nabble.com/Equally-split-a-
> >> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Equally-split-a-
> RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

Hi Fei,

I think it should work. But you may need to add few logic in compute() to
decide which half of the parent partition is needed to output. And you need
to get the correct preferred locations for the partitions sharing the same
parent partition.


Fei Hu wrote
> Hi Liang-Chi,
> 
> Yes, you are right. I implement the following solution for this problem,
> and it works. But I am not sure if it is efficient:
> 
> I double the partitions of the parent RDD, and then use the new partitions
> and parent RDD to construct the target RDD. In the compute() function of
> the target RDD, I use the input partition to get the corresponding parent
> partition, and get the half elements in the parent partitions as the
> output
> of the computing function.
> 
> Thanks,
> Fei
> 
> On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> Hi,
>>
>> When calling `coalesce` with `shuffle = false`, it is going to produce at
>> most min(numPartitions, previous RDD's number of partitions). So I think
>> it
>> can't be used to double the number of partitions.
>>
>>
>> Anastasios Zouzias wrote
>> > Hi Fei,
>> >
>> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>> >
>> > https://github.com/apache/spark/blob/branch-1.6/core/
>> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>> >
>> > coalesce is mostly used for reducing the number of partitions before
>> > writing to HDFS, but it might still be a narrow dependency (satisfying
>> > your
>> > requirements) if you increase the # of partitions.
>> >
>> > Best,
>> > Anastasios
>> >
>> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;
>>
>> > hufei68@
>>
>> > &gt; wrote:
>> >
>> >> Dear all,
>> >>
>> >> I want to equally divide a RDD partition into two partitions. That
>> means,
>> >> the first half of elements in the partition will create a new
>> partition,
>> >> and the second half of elements in the partition will generate another
>> >> new
>> >> partition. But the two new partitions are required to be at the same
>> node
>> >> with their parent partition, which can help get high data locality.
>> >>
>> >> Is there anyone who knows how to implement it or any hints for it?
>> >>
>> >> Thanks in advance,
>> >> Fei
>> >>
>> >>
>> >
>> >
>> > --
>> > -- Anastasios Zouzias
>> > &lt;
>>
>> > azo@.ibm
>>
>> > &gt;
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Equally-split-a-
>> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Liang-Chi,

Yes, you are right. I implement the following solution for this problem,
and it works. But I am not sure if it is efficient:

I double the partitions of the parent RDD, and then use the new partitions
and parent RDD to construct the target RDD. In the compute() function of
the target RDD, I use the input partition to get the corresponding parent
partition, and get the half elements in the parent partitions as the output
of the computing function.

Thanks,
Fei

On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh <vi...@gmail.com> wrote:

>
> Hi,
>
> When calling `coalesce` with `shuffle = false`, it is going to produce at
> most min(numPartitions, previous RDD's number of partitions). So I think it
> can't be used to double the number of partitions.
>
>
> Anastasios Zouzias wrote
> > Hi Fei,
> >
> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> >
> > https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> >
> > coalesce is mostly used for reducing the number of partitions before
> > writing to HDFS, but it might still be a narrow dependency (satisfying
> > your
> > requirements) if you increase the # of partitions.
> >
> > Best,
> > Anastasios
> >
> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;
>
> > hufei68@
>
> > &gt; wrote:
> >
> >> Dear all,
> >>
> >> I want to equally divide a RDD partition into two partitions. That
> means,
> >> the first half of elements in the partition will create a new partition,
> >> and the second half of elements in the partition will generate another
> >> new
> >> partition. But the two new partitions are required to be at the same
> node
> >> with their parent partition, which can help get high data locality.
> >>
> >> Is there anyone who knows how to implement it or any hints for it?
> >>
> >> Thanks in advance,
> >> Fei
> >>
> >>
> >
> >
> > --
> > -- Anastasios Zouzias
> > &lt;
>
> > azo@.ibm
>
> > &gt;
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Equally-split-a-
> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Equally split a RDD partition into two partition at the same node

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

Hi,

When calling `coalesce` with `shuffle = false`, it is going to produce at
most min(numPartitions, previous RDD's number of partitions). So I think it
can't be used to double the number of partitions.


Anastasios Zouzias wrote
> Hi Fei,
> 
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> 
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> 
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying
> your
> requirements) if you increase the # of partitions.
> 
> Best,
> Anastasios
> 
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;

> hufei68@

> &gt; wrote:
> 
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another
>> new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>
> 
> 
> -- 
> -- Anastasios Zouzias
> &lt;

> azo@.ibm

> &gt;





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Equally split a RDD partition into two partition at the same node

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Fei,

How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395

coalesce is mostly used for reducing the number of partitions before
writing to HDFS, but it might still be a narrow dependency (satisfying your
requirements) if you increase the # of partitions.

Best,
Anastasios

On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:

> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
> Is there anyone who knows how to implement it or any hints for it?
>
> Thanks in advance,
> Fei
>
>

-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

Re: Equally split a RDD partition into two partition at the same node

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi Fei,

How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395

coalesce is mostly used for reducing the number of partitions before
writing to HDFS, but it might still be a narrow dependency (satisfying your
requirements) if you increase the # of partitions.

Best,
Anastasios

On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hu...@gmail.com> wrote:

> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
> Is there anyone who knows how to implement it or any hints for it?
>
> Thanks in advance,
> Fei
>
>

-- 
-- Anastasios Zouzias
<az...@zurich.ibm.com>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Rishi,

Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data locality.

Thanks,
Fei

On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Can you provide some more details:
> 1. How many partitions does RDD have
> 2. How big is the cluster
> On Sat, Jan 14, 2017 at 3:59 PM Fei Hu <hu...@gmail.com> wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Rishi,

Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data locality.

Thanks,
Fei

On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Can you provide some more details:
> 1. How many partitions does RDD have
> 2. How big is the cluster
> On Sat, Jan 14, 2017 at 3:59 PM Fei Hu <hu...@gmail.com> wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>

Re: Equally split a RDD partition into two partition at the same node

Posted by Fei Hu <hu...@gmail.com>.

Hi Rishi,

Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data locality.

Thanks,
Fei

On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Can you provide some more details:
> 1. How many partitions does RDD have
> 2. How big is the cluster
> On Sat, Jan 14, 2017 at 3:59 PM Fei Hu <hu...@gmail.com> wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>

Re: Equally split a RDD partition into two partition at the same node

Posted by Rishi Yadav <ri...@infoobjects.com>.

Can you provide some more details:
1. How many partitions does RDD have
2. How big is the cluster
On Sat, Jan 14, 2017 at 3:59 PM Fei Hu <hu...@gmail.com> wrote:

> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
> Is there anyone who knows how to implement it or any hints for it?
>
> Thanks in advance,
> Fei
>
>