You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SRK <sw...@gmail.com> on 2016/02/16 19:21:55 UTC

How to use a custom partitioner in a dataframe in Spark

Hi,

How do I use a custom partitioner when I do a saveAsTable in a dataframe. 


Thanks,
Swetha



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to use a custom partitioner in a dataframe in Spark

Posted by Rishi Mishra <rm...@snappydata.io>.

Unfortunately there is not any,  at least till 1.5.  Have not gone through
the new DataSet of 1.6.  There is some basic support for Parquet like
partitionByColumn.
If you want to partition your dataset on a certain way you have to use an
RDD to partition & convert that into a DataFrame before storing in table.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Tue, Feb 16, 2016 at 11:51 PM, SRK <sw...@gmail.com> wrote:

> Hi,
>
> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to use a custom partitioner in a dataframe in Spark

Posted by Koert Kuipers <ko...@tresata.com>.

although it is not a bad idea to write data out partitioned, and then use a
merge join when reading it back in, this currently isn't even easily doable
with rdds because when you read an rdd from disk the partitioning info is
lost. re-introducing a partitioner at that point causes a shuffle defeating
the purpose.

On Thu, Feb 18, 2016 at 1:49 PM, Rishi Mishra <rm...@snappydata.io> wrote:

> Michael,
> Is there any specific reason why DataFrames does not have partitioners
> like RDDs ? This will be very useful if one is writing custom datasources ,
> which keeps data in partitions. While storing data one can pre-partition
> the data at Spark level rather than at the datasource.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Feb 18, 2016 at 3:50 AM, swetha kasireddy <
> swethakasireddy@gmail.com> wrote:
>
>> So suppose I have a bunch of userIds and I need to save them as parquet
>> in database. I also need to load them back and need to be able to do a join
>> on userId. My idea is to partition by userId hashcode first and then on
>> userId. So that I don't have to deal with any performance issues because of
>> a number of small files and also to be able to scan faster.
>>
>>
>> Something like ...df.write.format("parquet").partitionBy( "userIdHash"
>> , "userId").mode(SaveMode.Append).save("userRecords");
>>
>> On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <
>> swethakasireddy@gmail.com> wrote:
>>
>>> So suppose I have a bunch of userIds and I need to save them as parquet
>>> in database. I also need to load them back and need to be able to do a join
>>> on userId. My idea is to partition by userId hashcode first and then on
>>> userId.
>>>
>>>
>>>
>>> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Can you describe what you are trying to accomplish?  What would the
>>>> custom partitioner be?
>>>>
>>>> On Tue, Feb 16, 2016 at 1:21 PM, SRK <sw...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How do I use a custom partitioner when I do a saveAsTable in a
>>>>> dataframe.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Swetha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to use a custom partitioner in a dataframe in Spark

Posted by Rishi Mishra <rm...@snappydata.io>.

Michael,
Is there any specific reason why DataFrames does not have partitioners like
RDDs ? This will be very useful if one is writing custom datasources ,
which keeps data in partitions. While storing data one can pre-partition
the data at Spark level rather than at the datasource.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Thu, Feb 18, 2016 at 3:50 AM, swetha kasireddy <swethakasireddy@gmail.com
> wrote:

> So suppose I have a bunch of userIds and I need to save them as parquet in
> database. I also need to load them back and need to be able to do a join
> on userId. My idea is to partition by userId hashcode first and then on
> userId. So that I don't have to deal with any performance issues because of
> a number of small files and also to be able to scan faster.
>
>
> Something like ...df.write.format("parquet").partitionBy( "userIdHash"
> , "userId").mode(SaveMode.Append).save("userRecords");
>
> On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <
> swethakasireddy@gmail.com> wrote:
>
>> So suppose I have a bunch of userIds and I need to save them as parquet
>> in database. I also need to load them back and need to be able to do a join
>> on userId. My idea is to partition by userId hashcode first and then on
>> userId.
>>
>>
>>
>> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>>> Can you describe what you are trying to accomplish?  What would the
>>> custom partitioner be?
>>>
>>> On Tue, Feb 16, 2016 at 1:21 PM, SRK <sw...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How do I use a custom partitioner when I do a saveAsTable in a
>>>> dataframe.
>>>>
>>>>
>>>> Thanks,
>>>> Swetha
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: How to use a custom partitioner in a dataframe in Spark

Posted by swetha kasireddy <sw...@gmail.com>.

So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId. So that I don't have to deal with any performance issues because of
a number of small files and also to be able to scan faster.


Something like ...df.write.format("parquet").partitionBy( "userIdHash"
, "userId").mode(SaveMode.Append).save("userRecords");

On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <swethakasireddy@gmail.com
> wrote:

> So suppose I have a bunch of userIds and I need to save them as parquet in
> database. I also need to load them back and need to be able to do a join
> on userId. My idea is to partition by userId hashcode first and then on
> userId.
>
>
>
> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> Can you describe what you are trying to accomplish?  What would the
>> custom partitioner be?
>>
>> On Tue, Feb 16, 2016 at 1:21 PM, SRK <sw...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to use a custom partitioner in a dataframe in Spark

Posted by swetha kasireddy <sw...@gmail.com>.

So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId.



On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> Can you describe what you are trying to accomplish?  What would the custom
> partitioner be?
>
> On Tue, Feb 16, 2016 at 1:21 PM, SRK <sw...@gmail.com> wrote:
>
>> Hi,
>>
>> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to use a custom partitioner in a dataframe in Spark

Posted by Michael Armbrust <mi...@databricks.com>.

Can you describe what you are trying to accomplish?  What would the custom
partitioner be?

On Tue, Feb 16, 2016 at 1:21 PM, SRK <sw...@gmail.com> wrote:

> Hi,
>
> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>