You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Samy Dindane <sa...@dindane.com> on 2016/11/17 19:05:38 UTC

How to load only the data of the last partition

Hi,

I have some data partitioned this way:

/data/year=2016/month=9/version=0
/data/year=2016/month=10/version=0
/data/year=2016/month=10/version=1
/data/year=2016/month=10/version=2
/data/year=2016/month=10/version=3
/data/year=2016/month=11/version=0
/data/year=2016/month=11/version=1

When using this data, I'd like to load the last version only of each month.

A simple way to do this is to do `load("/data/year=2016/month=11/version=3")` instead of doing `load("/data")`.
The drawback of this solution is the loss of partitioning information such as `year` and `month`, which means it would not be possible to apply operations based on the year or the month anymore.

Is it possible to ask Spark to load the last version only of each month? How would you go about this?

Thank you,

Samy

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How to load only the data of the last partition

Posted by Rabin Banerjee <de...@gmail.com>.

HI ,

 In order to do that you can write code to read/list a HDFS directory first
, then list its sub-directories . In this way using custom logic ,first
identify the latest year/month/version , then read the avro in that dir in
a DF, then add year/month/version to that DF using withColumn.

Regards,
R Banerjee

On Fri, Nov 18, 2016 at 2:41 PM, Samy Dindane <sa...@dindane.com> wrote:

> Thank you Daniel. Unfortunately, we don't use Hive but bare (Avro) files.
>
>
> On 11/17/2016 08:47 PM, Daniel Haviv wrote:
>
>> Hi Samy,
>> If you're working with hive you could create a partitioned table and
>> update it's partitions' locations to the last version so when you'll query
>> it using spark, you'll always get the latest version.
>>
>> Daniel
>>
>> On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane <samy@dindane.com <mailto:
>> samy@dindane.com>> wrote:
>>
>>     Hi,
>>
>>     I have some data partitioned this way:
>>
>>     /data/year=2016/month=9/version=0
>>     /data/year=2016/month=10/version=0
>>     /data/year=2016/month=10/version=1
>>     /data/year=2016/month=10/version=2
>>     /data/year=2016/month=10/version=3
>>     /data/year=2016/month=11/version=0
>>     /data/year=2016/month=11/version=1
>>
>>     When using this data, I'd like to load the last version only of each
>> month.
>>
>>     A simple way to do this is to do `load("/data/year=2016/month=11/version=3")`
>> instead of doing `load("/data")`.
>>     The drawback of this solution is the loss of partitioning information
>> such as `year` and `month`, which means it would not be possible to apply
>> operations based on the year or the month anymore.
>>
>>     Is it possible to ask Spark to load the last version only of each
>> month? How would you go about this?
>>
>>     Thank you,
>>
>>     Samy
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:
>> user-unsubscribe@spark.apache.org>
>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: How to load only the data of the last partition

Posted by Samy Dindane <sa...@dindane.com>.

Thank you Daniel. Unfortunately, we don't use Hive but bare (Avro) files.


On 11/17/2016 08:47 PM, Daniel Haviv wrote:
> Hi Samy,
> If you're working with hive you could create a partitioned table and update it's partitions' locations to the last version so when you'll query it using spark, you'll always get the latest version.
>
> Daniel
>
> On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane <samy@dindane.com <ma...@dindane.com>> wrote:
>
>     Hi,
>
>     I have some data partitioned this way:
>
>     /data/year=2016/month=9/version=0
>     /data/year=2016/month=10/version=0
>     /data/year=2016/month=10/version=1
>     /data/year=2016/month=10/version=2
>     /data/year=2016/month=10/version=3
>     /data/year=2016/month=11/version=0
>     /data/year=2016/month=11/version=1
>
>     When using this data, I'd like to load the last version only of each month.
>
>     A simple way to do this is to do `load("/data/year=2016/month=11/version=3")` instead of doing `load("/data")`.
>     The drawback of this solution is the loss of partitioning information such as `year` and `month`, which means it would not be possible to apply operations based on the year or the month anymore.
>
>     Is it possible to ask Spark to load the last version only of each month? How would you go about this?
>
>     Thank you,
>
>     Samy
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How to load only the data of the last partition

Posted by Daniel Haviv <da...@veracity-group.com>.

Hi Samy,
If you're working with hive you could create a partitioned table and update
it's partitions' locations to the last version so when you'll query it
using spark, you'll always get the latest version.

Daniel

On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane <sa...@dindane.com> wrote:

> Hi,
>
> I have some data partitioned this way:
>
> /data/year=2016/month=9/version=0
> /data/year=2016/month=10/version=0
> /data/year=2016/month=10/version=1
> /data/year=2016/month=10/version=2
> /data/year=2016/month=10/version=3
> /data/year=2016/month=11/version=0
> /data/year=2016/month=11/version=1
>
> When using this data, I'd like to load the last version only of each month.
>
> A simple way to do this is to do `load("/data/year=2016/month=11/version=3")`
> instead of doing `load("/data")`.
> The drawback of this solution is the loss of partitioning information such
> as `year` and `month`, which means it would not be possible to apply
> operations based on the year or the month anymore.
>
> Is it possible to ask Spark to load the last version only of each month?
> How would you go about this?
>
> Thank you,
>
> Samy
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>