You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2016/01/07 16:34:41 UTC

How to load specific Hive partition in DataFrame Spark 1.6?

Hi from Spark 1.6 onwards as per this  doc
<http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery>  
We cant add specific hive partitions to DataFrame 

spark 1.5 the following used to work and the following dataframe will have
entity column

DataFrame df =
hiveContext.read().format("orc").load("path/to/table/entity=xyz")

But in Spark 1.6 above does not work and I have to give base path like the
following but it does not contain entity column which I want in DataFrame

DataFrame df = hiveContext.read().format("orc").load("path/to/table/")

How do I load specific hive partition in a dataframe? What was the driver
behind removing this feature which was efficient I believe now above Spark
1.6 code load all partitions and if I filter for specific partitions it is
not efficient it hits memory and throws GC error because of thousands of
partitions get loaded into memory and not the specific one please guide.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to load specific Hive partition in DataFrame Spark 1.6?

Posted by Yin Huai <yh...@databricks.com>.

No problem! Glad it helped!

On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Yin, thanks much your answer solved my problem. Really appreciate it!
>
> Regards
>
>
> On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi, we made the change because the partitioning discovery logic was too
>> flexible and it introduced problems that were very confusing to users. To
>> make your case work, we have introduced a new data source option called
>> basePath. You can use
>>
>> DataFrame df = hiveContext.read().format("orc").option("basePath", "
>> path/to/table/").load("path/to/table/entity=xyz")
>>
>> So, the partitioning discovery logic will understand that the base path
>> is path/to/table/ and your dataframe will has the column "entity".
>>
>> You can find the doc at the end of partitioning discovery section of the
>> sql programming guide (
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>> ).
>>
>> Thanks,
>>
>> Yin
>>
>> On Thu, Jan 7, 2016 at 7:34 AM, unk1102 <um...@gmail.com> wrote:
>>
>>> Hi from Spark 1.6 onwards as per this  doc
>>> <
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>>> >
>>> We cant add specific hive partitions to DataFrame
>>>
>>> spark 1.5 the following used to work and the following dataframe will
>>> have
>>> entity column
>>>
>>> DataFrame df =
>>> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>>>
>>> But in Spark 1.6 above does not work and I have to give base path like
>>> the
>>> following but it does not contain entity column which I want in DataFrame
>>>
>>> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>>>
>>> How do I load specific hive partition in a dataframe? What was the driver
>>> behind removing this feature which was efficient I believe now above
>>> Spark
>>> 1.6 code load all partitions and if I filter for specific partitions it
>>> is
>>> not efficient it hits memory and throws GC error because of thousands of
>>> partitions get loaded into memory and not the specific one please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to load specific Hive partition in DataFrame Spark 1.6?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Yin, thanks much your answer solved my problem. Really appreciate it!

Regards


On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai <yh...@databricks.com> wrote:

> Hi, we made the change because the partitioning discovery logic was too
> flexible and it introduced problems that were very confusing to users. To
> make your case work, we have introduced a new data source option called
> basePath. You can use
>
> DataFrame df = hiveContext.read().format("orc").option("basePath", "
> path/to/table/").load("path/to/table/entity=xyz")
>
> So, the partitioning discovery logic will understand that the base path is path/to/table/
> and your dataframe will has the column "entity".
>
> You can find the doc at the end of partitioning discovery section of the
> sql programming guide (
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
> ).
>
> Thanks,
>
> Yin
>
> On Thu, Jan 7, 2016 at 7:34 AM, unk1102 <um...@gmail.com> wrote:
>
>> Hi from Spark 1.6 onwards as per this  doc
>> <
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>> >
>> We cant add specific hive partitions to DataFrame
>>
>> spark 1.5 the following used to work and the following dataframe will have
>> entity column
>>
>> DataFrame df =
>> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>>
>> But in Spark 1.6 above does not work and I have to give base path like the
>> following but it does not contain entity column which I want in DataFrame
>>
>> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>>
>> How do I load specific hive partition in a dataframe? What was the driver
>> behind removing this feature which was efficient I believe now above Spark
>> 1.6 code load all partitions and if I filter for specific partitions it is
>> not efficient it hits memory and throws GC error because of thousands of
>> partitions get loaded into memory and not the specific one please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to load specific Hive partition in DataFrame Spark 1.6?

Posted by Yin Huai <yh...@databricks.com>.

Hi, we made the change because the partitioning discovery logic was too
flexible and it introduced problems that were very confusing to users. To
make your case work, we have introduced a new data source option called
basePath. You can use

DataFrame df = hiveContext.read().format("orc").option("basePath", "
path/to/table/").load("path/to/table/entity=xyz")

So, the partitioning discovery logic will understand that the base
path is path/to/table/
and your dataframe will has the column "entity".

You can find the doc at the end of partitioning discovery section of the
sql programming guide (
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
).

Thanks,

Yin

On Thu, Jan 7, 2016 at 7:34 AM, unk1102 <um...@gmail.com> wrote:

> Hi from Spark 1.6 onwards as per this  doc
> <
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
> >
> We cant add specific hive partitions to DataFrame
>
> spark 1.5 the following used to work and the following dataframe will have
> entity column
>
> DataFrame df =
> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>
> But in Spark 1.6 above does not work and I have to give base path like the
> following but it does not contain entity column which I want in DataFrame
>
> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>
> How do I load specific hive partition in a dataframe? What was the driver
> behind removing this feature which was efficient I believe now above Spark
> 1.6 code load all partitions and if I filter for specific partitions it is
> not efficient it hits memory and throws GC error because of thousands of
> partitions get loaded into memory and not the specific one please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>