You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Umesh Kacha (JIRA)" <ji...@apache.org> on 2016/01/07 19:56:40 UTC

[jira] [Updated] (SPARK-12698) How to load specific Hive partition in DataFrame Spark 1.6?

     [ https://issues.apache.org/jira/browse/SPARK-12698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Umesh Kacha updated SPARK-12698:
--------------------------------
    Description: 
Spark 1.6 onwards as per the official doc we cant add specific hive partitions to DataFrame

spark 1.5 the following used to work and the following dataframe will have entity column

DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")
But in Spark 1.6 above does not work as it does not contain entity column which I want in DataFrame and I have to give base path like the following but following loads all partitions into dataframe which is highly inefficient.

DataFrame df = hiveContext.read().format("orc").load("path/to/table/") 

How do I load specific hive partition in a dataframe? What was the driver behind removing this feature which was efficient I believe now above Spark 1.6 code load all partitions and if I filter for specific partitions it is not efficient it hits memory and throws GC error because of thousands of partitions get loaded into memory and not the specific partition. Please guide. Thanks in advance.

  was:
Spark 1.6 onwards as per the official doc we cant add specific hive partitions to DataFrame

spark 1.5 the following used to work and the following dataframe will have entity column

DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")
But in Spark 1.6 above does not work and I have to give base path like the following but it does not contain entity column which I want in DataFrame

DataFrame df = hiveContext.read().format("orc").load("path/to/table/") 

How do I load specific hive partition in a dataframe? What was the driver behind removing this feature which was efficient I believe now above Spark 1.6 code load all partitions and if I filter for specific partitions it is not efficient it hits memory and throws GC error because of thousands of partitions get loaded into memory and not the specific partition. Please guide. Thanks in advance.


> How to load specific Hive partition in DataFrame Spark 1.6?
> -----------------------------------------------------------
>
>                 Key: SPARK-12698
>                 URL: https://issues.apache.org/jira/browse/SPARK-12698
>             Project: Spark
>          Issue Type: Question
>          Components: Java API, SQL
>    Affects Versions: 1.6.0
>         Environment: YARN, Hive, Hadoop 2.6
>            Reporter: Umesh Kacha
>            Priority: Blocker
>
> Spark 1.6 onwards as per the official doc we cant add specific hive partitions to DataFrame
> spark 1.5 the following used to work and the following dataframe will have entity column
> DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")
> But in Spark 1.6 above does not work as it does not contain entity column which I want in DataFrame and I have to give base path like the following but following loads all partitions into dataframe which is highly inefficient.
> DataFrame df = hiveContext.read().format("orc").load("path/to/table/") 
> How do I load specific hive partition in a dataframe? What was the driver behind removing this feature which was efficient I believe now above Spark 1.6 code load all partitions and if I filter for specific partitions it is not efficient it hits memory and throws GC error because of thousands of partitions get loaded into memory and not the specific partition. Please guide. Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org