You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ahmed Kamal Abdelfatah <ah...@careem.com> on 2017/02/15 12:47:29 UTC

Query data in subdirectories in Hive Partitions using Spark SQL

Hi folks,

How can I force spark sql to recursively get data stored in parquet format from subdirectories ?  In Hive, I could achieve this by setting few Hive configs.

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

I tried to set these configs through spark sql queries but I get 0 records all the times compared to hive which get me the expected results. I also put these confs in hive-site.xml file but nothing changed. How can I handle this issue ?

Spark Version : 2.1.0
I used Hive 2.1.1  on emr-5.3.1

Regards,


Ahmed Kamal
MTS in Data Science
Email: ahmed.abdelfatah@careem.com<ma...@careem.com>



Re: Query data in subdirectories in Hive Partitions using Spark SQL

Posted by Jon Gregg <co...@gmail.com>.
Spark has partition discovery if your data is laid out in a
parquet-friendly directory structure:
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

You can also use wildcards to get subdirectories (I'm using spark 1.6 here)
>>
data2 = sqlContext.read.load("/my/data/parquetTable/*", "parquet") # gets
all subdirectories
>>

<http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery>Another
option would be to CREATE a Hive table on top of your data that uses
PARTITIONED BY to identify the subdirectories, and then use Spark SQL to
query that Hive table.  There might be a cleaner way to do this in Spark
2.0+ but that is a common pattern for me in Spark 1.6 when I know the
directory structure but don't have "=" signs in the paths.

Jon Gregg

On Fri, Feb 17, 2017 at 7:02 PM, 颜发才(Yan Facai) <fa...@gmail.com> wrote:

> Hi, Abdelfatah,
> How to you read these files? spark.read.parquet or spark.sql?
> Could you show some code?
>
>
> On Wed, Feb 15, 2017 at 8:47 PM, Ahmed Kamal Abdelfatah <
> ahmed.abdelfatah@careem.com> wrote:
>
>> Hi folks,
>>
>>
>>
>> How can I force spark sql to recursively get data stored in parquet
>> format from subdirectories ?  In Hive, I could achieve this by setting few
>> Hive configs.
>>
>>
>>
>> set hive.input.dir.recursive=true;
>>
>> set hive.mapred.supports.subdirectories=true;
>>
>> set hive.supports.subdirectories=true;
>>
>> set mapred.input.dir.recursive=true;
>>
>>
>>
>> I tried to set these configs through spark sql queries but I get 0
>> records all the times compared to hive which get me the expected results. I
>> also put these confs in hive-site.xml file but nothing changed. How can I
>> handle this issue ?
>>
>>
>>
>> Spark Version : 2.1.0
>>
>> I used Hive 2.1.1  on emr-5.3.1
>>
>>
>>
>> *Regards, *
>>
>>
>>
>>
>> *Ahmed Kamal*
>> *MTS in Data Science*
>>
>> *Email: **ahmed.abdelfatah@careem.com <ah...@careem.com>*
>>
>>
>>
>>
>>
>
>

Re: Query data in subdirectories in Hive Partitions using Spark SQL

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
Hi, Abdelfatah,
How to you read these files? spark.read.parquet or spark.sql?
Could you show some code?


On Wed, Feb 15, 2017 at 8:47 PM, Ahmed Kamal Abdelfatah <
ahmed.abdelfatah@careem.com> wrote:

> Hi folks,
>
>
>
> How can I force spark sql to recursively get data stored in parquet format
> from subdirectories ?  In Hive, I could achieve this by setting few Hive
> configs.
>
>
>
> set hive.input.dir.recursive=true;
>
> set hive.mapred.supports.subdirectories=true;
>
> set hive.supports.subdirectories=true;
>
> set mapred.input.dir.recursive=true;
>
>
>
> I tried to set these configs through spark sql queries but I get 0 records
> all the times compared to hive which get me the expected results. I also
> put these confs in hive-site.xml file but nothing changed. How can I handle
> this issue ?
>
>
>
> Spark Version : 2.1.0
>
> I used Hive 2.1.1  on emr-5.3.1
>
>
>
> *Regards, *
>
>
>
>
> *Ahmed Kamal*
> *MTS in Data Science*
>
> *Email: **ahmed.abdelfatah@careem.com <ah...@careem.com>*
>
>
>
>
>