You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2019/08/06 07:33:38 UTC

Re: Hive external table not working in sparkSQL when subdirectories are present

which versions of Spark and Hive are you using.

what will happen if you use parquet tables instead?

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com>
wrote:

> Hi.
> I have built a Hive external table on top of a directory 'A' which has
> data stored in ORC format. This directory has several subdirectories inside
> it, each of which contains the actual ORC files.
> These subdirectories are actually created by spark jobs which ingest data
> from other sources and write it into this directory.
> I tried creating a table and setting the table properties of the same as
> *hive.mapred.supports.subdirectories=TRUE* and
> *mapred.input.dir.recursive**=TRUE*.
> As a result of this, when i fire the simplest query of *select count(*)
> from ExtTable* via the Hive CLI, it successfully gives me the expected
> count of records in the table.
> However, when i fire the same query via sparkSQL, i get count = 0.
>
> I think the sparkSQL isn't able to descend into the subdirectories for
> getting the data while hive is able to do so.
> Are there any configurations needed to be set on the spark side so that
> this works as it does via hive cli?
> I am using Spark on YARN.
>
> Thanks,
> Rishikesh
>
> Tags: subdirectories, subdirectory, recursive, recursion, hive external
> table, orc, sparksql, yarn
>

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Mich Talebzadeh <mi...@gmail.com>.
Have you updated partition statistics by any chance?

I assume you can access the table and data though Hive itself?

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Aug 2019 at 21:07, Patrick McCarthy <pm...@dstillery.com>
wrote:

> Do the permissions on the hive table files on HDFS correspond with what
> the spark user is able to read? This might arise from spark being run as
> different users.
>
> On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade <ri...@gmail.com>
> wrote:
>
>> Hi,
>> I did not explicitly create a Hive Context. I have been using the
>> spark.sqlContext that gets created upon launching the spark-shell.
>> Isn't this sqlContext same as the hiveContext?
>> Thanks,
>> Rishikesh
>>
>> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Do you use the HiveContext in Spark? Do you configure the same options
>>> there? Can you share some code?
>>>
>>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <
>>> rishikeshg1996@gmail.com>:
>>>
>>> Hi.
>>> I am using Spark 2.3.2 and Hive 3.1.0.
>>> Even if i use parquet files the result would be same, because after all
>>> sparkSQL isn't able to descend into the subdirectories over which the table
>>> is created. Could there be any other way?
>>> Thanks,
>>> Rishikesh
>>>
>>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>> which versions of Spark and Hive are you using.
>>>>
>>>> what will happen if you use parquet tables instead?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi.
>>>>> I have built a Hive external table on top of a directory 'A' which has
>>>>> data stored in ORC format. This directory has several subdirectories inside
>>>>> it, each of which contains the actual ORC files.
>>>>> These subdirectories are actually created by spark jobs which ingest
>>>>> data from other sources and write it into this directory.
>>>>> I tried creating a table and setting the table properties of the same
>>>>> as *hive.mapred.supports.subdirectories=TRUE* and
>>>>> *mapred.input.dir.recursive**=TRUE*.
>>>>> As a result of this, when i fire the simplest query of *select
>>>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me
>>>>> the expected count of records in the table.
>>>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>>>
>>>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>>>> getting the data while hive is able to do so.
>>>>> Are there any configurations needed to be set on the spark side so
>>>>> that this works as it does via hive cli?
>>>>> I am using Spark on YARN.
>>>>>
>>>>> Thanks,
>>>>> Rishikesh
>>>>>
>>>>> Tags: subdirectories, subdirectory, recursive, recursion, hive
>>>>> external table, orc, sparksql, yarn
>>>>>
>>>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Mich Talebzadeh <mi...@gmail.com>.
Have you updated partition statistics by any chance?

I assume you can access the table and data though Hive itself?

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Aug 2019 at 21:07, Patrick McCarthy <pm...@dstillery.com>
wrote:

> Do the permissions on the hive table files on HDFS correspond with what
> the spark user is able to read? This might arise from spark being run as
> different users.
>
> On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade <ri...@gmail.com>
> wrote:
>
>> Hi,
>> I did not explicitly create a Hive Context. I have been using the
>> spark.sqlContext that gets created upon launching the spark-shell.
>> Isn't this sqlContext same as the hiveContext?
>> Thanks,
>> Rishikesh
>>
>> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Do you use the HiveContext in Spark? Do you configure the same options
>>> there? Can you share some code?
>>>
>>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <
>>> rishikeshg1996@gmail.com>:
>>>
>>> Hi.
>>> I am using Spark 2.3.2 and Hive 3.1.0.
>>> Even if i use parquet files the result would be same, because after all
>>> sparkSQL isn't able to descend into the subdirectories over which the table
>>> is created. Could there be any other way?
>>> Thanks,
>>> Rishikesh
>>>
>>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>> which versions of Spark and Hive are you using.
>>>>
>>>> what will happen if you use parquet tables instead?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi.
>>>>> I have built a Hive external table on top of a directory 'A' which has
>>>>> data stored in ORC format. This directory has several subdirectories inside
>>>>> it, each of which contains the actual ORC files.
>>>>> These subdirectories are actually created by spark jobs which ingest
>>>>> data from other sources and write it into this directory.
>>>>> I tried creating a table and setting the table properties of the same
>>>>> as *hive.mapred.supports.subdirectories=TRUE* and
>>>>> *mapred.input.dir.recursive**=TRUE*.
>>>>> As a result of this, when i fire the simplest query of *select
>>>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me
>>>>> the expected count of records in the table.
>>>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>>>
>>>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>>>> getting the data while hive is able to do so.
>>>>> Are there any configurations needed to be set on the spark side so
>>>>> that this works as it does via hive cli?
>>>>> I am using Spark on YARN.
>>>>>
>>>>> Thanks,
>>>>> Rishikesh
>>>>>
>>>>> Tags: subdirectories, subdirectory, recursive, recursion, hive
>>>>> external table, orc, sparksql, yarn
>>>>>
>>>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Patrick McCarthy <pm...@dstillery.com>.
Do the permissions on the hive table files on HDFS correspond with what the
spark user is able to read? This might arise from spark being run as
different users.

On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade <ri...@gmail.com>
wrote:

> Hi,
> I did not explicitly create a Hive Context. I have been using the
> spark.sqlContext that gets created upon launching the spark-shell.
> Isn't this sqlContext same as the hiveContext?
> Thanks,
> Rishikesh
>
> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke <jo...@gmail.com> wrote:
>
>> Do you use the HiveContext in Spark? Do you configure the same options
>> there? Can you share some code?
>>
>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <rishikeshg1996@gmail.com
>> >:
>>
>> Hi.
>> I am using Spark 2.3.2 and Hive 3.1.0.
>> Even if i use parquet files the result would be same, because after all
>> sparkSQL isn't able to descend into the subdirectories over which the table
>> is created. Could there be any other way?
>> Thanks,
>> Rishikesh
>>
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> which versions of Spark and Hive are you using.
>>>
>>> what will happen if you use parquet tables instead?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com>
>>> wrote:
>>>
>>>> Hi.
>>>> I have built a Hive external table on top of a directory 'A' which has
>>>> data stored in ORC format. This directory has several subdirectories inside
>>>> it, each of which contains the actual ORC files.
>>>> These subdirectories are actually created by spark jobs which ingest
>>>> data from other sources and write it into this directory.
>>>> I tried creating a table and setting the table properties of the same
>>>> as *hive.mapred.supports.subdirectories=TRUE* and
>>>> *mapred.input.dir.recursive**=TRUE*.
>>>> As a result of this, when i fire the simplest query of *select
>>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me the
>>>> expected count of records in the table.
>>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>>
>>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>>> getting the data while hive is able to do so.
>>>> Are there any configurations needed to be set on the spark side so that
>>>> this works as it does via hive cli?
>>>> I am using Spark on YARN.
>>>>
>>>> Thanks,
>>>> Rishikesh
>>>>
>>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>>>> table, orc, sparksql, yarn
>>>>
>>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Rishikesh Gawade <ri...@gmail.com>.
Hi,
I did not explicitly create a Hive Context. I have been using the
spark.sqlContext that gets created upon launching the spark-shell.
Isn't this sqlContext same as the hiveContext?
Thanks,
Rishikesh

On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke <jo...@gmail.com> wrote:

> Do you use the HiveContext in Spark? Do you configure the same options
> there? Can you share some code?
>
> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <rishikeshg1996@gmail.com
> >:
>
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0.
> Even if i use parquet files the result would be same, because after all
> sparkSQL isn't able to descend into the subdirectories over which the table
> is created. Could there be any other way?
> Thanks,
> Rishikesh
>
> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> which versions of Spark and Hive are you using.
>>
>> what will happen if you use parquet tables instead?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com>
>> wrote:
>>
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has
>>> data stored in ORC format. This directory has several subdirectories inside
>>> it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest
>>> data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as
>>> *hive.mapred.supports.subdirectories=TRUE* and
>>> *mapred.input.dir.recursive**=TRUE*.
>>> As a result of this, when i fire the simplest query of *select count(*)
>>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>>> count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>
>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that
>>> this works as it does via hive cli?
>>> I am using Spark on YARN.
>>>
>>> Thanks,
>>> Rishikesh
>>>
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>>> table, orc, sparksql, yarn
>>>
>>

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Jörn Franke <jo...@gmail.com>.
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <ri...@gmail.com>:
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all sparkSQL isn't able to descend into the subdirectories over which the table is created. Could there be any other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com> wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data stored in ORC format. This directory has several subdirectories inside it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as hive.mapred.supports.subdirectories=TRUE and mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from ExtTable via the Hive CLI, it successfully gives me the expected count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that this works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external table, orc, sparksql, yarn

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Jörn Franke <jo...@gmail.com>.
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <ri...@gmail.com>:
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all sparkSQL isn't able to descend into the subdirectories over which the table is created. Could there be any other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com> wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data stored in ORC format. This directory has several subdirectories inside it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as hive.mapred.supports.subdirectories=TRUE and mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from ExtTable via the Hive CLI, it successfully gives me the expected count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that this works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external table, orc, sparksql, yarn

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Jörn Franke <jo...@gmail.com>.
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <ri...@gmail.com>:
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all sparkSQL isn't able to descend into the subdirectories over which the table is created. Could there be any other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com> wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data stored in ORC format. This directory has several subdirectories inside it, each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as hive.mapred.supports.subdirectories=TRUE and mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from ExtTable via the Hive CLI, it successfully gives me the expected count of records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that this works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external table, orc, sparksql, yarn

Re: Hive external table not working in sparkSQL when subdirectories are present

Posted by Rishikesh Gawade <ri...@gmail.com>.
Hi.
I am using Spark 2.3.2 and Hive 3.1.0.
Even if i use parquet files the result would be same, because after all
sparkSQL isn't able to descend into the subdirectories over which the table
is created. Could there be any other way?
Thanks,
Rishikesh

On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> which versions of Spark and Hive are you using.
>
> what will happen if you use parquet tables instead?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade <ri...@gmail.com>
> wrote:
>
>> Hi.
>> I have built a Hive external table on top of a directory 'A' which has
>> data stored in ORC format. This directory has several subdirectories inside
>> it, each of which contains the actual ORC files.
>> These subdirectories are actually created by spark jobs which ingest data
>> from other sources and write it into this directory.
>> I tried creating a table and setting the table properties of the same as
>> *hive.mapred.supports.subdirectories=TRUE* and
>> *mapred.input.dir.recursive**=TRUE*.
>> As a result of this, when i fire the simplest query of *select count(*)
>> from ExtTable* via the Hive CLI, it successfully gives me the expected
>> count of records in the table.
>> However, when i fire the same query via sparkSQL, i get count = 0.
>>
>> I think the sparkSQL isn't able to descend into the subdirectories for
>> getting the data while hive is able to do so.
>> Are there any configurations needed to be set on the spark side so that
>> this works as it does via hive cli?
>> I am using Spark on YARN.
>>
>> Thanks,
>> Rishikesh
>>
>> Tags: subdirectories, subdirectory, recursive, recursion, hive external
>> table, orc, sparksql, yarn
>>
>