You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "@Sanjiv Singh" <sa...@gmail.com> on 2016/02/22 04:27:20 UTC

Spark SQL is not returning records for hive bucketed tables on HDP

Hi,


I have observed that Spark SQL is not returning records for hive bucketed
ORC tables on HDP.



On spark SQL , I am able to list all tables , but queries on hive bucketed
tables are not returning records.

I have also tried the same for non-bucketed hive tables. it is working fine.



Same is working on plain Apache setup.

Let me know if needs other details.

Regards
Sanjiv Singh
Mob :  +091 9990-447-339

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Yes, It is very strange and also very opposite to my belief on Spark SQL on
hive tables.

I am facing this issue on HDP setup on which COMPACTION is required only
once.
On the other hand, Apache setup doesn't required compaction even once.

May be something got triggered on meta-store after compaction, Spark SQL
start recognizing delta files.

Let know me if needed other details to get root cause.



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:28 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> That's interesting. I'm not sure why first compaction is needed but not on
> the subsequent inserts. May be its just to create few metadata. Thanks for
> clarifying this :)
>
> On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Try this,
>>
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
>>> if its different.
>>>
>>> Please walkthrough the below commands in the same order to understand
>>> the sequence.
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>>
>>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>>
>>>> That is the point, Spark SQL is able to recognize delta files. See
>>>> below directory structure, ONE BASE (43 records) and one DELTA (created
>>>> after last insert). And I am able see last insert through Spark SQL.
>>>>
>>>>
>>>> *See below complete scenario :*
>>>>
>>>> *Steps:*
>>>>
>>>>    - Inserted 43 records in table.
>>>>    - Run major compaction on table.
>>>>    - *alter table mytable COMPACT 'major';*
>>>>    - Disabled auto compaction on table.
>>>>    - *alter table mytable set
>>>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>>>    - Inserted 1 record in table.
>>>>
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>>>
>>>> *SPARK JDBC :*
>>>>
>>>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>>>> +------+
>>>> | _c0  |
>>>> +------+
>>>> | 44   |
>>>> +------+
>>>> 1 row selected (1.196 seconds)
>>>>
>>>> *HIVE JDBC :*
>>>>
>>>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>>>> +------+--+
>>>> | _c0  |
>>>> +------+--+
>>>> | 44   |
>>>> +------+--+
>>>> 1 row selected (0.121 seconds)
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Hi Sanjiv,
>>>>>
>>>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all
>>>>> the rows since it is hive which processes the query. But i think the
>>>>> problem with Hive JDBC is that there are two layers of processing, hive and
>>>>> then at spark with the result set. And another one is performance is
>>>>> limited to that single HiveServer2 node and network.
>>>>>
>>>>> But If we make use of sqlContext.table function in spark to access
>>>>> hive tables, it is supposed to read files directly from HDFS skipping the
>>>>> hive layer. But it doesn't read delta files and just reads the contents
>>>>> from base folder. Only after Major compaction, the delta files would be
>>>>> merged with based folder and be visible for Spark SQL
>>>>>
>>>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <
>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>
>>>>>> Hi Varadharajan,
>>>>>>
>>>>>> Can you elaborate on (you quoted on previous mail) :
>>>>>> "I observed that hive transaction storage structure do not work with
>>>>>> spark yet"
>>>>>>
>>>>>>
>>>>>> If it is related to delta files created after each transaction and
>>>>>> spark would not be able recognize them. then I have a table
>>>>>> *mytable *(ORC , BUCKETED , NON-SORTED) , already done lots on
>>>>>> insert , update and deletes. I can see delta files created in HDFS (see
>>>>>> below), Still able to fetch consistent records through Spark JDBC and HIVE
>>>>>> JDBC.
>>>>>>
>>>>>> Not compaction triggered for that table.
>>>>>>
>>>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>>>
>>>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>>>>> its possible to reduce the thresholds but that might reduce performance by
>>>>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>>>>
>>>>>>> The only other way to solve this problem as of now is to use Hive
>>>>>>> JDBC API.
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Compaction would have been triggered automatically as following
>>>>>>>> properties already set in *hive-site.xml*. and also
>>>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>>>
>>>>>>>>
>>>>>>>>     <property>
>>>>>>>>
>>>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>>>
>>>>>>>>       <value>true</value>
>>>>>>>>
>>>>>>>>     </property>
>>>>>>>>
>>>>>>>>     <property>
>>>>>>>>
>>>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>>>
>>>>>>>>       <value>1</value>
>>>>>>>>
>>>>>>>>     </property>
>>>>>>>>
>>>>>>>>
>>>>>>>> Documentation is upset sometimes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes, I was burned down by this issue couple of weeks back. This
>>>>>>>>> also means that after every insert job, compaction should be run to access
>>>>>>>>> new rows from Spark. Sad that this issue is not documented / mentioned
>>>>>>>>> anywhere.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Varadharajan,
>>>>>>>>>>
>>>>>>>>>> Thanks for your response.
>>>>>>>>>>
>>>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>>>
>>>>>>>>>> Table hardly have 3 records , and after triggering minor
>>>>>>>>>> compaction on tables , it start showing results on spark SQL.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > *show create table hivespark;*
>>>>>>>>>>
>>>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>>>
>>>>>>>>>>     `id` int,
>>>>>>>>>>
>>>>>>>>>>     `name` string)
>>>>>>>>>>
>>>>>>>>>>   CLUSTERED BY (
>>>>>>>>>>
>>>>>>>>>>     id)
>>>>>>>>>>
>>>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>>>
>>>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>>>
>>>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>>>
>>>>>>>>>>   OUTPUTFORMAT
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>>>
>>>>>>>>>>   LOCATION
>>>>>>>>>>
>>>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>>>   TBLPROPERTIES (
>>>>>>>>>>
>>>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>>>
>>>>>>>>>>     'last_modified_by'='root',
>>>>>>>>>>
>>>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>>>
>>>>>>>>>>     'numFiles'='37',
>>>>>>>>>>
>>>>>>>>>>     'numRows'='3',
>>>>>>>>>>
>>>>>>>>>>     'rawDataSize'='0',
>>>>>>>>>>
>>>>>>>>>>     'totalSize'='11383',
>>>>>>>>>>
>>>>>>>>>>     'transactional'='true',
>>>>>>>>>>
>>>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Sanjiv Singh
>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>>>
>>>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I have observed that Spark SQL is not returning records for
>>>>>>>>>>>> hive bucketed ORC tables on HDP.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on
>>>>>>>>>>>> hive bucketed tables are not returning records.
>>>>>>>>>>>>
>>>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>>>> working fine.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Sanjiv Singh
>>>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks,
>>>>>>>>>>> M. Varadharajan
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>>>
>>>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> M. Varadharajan
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>
>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Yes, It is very strange and also very opposite to my belief on Spark SQL on
hive tables.

I am facing this issue on HDP setup on which COMPACTION is required only
once.
On the other hand, Apache setup doesn't required compaction even once.

May be something got triggered on meta-store after compaction, Spark SQL
start recognizing delta files.

Let know me if needed other details to get root cause.



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:28 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> That's interesting. I'm not sure why first compaction is needed but not on
> the subsequent inserts. May be its just to create few metadata. Thanks for
> clarifying this :)
>
> On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Try this,
>>
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
>>> if its different.
>>>
>>> Please walkthrough the below commands in the same order to understand
>>> the sequence.
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>>
>>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>>
>>>> That is the point, Spark SQL is able to recognize delta files. See
>>>> below directory structure, ONE BASE (43 records) and one DELTA (created
>>>> after last insert). And I am able see last insert through Spark SQL.
>>>>
>>>>
>>>> *See below complete scenario :*
>>>>
>>>> *Steps:*
>>>>
>>>>    - Inserted 43 records in table.
>>>>    - Run major compaction on table.
>>>>    - *alter table mytable COMPACT 'major';*
>>>>    - Disabled auto compaction on table.
>>>>    - *alter table mytable set
>>>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>>>    - Inserted 1 record in table.
>>>>
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>>>
>>>> *SPARK JDBC :*
>>>>
>>>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>>>> +------+
>>>> | _c0  |
>>>> +------+
>>>> | 44   |
>>>> +------+
>>>> 1 row selected (1.196 seconds)
>>>>
>>>> *HIVE JDBC :*
>>>>
>>>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>>>> +------+--+
>>>> | _c0  |
>>>> +------+--+
>>>> | 44   |
>>>> +------+--+
>>>> 1 row selected (0.121 seconds)
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Hi Sanjiv,
>>>>>
>>>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all
>>>>> the rows since it is hive which processes the query. But i think the
>>>>> problem with Hive JDBC is that there are two layers of processing, hive and
>>>>> then at spark with the result set. And another one is performance is
>>>>> limited to that single HiveServer2 node and network.
>>>>>
>>>>> But If we make use of sqlContext.table function in spark to access
>>>>> hive tables, it is supposed to read files directly from HDFS skipping the
>>>>> hive layer. But it doesn't read delta files and just reads the contents
>>>>> from base folder. Only after Major compaction, the delta files would be
>>>>> merged with based folder and be visible for Spark SQL
>>>>>
>>>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <
>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>
>>>>>> Hi Varadharajan,
>>>>>>
>>>>>> Can you elaborate on (you quoted on previous mail) :
>>>>>> "I observed that hive transaction storage structure do not work with
>>>>>> spark yet"
>>>>>>
>>>>>>
>>>>>> If it is related to delta files created after each transaction and
>>>>>> spark would not be able recognize them. then I have a table
>>>>>> *mytable *(ORC , BUCKETED , NON-SORTED) , already done lots on
>>>>>> insert , update and deletes. I can see delta files created in HDFS (see
>>>>>> below), Still able to fetch consistent records through Spark JDBC and HIVE
>>>>>> JDBC.
>>>>>>
>>>>>> Not compaction triggered for that table.
>>>>>>
>>>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>>>
>>>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>>>>> its possible to reduce the thresholds but that might reduce performance by
>>>>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>>>>
>>>>>>> The only other way to solve this problem as of now is to use Hive
>>>>>>> JDBC API.
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Compaction would have been triggered automatically as following
>>>>>>>> properties already set in *hive-site.xml*. and also
>>>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>>>
>>>>>>>>
>>>>>>>>     <property>
>>>>>>>>
>>>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>>>
>>>>>>>>       <value>true</value>
>>>>>>>>
>>>>>>>>     </property>
>>>>>>>>
>>>>>>>>     <property>
>>>>>>>>
>>>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>>>
>>>>>>>>       <value>1</value>
>>>>>>>>
>>>>>>>>     </property>
>>>>>>>>
>>>>>>>>
>>>>>>>> Documentation is upset sometimes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes, I was burned down by this issue couple of weeks back. This
>>>>>>>>> also means that after every insert job, compaction should be run to access
>>>>>>>>> new rows from Spark. Sad that this issue is not documented / mentioned
>>>>>>>>> anywhere.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Varadharajan,
>>>>>>>>>>
>>>>>>>>>> Thanks for your response.
>>>>>>>>>>
>>>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>>>
>>>>>>>>>> Table hardly have 3 records , and after triggering minor
>>>>>>>>>> compaction on tables , it start showing results on spark SQL.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > *show create table hivespark;*
>>>>>>>>>>
>>>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>>>
>>>>>>>>>>     `id` int,
>>>>>>>>>>
>>>>>>>>>>     `name` string)
>>>>>>>>>>
>>>>>>>>>>   CLUSTERED BY (
>>>>>>>>>>
>>>>>>>>>>     id)
>>>>>>>>>>
>>>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>>>
>>>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>>>
>>>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>>>
>>>>>>>>>>   OUTPUTFORMAT
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>>>
>>>>>>>>>>   LOCATION
>>>>>>>>>>
>>>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>>>   TBLPROPERTIES (
>>>>>>>>>>
>>>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>>>
>>>>>>>>>>     'last_modified_by'='root',
>>>>>>>>>>
>>>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>>>
>>>>>>>>>>     'numFiles'='37',
>>>>>>>>>>
>>>>>>>>>>     'numRows'='3',
>>>>>>>>>>
>>>>>>>>>>     'rawDataSize'='0',
>>>>>>>>>>
>>>>>>>>>>     'totalSize'='11383',
>>>>>>>>>>
>>>>>>>>>>     'transactional'='true',
>>>>>>>>>>
>>>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Sanjiv Singh
>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>>>
>>>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I have observed that Spark SQL is not returning records for
>>>>>>>>>>>> hive bucketed ORC tables on HDP.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on
>>>>>>>>>>>> hive bucketed tables are not returning records.
>>>>>>>>>>>>
>>>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>>>> working fine.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Sanjiv Singh
>>>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks,
>>>>>>>>>>> M. Varadharajan
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>>>
>>>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> M. Varadharajan
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>
>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by Varadharajan Mukundan <sr...@gmail.com>.

That's interesting. I'm not sure why first compaction is needed but not on
the subsequent inserts. May be its just to create few metadata. Thanks for
clarifying this :)

On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Try this,
>
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
> hive> insert into foo values(20);
>
> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
> required.*
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <
> srinathsmn@gmail.com> wrote:
>
>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
>> if its different.
>>
>> Please walkthrough the below commands in the same order to understand the
>> sequence.
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>>
>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Hi Varadharajan,
>>>
>>>
>>> That is the point, Spark SQL is able to recognize delta files. See below
>>> directory structure, ONE BASE (43 records) and one DELTA (created after
>>> last insert). And I am able see last insert through Spark SQL.
>>>
>>>
>>> *See below complete scenario :*
>>>
>>> *Steps:*
>>>
>>>    - Inserted 43 records in table.
>>>    - Run major compaction on table.
>>>    - *alter table mytable COMPACT 'major';*
>>>    - Disabled auto compaction on table.
>>>    - *alter table mytable set
>>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>>    - Inserted 1 record in table.
>>>
>>>
>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>>
>>> *SPARK JDBC :*
>>>
>>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>>> +------+
>>> | _c0  |
>>> +------+
>>> | 44   |
>>> +------+
>>> 1 row selected (1.196 seconds)
>>>
>>> *HIVE JDBC :*
>>>
>>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>>> +------+--+
>>> | _c0  |
>>> +------+--+
>>> | 44   |
>>> +------+--+
>>> 1 row selected (0.121 seconds)
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>>> srinathsmn@gmail.com> wrote:
>>>
>>>> Hi Sanjiv,
>>>>
>>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>>>> rows since it is hive which processes the query. But i think the problem
>>>> with Hive JDBC is that there are two layers of processing, hive and then at
>>>> spark with the result set. And another one is performance is limited to
>>>> that single HiveServer2 node and network.
>>>>
>>>> But If we make use of sqlContext.table function in spark to access hive
>>>> tables, it is supposed to read files directly from HDFS skipping the hive
>>>> layer. But it doesn't read delta files and just reads the contents from
>>>> base folder. Only after Major compaction, the delta files would be merged
>>>> with based folder and be visible for Spark SQL
>>>>
>>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is.on@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Varadharajan,
>>>>>
>>>>> Can you elaborate on (you quoted on previous mail) :
>>>>> "I observed that hive transaction storage structure do not work with
>>>>> spark yet"
>>>>>
>>>>>
>>>>> If it is related to delta files created after each transaction and
>>>>> spark would not be able recognize them. then I have a table *mytable *(ORC
>>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and
>>>>> deletes. I can see delta files created in HDFS (see below), Still able to
>>>>> fetch consistent records through Spark JDBC and HIVE JDBC.
>>>>>
>>>>> Not compaction triggered for that table.
>>>>>
>>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>>
>>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>>> srinathsmn@gmail.com> wrote:
>>>>>
>>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>>>> its possible to reduce the thresholds but that might reduce performance by
>>>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>>>
>>>>>> The only other way to solve this problem as of now is to use Hive
>>>>>> JDBC API.
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>
>>>>>>> Compaction would have been triggered automatically as following
>>>>>>> properties already set in *hive-site.xml*. and also
>>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>>
>>>>>>>
>>>>>>>     <property>
>>>>>>>
>>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>>
>>>>>>>       <value>true</value>
>>>>>>>
>>>>>>>     </property>
>>>>>>>
>>>>>>>     <property>
>>>>>>>
>>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>>
>>>>>>>       <value>1</value>
>>>>>>>
>>>>>>>     </property>
>>>>>>>
>>>>>>>
>>>>>>> Documentation is upset sometimes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Sanjiv Singh
>>>>>>> Mob :  +091 9990-447-339
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes, I was burned down by this issue couple of weeks back. This
>>>>>>>> also means that after every insert job, compaction should be run to access
>>>>>>>> new rows from Spark. Sad that this issue is not documented / mentioned
>>>>>>>> anywhere.
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Varadharajan,
>>>>>>>>>
>>>>>>>>> Thanks for your response.
>>>>>>>>>
>>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>>
>>>>>>>>> Table hardly have 3 records , and after triggering minor
>>>>>>>>> compaction on tables , it start showing results on spark SQL.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > *show create table hivespark;*
>>>>>>>>>
>>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>>
>>>>>>>>>     `id` int,
>>>>>>>>>
>>>>>>>>>     `name` string)
>>>>>>>>>
>>>>>>>>>   CLUSTERED BY (
>>>>>>>>>
>>>>>>>>>     id)
>>>>>>>>>
>>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>>
>>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>>
>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>>
>>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>>
>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>>
>>>>>>>>>   OUTPUTFORMAT
>>>>>>>>>
>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>>
>>>>>>>>>   LOCATION
>>>>>>>>>
>>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>>   TBLPROPERTIES (
>>>>>>>>>
>>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>>
>>>>>>>>>     'last_modified_by'='root',
>>>>>>>>>
>>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>>
>>>>>>>>>     'numFiles'='37',
>>>>>>>>>
>>>>>>>>>     'numRows'='3',
>>>>>>>>>
>>>>>>>>>     'rawDataSize'='0',
>>>>>>>>>
>>>>>>>>>     'totalSize'='11383',
>>>>>>>>>
>>>>>>>>>     'transactional'='true',
>>>>>>>>>
>>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Sanjiv Singh
>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>>
>>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on
>>>>>>>>>>> hive bucketed tables are not returning records.
>>>>>>>>>>>
>>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>>> working fine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>>
>>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Sanjiv Singh
>>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>> M. Varadharajan
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>>
>>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> M. Varadharajan
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>
>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> M. Varadharajan
>>>>>>
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>
>>>>>> My Journal :- http://varadharajan.in
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Try this,


hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1

hive> insert into foo values(20);

scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
required.*




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
> if its different.
>
> Please walkthrough the below commands in the same order to understand the
> sequence.
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
>
> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>>
>> That is the point, Spark SQL is able to recognize delta files. See below
>> directory structure, ONE BASE (43 records) and one DELTA (created after
>> last insert). And I am able see last insert through Spark SQL.
>>
>>
>> *See below complete scenario :*
>>
>> *Steps:*
>>
>>    - Inserted 43 records in table.
>>    - Run major compaction on table.
>>    - *alter table mytable COMPACT 'major';*
>>    - Disabled auto compaction on table.
>>    - *alter table mytable set
>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>    - Inserted 1 record in table.
>>
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>
>> *SPARK JDBC :*
>>
>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>> +------+
>> | _c0  |
>> +------+
>> | 44   |
>> +------+
>> 1 row selected (1.196 seconds)
>>
>> *HIVE JDBC :*
>>
>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>> +------+--+
>> | _c0  |
>> +------+--+
>> | 44   |
>> +------+--+
>> 1 row selected (0.121 seconds)
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Hi Sanjiv,
>>>
>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>>> rows since it is hive which processes the query. But i think the problem
>>> with Hive JDBC is that there are two layers of processing, hive and then at
>>> spark with the result set. And another one is performance is limited to
>>> that single HiveServer2 node and network.
>>>
>>> But If we make use of sqlContext.table function in spark to access hive
>>> tables, it is supposed to read files directly from HDFS skipping the hive
>>> layer. But it doesn't read delta files and just reads the contents from
>>> base folder. Only after Major compaction, the delta files would be merged
>>> with based folder and be visible for Spark SQL
>>>
>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Can you elaborate on (you quoted on previous mail) :
>>>> "I observed that hive transaction storage structure do not work with
>>>> spark yet"
>>>>
>>>>
>>>> If it is related to delta files created after each transaction and
>>>> spark would not be able recognize them. then I have a table *mytable *(ORC
>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and
>>>> deletes. I can see delta files created in HDFS (see below), Still able to
>>>> fetch consistent records through Spark JDBC and HIVE JDBC.
>>>>
>>>> Not compaction triggered for that table.
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>
>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>>> its possible to reduce the thresholds but that might reduce performance by
>>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>>
>>>>> The only other way to solve this problem as of now is to use Hive JDBC
>>>>> API.
>>>>>
>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>
>>>>>> Compaction would have been triggered automatically as following
>>>>>> properties already set in *hive-site.xml*. and also
>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>
>>>>>>
>>>>>>     <property>
>>>>>>
>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>
>>>>>>       <value>true</value>
>>>>>>
>>>>>>     </property>
>>>>>>
>>>>>>     <property>
>>>>>>
>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>
>>>>>>       <value>1</value>
>>>>>>
>>>>>>     </property>
>>>>>>
>>>>>>
>>>>>> Documentation is upset sometimes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>>>> means that after every insert job, compaction should be run to access new
>>>>>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Varadharajan,
>>>>>>>>
>>>>>>>> Thanks for your response.
>>>>>>>>
>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>
>>>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>>>> on tables , it start showing results on spark SQL.
>>>>>>>>
>>>>>>>>
>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>
>>>>>>>>
>>>>>>>> > *show create table hivespark;*
>>>>>>>>
>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>
>>>>>>>>     `id` int,
>>>>>>>>
>>>>>>>>     `name` string)
>>>>>>>>
>>>>>>>>   CLUSTERED BY (
>>>>>>>>
>>>>>>>>     id)
>>>>>>>>
>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>
>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>
>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>
>>>>>>>>   OUTPUTFORMAT
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>
>>>>>>>>   LOCATION
>>>>>>>>
>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>   TBLPROPERTIES (
>>>>>>>>
>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>
>>>>>>>>     'last_modified_by'='root',
>>>>>>>>
>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>
>>>>>>>>     'numFiles'='37',
>>>>>>>>
>>>>>>>>     'numRows'='3',
>>>>>>>>
>>>>>>>>     'rawDataSize'='0',
>>>>>>>>
>>>>>>>>     'totalSize'='11383',
>>>>>>>>
>>>>>>>>     'transactional'='true',
>>>>>>>>
>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>
>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>>>> bucketed tables are not returning records.
>>>>>>>>>>
>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>> working fine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>
>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Sanjiv Singh
>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> M. Varadharajan
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>
>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Try this,


hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1

hive> insert into foo values(20);

scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
required.*




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
> if its different.
>
> Please walkthrough the below commands in the same order to understand the
> sequence.
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
>
> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>>
>> That is the point, Spark SQL is able to recognize delta files. See below
>> directory structure, ONE BASE (43 records) and one DELTA (created after
>> last insert). And I am able see last insert through Spark SQL.
>>
>>
>> *See below complete scenario :*
>>
>> *Steps:*
>>
>>    - Inserted 43 records in table.
>>    - Run major compaction on table.
>>    - *alter table mytable COMPACT 'major';*
>>    - Disabled auto compaction on table.
>>    - *alter table mytable set
>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>    - Inserted 1 record in table.
>>
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>
>> *SPARK JDBC :*
>>
>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>> +------+
>> | _c0  |
>> +------+
>> | 44   |
>> +------+
>> 1 row selected (1.196 seconds)
>>
>> *HIVE JDBC :*
>>
>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>> +------+--+
>> | _c0  |
>> +------+--+
>> | 44   |
>> +------+--+
>> 1 row selected (0.121 seconds)
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Hi Sanjiv,
>>>
>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>>> rows since it is hive which processes the query. But i think the problem
>>> with Hive JDBC is that there are two layers of processing, hive and then at
>>> spark with the result set. And another one is performance is limited to
>>> that single HiveServer2 node and network.
>>>
>>> But If we make use of sqlContext.table function in spark to access hive
>>> tables, it is supposed to read files directly from HDFS skipping the hive
>>> layer. But it doesn't read delta files and just reads the contents from
>>> base folder. Only after Major compaction, the delta files would be merged
>>> with based folder and be visible for Spark SQL
>>>
>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Can you elaborate on (you quoted on previous mail) :
>>>> "I observed that hive transaction storage structure do not work with
>>>> spark yet"
>>>>
>>>>
>>>> If it is related to delta files created after each transaction and
>>>> spark would not be able recognize them. then I have a table *mytable *(ORC
>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and
>>>> deletes. I can see delta files created in HDFS (see below), Still able to
>>>> fetch consistent records through Spark JDBC and HIVE JDBC.
>>>>
>>>> Not compaction triggered for that table.
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>
>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>>> its possible to reduce the thresholds but that might reduce performance by
>>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>>
>>>>> The only other way to solve this problem as of now is to use Hive JDBC
>>>>> API.
>>>>>
>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>
>>>>>> Compaction would have been triggered automatically as following
>>>>>> properties already set in *hive-site.xml*. and also
>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>
>>>>>>
>>>>>>     <property>
>>>>>>
>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>
>>>>>>       <value>true</value>
>>>>>>
>>>>>>     </property>
>>>>>>
>>>>>>     <property>
>>>>>>
>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>
>>>>>>       <value>1</value>
>>>>>>
>>>>>>     </property>
>>>>>>
>>>>>>
>>>>>> Documentation is upset sometimes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>>>> means that after every insert job, compaction should be run to access new
>>>>>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Varadharajan,
>>>>>>>>
>>>>>>>> Thanks for your response.
>>>>>>>>
>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>
>>>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>>>> on tables , it start showing results on spark SQL.
>>>>>>>>
>>>>>>>>
>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>
>>>>>>>>
>>>>>>>> > *show create table hivespark;*
>>>>>>>>
>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>
>>>>>>>>     `id` int,
>>>>>>>>
>>>>>>>>     `name` string)
>>>>>>>>
>>>>>>>>   CLUSTERED BY (
>>>>>>>>
>>>>>>>>     id)
>>>>>>>>
>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>
>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>
>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>
>>>>>>>>   OUTPUTFORMAT
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>
>>>>>>>>   LOCATION
>>>>>>>>
>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>   TBLPROPERTIES (
>>>>>>>>
>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>
>>>>>>>>     'last_modified_by'='root',
>>>>>>>>
>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>
>>>>>>>>     'numFiles'='37',
>>>>>>>>
>>>>>>>>     'numRows'='3',
>>>>>>>>
>>>>>>>>     'rawDataSize'='0',
>>>>>>>>
>>>>>>>>     'totalSize'='11383',
>>>>>>>>
>>>>>>>>     'transactional'='true',
>>>>>>>>
>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>
>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>>>> bucketed tables are not returning records.
>>>>>>>>>>
>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>> working fine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>
>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Sanjiv Singh
>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> M. Varadharajan
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>
>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by Varadharajan Mukundan <sr...@gmail.com>.

This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure if
its different.

Please walkthrough the below commands in the same order to understand the
sequence.

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1


On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Hi Varadharajan,
>
>
> That is the point, Spark SQL is able to recognize delta files. See below
> directory structure, ONE BASE (43 records) and one DELTA (created after
> last insert). And I am able see last insert through Spark SQL.
>
>
> *See below complete scenario :*
>
> *Steps:*
>
>    - Inserted 43 records in table.
>    - Run major compaction on table.
>    - *alter table mytable COMPACT 'major';*
>    - Disabled auto compaction on table.
>    - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>    - Inserted 1 record in table.
>
>
> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
> /apps/hive/warehouse/mydb.db/mytable/base_0000087
> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>
> *SPARK JDBC :*
>
> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
> +------+
> | _c0  |
> +------+
> | 44   |
> +------+
> 1 row selected (1.196 seconds)
>
> *HIVE JDBC :*
>
> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
> +------+--+
> | _c0  |
> +------+--+
> | 44   |
> +------+--+
> 1 row selected (0.121 seconds)
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
> srinathsmn@gmail.com> wrote:
>
>> Hi Sanjiv,
>>
>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>> rows since it is hive which processes the query. But i think the problem
>> with Hive JDBC is that there are two layers of processing, hive and then at
>> spark with the result set. And another one is performance is limited to
>> that single HiveServer2 node and network.
>>
>> But If we make use of sqlContext.table function in spark to access hive
>> tables, it is supposed to read files directly from HDFS skipping the hive
>> layer. But it doesn't read delta files and just reads the contents from
>> base folder. Only after Major compaction, the delta files would be merged
>> with based folder and be visible for Spark SQL
>>
>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Hi Varadharajan,
>>>
>>> Can you elaborate on (you quoted on previous mail) :
>>> "I observed that hive transaction storage structure do not work with
>>> spark yet"
>>>
>>>
>>> If it is related to delta files created after each transaction and spark
>>> would not be able recognize them. then I have a table *mytable *(ORC ,
>>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>>> I can see delta files created in HDFS (see below), Still able to fetch
>>> consistent records through Spark JDBC and HIVE JDBC.
>>>
>>> Not compaction triggered for that table.
>>>
>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>
>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>> srinathsmn@gmail.com> wrote:
>>>
>>>> Actually the auto compaction if enabled is triggered based on the
>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>> its possible to reduce the thresholds but that might reduce performance by
>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>
>>>> The only other way to solve this problem as of now is to use Hive JDBC
>>>> API.
>>>>
>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is.on@gmail.com
>>>> > wrote:
>>>>
>>>>> Compaction would have been triggered automatically as following
>>>>> properties already set in *hive-site.xml*. and also
>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>
>>>>>
>>>>>     <property>
>>>>>
>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>
>>>>>       <value>true</value>
>>>>>
>>>>>     </property>
>>>>>
>>>>>     <property>
>>>>>
>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>
>>>>>       <value>1</value>
>>>>>
>>>>>     </property>
>>>>>
>>>>>
>>>>> Documentation is upset sometimes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>> srinathsmn@gmail.com> wrote:
>>>>>
>>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>>> means that after every insert job, compaction should be run to access new
>>>>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Varadharajan,
>>>>>>>
>>>>>>> Thanks for your response.
>>>>>>>
>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>
>>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>>> on tables , it start showing results on spark SQL.
>>>>>>>
>>>>>>>
>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>
>>>>>>>
>>>>>>> > *show create table hivespark;*
>>>>>>>
>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>
>>>>>>>     `id` int,
>>>>>>>
>>>>>>>     `name` string)
>>>>>>>
>>>>>>>   CLUSTERED BY (
>>>>>>>
>>>>>>>     id)
>>>>>>>
>>>>>>>   INTO 32 BUCKETS
>>>>>>>
>>>>>>>   ROW FORMAT SERDE
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>
>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>
>>>>>>>   OUTPUTFORMAT
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>
>>>>>>>   LOCATION
>>>>>>>
>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>   TBLPROPERTIES (
>>>>>>>
>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>
>>>>>>>     'last_modified_by'='root',
>>>>>>>
>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>
>>>>>>>     'numFiles'='37',
>>>>>>>
>>>>>>>     'numRows'='3',
>>>>>>>
>>>>>>>     'rawDataSize'='0',
>>>>>>>
>>>>>>>     'totalSize'='11383',
>>>>>>>
>>>>>>>     'transactional'='true',
>>>>>>>
>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Sanjiv Singh
>>>>>>> Mob :  +091 9990-447-339
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>
>>>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>>>> compaction and then try accessing the tables
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>>> bucketed tables are not returning records.
>>>>>>>>>
>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>> working fine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>
>>>>>>>>> Let me know if needs other details.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Sanjiv Singh
>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> M. Varadharajan
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>
>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> M. Varadharajan
>>>>>>
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>
>>>>>> My Journal :- http://varadharajan.in
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi Varadharajan,


That is the point, Spark SQL is able to recognize delta files. See below
directory structure, ONE BASE (43 records) and one DELTA (created after
last insert). And I am able see last insert through Spark SQL.


*See below complete scenario :*

*Steps:*

   - Inserted 43 records in table.
   - Run major compaction on table.
   - *alter table mytable COMPACT 'major';*
   - Disabled auto compaction on table.
   - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
   - Inserted 1 record in table.


> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
/apps/hive/warehouse/mydb.db/mytable/base_0000087
drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
/apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088

*SPARK JDBC :*

0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
+------+
| _c0  |
+------+
| 44   |
+------+
1 row selected (1.196 seconds)

*HIVE JDBC :*

1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
+------+--+
| _c0  |
+------+--+
| 44   |
+------+--+
1 row selected (0.121 seconds)


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
srinathsmn@gmail.com> wrote:

> Hi Sanjiv,
>
> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
> rows since it is hive which processes the query. But i think the problem
> with Hive JDBC is that there are two layers of processing, hive and then at
> spark with the result set. And another one is performance is limited to
> that single HiveServer2 node and network.
>
> But If we make use of sqlContext.table function in spark to access hive
> tables, it is supposed to read files directly from HDFS skipping the hive
> layer. But it doesn't read delta files and just reads the contents from
> base folder. Only after Major compaction, the delta files would be merged
> with based folder and be visible for Spark SQL
>
> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>> Can you elaborate on (you quoted on previous mail) :
>> "I observed that hive transaction storage structure do not work with
>> spark yet"
>>
>>
>> If it is related to delta files created after each transaction and spark
>> would not be able recognize them. then I have a table *mytable *(ORC ,
>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>> I can see delta files created in HDFS (see below), Still able to fetch
>> consistent records through Spark JDBC and HIVE JDBC.
>>
>> Not compaction triggered for that table.
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>
>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Actually the auto compaction if enabled is triggered based on the volume
>>> of changes. It doesn't automatically run after every insert. I think its
>>> possible to reduce the thresholds but that might reduce performance by a
>>> big margin. As of now, we do compaction after the batch insert completes.
>>>
>>> The only other way to solve this problem as of now is to use Hive JDBC
>>> API.
>>>
>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Compaction would have been triggered automatically as following
>>>> properties already set in *hive-site.xml*. and also
>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>
>>>>
>>>>     <property>
>>>>
>>>>       <name>hive.compactor.initiator.on</name>
>>>>
>>>>       <value>true</value>
>>>>
>>>>     </property>
>>>>
>>>>     <property>
>>>>
>>>>       <name>hive.compactor.worker.threads</name>
>>>>
>>>>       <value>1</value>
>>>>
>>>>     </property>
>>>>
>>>>
>>>> Documentation is upset sometimes.
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>> means that after every insert job, compaction should be run to access new
>>>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sanjiv.is.on@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Varadharajan,
>>>>>>
>>>>>> Thanks for your response.
>>>>>>
>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>
>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>> on tables , it start showing results on spark SQL.
>>>>>>
>>>>>>
>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>
>>>>>>
>>>>>> > *show create table hivespark;*
>>>>>>
>>>>>>   CREATE TABLE `hivespark`(
>>>>>>
>>>>>>     `id` int,
>>>>>>
>>>>>>     `name` string)
>>>>>>
>>>>>>   CLUSTERED BY (
>>>>>>
>>>>>>     id)
>>>>>>
>>>>>>   INTO 32 BUCKETS
>>>>>>
>>>>>>   ROW FORMAT SERDE
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>
>>>>>>   STORED AS INPUTFORMAT
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>
>>>>>>   OUTPUTFORMAT
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>
>>>>>>   LOCATION
>>>>>>
>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>   TBLPROPERTIES (
>>>>>>
>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>
>>>>>>     'last_modified_by'='root',
>>>>>>
>>>>>>     'last_modified_time'='1455859079',
>>>>>>
>>>>>>     'numFiles'='37',
>>>>>>
>>>>>>     'numRows'='3',
>>>>>>
>>>>>>     'rawDataSize'='0',
>>>>>>
>>>>>>     'totalSize'='11383',
>>>>>>
>>>>>>     'transactional'='true',
>>>>>>
>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>>>> transaction storage structure do not work with spark yet. You can confirm
>>>>>>> this by looking at the transactional attribute in the output of "desc
>>>>>>> extended <tablename>" in hive console.
>>>>>>>
>>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>>> compaction and then try accessing the tables
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>> bucketed tables are not returning records.
>>>>>>>>
>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>> working fine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>
>>>>>>>> Let me know if needs other details.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi Varadharajan,


That is the point, Spark SQL is able to recognize delta files. See below
directory structure, ONE BASE (43 records) and one DELTA (created after
last insert). And I am able see last insert through Spark SQL.


*See below complete scenario :*

*Steps:*

   - Inserted 43 records in table.
   - Run major compaction on table.
   - *alter table mytable COMPACT 'major';*
   - Disabled auto compaction on table.
   - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
   - Inserted 1 record in table.


> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
/apps/hive/warehouse/mydb.db/mytable/base_0000087
drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
/apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088

*SPARK JDBC :*

0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
+------+
| _c0  |
+------+
| 44   |
+------+
1 row selected (1.196 seconds)

*HIVE JDBC :*

1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
+------+--+
| _c0  |
+------+--+
| 44   |
+------+--+
1 row selected (0.121 seconds)


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
srinathsmn@gmail.com> wrote:

> Hi Sanjiv,
>
> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
> rows since it is hive which processes the query. But i think the problem
> with Hive JDBC is that there are two layers of processing, hive and then at
> spark with the result set. And another one is performance is limited to
> that single HiveServer2 node and network.
>
> But If we make use of sqlContext.table function in spark to access hive
> tables, it is supposed to read files directly from HDFS skipping the hive
> layer. But it doesn't read delta files and just reads the contents from
> base folder. Only after Major compaction, the delta files would be merged
> with based folder and be visible for Spark SQL
>
> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>> Can you elaborate on (you quoted on previous mail) :
>> "I observed that hive transaction storage structure do not work with
>> spark yet"
>>
>>
>> If it is related to delta files created after each transaction and spark
>> would not be able recognize them. then I have a table *mytable *(ORC ,
>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>> I can see delta files created in HDFS (see below), Still able to fetch
>> consistent records through Spark JDBC and HIVE JDBC.
>>
>> Not compaction triggered for that table.
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>
>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Actually the auto compaction if enabled is triggered based on the volume
>>> of changes. It doesn't automatically run after every insert. I think its
>>> possible to reduce the thresholds but that might reduce performance by a
>>> big margin. As of now, we do compaction after the batch insert completes.
>>>
>>> The only other way to solve this problem as of now is to use Hive JDBC
>>> API.
>>>
>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Compaction would have been triggered automatically as following
>>>> properties already set in *hive-site.xml*. and also
>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>
>>>>
>>>>     <property>
>>>>
>>>>       <name>hive.compactor.initiator.on</name>
>>>>
>>>>       <value>true</value>
>>>>
>>>>     </property>
>>>>
>>>>     <property>
>>>>
>>>>       <name>hive.compactor.worker.threads</name>
>>>>
>>>>       <value>1</value>
>>>>
>>>>     </property>
>>>>
>>>>
>>>> Documentation is upset sometimes.
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>> means that after every insert job, compaction should be run to access new
>>>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sanjiv.is.on@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Varadharajan,
>>>>>>
>>>>>> Thanks for your response.
>>>>>>
>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>
>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>> on tables , it start showing results on spark SQL.
>>>>>>
>>>>>>
>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>
>>>>>>
>>>>>> > *show create table hivespark;*
>>>>>>
>>>>>>   CREATE TABLE `hivespark`(
>>>>>>
>>>>>>     `id` int,
>>>>>>
>>>>>>     `name` string)
>>>>>>
>>>>>>   CLUSTERED BY (
>>>>>>
>>>>>>     id)
>>>>>>
>>>>>>   INTO 32 BUCKETS
>>>>>>
>>>>>>   ROW FORMAT SERDE
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>
>>>>>>   STORED AS INPUTFORMAT
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>
>>>>>>   OUTPUTFORMAT
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>
>>>>>>   LOCATION
>>>>>>
>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>   TBLPROPERTIES (
>>>>>>
>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>
>>>>>>     'last_modified_by'='root',
>>>>>>
>>>>>>     'last_modified_time'='1455859079',
>>>>>>
>>>>>>     'numFiles'='37',
>>>>>>
>>>>>>     'numRows'='3',
>>>>>>
>>>>>>     'rawDataSize'='0',
>>>>>>
>>>>>>     'totalSize'='11383',
>>>>>>
>>>>>>     'transactional'='true',
>>>>>>
>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>>>> transaction storage structure do not work with spark yet. You can confirm
>>>>>>> this by looking at the transactional attribute in the output of "desc
>>>>>>> extended <tablename>" in hive console.
>>>>>>>
>>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>>> compaction and then try accessing the tables
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>> bucketed tables are not returning records.
>>>>>>>>
>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>> working fine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>
>>>>>>>> Let me know if needs other details.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by Varadharajan Mukundan <sr...@gmail.com>.

Hi Sanjiv,

Yes.. If we make use of Hive JDBC we should be able to retrieve all the
rows since it is hive which processes the query. But i think the problem
with Hive JDBC is that there are two layers of processing, hive and then at
spark with the result set. And another one is performance is limited to
that single HiveServer2 node and network.

But If we make use of sqlContext.table function in spark to access hive
tables, it is supposed to read files directly from HDFS skipping the hive
layer. But it doesn't read delta files and just reads the contents from
base folder. Only after Major compaction, the delta files would be merged
with based folder and be visible for Spark SQL

On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Hi Varadharajan,
>
> Can you elaborate on (you quoted on previous mail) :
> "I observed that hive transaction storage structure do not work with
> spark yet"
>
>
> If it is related to delta files created after each transaction and spark
> would not be able recognize them. then I have a table *mytable *(ORC ,
> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
> I can see delta files created in HDFS (see below), Still able to fetch
> consistent records through Spark JDBC and HIVE JDBC.
>
> Not compaction triggered for that table.
>
> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>
> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/base_0000060
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
> srinathsmn@gmail.com> wrote:
>
>> Actually the auto compaction if enabled is triggered based on the volume
>> of changes. It doesn't automatically run after every insert. I think its
>> possible to reduce the thresholds but that might reduce performance by a
>> big margin. As of now, we do compaction after the batch insert completes.
>>
>> The only other way to solve this problem as of now is to use Hive JDBC
>> API.
>>
>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Compaction would have been triggered automatically as following
>>> properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property
>>> not been set for these tables.
>>>
>>>
>>>     <property>
>>>
>>>       <name>hive.compactor.initiator.on</name>
>>>
>>>       <value>true</value>
>>>
>>>     </property>
>>>
>>>     <property>
>>>
>>>       <name>hive.compactor.worker.threads</name>
>>>
>>>       <value>1</value>
>>>
>>>     </property>
>>>
>>>
>>> Documentation is upset sometimes.
>>>
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>> srinathsmn@gmail.com> wrote:
>>>
>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>> means that after every insert job, compaction should be run to access new
>>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>>
>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Varadharajan,
>>>>>
>>>>> Thanks for your response.
>>>>>
>>>>> Yes it is transnational table; See below *show create table. *
>>>>>
>>>>> Table hardly have 3 records , and after triggering minor compaction on
>>>>> tables , it start showing results on spark SQL.
>>>>>
>>>>>
>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>
>>>>>
>>>>> > *show create table hivespark;*
>>>>>
>>>>>   CREATE TABLE `hivespark`(
>>>>>
>>>>>     `id` int,
>>>>>
>>>>>     `name` string)
>>>>>
>>>>>   CLUSTERED BY (
>>>>>
>>>>>     id)
>>>>>
>>>>>   INTO 32 BUCKETS
>>>>>
>>>>>   ROW FORMAT SERDE
>>>>>
>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>
>>>>>   STORED AS INPUTFORMAT
>>>>>
>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>
>>>>>   OUTPUTFORMAT
>>>>>
>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>
>>>>>   LOCATION
>>>>>
>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>   TBLPROPERTIES (
>>>>>
>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>
>>>>>     'last_modified_by'='root',
>>>>>
>>>>>     'last_modified_time'='1455859079',
>>>>>
>>>>>     'numFiles'='37',
>>>>>
>>>>>     'numRows'='3',
>>>>>
>>>>>     'rawDataSize'='0',
>>>>>
>>>>>     'totalSize'='11383',
>>>>>
>>>>>     'transactional'='true',
>>>>>
>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>> srinathsmn@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>>> transaction storage structure do not work with spark yet. You can confirm
>>>>>> this by looking at the transactional attribute in the output of "desc
>>>>>> extended <tablename>" in hive console.
>>>>>>
>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>> compaction and then try accessing the tables
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>> bucketed ORC tables on HDP.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>> bucketed tables are not returning records.
>>>>>>>
>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>> working fine.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Same is working on plain Apache setup.
>>>>>>>
>>>>>>> Let me know if needs other details.
>>>>>>>
>>>>>>> Regards
>>>>>>> Sanjiv Singh
>>>>>>> Mob :  +091 9990-447-339
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> M. Varadharajan
>>>>>>
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>
>>>>>> My Journal :- http://varadharajan.in
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi Varadharajan,

Can you elaborate on (you quoted on previous mail) :
"I observed that hive transaction storage structure do not work with spark
yet"


If it is related to delta files created after each transaction and spark
would not be able recognize them. then I have a table *mytable *(ORC ,
BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
I can see delta files created in HDFS (see below), Still able to fetch
consistent records through Spark JDBC and HIVE JDBC.

Not compaction triggered for that table.

> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*

drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/base_0000060
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
/apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
/apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> Actually the auto compaction if enabled is triggered based on the volume
> of changes. It doesn't automatically run after every insert. I think its
> possible to reduce the thresholds but that might reduce performance by a
> big margin. As of now, we do compaction after the batch insert completes.
>
> The only other way to solve this problem as of now is to use Hive JDBC API.
>
> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Compaction would have been triggered automatically as following
>> properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property
>> not been set for these tables.
>>
>>
>>     <property>
>>
>>       <name>hive.compactor.initiator.on</name>
>>
>>       <value>true</value>
>>
>>     </property>
>>
>>     <property>
>>
>>       <name>hive.compactor.worker.threads</name>
>>
>>       <value>1</value>
>>
>>     </property>
>>
>>
>> Documentation is upset sometimes.
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Yes, I was burned down by this issue couple of weeks back. This also
>>> means that after every insert job, compaction should be run to access new
>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>
>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Thanks for your response.
>>>>
>>>> Yes it is transnational table; See below *show create table. *
>>>>
>>>> Table hardly have 3 records , and after triggering minor compaction on
>>>> tables , it start showing results on spark SQL.
>>>>
>>>>
>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>
>>>>
>>>> > *show create table hivespark;*
>>>>
>>>>   CREATE TABLE `hivespark`(
>>>>
>>>>     `id` int,
>>>>
>>>>     `name` string)
>>>>
>>>>   CLUSTERED BY (
>>>>
>>>>     id)
>>>>
>>>>   INTO 32 BUCKETS
>>>>
>>>>   ROW FORMAT SERDE
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>
>>>>   STORED AS INPUTFORMAT
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>
>>>>   OUTPUTFORMAT
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>
>>>>   LOCATION
>>>>
>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>   TBLPROPERTIES (
>>>>
>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>
>>>>     'last_modified_by'='root',
>>>>
>>>>     'last_modified_time'='1455859079',
>>>>
>>>>     'numFiles'='37',
>>>>
>>>>     'numRows'='3',
>>>>
>>>>     'rawDataSize'='0',
>>>>
>>>>     'totalSize'='11383',
>>>>
>>>>     'transactional'='true',
>>>>
>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>> transaction storage structure do not work with spark yet. You can confirm
>>>>> this by looking at the transactional attribute in the output of "desc
>>>>> extended <tablename>" in hive console.
>>>>>
>>>>> If you'd need to access transactional table, consider doing a major
>>>>> compaction and then try accessing the tables
>>>>>
>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sanjiv.is.on@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>> bucketed ORC tables on HDP.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>> bucketed tables are not returning records.
>>>>>>
>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>> working fine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Same is working on plain Apache setup.
>>>>>>
>>>>>> Let me know if needs other details.
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi Varadharajan,

Can you elaborate on (you quoted on previous mail) :
"I observed that hive transaction storage structure do not work with spark
yet"


If it is related to delta files created after each transaction and spark
would not be able recognize them. then I have a table *mytable *(ORC ,
BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
I can see delta files created in HDFS (see below), Still able to fetch
consistent records through Spark JDBC and HIVE JDBC.

Not compaction triggered for that table.

> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*

drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/base_0000060
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
/apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
/apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> Actually the auto compaction if enabled is triggered based on the volume
> of changes. It doesn't automatically run after every insert. I think its
> possible to reduce the thresholds but that might reduce performance by a
> big margin. As of now, we do compaction after the batch insert completes.
>
> The only other way to solve this problem as of now is to use Hive JDBC API.
>
> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Compaction would have been triggered automatically as following
>> properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property
>> not been set for these tables.
>>
>>
>>     <property>
>>
>>       <name>hive.compactor.initiator.on</name>
>>
>>       <value>true</value>
>>
>>     </property>
>>
>>     <property>
>>
>>       <name>hive.compactor.worker.threads</name>
>>
>>       <value>1</value>
>>
>>     </property>
>>
>>
>> Documentation is upset sometimes.
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Yes, I was burned down by this issue couple of weeks back. This also
>>> means that after every insert job, compaction should be run to access new
>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>
>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Thanks for your response.
>>>>
>>>> Yes it is transnational table; See below *show create table. *
>>>>
>>>> Table hardly have 3 records , and after triggering minor compaction on
>>>> tables , it start showing results on spark SQL.
>>>>
>>>>
>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>
>>>>
>>>> > *show create table hivespark;*
>>>>
>>>>   CREATE TABLE `hivespark`(
>>>>
>>>>     `id` int,
>>>>
>>>>     `name` string)
>>>>
>>>>   CLUSTERED BY (
>>>>
>>>>     id)
>>>>
>>>>   INTO 32 BUCKETS
>>>>
>>>>   ROW FORMAT SERDE
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>
>>>>   STORED AS INPUTFORMAT
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>
>>>>   OUTPUTFORMAT
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>
>>>>   LOCATION
>>>>
>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>   TBLPROPERTIES (
>>>>
>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>
>>>>     'last_modified_by'='root',
>>>>
>>>>     'last_modified_time'='1455859079',
>>>>
>>>>     'numFiles'='37',
>>>>
>>>>     'numRows'='3',
>>>>
>>>>     'rawDataSize'='0',
>>>>
>>>>     'totalSize'='11383',
>>>>
>>>>     'transactional'='true',
>>>>
>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>> transaction storage structure do not work with spark yet. You can confirm
>>>>> this by looking at the transactional attribute in the output of "desc
>>>>> extended <tablename>" in hive console.
>>>>>
>>>>> If you'd need to access transactional table, consider doing a major
>>>>> compaction and then try accessing the tables
>>>>>
>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sanjiv.is.on@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>> bucketed ORC tables on HDP.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>> bucketed tables are not returning records.
>>>>>>
>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>> working fine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Same is working on plain Apache setup.
>>>>>>
>>>>>> Let me know if needs other details.
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by Varadharajan Mukundan <sr...@gmail.com>.

Actually the auto compaction if enabled is triggered based on the volume of
changes. It doesn't automatically run after every insert. I think its
possible to reduce the thresholds but that might reduce performance by a
big margin. As of now, we do compaction after the batch insert completes.

The only other way to solve this problem as of now is to use Hive JDBC API.

On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Compaction would have been triggered automatically as following properties
> already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property
> not been set for these tables.
>
>
>     <property>
>
>       <name>hive.compactor.initiator.on</name>
>
>       <value>true</value>
>
>     </property>
>
>     <property>
>
>       <name>hive.compactor.worker.threads</name>
>
>       <value>1</value>
>
>     </property>
>
>
> Documentation is upset sometimes.
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
> srinathsmn@gmail.com> wrote:
>
>> Yes, I was burned down by this issue couple of weeks back. This also
>> means that after every insert job, compaction should be run to access new
>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>
>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Hi Varadharajan,
>>>
>>> Thanks for your response.
>>>
>>> Yes it is transnational table; See below *show create table. *
>>>
>>> Table hardly have 3 records , and after triggering minor compaction on
>>> tables , it start showing results on spark SQL.
>>>
>>>
>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>
>>>
>>> > *show create table hivespark;*
>>>
>>>   CREATE TABLE `hivespark`(
>>>
>>>     `id` int,
>>>
>>>     `name` string)
>>>
>>>   CLUSTERED BY (
>>>
>>>     id)
>>>
>>>   INTO 32 BUCKETS
>>>
>>>   ROW FORMAT SERDE
>>>
>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>
>>>   STORED AS INPUTFORMAT
>>>
>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>
>>>   OUTPUTFORMAT
>>>
>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>
>>>   LOCATION
>>>
>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>   TBLPROPERTIES (
>>>
>>>     'COLUMN_STATS_ACCURATE'='true',
>>>
>>>     'last_modified_by'='root',
>>>
>>>     'last_modified_time'='1455859079',
>>>
>>>     'numFiles'='37',
>>>
>>>     'numRows'='3',
>>>
>>>     'rawDataSize'='0',
>>>
>>>     'totalSize'='11383',
>>>
>>>     'transactional'='true',
>>>
>>>     'transient_lastDdlTime'='1455864121') ;
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>> srinathsmn@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is the transaction attribute set on your table? I observed that hive
>>>> transaction storage structure do not work with spark yet. You can confirm
>>>> this by looking at the transactional attribute in the output of "desc
>>>> extended <tablename>" in hive console.
>>>>
>>>> If you'd need to access transactional table, consider doing a major
>>>> compaction and then try accessing the tables
>>>>
>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I have observed that Spark SQL is not returning records for hive
>>>>> bucketed ORC tables on HDP.
>>>>>
>>>>>
>>>>>
>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>> bucketed tables are not returning records.
>>>>>
>>>>> I have also tried the same for non-bucketed hive tables. it is working
>>>>> fine.
>>>>>
>>>>>
>>>>>
>>>>> Same is working on plain Apache setup.
>>>>>
>>>>> Let me know if needs other details.
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Compaction would have been triggered automatically as following properties
already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not
been set for these tables.


    <property>

      <name>hive.compactor.initiator.on</name>

      <value>true</value>

    </property>

    <property>

      <name>hive.compactor.worker.threads</name>

      <value>1</value>

    </property>


Documentation is upset sometimes.




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> Yes, I was burned down by this issue couple of weeks back. This also means
> that after every insert job, compaction should be run to access new rows
> from Spark. Sad that this issue is not documented / mentioned anywhere.
>
> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>> Thanks for your response.
>>
>> Yes it is transnational table; See below *show create table. *
>>
>> Table hardly have 3 records , and after triggering minor compaction on
>> tables , it start showing results on spark SQL.
>>
>>
>> > *ALTER TABLE hivespark COMPACT 'major';*
>>
>>
>> > *show create table hivespark;*
>>
>>   CREATE TABLE `hivespark`(
>>
>>     `id` int,
>>
>>     `name` string)
>>
>>   CLUSTERED BY (
>>
>>     id)
>>
>>   INTO 32 BUCKETS
>>
>>   ROW FORMAT SERDE
>>
>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>
>>   STORED AS INPUTFORMAT
>>
>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>
>>   OUTPUTFORMAT
>>
>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>
>>   LOCATION
>>
>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>   TBLPROPERTIES (
>>
>>     'COLUMN_STATS_ACCURATE'='true',
>>
>>     'last_modified_by'='root',
>>
>>     'last_modified_time'='1455859079',
>>
>>     'numFiles'='37',
>>
>>     'numRows'='3',
>>
>>     'rawDataSize'='0',
>>
>>     'totalSize'='11383',
>>
>>     'transactional'='true',
>>
>>     'transient_lastDdlTime'='1455864121') ;
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is the transaction attribute set on your table? I observed that hive
>>> transaction storage structure do not work with spark yet. You can confirm
>>> this by looking at the transactional attribute in the output of "desc
>>> extended <tablename>" in hive console.
>>>
>>> If you'd need to access transactional table, consider doing a major
>>> compaction and then try accessing the tables
>>>
>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I have observed that Spark SQL is not returning records for hive
>>>> bucketed ORC tables on HDP.
>>>>
>>>>
>>>>
>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>> bucketed tables are not returning records.
>>>>
>>>> I have also tried the same for non-bucketed hive tables. it is working
>>>> fine.
>>>>
>>>>
>>>>
>>>> Same is working on plain Apache setup.
>>>>
>>>> Let me know if needs other details.
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Compaction would have been triggered automatically as following properties
already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not
been set for these tables.


    <property>

      <name>hive.compactor.initiator.on</name>

      <value>true</value>

    </property>

    <property>

      <name>hive.compactor.worker.threads</name>

      <value>1</value>

    </property>


Documentation is upset sometimes.




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> Yes, I was burned down by this issue couple of weeks back. This also means
> that after every insert job, compaction should be run to access new rows
> from Spark. Sad that this issue is not documented / mentioned anywhere.
>
> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>> Thanks for your response.
>>
>> Yes it is transnational table; See below *show create table. *
>>
>> Table hardly have 3 records , and after triggering minor compaction on
>> tables , it start showing results on spark SQL.
>>
>>
>> > *ALTER TABLE hivespark COMPACT 'major';*
>>
>>
>> > *show create table hivespark;*
>>
>>   CREATE TABLE `hivespark`(
>>
>>     `id` int,
>>
>>     `name` string)
>>
>>   CLUSTERED BY (
>>
>>     id)
>>
>>   INTO 32 BUCKETS
>>
>>   ROW FORMAT SERDE
>>
>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>
>>   STORED AS INPUTFORMAT
>>
>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>
>>   OUTPUTFORMAT
>>
>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>
>>   LOCATION
>>
>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>   TBLPROPERTIES (
>>
>>     'COLUMN_STATS_ACCURATE'='true',
>>
>>     'last_modified_by'='root',
>>
>>     'last_modified_time'='1455859079',
>>
>>     'numFiles'='37',
>>
>>     'numRows'='3',
>>
>>     'rawDataSize'='0',
>>
>>     'totalSize'='11383',
>>
>>     'transactional'='true',
>>
>>     'transient_lastDdlTime'='1455864121') ;
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is the transaction attribute set on your table? I observed that hive
>>> transaction storage structure do not work with spark yet. You can confirm
>>> this by looking at the transactional attribute in the output of "desc
>>> extended <tablename>" in hive console.
>>>
>>> If you'd need to access transactional table, consider doing a major
>>> compaction and then try accessing the tables
>>>
>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I have observed that Spark SQL is not returning records for hive
>>>> bucketed ORC tables on HDP.
>>>>
>>>>
>>>>
>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>> bucketed tables are not returning records.
>>>>
>>>> I have also tried the same for non-bucketed hive tables. it is working
>>>> fine.
>>>>
>>>>
>>>>
>>>> Same is working on plain Apache setup.
>>>>
>>>> Let me know if needs other details.
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by Varadharajan Mukundan <sr...@gmail.com>.

Yes, I was burned down by this issue couple of weeks back. This also means
that after every insert job, compaction should be run to access new rows
from Spark. Sad that this issue is not documented / mentioned anywhere.

On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Hi Varadharajan,
>
> Thanks for your response.
>
> Yes it is transnational table; See below *show create table. *
>
> Table hardly have 3 records , and after triggering minor compaction on
> tables , it start showing results on spark SQL.
>
>
> > *ALTER TABLE hivespark COMPACT 'major';*
>
>
> > *show create table hivespark;*
>
>   CREATE TABLE `hivespark`(
>
>     `id` int,
>
>     `name` string)
>
>   CLUSTERED BY (
>
>     id)
>
>   INTO 32 BUCKETS
>
>   ROW FORMAT SERDE
>
>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>
>   STORED AS INPUTFORMAT
>
>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>
>   OUTPUTFORMAT
>
>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>
>   LOCATION
>
>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>   TBLPROPERTIES (
>
>     'COLUMN_STATS_ACCURATE'='true',
>
>     'last_modified_by'='root',
>
>     'last_modified_time'='1455859079',
>
>     'numFiles'='37',
>
>     'numRows'='3',
>
>     'rawDataSize'='0',
>
>     'totalSize'='11383',
>
>     'transactional'='true',
>
>     'transient_lastDdlTime'='1455864121') ;
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
> srinathsmn@gmail.com> wrote:
>
>> Hi,
>>
>> Is the transaction attribute set on your table? I observed that hive
>> transaction storage structure do not work with spark yet. You can confirm
>> this by looking at the transactional attribute in the output of "desc
>> extended <tablename>" in hive console.
>>
>> If you'd need to access transactional table, consider doing a major
>> compaction and then try accessing the tables
>>
>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> I have observed that Spark SQL is not returning records for hive
>>> bucketed ORC tables on HDP.
>>>
>>>
>>>
>>> On spark SQL , I am able to list all tables , but queries on hive
>>> bucketed tables are not returning records.
>>>
>>> I have also tried the same for non-bucketed hive tables. it is working
>>> fine.
>>>
>>>
>>>
>>> Same is working on plain Apache setup.
>>>
>>> Let me know if needs other details.
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi Varadharajan,

Thanks for your response.

Yes it is transnational table; See below *show create table. *

Table hardly have 3 records , and after triggering minor compaction on
tables , it start showing results on spark SQL.


> *ALTER TABLE hivespark COMPACT 'major';*


> *show create table hivespark;*

  CREATE TABLE `hivespark`(

    `id` int,

    `name` string)

  CLUSTERED BY (

    id)

  INTO 32 BUCKETS

  ROW FORMAT SERDE

    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

  STORED AS INPUTFORMAT

    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

  OUTPUTFORMAT

    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

  LOCATION

    'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
  TBLPROPERTIES (

    'COLUMN_STATS_ACCURATE'='true',

    'last_modified_by'='root',

    'last_modified_time'='1455859079',

    'numFiles'='37',

    'numRows'='3',

    'rawDataSize'='0',

    'totalSize'='11383',

    'transactional'='true',

    'transient_lastDdlTime'='1455864121') ;


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> Hi,
>
> Is the transaction attribute set on your table? I observed that hive
> transaction storage structure do not work with spark yet. You can confirm
> this by looking at the transactional attribute in the output of "desc
> extended <tablename>" in hive console.
>
> If you'd need to access transactional table, consider doing a major
> compaction and then try accessing the tables
>
> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> I have observed that Spark SQL is not returning records for hive bucketed
>> ORC tables on HDP.
>>
>>
>>
>> On spark SQL , I am able to list all tables , but queries on hive
>> bucketed tables are not returning records.
>>
>> I have also tried the same for non-bucketed hive tables. it is working
>> fine.
>>
>>
>>
>> Same is working on plain Apache setup.
>>
>> Let me know if needs other details.
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi Varadharajan,

Thanks for your response.

Yes it is transnational table; See below *show create table. *

Table hardly have 3 records , and after triggering minor compaction on
tables , it start showing results on spark SQL.


> *ALTER TABLE hivespark COMPACT 'major';*


> *show create table hivespark;*

  CREATE TABLE `hivespark`(

    `id` int,

    `name` string)

  CLUSTERED BY (

    id)

  INTO 32 BUCKETS

  ROW FORMAT SERDE

    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

  STORED AS INPUTFORMAT

    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

  OUTPUTFORMAT

    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

  LOCATION

    'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
  TBLPROPERTIES (

    'COLUMN_STATS_ACCURATE'='true',

    'last_modified_by'='root',

    'last_modified_time'='1455859079',

    'numFiles'='37',

    'numRows'='3',

    'rawDataSize'='0',

    'totalSize'='11383',

    'transactional'='true',

    'transient_lastDdlTime'='1455864121') ;


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> Hi,
>
> Is the transaction attribute set on your table? I observed that hive
> transaction storage structure do not work with spark yet. You can confirm
> this by looking at the transactional attribute in the output of "desc
> extended <tablename>" in hive console.
>
> If you'd need to access transactional table, consider doing a major
> compaction and then try accessing the tables
>
> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> I have observed that Spark SQL is not returning records for hive bucketed
>> ORC tables on HDP.
>>
>>
>>
>> On spark SQL , I am able to list all tables , but queries on hive
>> bucketed tables are not returning records.
>>
>> I have also tried the same for non-bucketed hive tables. it is working
>> fine.
>>
>>
>>
>> Same is working on plain Apache setup.
>>
>> Let me know if needs other details.
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Posted by Varadharajan Mukundan <sr...@gmail.com>.

Hi,

Is the transaction attribute set on your table? I observed that hive
transaction storage structure do not work with spark yet. You can confirm
this by looking at the transactional attribute in the output of "desc
extended <tablename>" in hive console.

If you'd need to access transactional table, consider doing a major
compaction and then try accessing the tables

On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Hi,
>
>
> I have observed that Spark SQL is not returning records for hive bucketed
> ORC tables on HDP.
>
>
>
> On spark SQL , I am able to list all tables , but queries on hive bucketed
> tables are not returning records.
>
> I have also tried the same for non-bucketed hive tables. it is working
> fine.
>
>
>
> Same is working on plain Apache setup.
>
> Let me know if needs other details.
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>



-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in