You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Suryansh Agnihotri <sa...@gmail.com> on 2021/06/14 13:32:49 UTC

Spark ACID compatibility

Hi
Does spark support querying hive tables which are transactional?
 I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
table but I am not able to see the data from the table , although *show
tables *does list the table from hive metastore and desc table works fine
but *select * from table* gives *empty result*.
Does the later version of spark have the fix or is there another way to
query?
Thanks

Re: Spark ACID compatibility

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

On Mon, 14 Jun 2021 at 19:07, Mich Talebzadeh <mi...@gmail.com>
wrote:

>
>
> Now I am trying to read it in Hive
>
> 0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
> +----------------+--------------+----------+
> |    col_name    |  data_type   | comment  |
> +----------------+--------------+----------+
> | id             | int          |          |
> | clustered      | int          |          |
> | scattered      | int          |          |
> | randomised     | int          |          |
> | random_string  | varchar(50)  |          |
> | small_vc       | varchar(50)  |          |
> | padding        | varchar(40)  |          |
> +----------------+--------------+----------+
> 7 rows selected (0.169 seconds)
> 0: jdbc:hive2://rhes75:10099/default>
>
> *select count(1) from test.randomDataDelta;Error: Error while processing
> statement: FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed
> with exception: java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=08S01,code=1)*
>
> I did a Google search and showed the error I raised three years ago
>
>
> https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split
>
> So it has not been fixed yet!
>

Looking at the commit log for FileStatus shows HADOOP-14683 touching
compareTo, which cross references
https://issues.apache.org/jira/browse/HIVE-17133 , which fixes are
regression in https://issues.apache.org/jira/browse/HADOOP-12209 which was
committed by, er, one stevel@apache.org, whoever they are (*).

Try to build hive with the patch, drop in the modified JAR and verify it
works, then confirm this on the hive JIRA. That will reassure reviewers
this patch is needed and correct.

steve

(*) hadn't seen that regression; we should maybe have fixed by reinstating
the old compareTo(Object) as an overloaded call, but it may have been
impossible Comparator has expectations.

Re: Spark ACID compatibility

Posted by Mich Talebzadeh <mi...@gmail.com>.

I think we are hitting an old bug.

tried it with

Hadoop 3.1.1
Hive 3.1.1
Spark 3.1.1

Try to create an ORC transactional table in Hive (PySpark)

  CREATE TABLE if not exists test.randomDataDelta(
       ID INT
     , CLUSTERED INT
     , SCATTERED INT
     , RANDOMISED INT
     , RANDOM_STRING VARCHAR(50)
     , SMALL_VC VARCHAR(50)
     , PADDING  VARCHAR(40)
    )
  STORED AS ORC
  TBLPROPERTIES (






*"transactional" = "true",  "orc.create.index"="true",
"orc.bloom.filter.columns"="ID",  "orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",  "orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )*


And populate it through Spark with random data

it works and can red it through Spark

starting at ID =  218 ,ending on =  236
Schema of delta table
root
 |-- ID: long (nullable = true)
 |-- CLUSTERED: double (nullable = true)
 |-- SCATTERED: double (nullable = true)
 |-- RANDOMISED: double (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)

+-----+-----+
|minID|maxID|
+-----+-----+
|    1|  236|
+-----+-----+

Finished at
14/06/2021 19:02:43.43


Now I am trying to read it in Hive

0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
+----------------+--------------+----------+
|    col_name    |  data_type   | comment  |
+----------------+--------------+----------+
| id             | int          |          |
| clustered      | int          |          |
| scattered      | int          |          |
| randomised     | int          |          |
| random_string  | varchar(50)  |          |
| small_vc       | varchar(50)  |          |
| padding        | varchar(40)  |          |
+----------------+--------------+----------+
7 rows selected (0.169 seconds)
0: jdbc:hive2://rhes75:10099/default>

*select count(1) from test.randomDataDelta;Error: Error while processing
statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed
with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=1)*

I did a Google search and showed the error I raised three years ago

https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split

So it has not been fixed yet!

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 16:29, Suryansh Agnihotri <sa...@gmail.com>
wrote:

> No this also does not work.
> Steps I followed.
> spark-sql:
> CREATE TABLE students (id int, name string, marks int) STORED AS ORC
> TBLPROPERTIES ('transactional' = 'true');
>
> hive-cli:
> created a students_copy table and inserted some values in it and did
> "INSERT OVERWRITE TABLE students select * from default.students_copy;"
> I am able to query both tables from hive-cli but not from spark (table
> students is created using spark )
>
> Thanks
>
> On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Ok there were issues in the past with the ORC table read through Spark.
>>
>> If the ORC table is created through Spark I believe it will work
>>
>> Do a test. Create the ORC table through Spark first.
>>
>> Then do insert overwrite into that table through Hive cli from your Hive
>> created ORC table and see if you can access data in the new table through
>> Spark.
>>
>> HTH
>>
>>
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri <
>> sagnihotri2k14@gmail.com> wrote:
>>
>>> Table was created by hive (hive-cli) , format is orc. I am able to get
>>> data from hive-cli (hive return rows).
>>> But spark-sql/spark-shell does not return any rows.
>>>
>>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>> How the table was created in the first place, spark or Hive?
>>>>
>>>> Is this table an ORC table and does Spark or Hive return rows?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>>>> sagnihotri2k14@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>> Does spark support querying hive tables which are transactional?
>>>>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>>>>> table but I am not able to see the data from the table , although *show
>>>>> tables *does list the table from hive metastore and desc table works
>>>>> fine but *select * from table* gives *empty result*.
>>>>> Does the later version of spark have the fix or is there another way
>>>>> to query?
>>>>> Thanks
>>>>>
>>>>

Re: Spark ACID compatibility

Posted by Suryansh Agnihotri <sa...@gmail.com>.

No this also does not work.
Steps I followed.
spark-sql:
CREATE TABLE students (id int, name string, marks int) STORED AS ORC
TBLPROPERTIES ('transactional' = 'true');

hive-cli:
created a students_copy table and inserted some values in it and did
"INSERT OVERWRITE TABLE students select * from default.students_copy;"
I am able to query both tables from hive-cli but not from spark (table
students is created using spark )

Thanks

On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Ok there were issues in the past with the ORC table read through Spark.
>
> If the ORC table is created through Spark I believe it will work
>
> Do a test. Create the ORC table through Spark first.
>
> Then do insert overwrite into that table through Hive cli from your Hive
> created ORC table and see if you can access data in the new table through
> Spark.
>
> HTH
>
>
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri <sa...@gmail.com>
> wrote:
>
>> Table was created by hive (hive-cli) , format is orc. I am able to get
>> data from hive-cli (hive return rows).
>> But spark-sql/spark-shell does not return any rows.
>>
>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> How the table was created in the first place, spark or Hive?
>>>
>>> Is this table an ORC table and does Spark or Hive return rows?
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>>> sagnihotri2k14@gmail.com> wrote:
>>>
>>>> Hi
>>>> Does spark support querying hive tables which are transactional?
>>>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>>>> table but I am not able to see the data from the table , although *show
>>>> tables *does list the table from hive metastore and desc table works
>>>> fine but *select * from table* gives *empty result*.
>>>> Does the later version of spark have the fix or is there another way to
>>>> query?
>>>> Thanks
>>>>
>>>

Re: Spark ACID compatibility

Posted by Mich Talebzadeh <mi...@gmail.com>.

Ok there were issues in the past with the ORC table read through Spark.

If the ORC table is created through Spark I believe it will work

Do a test. Create the ORC table through Spark first.

Then do insert overwrite into that table through Hive cli from your Hive
created ORC table and see if you can access data in the new table through
Spark.

HTH





   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri <sa...@gmail.com>
wrote:

> Table was created by hive (hive-cli) , format is orc. I am able to get
> data from hive-cli (hive return rows).
> But spark-sql/spark-shell does not return any rows.
>
> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> How the table was created in the first place, spark or Hive?
>>
>> Is this table an ORC table and does Spark or Hive return rows?
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>> sagnihotri2k14@gmail.com> wrote:
>>
>>> Hi
>>> Does spark support querying hive tables which are transactional?
>>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>>> table but I am not able to see the data from the table , although *show
>>> tables *does list the table from hive metastore and desc table works
>>> fine but *select * from table* gives *empty result*.
>>> Does the later version of spark have the fix or is there another way to
>>> query?
>>> Thanks
>>>
>>

Re: Spark ACID compatibility

Posted by Suryansh Agnihotri <sa...@gmail.com>.

Table was created by hive (hive-cli) , format is orc. I am able to get data
from hive-cli (hive return rows).
But spark-sql/spark-shell does not return any rows.

On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh <mi...@gmail.com>
wrote:

> How the table was created in the first place, spark or Hive?
>
> Is this table an ORC table and does Spark or Hive return rows?
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <sa...@gmail.com>
> wrote:
>
>> Hi
>> Does spark support querying hive tables which are transactional?
>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>> table but I am not able to see the data from the table , although *show
>> tables *does list the table from hive metastore and desc table works
>> fine but *select * from table* gives *empty result*.
>> Does the later version of spark have the fix or is there another way to
>> query?
>> Thanks
>>
>

Re: Spark ACID compatibility

Posted by Mich Talebzadeh <mi...@gmail.com>.

How the table was created in the first place, spark or Hive?

Is this table an ORC table and does Spark or Hive return rows?

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <sa...@gmail.com>
wrote:

> Hi
> Does spark support querying hive tables which are transactional?
>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
> table but I am not able to see the data from the table , although *show
> tables *does list the table from hive metastore and desc table works fine
> but *select * from table* gives *empty result*.
> Does the later version of spark have the fix or is there another way to
> query?
> Thanks
>