You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Frank Heimerzheim <fh...@gmail.com> on 2016/12/13 11:12:03 UTC

Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Hello,

within the impala-shell i can create an external table and thereafter
select and insert data from an underlying kudu table. Within the statement
for creation of the table an 'StorageHandler' will be set to
 'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
there exists apparently an *.jar with the referenced library within.

When trying to select from a hive-shell there is an error that the handler
is not available. Trying to 'rdd.collect()' from an hiveCtx within an
sparkSession i also get an error JavaClassNotFoundException as
the KuduStorageHandler is not available.

I then tried to find a jar in my system with the intention to copy it to
all my data nodes. Sadly i couldn´t find the specific jar. I think it
exists in the system as impala apparently is using it. For a test i´ve
changed the 'StorageHandler' in the creation statement to
'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
worked. Also the select from impala, but i didin´t return any data. There
was no error as i expected. The test was just for the case impala would in
a magic way select data from kudu without an correct 'StorageHandler'.
Apparently this is not the case and impala has access to an
 'com.cloudera.kudu.hive.KuduStorageHandler'.

Long story, short question:
In which *.jar i can find the  'com.cloudera.kudu.hive.KuduStorageHandler'?
Is the approach to copy the jar per hand to all nodes an appropriate way to
bring spark in a position to work with kudu?
What about the beeline-shell from hive and the possibility to read from
kudu?

My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
parcels. Build a working python-kudu library successfully from scratch (git)

Thanks a lot!
Frank

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Feb 14, 2017 at 11:51 PM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello Todd,
>
> i was so naive to assume a decent automatic type assignment. With explicit
> type assignement via schema everything works fine.
>
> From an pythonic viewpoint the necessity to infer data types is not what i
> want to do all day. But this is a philosophical discussion and i got
> trapped with this way of thinking right now.
>
> I´ve expected that the attemp to store 42 in an int8 would work and the
> attemp to store 4242 would raise an error. But the drive ist not
> "inteligent" enough to test every individual case but checks ones for a
> data type. Not pythonic, but now that i´m realy aware of the topic: no
> further problem.
>

Yea, I see that the "static typing" we are doing isn't very pythonic.

Our thinking here (which really comes from the thinking on the C++ side) is
that, given the underlying Kudu data has static column typing, we didn't
want to have a scenario where someone uses a too-small column, and then
tests on data that happens to fit in range. Then, they get a surprise one
day when all of their inserts start failing with "data out of range for
int32" errors or whatever. Forcing people to evaluate the column sizes up
front avoids nasty surprises later.

But, maybe you can see my biases towards static-typed languages leaking
through here ;-)

-Todd


> 2017-02-14 19:44 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>
>> Hi Frank,
>>
>> Could you try something like:
>>
>> data = [(42, 2017, 'John')]
>> schema = StructType([
>>     StructField("id", ByteType(), True),
>>     StructField("year", ByteType(), True),
>>     StructField("name", StringType(), True)])
>> df = sqlContext.createDataFrame(data, schema)
>>
>> That should explicitly set the types (based on my reading of the pyspark
>> docs for createDataFrame)
>>
>> -Todd
>>
>>
>> On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> here a snippet which produces the error.
>>>
>>> Call from the shell:
>>> spark-submit --jars /opt/storage/data_nfs/cloudera
>>> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>>>
>>>
>>> Snippet from the python-code test.py:
>>>
>>> (..)
>>> builder = kudu.schema_builder()
>>> builder.add_column('id', kudu.int64, nullable=False)
>>> builder.add_column('year', kudu.int32)
>>> builder.add_column('name', kudu.string)
>>> (..)
>>>
>>> (..)
>>> data = [(42, 2017, 'John')]
>>> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
>>> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>>>                                              .option('kudu.table', kudu_table)\
>>>                                              .mode('append')\
>>>                                              .save()
>>> (..)
>>>
>>> Error:
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
>>> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
>>> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
>>> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
>>> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
>>> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
>>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>>> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 	at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem persists, be the attribute part of the key or not.
>>>
>>> My
>>>
>>> Greeting
>>>
>>> Frank
>>>
>>>
>>> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>>>
>>>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> quite a while i´ve worked successfully with https://maven2repo.com/org
>>>>> .apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>>>
>>>>> For a bit i ignored a problem with kudu datatype int8. With the
>>>>> connector i can´t write int8 as int in python will always bring up
>>>>> errors like
>>>>>
>>>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>>>
>>>>> As python isn´t hard typed the connector is trying to find a suitable
>>>>> type for python int in java/kudu. Apparently the python int is
>>>>> matched to int64/unixtime_micros and not int8 as kudu is expecting at
>>>>> this place.
>>>>>
>>>>> As a quick solution all my int in kudu are int64 at the moment
>>>>>
>>>>> In the long run i can´t accept this waste of hdd space or even worse
>>>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>>>
>>>>> With the "normal" python api everything works fine, only the spark/
>>>>> kudu/python connector brings up the problem.
>>>>>
>>>>
>>>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>>>> bit of sample code that reproduces the issue?
>>>>
>>>> -Todd
>>>>
>>>>
>>>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> within the impala-shell i can create an external table and thereafter
>>>>>> select and insert data from an underlying kudu table. Within the statement
>>>>>> for creation of the table an 'StorageHandler' will be set to
>>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine
>>>>>> as there exists apparently an *.jar with the referenced library within.
>>>>>>
>>>>>> When trying to select from a hive-shell there is an error that the
>>>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>>>> the KuduStorageHandler is not available.
>>>>>>
>>>>>> I then tried to find a jar in my system with the intention to copy it
>>>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>>>> changed the 'StorageHandler' in the creation statement to
>>>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create
>>>>>> statement worked. Also the select from impala, but i didin´t return any
>>>>>> data. There was no error as i expected. The test was just for the case
>>>>>> impala would in a magic way select data from kudu without an correct
>>>>>> 'StorageHandler'. Apparently this is not the case and impala has access to
>>>>>> an  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>>>
>>>>>> Long story, short question:
>>>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>>>> torageHandler'?
>>>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>>>> way to bring spark in a position to work with kudu?
>>>>>> What about the beeline-shell from hive and the possibility to read
>>>>>> from kudu?
>>>>>>
>>>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Frank
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.

Hello Todd,

i was so naive to assume a decent automatic type assignment. With explicit
type assignement via schema everything works fine.

From an pythonic viewpoint the necessity to infer data types is not what i
want to do all day. But this is a philosophical discussion and i got
trapped with this way of thinking right now.

I´ve expected that the attemp to store 42 in an int8 would work and the
attemp to store 4242 would raise an error. But the drive ist not
"inteligent" enough to test every individual case but checks ones for a
data type. Not pythonic, but now that i´m realy aware of the topic: no
further problem.

Thanks again
Frank

2017-02-14 19:44 GMT+01:00 Todd Lipcon <to...@cloudera.com>:

> Hi Frank,
>
> Could you try something like:
>
> data = [(42, 2017, 'John')]
> schema = StructType([
>     StructField("id", ByteType(), True),
>     StructField("year", ByteType(), True),
>     StructField("name", StringType(), True)])
> df = sqlContext.createDataFrame(data, schema)
>
> That should explicitly set the types (based on my reading of the pyspark
> docs for createDataFrame)
>
> -Todd
>
>
> On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> here a snippet which produces the error.
>>
>> Call from the shell:
>> spark-submit --jars /opt/storage/data_nfs/cloudera
>> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>>
>>
>> Snippet from the python-code test.py:
>>
>> (..)
>> builder = kudu.schema_builder()
>> builder.add_column('id', kudu.int64, nullable=False)
>> builder.add_column('year', kudu.int32)
>> builder.add_column('name', kudu.string)
>> (..)
>>
>> (..)
>> data = [(42, 2017, 'John')]
>> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
>> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>>                                              .option('kudu.table', kudu_table)\
>>                                              .mode('append')\
>>                                              .save()
>> (..)
>>
>> Error:
>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
>> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
>> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
>> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
>> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
>> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 	at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem persists, be the attribute part of the key or not.
>>
>> My
>>
>> Greeting
>>
>> Frank
>>
>>
>> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>>
>>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>>>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>>
>>>> For a bit i ignored a problem with kudu datatype int8. With the
>>>> connector i can´t write int8 as int in python will always bring up
>>>> errors like
>>>>
>>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>>
>>>> As python isn´t hard typed the connector is trying to find a suitable
>>>> type for python int in java/kudu. Apparently the python int is matched
>>>> to int64/unixtime_micros and not int8 as kudu is expecting at this
>>>> place.
>>>>
>>>> As a quick solution all my int in kudu are int64 at the moment
>>>>
>>>> In the long run i can´t accept this waste of hdd space or even worse
>>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>>
>>>> With the "normal" python api everything works fine, only the spark/kudu/python
>>>> connector brings up the problem.
>>>>
>>>
>>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>>> bit of sample code that reproduces the issue?
>>>
>>> -Todd
>>>
>>>
>>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>>>
>>>>> Hello,
>>>>>
>>>>> within the impala-shell i can create an external table and thereafter
>>>>> select and insert data from an underlying kudu table. Within the statement
>>>>> for creation of the table an 'StorageHandler' will be set to
>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine
>>>>> as there exists apparently an *.jar with the referenced library within.
>>>>>
>>>>> When trying to select from a hive-shell there is an error that the
>>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>>> the KuduStorageHandler is not available.
>>>>>
>>>>> I then tried to find a jar in my system with the intention to copy it
>>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>>> changed the 'StorageHandler' in the creation statement to
>>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>>>> worked. Also the select from impala, but i didin´t return any data. There
>>>>> was no error as i expected. The test was just for the case impala would in
>>>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>>>> Apparently this is not the case and impala has access to an
>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>>
>>>>> Long story, short question:
>>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>>> torageHandler'?
>>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>>> way to bring spark in a position to work with kudu?
>>>>> What about the beeline-shell from hive and the possibility to read
>>>>> from kudu?
>>>>>
>>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>>
>>>>> Thanks a lot!
>>>>> Frank
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Frank,

Could you try something like:

data = [(42, 2017, 'John')]
schema = StructType([
    StructField("id", ByteType(), True),
    StructField("year", ByteType(), True),
    StructField("name", StringType(), True)])
df = sqlContext.createDataFrame(data, schema)

That should explicitly set the types (based on my reading of the pyspark
docs for createDataFrame)

-Todd


On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello,
>
> here a snippet which produces the error.
>
> Call from the shell:
> spark-submit --jars /opt/storage/data_nfs/cloudera
> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>
>
> Snippet from the python-code test.py:
>
> (..)
> builder = kudu.schema_builder()
> builder.add_column('id', kudu.int64, nullable=False)
> builder.add_column('year', kudu.int32)
> builder.add_column('name', kudu.string)
> (..)
>
> (..)
> data = [(42, 2017, 'John')]
> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>                                              .option('kudu.table', kudu_table)\
>                                              .mode('append')\
>                                              .save()
> (..)
>
> Error:
> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
>
>
> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem persists, be the attribute part of the key or not.
>
> My
>
> Greeting
>
> Frank
>
>
> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>
>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>
>>> For a bit i ignored a problem with kudu datatype int8. With the
>>> connector i can´t write int8 as int in python will always bring up
>>> errors like
>>>
>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>
>>> As python isn´t hard typed the connector is trying to find a suitable
>>> type for python int in java/kudu. Apparently the python int is matched
>>> to int64/unixtime_micros and not int8 as kudu is expecting at this
>>> place.
>>>
>>> As a quick solution all my int in kudu are int64 at the moment
>>>
>>> In the long run i can´t accept this waste of hdd space or even worse
>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>
>>> With the "normal" python api everything works fine, only the spark/kudu/python
>>> connector brings up the problem.
>>>
>>
>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>> bit of sample code that reproduces the issue?
>>
>> -Todd
>>
>>
>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>>
>>>> Hello,
>>>>
>>>> within the impala-shell i can create an external table and thereafter
>>>> select and insert data from an underlying kudu table. Within the statement
>>>> for creation of the table an 'StorageHandler' will be set to
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>>> there exists apparently an *.jar with the referenced library within.
>>>>
>>>> When trying to select from a hive-shell there is an error that the
>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>> the KuduStorageHandler is not available.
>>>>
>>>> I then tried to find a jar in my system with the intention to copy it
>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>> changed the 'StorageHandler' in the creation statement to
>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>>> worked. Also the select from impala, but i didin´t return any data. There
>>>> was no error as i expected. The test was just for the case impala would in
>>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>>> Apparently this is not the case and impala has access to an
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>
>>>> Long story, short question:
>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>> torageHandler'?
>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>> way to bring spark in a position to work with kudu?
>>>> What about the beeline-shell from hive and the possibility to read from
>>>> kudu?
>>>>
>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>
>>>> Thanks a lot!
>>>> Frank
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.

Hello,

here a snippet which produces the error.

Call from the shell:
spark-submit --jars
/opt/storage/data_nfs/cloudera/pyspark/libs/kudu-spark_2.10-1.2.0.jar
test.py


Snippet from the python-code test.py:

(..)
builder = kudu.schema_builder()
builder.add_column('id', kudu.int64, nullable=False)
builder.add_column('year', kudu.int32)
builder.add_column('name', kudu.string)
(..)

(..)
data = [(42, 2017, 'John')]
df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
df.write.format('org.apache.kudu.spark.kudu').option('kudu.master',
kudu_master)\
                                             .option('kudu.table', kudu_table)\
                                             .mode('append')\
                                             .save()
(..)

Error:
17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in
stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096
bytes)
17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in
stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in
stage 4.0 (TID 6, ls00152y.xx.com):
java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8,
Type: unixtime_micros, size: 8], it's int32
	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for
me. The problem persists, be the attribute part of the key or not.

My

Greeting

Frank


2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:

> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>
>> For a bit i ignored a problem with kudu datatype int8. With the
>> connector i can´t write int8 as int in python will always bring up
>> errors like
>>
>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>> Tye: unixtime_micros, size: 8], it´s int8"
>>
>> As python isn´t hard typed the connector is trying to find a suitable
>> type for python int in java/kudu. Apparently the python int is matched
>> to int64/unixtime_micros and not int8 as kudu is expecting at this place.
>>
>> As a quick solution all my int in kudu are int64 at the moment
>>
>> In the long run i can´t accept this waste of hdd space or even worse
>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>
>> With the "normal" python api everything works fine, only the spark/kudu/python
>> connector brings up the problem.
>>
>
> Not 100% sure I'm following. You're using pyspark here? Can you post a bit
> of sample code that reproduces the issue?
>
> -Todd
>
>
>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>
>>> Hello,
>>>
>>> within the impala-shell i can create an external table and thereafter
>>> select and insert data from an underlying kudu table. Within the statement
>>> for creation of the table an 'StorageHandler' will be set to
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>> there exists apparently an *.jar with the referenced library within.
>>>
>>> When trying to select from a hive-shell there is an error that the
>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>> an sparkSession i also get an error JavaClassNotFoundException as
>>> the KuduStorageHandler is not available.
>>>
>>> I then tried to find a jar in my system with the intention to copy it to
>>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>> exists in the system as impala apparently is using it. For a test i´ve
>>> changed the 'StorageHandler' in the creation statement to
>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>> worked. Also the select from impala, but i didin´t return any data. There
>>> was no error as i expected. The test was just for the case impala would in
>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>> Apparently this is not the case and impala has access to an
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>
>>> Long story, short question:
>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>> torageHandler'?
>>> Is the approach to copy the jar per hand to all nodes an appropriate way
>>> to bring spark in a position to work with kudu?
>>> What about the beeline-shell from hive and the possibility to read from
>>> kudu?
>>>
>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>
>>> Thanks a lot!
>>> Frank
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello,
>
> quite a while i´ve worked successfully with https://maven2repo.com/org.
> apache.kudu/kudu-spark_2.10/1.2.0/jar
>
> For a bit i ignored a problem with kudu datatype int8. With the connector
> i can´t write int8 as int in python will always bring up errors like
>
> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8, Tye:
> unixtime_micros, size: 8], it´s int8"
>
> As python isn´t hard typed the connector is trying to find a suitable type
> for python int in java/kudu. Apparently the python int is matched to
> int64/unixtime_micros and not int8 as kudu is expecting at this place.
>
> As a quick solution all my int in kudu are int64 at the moment
>
> In the long run i can´t accept this waste of hdd space or even worse I/O.
> Any idea when i can store int8 from python/spark to kudu?
>
> With the "normal" python api everything works fine, only the spark/kudu/python
> connector brings up the problem.
>

Not 100% sure I'm following. You're using pyspark here? Can you post a bit
of sample code that reproduces the issue?

-Todd


> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>
>> Hello,
>>
>> within the impala-shell i can create an external table and thereafter
>> select and insert data from an underlying kudu table. Within the statement
>> for creation of the table an 'StorageHandler' will be set to
>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>> there exists apparently an *.jar with the referenced library within.
>>
>> When trying to select from a hive-shell there is an error that the
>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>> an sparkSession i also get an error JavaClassNotFoundException as
>> the KuduStorageHandler is not available.
>>
>> I then tried to find a jar in my system with the intention to copy it to
>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>> exists in the system as impala apparently is using it. For a test i´ve
>> changed the 'StorageHandler' in the creation statement to
>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>> worked. Also the select from impala, but i didin´t return any data. There
>> was no error as i expected. The test was just for the case impala would in
>> a magic way select data from kudu without an correct 'StorageHandler'.
>> Apparently this is not the case and impala has access to an
>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>
>> Long story, short question:
>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>> torageHandler'?
>> Is the approach to copy the jar per hand to all nodes an appropriate way
>> to bring spark in a position to work with kudu?
>> What about the beeline-shell from hive and the possibility to read from
>> kudu?
>>
>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>> parcels. Build a working python-kudu library successfully from scratch (git)
>>
>> Thanks a lot!
>> Frank
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.

Hello,

quite a while i´ve worked successfully with https://maven2repo.com/org.
apache.kudu/kudu-spark_2.10/1.2.0/jar

For a bit i ignored a problem with kudu datatype int8. With the connector i
can´t write int8 as int in python will always bring up errors like

"java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8, Tye:
unixtime_micros, size: 8], it´s int8"

As python isn´t hard typed the connector is trying to find a suitable type
for python int in java/kudu. Apparently the python int is matched to int64/
unixtime_micros and not int8 as kudu is expecting at this place.

As a quick solution all my int in kudu are int64 at the moment

In the long run i can´t accept this waste of hdd space or even worse I/O.
Any idea when i can store int8 from python/spark to kudu?

With the "normal" python api everything works fine, only the spark/kudu/python
connector brings up the problem.

As so often: Thanks in advance for your excellent help!

Frank

2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:

> Hello,
>
> within the impala-shell i can create an external table and thereafter
> select and insert data from an underlying kudu table. Within the statement
> for creation of the table an 'StorageHandler' will be set to
>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
> there exists apparently an *.jar with the referenced library within.
>
> When trying to select from a hive-shell there is an error that the handler
> is not available. Trying to 'rdd.collect()' from an hiveCtx within an
> sparkSession i also get an error JavaClassNotFoundException as
> the KuduStorageHandler is not available.
>
> I then tried to find a jar in my system with the intention to copy it to
> all my data nodes. Sadly i couldn´t find the specific jar. I think it
> exists in the system as impala apparently is using it. For a test i´ve
> changed the 'StorageHandler' in the creation statement to
> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
> worked. Also the select from impala, but i didin´t return any data. There
> was no error as i expected. The test was just for the case impala would in
> a magic way select data from kudu without an correct 'StorageHandler'.
> Apparently this is not the case and impala has access to an
>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>
> Long story, short question:
> In which *.jar i can find the  'com.cloudera.kudu.hive.
> KuduStorageHandler'?
> Is the approach to copy the jar per hand to all nodes an appropriate way
> to bring spark in a position to work with kudu?
> What about the beeline-shell from hive and the possibility to read from
> kudu?
>
> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
> parcels. Build a working python-kudu library successfully from scratch (git)
>
> Thanks a lot!
> Frank
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Jan 9, 2017 at 2:54 AM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello Todd,
>
> one additional question:
>
> There exists a KuduContext in org.apache.kudu.spark.kudu._ which provides
> read/write/update to be used with scala and spark. I´m now looking fo a
> similar solution for python and spark. I´ve found
> https://github.com/bkvarda/iot_demo which looks fine on a first look. But
> i would much more prever an "official"  solution. Is there anything to be
> expected in the near future? Or a way - i don´t know yet - to use the scala
> library from python?
>

I'm not a real Spark expert (especially not pyspark) so I don't have a
great answer to this question. The github demo you linked above looks like
a reasonable approach, though.

Jordan Birdsell is our primary Python expert, and he filed
https://issues.apache.org/jira/browse/KUDU-1603 a while back. Hopefully he
will chime in with a better answer than I can give :)

-Todd

2016-12-13 16:05 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>
>> Hello Todd,
>>
>> thanks a lot for the clarification.
>>
>> Greetings
>> Frank
>>
>> 2016-12-13 15:36 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>>
>>> Hi Frank,
>>>
>>> I'm sorry to say that the Java storage handler implementation you're
>>> looking for doesn't exist. The Hive metastore requires that non-HDFS
>>> storage engines set some value for the 'storage handler' property, so
>>> Impala uses that special string to denote a Kudu table in the HMS. However,
>>> there is no such Java implementation- Impala detects this class name and
>>> uses its own implementation to plan and execute queries against Kudu.
>>>
>>> The Hive support for Kudu is tracked here: https://issues.apache.or
>>> g/jira/browse/HIVE-12971
>>> This work isn't committed to the Hive project but there is a prototype
>>> on github that you could try. Note that it's not being actively developed
>>> by the Kudu dev community at this point in time, but if you get it working,
>>> please report back with your experiences.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Tue, Dec 13, 2016 at 6:12 PM, Frank Heimerzheim <fh...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> within the impala-shell i can create an external table and thereafter
>>>> select and insert data from an underlying kudu table. Within the statement
>>>> for creation of the table an 'StorageHandler' will be set to
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>>> there exists apparently an *.jar with the referenced library within.
>>>>
>>>> When trying to select from a hive-shell there is an error that the
>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>> the KuduStorageHandler is not available.
>>>>
>>>> I then tried to find a jar in my system with the intention to copy it
>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>> changed the 'StorageHandler' in the creation statement to
>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>>> worked. Also the select from impala, but i didin´t return any data. There
>>>> was no error as i expected. The test was just for the case impala would in
>>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>>> Apparently this is not the case and impala has access to an
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>
>>>> Long story, short question:
>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>> torageHandler'?
>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>> way to bring spark in a position to work with kudu?
>>>> What about the beeline-shell from hive and the possibility to read from
>>>> kudu?
>>>>
>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>
>>>> Thanks a lot!
>>>> Frank
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.

Hello Todd,

one additional question:

There exists a KuduContext in org.apache.kudu.spark.kudu._ which provides
read/write/update to be used with scala and spark. I´m now looking fo a
similar solution for python and spark. I´ve found
https://github.com/bkvarda/iot_demo which looks fine on a first look. But i
would much more prever an "official"  solution. Is there anything to be
expected in the near future? Or a way - i don´t know yet - to use the scala
library from python?

Thanks
Frank

2016-12-13 16:05 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:

> Hello Todd,
>
> thanks a lot for the clarification.
>
> Greetings
> Frank
>
> 2016-12-13 15:36 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>
>> Hi Frank,
>>
>> I'm sorry to say that the Java storage handler implementation you're
>> looking for doesn't exist. The Hive metastore requires that non-HDFS
>> storage engines set some value for the 'storage handler' property, so
>> Impala uses that special string to denote a Kudu table in the HMS. However,
>> there is no such Java implementation- Impala detects this class name and
>> uses its own implementation to plan and execute queries against Kudu.
>>
>> The Hive support for Kudu is tracked here: https://issues.apache.or
>> g/jira/browse/HIVE-12971
>> This work isn't committed to the Hive project but there is a prototype on
>> github that you could try. Note that it's not being actively developed by
>> the Kudu dev community at this point in time, but if you get it working,
>> please report back with your experiences.
>>
>> Thanks
>> -Todd
>>
>> On Tue, Dec 13, 2016 at 6:12 PM, Frank Heimerzheim <fh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> within the impala-shell i can create an external table and thereafter
>>> select and insert data from an underlying kudu table. Within the statement
>>> for creation of the table an 'StorageHandler' will be set to
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>> there exists apparently an *.jar with the referenced library within.
>>>
>>> When trying to select from a hive-shell there is an error that the
>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>> an sparkSession i also get an error JavaClassNotFoundException as
>>> the KuduStorageHandler is not available.
>>>
>>> I then tried to find a jar in my system with the intention to copy it to
>>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>> exists in the system as impala apparently is using it. For a test i´ve
>>> changed the 'StorageHandler' in the creation statement to
>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>> worked. Also the select from impala, but i didin´t return any data. There
>>> was no error as i expected. The test was just for the case impala would in
>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>> Apparently this is not the case and impala has access to an
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>
>>> Long story, short question:
>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>> torageHandler'?
>>> Is the approach to copy the jar per hand to all nodes an appropriate way
>>> to bring spark in a position to work with kudu?
>>> What about the beeline-shell from hive and the possibility to read from
>>> kudu?
>>>
>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>
>>> Thanks a lot!
>>> Frank
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.

Hello Todd,

thanks a lot for the clarification.

Greetings
Frank

2016-12-13 15:36 GMT+01:00 Todd Lipcon <to...@cloudera.com>:

> Hi Frank,
>
> I'm sorry to say that the Java storage handler implementation you're
> looking for doesn't exist. The Hive metastore requires that non-HDFS
> storage engines set some value for the 'storage handler' property, so
> Impala uses that special string to denote a Kudu table in the HMS. However,
> there is no such Java implementation- Impala detects this class name and
> uses its own implementation to plan and execute queries against Kudu.
>
> The Hive support for Kudu is tracked here: https://issues.apache.
> org/jira/browse/HIVE-12971
> This work isn't committed to the Hive project but there is a prototype on
> github that you could try. Note that it's not being actively developed by
> the Kudu dev community at this point in time, but if you get it working,
> please report back with your experiences.
>
> Thanks
> -Todd
>
> On Tue, Dec 13, 2016 at 6:12 PM, Frank Heimerzheim <fh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> within the impala-shell i can create an external table and thereafter
>> select and insert data from an underlying kudu table. Within the statement
>> for creation of the table an 'StorageHandler' will be set to
>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>> there exists apparently an *.jar with the referenced library within.
>>
>> When trying to select from a hive-shell there is an error that the
>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>> an sparkSession i also get an error JavaClassNotFoundException as
>> the KuduStorageHandler is not available.
>>
>> I then tried to find a jar in my system with the intention to copy it to
>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>> exists in the system as impala apparently is using it. For a test i´ve
>> changed the 'StorageHandler' in the creation statement to
>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>> worked. Also the select from impala, but i didin´t return any data. There
>> was no error as i expected. The test was just for the case impala would in
>> a magic way select data from kudu without an correct 'StorageHandler'.
>> Apparently this is not the case and impala has access to an
>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>
>> Long story, short question:
>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>> torageHandler'?
>> Is the approach to copy the jar per hand to all nodes an appropriate way
>> to bring spark in a position to work with kudu?
>> What about the beeline-shell from hive and the possibility to read from
>> kudu?
>>
>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>> parcels. Build a working python-kudu library successfully from scratch (git)
>>
>> Thanks a lot!
>> Frank
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Frank,

I'm sorry to say that the Java storage handler implementation you're
looking for doesn't exist. The Hive metastore requires that non-HDFS
storage engines set some value for the 'storage handler' property, so
Impala uses that special string to denote a Kudu table in the HMS. However,
there is no such Java implementation- Impala detects this class name and
uses its own implementation to plan and execute queries against Kudu.

The Hive support for Kudu is tracked here:
https://issues.apache.org/jira/browse/HIVE-12971
This work isn't committed to the Hive project but there is a prototype on
github that you could try. Note that it's not being actively developed by
the Kudu dev community at this point in time, but if you get it working,
please report back with your experiences.

Thanks
-Todd

On Tue, Dec 13, 2016 at 6:12 PM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello,
>
> within the impala-shell i can create an external table and thereafter
> select and insert data from an underlying kudu table. Within the statement
> for creation of the table an 'StorageHandler' will be set to
>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
> there exists apparently an *.jar with the referenced library within.
>
> When trying to select from a hive-shell there is an error that the handler
> is not available. Trying to 'rdd.collect()' from an hiveCtx within an
> sparkSession i also get an error JavaClassNotFoundException as
> the KuduStorageHandler is not available.
>
> I then tried to find a jar in my system with the intention to copy it to
> all my data nodes. Sadly i couldn´t find the specific jar. I think it
> exists in the system as impala apparently is using it. For a test i´ve
> changed the 'StorageHandler' in the creation statement to
> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
> worked. Also the select from impala, but i didin´t return any data. There
> was no error as i expected. The test was just for the case impala would in
> a magic way select data from kudu without an correct 'StorageHandler'.
> Apparently this is not the case and impala has access to an
>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>
> Long story, short question:
> In which *.jar i can find the  'com.cloudera.kudu.hive.
> KuduStorageHandler'?
> Is the approach to copy the jar per hand to all nodes an appropriate way
> to bring spark in a position to work with kudu?
> What about the beeline-shell from hive and the possibility to read from
> kudu?
>
> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
> parcels. Build a working python-kudu library successfully from scratch (git)
>
> Thanks a lot!
> Frank
>

-- 
Todd Lipcon
Software Engineer, Cloudera