You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Frank Heimerzheim <fh...@gmail.com> on 2017/02/07 14:17:50 UTC

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Hello,

quite a while i´ve worked successfully with https://maven2repo.com/org.
apache.kudu/kudu-spark_2.10/1.2.0/jar

For a bit i ignored a problem with kudu datatype int8. With the connector i
can´t write int8 as int in python will always bring up errors like

"java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8, Tye:
unixtime_micros, size: 8], it´s int8"

As python isn´t hard typed the connector is trying to find a suitable type
for python int in java/kudu. Apparently the python int is matched to int64/
unixtime_micros and not int8 as kudu is expecting at this place.

As a quick solution all my int in kudu are int64 at the moment

In the long run i can´t accept this waste of hdd space or even worse I/O.
Any idea when i can store int8 from python/spark to kudu?

With the "normal" python api everything works fine, only the spark/kudu/python
connector brings up the problem.

As so often: Thanks in advance for your excellent help!

Frank

2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:

> Hello,
>
> within the impala-shell i can create an external table and thereafter
> select and insert data from an underlying kudu table. Within the statement
> for creation of the table an 'StorageHandler' will be set to
>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
> there exists apparently an *.jar with the referenced library within.
>
> When trying to select from a hive-shell there is an error that the handler
> is not available. Trying to 'rdd.collect()' from an hiveCtx within an
> sparkSession i also get an error JavaClassNotFoundException as
> the KuduStorageHandler is not available.
>
> I then tried to find a jar in my system with the intention to copy it to
> all my data nodes. Sadly i couldn´t find the specific jar. I think it
> exists in the system as impala apparently is using it. For a test i´ve
> changed the 'StorageHandler' in the creation statement to
> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
> worked. Also the select from impala, but i didin´t return any data. There
> was no error as i expected. The test was just for the case impala would in
> a magic way select data from kudu without an correct 'StorageHandler'.
> Apparently this is not the case and impala has access to an
>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>
> Long story, short question:
> In which *.jar i can find the  'com.cloudera.kudu.hive.
> KuduStorageHandler'?
> Is the approach to copy the jar per hand to all nodes an appropriate way
> to bring spark in a position to work with kudu?
> What about the beeline-shell from hive and the possibility to read from
> kudu?
>
> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
> parcels. Build a working python-kudu library successfully from scratch (git)
>
> Thanks a lot!
> Frank
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.
On Tue, Feb 14, 2017 at 11:51 PM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello Todd,
>
> i was so naive to assume a decent automatic type assignment. With explicit
> type assignement via schema everything works fine.
>
> From an pythonic viewpoint the necessity to infer data types is not what i
> want to do all day. But this is a philosophical discussion and i got
> trapped with this way of thinking right now.
>
> I´ve expected that the attemp to store 42 in an int8 would work and the
> attemp to store 4242 would raise an error. But the drive ist not
> "inteligent" enough to test every individual case but checks ones for a
> data type. Not pythonic, but now that i´m realy aware of the topic: no
> further problem.
>

Yea, I see that the "static typing" we are doing isn't very pythonic.

Our thinking here (which really comes from the thinking on the C++ side) is
that, given the underlying Kudu data has static column typing, we didn't
want to have a scenario where someone uses a too-small column, and then
tests on data that happens to fit in range. Then, they get a surprise one
day when all of their inserts start failing with "data out of range for
int32" errors or whatever. Forcing people to evaluate the column sizes up
front avoids nasty surprises later.

But, maybe you can see my biases towards static-typed languages leaking
through here ;-)

-Todd


> 2017-02-14 19:44 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>
>> Hi Frank,
>>
>> Could you try something like:
>>
>> data = [(42, 2017, 'John')]
>> schema = StructType([
>>     StructField("id", ByteType(), True),
>>     StructField("year", ByteType(), True),
>>     StructField("name", StringType(), True)])
>> df = sqlContext.createDataFrame(data, schema)
>>
>> That should explicitly set the types (based on my reading of the pyspark
>> docs for createDataFrame)
>>
>> -Todd
>>
>>
>> On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> here a snippet which produces the error.
>>>
>>> Call from the shell:
>>> spark-submit --jars /opt/storage/data_nfs/cloudera
>>> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>>>
>>>
>>> Snippet from the python-code test.py:
>>>
>>> (..)
>>> builder = kudu.schema_builder()
>>> builder.add_column('id', kudu.int64, nullable=False)
>>> builder.add_column('year', kudu.int32)
>>> builder.add_column('name', kudu.string)
>>> (..)
>>>
>>> (..)
>>> data = [(42, 2017, 'John')]
>>> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
>>> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>>>                                              .option('kudu.table', kudu_table)\
>>>                                              .mode('append')\
>>>                                              .save()
>>> (..)
>>>
>>> Error:
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
>>> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
>>> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
>>> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
>>> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
>>> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
>>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
>>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>>> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 	at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem persists, be the attribute part of the key or not.
>>>
>>> My
>>>
>>> Greeting
>>>
>>> Frank
>>>
>>>
>>> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>>>
>>>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> quite a while i´ve worked successfully with https://maven2repo.com/org
>>>>> .apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>>>
>>>>> For a bit i ignored a problem with kudu datatype int8. With the
>>>>> connector i can´t write int8 as int in python will always bring up
>>>>> errors like
>>>>>
>>>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>>>
>>>>> As python isn´t hard typed the connector is trying to find a suitable
>>>>> type for python int in java/kudu. Apparently the python int is
>>>>> matched to int64/unixtime_micros and not int8 as kudu is expecting at
>>>>> this place.
>>>>>
>>>>> As a quick solution all my int in kudu are int64 at the moment
>>>>>
>>>>> In the long run i can´t accept this waste of hdd space or even worse
>>>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>>>
>>>>> With the "normal" python api everything works fine, only the spark/
>>>>> kudu/python connector brings up the problem.
>>>>>
>>>>
>>>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>>>> bit of sample code that reproduces the issue?
>>>>
>>>> -Todd
>>>>
>>>>
>>>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> within the impala-shell i can create an external table and thereafter
>>>>>> select and insert data from an underlying kudu table. Within the statement
>>>>>> for creation of the table an 'StorageHandler' will be set to
>>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine
>>>>>> as there exists apparently an *.jar with the referenced library within.
>>>>>>
>>>>>> When trying to select from a hive-shell there is an error that the
>>>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>>>> the KuduStorageHandler is not available.
>>>>>>
>>>>>> I then tried to find a jar in my system with the intention to copy it
>>>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>>>> changed the 'StorageHandler' in the creation statement to
>>>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create
>>>>>> statement worked. Also the select from impala, but i didin´t return any
>>>>>> data. There was no error as i expected. The test was just for the case
>>>>>> impala would in a magic way select data from kudu without an correct
>>>>>> 'StorageHandler'. Apparently this is not the case and impala has access to
>>>>>> an  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>>>
>>>>>> Long story, short question:
>>>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>>>> torageHandler'?
>>>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>>>> way to bring spark in a position to work with kudu?
>>>>>> What about the beeline-shell from hive and the possibility to read
>>>>>> from kudu?
>>>>>>
>>>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Frank
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.
Hello Todd,

i was so naive to assume a decent automatic type assignment. With explicit
type assignement via schema everything works fine.

From an pythonic viewpoint the necessity to infer data types is not what i
want to do all day. But this is a philosophical discussion and i got
trapped with this way of thinking right now.

I´ve expected that the attemp to store 42 in an int8 would work and the
attemp to store 4242 would raise an error. But the drive ist not
"inteligent" enough to test every individual case but checks ones for a
data type. Not pythonic, but now that i´m realy aware of the topic: no
further problem.

Thanks again
Frank

2017-02-14 19:44 GMT+01:00 Todd Lipcon <to...@cloudera.com>:

> Hi Frank,
>
> Could you try something like:
>
> data = [(42, 2017, 'John')]
> schema = StructType([
>     StructField("id", ByteType(), True),
>     StructField("year", ByteType(), True),
>     StructField("name", StringType(), True)])
> df = sqlContext.createDataFrame(data, schema)
>
> That should explicitly set the types (based on my reading of the pyspark
> docs for createDataFrame)
>
> -Todd
>
>
> On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> here a snippet which produces the error.
>>
>> Call from the shell:
>> spark-submit --jars /opt/storage/data_nfs/cloudera
>> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>>
>>
>> Snippet from the python-code test.py:
>>
>> (..)
>> builder = kudu.schema_builder()
>> builder.add_column('id', kudu.int64, nullable=False)
>> builder.add_column('year', kudu.int32)
>> builder.add_column('name', kudu.string)
>> (..)
>>
>> (..)
>> data = [(42, 2017, 'John')]
>> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
>> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>>                                              .option('kudu.table', kudu_table)\
>>                                              .mode('append')\
>>                                              .save()
>> (..)
>>
>> Error:
>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
>> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
>> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
>> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
>> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
>> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
>> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
>> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 	at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem persists, be the attribute part of the key or not.
>>
>> My
>>
>> Greeting
>>
>> Frank
>>
>>
>> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>>
>>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>>>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>>
>>>> For a bit i ignored a problem with kudu datatype int8. With the
>>>> connector i can´t write int8 as int in python will always bring up
>>>> errors like
>>>>
>>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>>
>>>> As python isn´t hard typed the connector is trying to find a suitable
>>>> type for python int in java/kudu. Apparently the python int is matched
>>>> to int64/unixtime_micros and not int8 as kudu is expecting at this
>>>> place.
>>>>
>>>> As a quick solution all my int in kudu are int64 at the moment
>>>>
>>>> In the long run i can´t accept this waste of hdd space or even worse
>>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>>
>>>> With the "normal" python api everything works fine, only the spark/kudu/python
>>>> connector brings up the problem.
>>>>
>>>
>>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>>> bit of sample code that reproduces the issue?
>>>
>>> -Todd
>>>
>>>
>>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>>>
>>>>> Hello,
>>>>>
>>>>> within the impala-shell i can create an external table and thereafter
>>>>> select and insert data from an underlying kudu table. Within the statement
>>>>> for creation of the table an 'StorageHandler' will be set to
>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine
>>>>> as there exists apparently an *.jar with the referenced library within.
>>>>>
>>>>> When trying to select from a hive-shell there is an error that the
>>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>>> the KuduStorageHandler is not available.
>>>>>
>>>>> I then tried to find a jar in my system with the intention to copy it
>>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>>> changed the 'StorageHandler' in the creation statement to
>>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>>>> worked. Also the select from impala, but i didin´t return any data. There
>>>>> was no error as i expected. The test was just for the case impala would in
>>>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>>>> Apparently this is not the case and impala has access to an
>>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>>
>>>>> Long story, short question:
>>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>>> torageHandler'?
>>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>>> way to bring spark in a position to work with kudu?
>>>>> What about the beeline-shell from hive and the possibility to read
>>>>> from kudu?
>>>>>
>>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>>
>>>>> Thanks a lot!
>>>>> Frank
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Frank,

Could you try something like:

data = [(42, 2017, 'John')]
schema = StructType([
    StructField("id", ByteType(), True),
    StructField("year", ByteType(), True),
    StructField("name", StringType(), True)])
df = sqlContext.createDataFrame(data, schema)

That should explicitly set the types (based on my reading of the pyspark
docs for createDataFrame)

-Todd


On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello,
>
> here a snippet which produces the error.
>
> Call from the shell:
> spark-submit --jars /opt/storage/data_nfs/cloudera
> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>
>
> Snippet from the python-code test.py:
>
> (..)
> builder = kudu.schema_builder()
> builder.add_column('id', kudu.int64, nullable=False)
> builder.add_column('year', kudu.int32)
> builder.add_column('name', kudu.string)
> (..)
>
> (..)
> data = [(42, 2017, 'John')]
> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>                                              .option('kudu.table', kudu_table)\
>                                              .mode('append')\
>                                              .save()
> (..)
>
> Error:
> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
>
>
> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem persists, be the attribute part of the key or not.
>
> My
>
> Greeting
>
> Frank
>
>
> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:
>
>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>
>>> For a bit i ignored a problem with kudu datatype int8. With the
>>> connector i can´t write int8 as int in python will always bring up
>>> errors like
>>>
>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>
>>> As python isn´t hard typed the connector is trying to find a suitable
>>> type for python int in java/kudu. Apparently the python int is matched
>>> to int64/unixtime_micros and not int8 as kudu is expecting at this
>>> place.
>>>
>>> As a quick solution all my int in kudu are int64 at the moment
>>>
>>> In the long run i can´t accept this waste of hdd space or even worse
>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>
>>> With the "normal" python api everything works fine, only the spark/kudu/python
>>> connector brings up the problem.
>>>
>>
>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>> bit of sample code that reproduces the issue?
>>
>> -Todd
>>
>>
>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>>
>>>> Hello,
>>>>
>>>> within the impala-shell i can create an external table and thereafter
>>>> select and insert data from an underlying kudu table. Within the statement
>>>> for creation of the table an 'StorageHandler' will be set to
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>>> there exists apparently an *.jar with the referenced library within.
>>>>
>>>> When trying to select from a hive-shell there is an error that the
>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>> the KuduStorageHandler is not available.
>>>>
>>>> I then tried to find a jar in my system with the intention to copy it
>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>> changed the 'StorageHandler' in the creation statement to
>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>>> worked. Also the select from impala, but i didin´t return any data. There
>>>> was no error as i expected. The test was just for the case impala would in
>>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>>> Apparently this is not the case and impala has access to an
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>
>>>> Long story, short question:
>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>> torageHandler'?
>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>> way to bring spark in a position to work with kudu?
>>>> What about the beeline-shell from hive and the possibility to read from
>>>> kudu?
>>>>
>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>
>>>> Thanks a lot!
>>>> Frank
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Frank Heimerzheim <fh...@gmail.com>.
Hello,

here a snippet which produces the error.

Call from the shell:
spark-submit --jars
/opt/storage/data_nfs/cloudera/pyspark/libs/kudu-spark_2.10-1.2.0.jar
test.py


Snippet from the python-code test.py:

(..)
builder = kudu.schema_builder()
builder.add_column('id', kudu.int64, nullable=False)
builder.add_column('year', kudu.int32)
builder.add_column('name', kudu.string)
(..)

(..)
data = [(42, 2017, 'John')]
df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
df.write.format('org.apache.kudu.spark.kudu').option('kudu.master',
kudu_master)\
                                             .option('kudu.table', kudu_table)\
                                             .mode('append')\
                                             .save()
(..)

Error:
17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in
stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096
bytes)
17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in
stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in
stage 4.0 (TID 6, ls00152y.xx.com):
java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8,
Type: unixtime_micros, size: 8], it's int32
	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for
me. The problem persists, be the attribute part of the key or not.

My

Greeting

Frank


2017-02-13 6:23 GMT+01:00 Todd Lipcon <to...@cloudera.com>:

> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>
>> For a bit i ignored a problem with kudu datatype int8. With the
>> connector i can´t write int8 as int in python will always bring up
>> errors like
>>
>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>> Tye: unixtime_micros, size: 8], it´s int8"
>>
>> As python isn´t hard typed the connector is trying to find a suitable
>> type for python int in java/kudu. Apparently the python int is matched
>> to int64/unixtime_micros and not int8 as kudu is expecting at this place.
>>
>> As a quick solution all my int in kudu are int64 at the moment
>>
>> In the long run i can´t accept this waste of hdd space or even worse
>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>
>> With the "normal" python api everything works fine, only the spark/kudu/python
>> connector brings up the problem.
>>
>
> Not 100% sure I'm following. You're using pyspark here? Can you post a bit
> of sample code that reproduces the issue?
>
> -Todd
>
>
>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>>
>>> Hello,
>>>
>>> within the impala-shell i can create an external table and thereafter
>>> select and insert data from an underlying kudu table. Within the statement
>>> for creation of the table an 'StorageHandler' will be set to
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>> there exists apparently an *.jar with the referenced library within.
>>>
>>> When trying to select from a hive-shell there is an error that the
>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>> an sparkSession i also get an error JavaClassNotFoundException as
>>> the KuduStorageHandler is not available.
>>>
>>> I then tried to find a jar in my system with the intention to copy it to
>>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>> exists in the system as impala apparently is using it. For a test i´ve
>>> changed the 'StorageHandler' in the creation statement to
>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>> worked. Also the select from impala, but i didin´t return any data. There
>>> was no error as i expected. The test was just for the case impala would in
>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>> Apparently this is not the case and impala has access to an
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>
>>> Long story, short question:
>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>> torageHandler'?
>>> Is the approach to copy the jar per hand to all nodes an appropriate way
>>> to bring spark in a position to work with kudu?
>>> What about the beeline-shell from hive and the possibility to read from
>>> kudu?
>>>
>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>
>>> Thanks a lot!
>>> Frank
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

Posted by Todd Lipcon <to...@cloudera.com>.
On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh...@gmail.com>
wrote:

> Hello,
>
> quite a while i´ve worked successfully with https://maven2repo.com/org.
> apache.kudu/kudu-spark_2.10/1.2.0/jar
>
> For a bit i ignored a problem with kudu datatype int8. With the connector
> i can´t write int8 as int in python will always bring up errors like
>
> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8, Tye:
> unixtime_micros, size: 8], it´s int8"
>
> As python isn´t hard typed the connector is trying to find a suitable type
> for python int in java/kudu. Apparently the python int is matched to
> int64/unixtime_micros and not int8 as kudu is expecting at this place.
>
> As a quick solution all my int in kudu are int64 at the moment
>
> In the long run i can´t accept this waste of hdd space or even worse I/O.
> Any idea when i can store int8 from python/spark to kudu?
>
> With the "normal" python api everything works fine, only the spark/kudu/python
> connector brings up the problem.
>

Not 100% sure I'm following. You're using pyspark here? Can you post a bit
of sample code that reproduces the issue?

-Todd


> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh...@gmail.com>:
>
>> Hello,
>>
>> within the impala-shell i can create an external table and thereafter
>> select and insert data from an underlying kudu table. Within the statement
>> for creation of the table an 'StorageHandler' will be set to
>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>> there exists apparently an *.jar with the referenced library within.
>>
>> When trying to select from a hive-shell there is an error that the
>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>> an sparkSession i also get an error JavaClassNotFoundException as
>> the KuduStorageHandler is not available.
>>
>> I then tried to find a jar in my system with the intention to copy it to
>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>> exists in the system as impala apparently is using it. For a test i´ve
>> changed the 'StorageHandler' in the creation statement to
>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>> worked. Also the select from impala, but i didin´t return any data. There
>> was no error as i expected. The test was just for the case impala would in
>> a magic way select data from kudu without an correct 'StorageHandler'.
>> Apparently this is not the case and impala has access to an
>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>
>> Long story, short question:
>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>> torageHandler'?
>> Is the approach to copy the jar per hand to all nodes an appropriate way
>> to bring spark in a position to work with kudu?
>> What about the beeline-shell from hive and the possibility to read from
>> kudu?
>>
>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>> parcels. Build a working python-kudu library successfully from scratch (git)
>>
>> Thanks a lot!
>> Frank
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera