You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Kevin Weiler <Ke...@imc-chicago.com> on 2014/07/24 17:52:03 UTC

python UDF and Avro tables

Hi All,

I hope I’m not duplicating a previous question, but I couldn’t find any search functionality for the user list archives.

I have written a relatively simple python script that is meant to take a field from a hive query and transform it (just some string processing through a dict) given that certain conditions are met. After reading this guide:

http://blog.spryinc.com/2013/09/a-guide-to-user-defined-functions-in.html

it would appear that the python script needs to read from STDIN the native file format (in my case Avro) and write to STDOUT. I implemented this functionality using the python fastavro deserializer and cStringIO for the STDIN/STDOUT bit. I then placed the appropriate python modules on all the nodes (which I could probably do a bit better by simply storing in HDFS). Unfortunately, I’m still getting errors while trying to transform my field which are appended below. I believe the problem is that HDFS can end up splitting the files at arbitrary points and you could have an Avro file with no schema appended to the top. Has anyone had any luck running a python UDF on an Avro table? Cheers!


Traceback (most recent call last):
  File "coltoskip.py", line 33, in <module>
    reader = avro.reader(avrofile)
  File "_reader.py", line 368, in fastavro._reader.iter_avro.__init__ (fastavro/_reader.c:6438)
ValueError: cannot read header - is it an avro file?
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)

--
Kevin Weiler
IT

IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com<ma...@imc-chicago.com>


________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets & Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.

Re: python UDF and Avro tables

Posted by Kevin Weiler <Ke...@imc-chicago.com>.

Thanks for the information. In all the instructions that I’ve read, it mentions splitting the fields of the table on a delimiter, and I figured that it was the delimiter of whatever the file format was. It turns out that in the version of Hive I’m using (hive-0.12.0+cdh5.1.0+369), Avro tables are passed through STDIN and delimited by “\t”. I simply needed to split on this field in my python UDF and it worked. Thanks!

--
Kevin Weiler
IT

IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com<ma...@imc-chicago.com>

On Jul 24, 2014, at 4:53 PM, java8964 <ja...@hotmail.com>> wrote:

Are you trying to read the Avro file directly in your UDF? If so, that is not the correct way to do it in UDF.

Hive can support Avro file natively. Don't know your UDF requirement, but here is normally what I will do:

Create the table in hive as using AvroContainerInputFormat

create external table foo
row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
stored as
inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
location '/xxx.avro'
tblproperties (
    'avro.schema.url'='hdfs://xxxx.avsc'
);

In this case, the hive will map the table structure based on the avro schema file.

Then  you can register your UDF and start to use it.

Remember, in this case, when your python UDF being invoked, the avro data will be wrapped as a JSON string, passed to your python UDF through STDIN.

For example, if you do "select MYUDF(col1) from foo", then the col1 data from Avro will be passed to  your python script as a JSON string, even if the col1 is a nested structure.
Then it is up to your python script to handle the JSON string, and return whatever output result through STDOUT.

Yong

________________________________
From: Kevin.Weiler@imc-chicago.com<ma...@imc-chicago.com>
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: python UDF and Avro tables
Date: Thu, 24 Jul 2014 15:52:03 +0000

Hi All,

I hope I’m not duplicating a previous question, but I couldn’t find any search functionality for the user list archives.

I have written a relatively simple python script that is meant to take a field from a hive query and transform it (just some string processing through a dict) given that certain conditions are met. After reading this guide:

http://blog.spryinc.com/2013/09/a-guide-to-user-defined-functions-in.html

it would appear that the python script needs to read from STDIN the native file format (in my case Avro) and write to STDOUT. I implemented this functionality using the python fastavro deserializer and cStringIO for the STDIN/STDOUT bit. I then placed the appropriate python modules on all the nodes (which I could probably do a bit better by simply storing in HDFS). Unfortunately, I’m still getting errors while trying to transform my field which are appended below. I believe the problem is that HDFS can end up splitting the files at arbitrary points and you could have an Avro file with no schema appended to the top. Has anyone had any luck running a python UDF on an Avro table? Cheers!


Traceback (most recent call last):
  File "coltoskip.py", line 33, in <module>
    reader = avro.reader(avrofile)
  File "_reader.py", line 368, in fastavro._reader.iter_avro.__init__ (fastavro/_reader.c:6438)
ValueError: cannot read header - is it an avro file?
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)

--
Kevin Weiler
IT



IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/
Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com<ma...@imc-chicago.com>


________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets & Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.


________________________________

The information in this e-mail is intended only for the person or entity to which it is addressed.

It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by "reply" and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets & Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments.

Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.

RE: python UDF and Avro tables

Posted by java8964 <ja...@hotmail.com>.

Are you trying to read the Avro file directly in your UDF? If so, that is not the correct way to do it in UDF.
Hive can support Avro file natively. Don't know your UDF requirement, but here is normally what I will do:
Create the table in hive as using AvroContainerInputFormat
create external table foorow format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'stored asinputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'location '/xxx.avro'tblproperties (    'avro.schema.url'='hdfs://xxxx.avsc');
In this case, the hive will map the table structure based on the avro schema file.
Then  you can register your UDF and start to use it.
Remember, in this case, when your python UDF being invoked, the avro data will be wrapped as a JSON string, passed to your python UDF through STDIN.
For example, if you do "select MYUDF(col1) from foo", then the col1 data from Avro will be passed to  your python script as a JSON string, even if the col1 is a nested structure. Then it is up to your python script to handle the JSON string, and return whatever output result through STDOUT.
Yong
From: Kevin.Weiler@imc-chicago.com
To: user@hive.apache.org
Subject: python UDF and Avro tables
Date: Thu, 24 Jul 2014 15:52:03 +0000






Hi All,



I hope I’m not duplicating a previous question, but I couldn’t find any search functionality for the user list archives.



I have written a relatively simple python script that is meant to take a field from a hive query and transform it (just some string processing through a dict) given that certain conditions are met. After reading this guide:



http://blog.spryinc.com/2013/09/a-guide-to-user-defined-functions-in.html



it would appear that the python script needs to read from STDIN the native file format (in my case Avro) and write to STDOUT. I implemented this functionality using the python fastavro deserializer and cStringIO for the STDIN/STDOUT bit. I then placed
 the appropriate python modules on all the nodes (which I could probably do a bit better by simply storing in HDFS). Unfortunately, I’m still getting errors while trying to transform my field which are appended below. I believe the problem is that HDFS can
 end up splitting the files at arbitrary points and you could have an Avro file with no schema appended to the top. Has anyone had any luck running a python UDF on an Avro table? Cheers!




Traceback (most recent call last):
  File "coltoskip.py", line 33, in <module>
    reader = avro.reader(avrofile)
  File "_reader.py", line 368, in fastavro._reader.iter_avro.__init__ (fastavro/_reader.c:6438)
ValueError: cannot read header - is it an avro file?
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
	at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
	at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
	at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)





--




Kevin Weiler


IT



 



IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/



Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-chicago.com












The information in this e-mail is intended only for the person or entity to which it is addressed.



It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it.



If you receive this e-mail unintentionally, please inform us immediately by "reply" and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets & Asset Management nor any of its related entities
 shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect
 transmission, please return the e-mail to the sender and permanently delete this message and any attachments.



Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.