You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2012/09/21 14:55:20 UTC

Hive Transform Scripts Ending Cleanly

Greetings All -

I have a transform script that some some awesome stuff (at least to my eyes)

Basically, here is the SQL


  SELECT TRANSFORM (filename)
  USING 'worker.sh' as (col1, col2, col3, col4, col5)
  FROM mysource_filetable


worker.sh is actually a wrapper script that

looks like this:

#!/bin/bash

while read line; do
    filename=$line
    python /mnt/node_scripts/parser.py -i $filename -o STDOUT
done

The reason for handling calling the python script in a bash script is so I
can read off stdin, process the data, and then shoot it off to standard
OUT.  There are some other reasons... but it works great, most of the time.

Sometimes, for whatever reason, we have a situation where the hive
"listener" )(I don't know what else to call it) gets bored listening for
data. The python script can take a long time depending on the data being
sent to it.  It gives up listening for STDOUT, the task times out, and the
job retries that file somewhere else where it succeeds. No big deal.
However, the python script and the java that's calling it seems to still be
running using up resources. If it doesn't exit cleanly, it kinda wigs out
and goes on to TRANSFORM THE WORLD (said in a loud echoing booming voice).
 Anywho, just curious if there are ways I can monitor for that. Perhaps
check for things in my worker.sh, maybe run python direct from hive?
Settings in hive that will force kill the runaways?  Transform, and it's
capabilities are AWESOME, but like much in hive, documentation is all over
the place.

Re: Hive Transform Scripts Ending Cleanly

Posted by Edward Capriolo <ed...@gmail.com>.
There is a setting in hive site which allows transform scripts to
continue even if they take a long time to return a single row.

Edward

On Fri, Sep 21, 2012 at 8:55 AM, John Omernik <jo...@omernik.com> wrote:
> Greetings All -
>
> I have a transform script that some some awesome stuff (at least to my eyes)
>
> Basically, here is the SQL
>
>
>   SELECT TRANSFORM (filename)
>   USING 'worker.sh' as (col1, col2, col3, col4, col5)
>   FROM mysource_filetable
>
>
> worker.sh is actually a wrapper script that
>
> looks like this:
>
> #!/bin/bash
>
> while read line; do
>     filename=$line
>     python /mnt/node_scripts/parser.py -i $filename -o STDOUT
> done
>
> The reason for handling calling the python script in a bash script is so I
> can read off stdin, process the data, and then shoot it off to standard OUT.
> There are some other reasons... but it works great, most of the time.
>
> Sometimes, for whatever reason, we have a situation where the hive
> "listener" )(I don't know what else to call it) gets bored listening for
> data. The python script can take a long time depending on the data being
> sent to it.  It gives up listening for STDOUT, the task times out, and the
> job retries that file somewhere else where it succeeds. No big deal.
> However, the python script and the java that's calling it seems to still be
> running using up resources. If it doesn't exit cleanly, it kinda wigs out
> and goes on to TRANSFORM THE WORLD (said in a loud echoing booming voice).
> Anywho, just curious if there are ways I can monitor for that. Perhaps check
> for things in my worker.sh, maybe run python direct from hive? Settings in
> hive that will force kill the runaways?  Transform, and it's capabilities
> are AWESOME, but like much in hive, documentation is all over the place.
>
>