You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2013/01/17 04:32:59 UTC

Interaction between Java and Transform Scripts on Hive

I am perplexed  if I run a transform script on a file by itself, it runs
fine, outputs to standard out life is good. If I run the transform script
on that same file (with the path and filename being passed into the script
via transform so that the python script is doing the exact same thing) I
get a java heap space error. This process works on 99% of files, and I just
can't figure out why this file is different.  How does say a python
transform script run "in" the java process (if that is even what it is
doing) so that it causes a heap error in a transform script but not run
without java around?

I am curious on what steps I can take to trouble shoot or eliminate this
problem.

Re: Interaction between Java and Transform Scripts on Hive

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
The transform scripts (or executables) are run as separate processes, so it
sounds like Hive itself is blowing up. That would be consistent with your
script working fine outside Hive. The Hive or Hadoop logs might have clues.

So, it happens consistently with this one file? I would check to be sure
that there isn't a subtle error in the file or the output from your script,
say an extra tab, other whitespace, or a malformed data value. If you can
find the line where it blows up, that would be good. You could have your
script dump debug data, like an index for each input and the corresponding
key-value pair. Or modify the output of the script and the query results to
return information like this to Hive. It seems more likely that the problem
is downstream from when the data passes through the query. So, you could
try changing the Hive query to just dump the script results and do nothing
else afterwards, etc.

However, I wouldn't expect those problems to cause heap exhaustion, unless
it somehow triggers an infinite loop.

Can you share your python script, Hive query, table schema(s), and a sample
of the file?

dean

On Wed, Jan 16, 2013 at 9:32 PM, John Omernik <jo...@omernik.com> wrote:

> I am perplexed  if I run a transform script on a file by itself, it runs
> fine, outputs to standard out life is good. If I run the transform script
> on that same file (with the path and filename being passed into the script
> via transform so that the python script is doing the exact same thing) I
> get a java heap space error. This process works on 99% of files, and I just
> can't figure out why this file is different.  How does say a python
> transform script run "in" the java process (if that is even what it is
> doing) so that it causes a heap error in a transform script but not run
> without java around?
>
> I am curious on what steps I can take to trouble shoot or eliminate this
> problem.
>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330