You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Florin Diaconeasa <fl...@gmail.com> on 2012/10/01 09:19:53 UTC

Re: Issue uploading data to S3 with Hive

Hello,

I've met this issue several times before.

The problem, from what i saw, is that Hive isn't actually aware of the
underlying storage system (which is rather normal), as Hadoop should handle
that. Also, Hadoop might get 404 from Amazon WS (i guess in order for them
to throttle) and simply states that the resource does not exists (which is
normal again). The only solution i got was to actually modify the code of
Hive for my particular case. So that it does several retries after a period
in order to make sure that the resource actually isn't there.

Hope this helps,

Florin

On 29 September 2012 00:00, Charles Menguy <me...@gmail.com> wrote:

> Hi everyone,
>
> I'm using S3 regularly as a means of data storage and transfer, and I have
> some Hive jobs who are running on data in HDFS but writing their output in
> S3.
> I'm doing this by doing an "insert overwrite directory
> 's3n://myaccesskey:mysecretkey@mybucket/path/to/output'"
>
> This works fine 99% of the time, but I see some cases where the job fails
> during the upload even if the query itself is fine. This seems to have
> something to do with temporary storage but I'm not sure what exactly:
>
> set mapred.output.compress=true;
> set hive.exec.compress.output=true;
> set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
> set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
> insert overwrite directory 's3n://myaccesskey:mysecretkey@mybucket
> /path/to/output'
> select
>         whatever
> from
>         mytable
> where
>         day='2012-09-27'
> group by
>         something
>
> Hive history file=/tmp/keystone/hive_job_log_201209281712_1248082813.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks not specified. Estimated from input data size: 1
> Starting Job = job_201209181846_3513, Tracking URL =
> http://myjobtracker/jobdetails.jsp?jobid=job_201209181846_3513
> Kill Command = /usr/lib/hadoop/bin/hadoop job
>  -Dmapred.job.tracker=hdfs://myjobtracker -kill job_201209181846_3513
> 2012-09-28 17:12:21,714 Stage-1 map = 0%,  reduce = 0%
> 2012-09-28 17:12:25,785 Stage-1 map = 100%,  reduce = 0%
> 2012-09-28 17:12:34,026 Stage-1 map = 100%,  reduce = 17%
> 2012-09-28 17:12:37,087 Stage-1 map = 100%,  reduce = 76%
> 2012-09-28 17:12:40,153 Stage-1 map = 100%,  reduce = 98%
> 2012-09-28 17:12:42,317 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201209181846_3513
> Job Commit failed with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.FileNotFoundException:
> Key
> 'tmp/hive-keystone/hive_2012-09-28_17-12-06_556_7326038366021318453/_tmp.-ext-10000/000000_0.gz'
> does not exist in S3)'
>
> This happens very rarely, but when it does the job just fails and doesn't
> even retry and nothing is uploaded to S3. If I rerun the exact same query
> after, most of the time it works fine.
> It also doesn't seem to be related to the amount of data being uploaded,
> I've seen it happen on very small queries like the one above, and sometimes
> on the ones with a big amount of data.
> This also doesn't seem to be related to using gzip compression or not,
> i've seen it happen with and without compression.
> From what I can see this seems to be related to S3 specifically, but I'm
> not sure why as it seems pretty random.
>
> If I look in the jobtracker, the job looks fine and is marked as
> successful, so this happens after the job has completed, so I don't see any
> error in the logs anywhere than the above.
>
> Is there anything I could do to avoid this rare problem?
>
> Thanks,
> *
> *
> *Charles*
>



-- 


Florin