You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Meghana Narasimhan <Me...@demandmedia.com> on 2012/10/12 22:50:07 UTC

PigStorageSchema and S3 bug

Hello,

We are using PigStorageSchema to store our results on S3 with HDFS still as the file system and we are running into issues writing out the schema file to s3.

We are just loading a CSV file using PigStorage, running through some basic Pig stuff and then storing it on S3 using PigStorageSchema. We are on Hadoop 2.0.0-cdh4.1.0 and Apache Pig version 0.10.0-cdh4.1.0.


{code}
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS A1, $1 AS A2, $2 AS A3;
C = LIMIT B 3;
STORE C INTO 's3n://XXX:XXX@bucket/outPigStorageSchema1' USING org.apache.pig.piggybank.storage.PigStorageSchema();
{code}

Pig logs :

{code}
2012-10-11 21:00:56,193 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT
2012-10-11 21:00:56,209 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for A: $3, $4, $5, $6
2012-10-11 21:00:56,250 [main] WARN  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
2012-10-11 21:00:57,174 [main] WARN  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/Meg/outPigStorageSchema1_%24folder%24' - Unexpected response code 404, expected 200
2012-10-11 21:00:57,212 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-10-11 21:00:57,218 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2012-10-11 21:00:57,218 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2012-10-11 21:00:57,221 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-10-11 21:00:57,221 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-10-11 21:00:57,222 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job7469072732967367765.jar
2012-10-11 21:01:02,810 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job7469072732967367765.jar created
2012-10-11 21:01:02,815 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job2012-10-11 21:01:02,830 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-10-11 21:01:02,884 [Thread-64] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2012-10-11 21:01:03,256 [Thread-64] WARN  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
2012-10-11 21:01:03,332 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-10-11 21:01:03,502 [Thread-64] WARN  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/Meg/outPigStorageSchema1_%24folder%24' - Unexpected response code 404, expected 200
2012-10-11 21:01:03,563 [Thread-64] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-10-11 21:01:03,563 [Thread-64] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total inputpaths to process : 1
2012-10-11 21:01:03,565 [Thread-64] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-10-11 21:01:04,488 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201210052302_0065
2012-10-11 21:01:04,489 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://ec2-184-72-197-101.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201210052302_0065
2012-10-11 21:01:17,679 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-10-11 21:02:24,236 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201210052302_0065 has failed! Stop running all dependent jobs
2012-10-11 21:02:24,237 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-10-11 21:02:24,244 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2012-10-11 21:02:24,244 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion PigVersion
UserId StartedAt
FinishedAt Features
2.0.0-cdh4.0.1 0.10.0-cdh4.1.0
whirr 2012-10-11 21:00:57
2012-10-11 21:02:24  LIMIT

Failed!

Failed Jobs:
JobId Alias
Feature Message
Outputs
job_201210052302_0065  A,B,C Message: Job failed!
s3n://AKIAJFE3KBQKC5CKRFPA:3Ss5Ib9PpaWZJz7BTUhXnmW6nLPyC26b+pRWNmhj@Meg/outPigStorageSchema1,

Input(s):
Failed to read data from "hdfs://ec2-184-72-197-101.compute-1.amazonaws.com/user/whirr/incite/site_clicks_spend_by_hour/2012/07/10/20121008-205839320852/part-r-00000<http://compute-1.amazonaws.com/user/whirr/incite/site_clicks_spend_by_hour/2012/07/10/20121008-205839320852/part-r-00000>"

Output(s):
Failed to produce result in "s3n://XXX:XXX@Meg/outPigStorageSchema1"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201210052302_0065


2012-10-11 21:02:24,244 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2012-10-11 21:02:24,321 [main] WARN  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
2012-10-11 21:02:24,804 [main] WARN  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200

{code}

And the hadoop log files :

{code}

java.lang.IllegalArgumentException: Wrong FS: s3n://XXX:XXXX@bucket/outPigStorageSchema1/.pig_schema<mailto:XXXX@bucket/outPigStorageSchema1/.pig_schema>, expected: hdfs://xxxxx.compute-1.amazonaws.com
            at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:547)
            at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:169)
            at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:783)
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1257)
            at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
            at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
            at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147)
            at org.apache.pig.piggybank.storage.JsonMetadata.storeSchema(JsonMetadata.java:249)
            at org.apache.pig.piggybank.storage.PigStorageSchema.storeSchema(PigStorageSchema.java:152)
            at org.apache.pig.backend.hadoop.ex
{code}


This seems to be a pretty big bug in Piggybank. Has someone else seen this issuebefore?

Thanks,
Meg

________________________________
Please NOTE: This electronic message, including any attachments, may include privileged, confidential and/or inside information owned by Demand Media, Inc. Any distribution or use of this communication by anyone other than the intended recipient(s) is strictly prohibited and may be unlawful. If you are not the intended recipient, please notify the sender by replying to this message and then delete it from your system. Thank you.

Re: PigStorageSchema and S3 bug

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Meghana,

Are you sure that you're using Apache Pig version 0.10.0-cdh4.1.0? Because
a change was made to PigStorageSchema in Pig 0.10 (
https://issues.apache.org/jira/browse/PIG-2143), it is not possible to get
your call stack:

            at
org.apache.pig.piggybank.storage.PigStorageSchema.storeSchema(PigStorageSchema.java:152)

In Pig 0.10, PigStorageSchema.java is only 45-line long. I double checked
that Pig version 0.10.0-cdh4.1.0 includes PIG-2143. It looks like you're
using an old version of Pig something like 0.9.2-cdh4.0.1.

Thanks,
Cheolsoo

On Fri, Oct 12, 2012 at 1:50 PM, Meghana Narasimhan <
Meghana.Narasimhan@demandmedia.com> wrote:

> Hello,
>
> We are using PigStorageSchema to store our results on S3 with HDFS still
> as the file system and we are running into issues writing out the schema
> file to s3.
>
> We are just loading a CSV file using PigStorage, running through some
> basic Pig stuff and then storing it on S3 using PigStorageSchema. We are on
> Hadoop 2.0.0-cdh4.1.0 and Apache Pig version 0.10.0-cdh4.1.0.
>
>
> {code}
> A = LOAD 'input' USING PigStorage(',');
> B = FOREACH A GENERATE $0 AS A1, $1 AS A2, $2 AS A3;
> C = LIMIT B 3;
> STORE C INTO 's3n://XXX:XXX@bucket/outPigStorageSchema1' USING
> org.apache.pig.piggybank.storage.PigStorageSchema();
> {code}
>
> Pig logs :
>
> {code}
> 2012-10-11 21:00:56,193 [main] INFO
>  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: LIMIT
> 2012-10-11 21:00:56,209 [main] INFO
>  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for A: $3, $4, $5, $6
> 2012-10-11 21:00:56,250 [main] WARN
>  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
> '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
> 2012-10-11 21:00:57,174 [main] WARN
>  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
> '/Meg/outPigStorageSchema1_%24folder%24' - Unexpected response code 404,
> expected 200
> 2012-10-11 21:00:57,212 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
> 2012-10-11 21:00:57,218 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2012-10-11 21:00:57,218 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2012-10-11 21:00:57,221 [main] INFO
>  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> to the job
> 2012-10-11 21:00:57,221 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-10-11 21:00:57,222 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - creating jar file Job7469072732967367765.jar
> 2012-10-11 21:01:02,810 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - jar file Job7469072732967367765.jar created
> 2012-10-11 21:01:02,815 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job2012-10-11 21:01:02,830 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-10-11 21:01:02,884 [Thread-64] WARN
>  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing
> the arguments. Applications should implement Tool for the same.
> 2012-10-11 21:01:03,256 [Thread-64] WARN
>  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
> '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
> 2012-10-11 21:01:03,332 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2012-10-11 21:01:03,502 [Thread-64] WARN
>  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
> '/Meg/outPigStorageSchema1_%24folder%24' - Unexpected response code 404,
> expected 200
> 2012-10-11 21:01:03,563 [Thread-64] INFO
>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
> 2012-10-11 21:01:03,563 [Thread-64] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> inputpaths to process : 1
> 2012-10-11 21:01:03,565 [Thread-64] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2012-10-11 21:01:04,488 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201210052302_0065
> 2012-10-11 21:01:04,489 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - More information at:
> http://ec2-184-72-197-101.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201210052302_0065
> 2012-10-11 21:01:17,679 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 50% complete
> 2012-10-11 21:02:24,236 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - job job_201210052302_0065 has failed! Stop running all dependent jobs
> 2012-10-11 21:02:24,237 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2012-10-11 21:02:24,244 [main] ERROR
> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
> 2012-10-11 21:02:24,244 [main] INFO
>  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion PigVersion
> UserId StartedAt
> FinishedAt Features
> 2.0.0-cdh4.0.1 0.10.0-cdh4.1.0
> whirr 2012-10-11 21:00:57
> 2012-10-11 21:02:24  LIMIT
>
> Failed!
>
> Failed Jobs:
> JobId Alias
> Feature Message
> Outputs
> job_201210052302_0065  A,B,C Message: Job failed!
> s3n://AKIAJFE3KBQKC5CKRFPA:3Ss5Ib9PpaWZJz7BTUhXnmW6nLPyC26b+pRWNmhj@Meg
> /outPigStorageSchema1,
>
> Input(s):
> Failed to read data from "hdfs://
> ec2-184-72-197-101.compute-1.amazonaws.com/user/whirr/incite/site_clicks_spend_by_hour/2012/07/10/20121008-205839320852/part-r-00000
> <
> http://compute-1.amazonaws.com/user/whirr/incite/site_clicks_spend_by_hour/2012/07/10/20121008-205839320852/part-r-00000
> >"
>
> Output(s):
> Failed to produce result in "s3n://XXX:XXX@Meg/outPigStorageSchema1"
>
> Counters:
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201210052302_0065
>
>
> 2012-10-11 21:02:24,244 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Failed!
> 2012-10-11 21:02:24,321 [main] WARN
>  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
> '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
> 2012-10-11 21:02:24,804 [main] WARN
>  org.jets3t.service.impl.rest.httpclient.RestS3Service - Response
> '/Meg/outPigStorageSchema1' - Unexpected response code 404, expected 200
>
> {code}
>
> And the hadoop log files :
>
> {code}
>
> java.lang.IllegalArgumentException: Wrong FS: s3n://XXX:XXXX@bucket
> /outPigStorageSchema1/.pig_schema<mailto:XXXX@bucket/outPigStorageSchema1/.pig_schema>,
> expected: hdfs://xxxxx.compute-1.amazonaws.com
>             at
> org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:547)
>             at
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:169)
>             at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:783)
>             at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1257)
>             at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
>             at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
>             at
> org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147)
>             at
> org.apache.pig.piggybank.storage.JsonMetadata.storeSchema(JsonMetadata.java:249)
>             at
> org.apache.pig.piggybank.storage.PigStorageSchema.storeSchema(PigStorageSchema.java:152)
>             at org.apache.pig.backend.hadoop.ex
> {code}
>
>
> This seems to be a pretty big bug in Piggybank. Has someone else seen this
> issuebefore?
>
> Thanks,
> Meg
>
> ________________________________
> Please NOTE: This electronic message, including any attachments, may
> include privileged, confidential and/or inside information owned by Demand
> Media, Inc. Any distribution or use of this communication by anyone other
> than the intended recipient(s) is strictly prohibited and may be unlawful.
> If you are not the intended recipient, please notify the sender by replying
> to this message and then delete it from your system. Thank you.
>