You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/04/07 19:11:46 UTC

pig script - failed reading input from s3

Hello

I am trying to run a pig script which is suppoesed to read input from s3
and write back to s3. The cluster
scenario is as follows:
* Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic
Installation
* Installed version: CDH4
* Script location on - one of the nodes of cluster
* running as : $ pig countGroups_daily.pig

*The Pig Script*:
set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx
set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--load the sample input file
data = load 's3://steamdata/nysedata/NYSE_daily.txt' as
(exchange:chararray, symbol:chararray, date:chararray, open:float,
high:float, low:float, close:float, volume:int, adj_close:float);
--group data by symbols
symbolgrp = group data by symbol;
--count data in every group
symcount = foreach symbolgrp generate group,COUNT(data);
--order the counted list by count
symcountordered = order symcount by $1;
store symcountordered into 's3://steamdata/nyseoutput/daily';

*Error:*

Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt

Input(s):
Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt"

Please help me, what am I doing wrong. I can assure you that the input
path/file exists on s3 and the AWS key and secret key entered are correct.

Thanking You,


-- 
Regards,
Ouch Whisper
010101010101

Re: pig script - failed reading input from s3

Posted by Panshul Whisper <ou...@gmail.com>.

Hello Vitalii,

The 5TB limit is only valid if you are using the EMR framework to run ur
jobs in a jobflow.
I think we cannot use that in my case as I have a CDH4 cluster on EC2. But
thanks for the tip.
Reference:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html


On Tue, Apr 9, 2013 at 9:09 AM, Vitalii Tymchyshyn <ti...@gmail.com> wrote:

> Have you tried it with native? AFAIR the limitation was raised to 5TB few
> years ago.
> 8 квіт. 2013 18:30, "Panshul Whisper" <ou...@gmail.com> напис.
>
> > Thank you for the advice David.
> >
> > I tried this ant it works with the native system. But my problem is not
> > solved yet, because I have to work with files much bigger than 5GB. My
> test
> > data file is 9GB. How do I make it read from s3://
> >
> > Thanking You,
> >
> > Regards,
> >
> >
> > On Mon, Apr 8, 2013 at 3:27 PM, David LaBarbera <
> > davidlabarbera@localresponse.com> wrote:
> >
> > > Try
> > > fs.s3n.aws...
> > >
> > > and also load from s3
> > > data = load 's3n://...'
> > >
> > > The "n" stands for native. I believe S3 also supports block device
> > storage
> > > (s3://) which allows bigger files to be stored. I don't know how (if at
> > > all) the two types interact.
> > >
> > > David
> > >
> > > On Apr 7, 2013, at 1:11 PM, Panshul Whisper <ou...@gmail.com>
> > wrote:
> > >
> > > > Hello
> > > >
> > > > I am trying to run a pig script which is suppoesed to read input from
> > s3
> > > > and write back to s3. The cluster
> > > > scenario is as follows:
> > > > * Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic
> > > > Installation
> > > > * Installed version: CDH4
> > > > * Script location on - one of the nodes of cluster
> > > > * running as : $ pig countGroups_daily.pig
> > > >
> > > > *The Pig Script*:
> > > > set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx
> > > > set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > > > --load the sample input file
> > > > data = load 's3://steamdata/nysedata/NYSE_daily.txt' as
> > > > (exchange:chararray, symbol:chararray, date:chararray, open:float,
> > > > high:float, low:float, close:float, volume:int, adj_close:float);
> > > > --group data by symbols
> > > > symbolgrp = group data by symbol;
> > > > --count data in every group
> > > > symcount = foreach symbolgrp generate group,COUNT(data);
> > > > --order the counted list by count
> > > > symcountordered = order symcount by $1;
> > > > store symcountordered into 's3://steamdata/nyseoutput/daily';
> > > >
> > > > *Error:*
> > > >
> > > > Message: org.apache.pig.backend.executionengine.ExecException: ERROR
> > > 2118:
> > > > Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt
> > > >
> > > > Input(s):
> > > > Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt"
> > > >
> > > > Please help me, what am I doing wrong. I can assure you that the
> input
> > > > path/file exists on s3 and the AWS key and secret key entered are
> > > correct.
> > > >
> > > > Thanking You,
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Ouch Whisper
> > > > 010101010101
> > >
> > >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>



-- 
Regards,
Ouch Whisper
010101010101

Re: pig script - failed reading input from s3

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.

Have you tried it with native? AFAIR the limitation was raised to 5TB few
years ago.
8 квіт. 2013 18:30, "Panshul Whisper" <ou...@gmail.com> напис.

> Thank you for the advice David.
>
> I tried this ant it works with the native system. But my problem is not
> solved yet, because I have to work with files much bigger than 5GB. My test
> data file is 9GB. How do I make it read from s3://
>
> Thanking You,
>
> Regards,
>
>
> On Mon, Apr 8, 2013 at 3:27 PM, David LaBarbera <
> davidlabarbera@localresponse.com> wrote:
>
> > Try
> > fs.s3n.aws...
> >
> > and also load from s3
> > data = load 's3n://...'
> >
> > The "n" stands for native. I believe S3 also supports block device
> storage
> > (s3://) which allows bigger files to be stored. I don't know how (if at
> > all) the two types interact.
> >
> > David
> >
> > On Apr 7, 2013, at 1:11 PM, Panshul Whisper <ou...@gmail.com>
> wrote:
> >
> > > Hello
> > >
> > > I am trying to run a pig script which is suppoesed to read input from
> s3
> > > and write back to s3. The cluster
> > > scenario is as follows:
> > > * Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic
> > > Installation
> > > * Installed version: CDH4
> > > * Script location on - one of the nodes of cluster
> > > * running as : $ pig countGroups_daily.pig
> > >
> > > *The Pig Script*:
> > > set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx
> > > set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > > --load the sample input file
> > > data = load 's3://steamdata/nysedata/NYSE_daily.txt' as
> > > (exchange:chararray, symbol:chararray, date:chararray, open:float,
> > > high:float, low:float, close:float, volume:int, adj_close:float);
> > > --group data by symbols
> > > symbolgrp = group data by symbol;
> > > --count data in every group
> > > symcount = foreach symbolgrp generate group,COUNT(data);
> > > --order the counted list by count
> > > symcountordered = order symcount by $1;
> > > store symcountordered into 's3://steamdata/nyseoutput/daily';
> > >
> > > *Error:*
> > >
> > > Message: org.apache.pig.backend.executionengine.ExecException: ERROR
> > 2118:
> > > Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt
> > >
> > > Input(s):
> > > Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt"
> > >
> > > Please help me, what am I doing wrong. I can assure you that the input
> > > path/file exists on s3 and the AWS key and secret key entered are
> > correct.
> > >
> > > Thanking You,
> > >
> > >
> > > --
> > > Regards,
> > > Ouch Whisper
> > > 010101010101
> >
> >
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

Re: pig script - failed reading input from s3

Posted by Panshul Whisper <ou...@gmail.com>.

Thank you for the advice David.

I tried this ant it works with the native system. But my problem is not
solved yet, because I have to work with files much bigger than 5GB. My test
data file is 9GB. How do I make it read from s3://

Thanking You,

Regards,


On Mon, Apr 8, 2013 at 3:27 PM, David LaBarbera <
davidlabarbera@localresponse.com> wrote:

> Try
> fs.s3n.aws…
>
> and also load from s3
> data = load 's3n://...'
>
> The "n" stands for native. I believe S3 also supports block device storage
> (s3://) which allows bigger files to be stored. I don't know how (if at
> all) the two types interact.
>
> David
>
> On Apr 7, 2013, at 1:11 PM, Panshul Whisper <ou...@gmail.com> wrote:
>
> > Hello
> >
> > I am trying to run a pig script which is suppoesed to read input from s3
> > and write back to s3. The cluster
> > scenario is as follows:
> > * Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic
> > Installation
> > * Installed version: CDH4
> > * Script location on - one of the nodes of cluster
> > * running as : $ pig countGroups_daily.pig
> >
> > *The Pig Script*:
> > set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx
> > set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > --load the sample input file
> > data = load 's3://steamdata/nysedata/NYSE_daily.txt' as
> > (exchange:chararray, symbol:chararray, date:chararray, open:float,
> > high:float, low:float, close:float, volume:int, adj_close:float);
> > --group data by symbols
> > symbolgrp = group data by symbol;
> > --count data in every group
> > symcount = foreach symbolgrp generate group,COUNT(data);
> > --order the counted list by count
> > symcountordered = order symcount by $1;
> > store symcountordered into 's3://steamdata/nyseoutput/daily';
> >
> > *Error:*
> >
> > Message: org.apache.pig.backend.executionengine.ExecException: ERROR
> 2118:
> > Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt
> >
> > Input(s):
> > Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt"
> >
> > Please help me, what am I doing wrong. I can assure you that the input
> > path/file exists on s3 and the AWS key and secret key entered are
> correct.
> >
> > Thanking You,
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>
>


-- 
Regards,
Ouch Whisper
010101010101

Re: pig script - failed reading input from s3

Posted by David LaBarbera <da...@localresponse.com>.

Try 
fs.s3n.aws…

and also load from s3 
data = load 's3n://...' 

The "n" stands for native. I believe S3 also supports block device storage (s3://) which allows bigger files to be stored. I don't know how (if at all) the two types interact.

David

On Apr 7, 2013, at 1:11 PM, Panshul Whisper <ou...@gmail.com> wrote:

> Hello
> 
> I am trying to run a pig script which is suppoesed to read input from s3
> and write back to s3. The cluster
> scenario is as follows:
> * Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic
> Installation
> * Installed version: CDH4
> * Script location on - one of the nodes of cluster
> * running as : $ pig countGroups_daily.pig
> 
> *The Pig Script*:
> set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx
> set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> --load the sample input file
> data = load 's3://steamdata/nysedata/NYSE_daily.txt' as
> (exchange:chararray, symbol:chararray, date:chararray, open:float,
> high:float, low:float, close:float, volume:int, adj_close:float);
> --group data by symbols
> symbolgrp = group data by symbol;
> --count data in every group
> symcount = foreach symbolgrp generate group,COUNT(data);
> --order the counted list by count
> symcountordered = order symcount by $1;
> store symcountordered into 's3://steamdata/nyseoutput/daily';
> 
> *Error:*
> 
> Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
> Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt
> 
> Input(s):
> Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt"
> 
> Please help me, what am I doing wrong. I can assure you that the input
> path/file exists on s3 and the AWS key and secret key entered are correct.
> 
> Thanking You,
> 
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101