You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Matthew Smith <Ma...@g2-inc.com> on 2010/08/19 20:35:57 UTC

ORDER Issue (repost to avoid spam filters)

All,

 

I am running pig-0.7.0 and I have been running into an issue running the
ORDER command. I have attempted to run pig out of the box on 2 separate
LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
occurred. I run these commands in a script file:

 

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

 

target = FILTER start BY sip matches '51.37.8.63';

 

fail = ORDER target BY bytes DESC;

 

not_reached = LIMIT fail 10;

 

dump not_reached;

 

 

The error is listed below. I then run:

 

 

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

 

target = FILTER start BY sip matches '51.37.8.63';

 

dump target;

 

 

This script produces a large list of sips matching the filter.  What am
I doing wrong that causes pig to not want to ORDER these properly? I
have been wrestling with this issue for a week now. Any help would be
greatly appreciated.

 

 

 

Best,

 

Matthew

 

/ERROR

 

java.lang.RuntimeException:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: file:/user/matt/pigsample_24118161_1282155871461

 

                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner

s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)

 

                at

org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)

 

                at

org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:

117)

 

                at

org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:

527)

 

                at

org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)

 

                at

org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)

 

                at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

 

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:

Input path does not exist:

file:/user/matt/pigsample_24118161_1282155871461

 

                at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp

utFormat.java:224)

 

                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu

tFormat.listStatus(PigFileInputFormat.java:37)

 

                at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu

tFormat.java:241)

 

                at

org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)

 

                at

org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)

 

                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner

s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)

 

                ... 6 more

Re: ORDER Issue (repost to avoid spam filters)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

Are you using pig local mode ?
If yes, does this work with hadoop ?

Regards,
Mridul

On Friday 20 August 2010 12:05 AM, Matthew Smith wrote:
> All,
>
>
>
> I am running pig-0.7.0 and I have been running into an issue running the
> ORDER command. I have attempted to run pig out of the box on 2 separate
> LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
> occurred. I run these commands in a script file:
>
>
>
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
>
>
>
> target = FILTER start BY sip matches '51.37.8.63';
>
>
>
> fail = ORDER target BY bytes DESC;
>
>
>
> not_reached = LIMIT fail 10;
>
>
>
> dump not_reached;
>
>
>
>
>
> The error is listed below. I then run:
>
>
>
>
>
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
>
>
>
> target = FILTER start BY sip matches '51.37.8.63';
>
>
>
> dump target;
>
>
>
>
>
> This script produces a large list of sips matching the filter.  What am
> I doing wrong that causes pig to not want to ORDER these properly? I
> have been wrestling with this issue for a week now. Any help would be
> greatly appreciated.
>
>
>
>
>
>
>
> Best,
>
>
>
> Matthew
>
>
>
> /ERROR
>
>
>
> java.lang.RuntimeException:
>
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> does not exist: file:/user/matt/pigsample_24118161_1282155871461
>
>
>
>                  at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
>
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
>
>
>
>                  at
>
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
>
>
>
>                  at
>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>
> 117)
>
>
>
>                  at
>
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
>
> 527)
>
>
>
>                  at
>
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>
>
>
>                  at
>
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>
>
>
>                  at
>
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
>
>
> Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
>
> Input path does not exist:
>
> file:/user/matt/pigsample_24118161_1282155871461
>
>
>
>                  at
>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
>
> utFormat.java:224)
>
>
>
>                  at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
>
> tFormat.listStatus(PigFileInputFormat.java:37)
>
>
>
>                  at
>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
>
> tFormat.java:241)
>
>
>
>                  at
>
> org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
>
>
>
>                  at
>
> org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
>
>
>
>                  at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
>
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
>
>
>
>                  ... 6 more
>
>
>
>
>
>
>

Re: ORDER Issue (repost to avoid spam filters)

Posted by Thejas M Nair <te...@yahoo-inc.com>.

Can you check if the initial MR jobs in the order-by query failed because of
some other error ? (specifically the sampling MR job that is part of
order-by). Maybe, for some reason(bug?) pig did not capture/log that error.
-Thejas



On 8/23/10 11:13 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:

> Update:
> After downloading and installing pig-0.6.0, I ran the script again over
> the same data set. It produced the desired results. I don't know what I
> am doing wrong in 0.7.0, but will be reverting back to 0.6.0 until I can
> sort out what went wrong in 0.7.0. Thoughts are still welcome and wanted
> :D
> 
> Thanks,
> Matt
> 
> -----Original Message-----
> From: Matthew Smith [mailto:Matthew.Smith@g2-inc.com]
> Sent: Monday, August 23, 2010 11:39 AM
> To: Thejas M Nair; pig-user@hadoop.apache.org
> Subject: RE: ORDER Issue (repost to avoid spam filters)
> 
> Changed the script to:
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> target = FILTER start BY sip matches '51.37.8.63';
> not_null_bytes = FILTER target BY bytes is not null;
> dump not_null_bytes;
> 
> and dumped the expected tuples. There were plenty of records that were
> valid. I will attempt to revert everything to pig-0.6.0 and re run the
> scripts to determine if the issue is in pig-0.7.0.
> 
> Matt
> 
> -----Original Message-----
> From: Thejas M Nair [mailto:tejas@yahoo-inc.com]
> Sent: Friday, August 20, 2010 5:23 PM
> To: pig-user@hadoop.apache.org; Matthew Smith
> Subject: Re: ORDER Issue (repost to avoid spam filters)
> 
> I was wondering if the bytes column is having all null values (probably
> because the input has formatting issues.)
> 
> Can check you if the following query gives any output -
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> non_null_bytes = FILTER target by bytes is not null;
> 
> dump just_bytes;
> 
> -Thejas
> 
> 
> On 8/20/10 1:56 PM, "Matthew Smith" <Ma...@g2-inc.com> wrote:
> 
>> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and
> the
>> script worked as intended over the data set. This leads me to believe
>> that the issue is with pig-0.7.0 or my configuration. I would however
>> like to not pay for something that is free :D. Any other ideas would
> be
>> most welcome
>> 
>> 
>> 
>> @Thejas
>> 
>> I changed the Script to:
>> 
>> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
>> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
>> bytes:int, flags:chararray, startTime:long, endTime:long);
>> 
>> target = FILTER start BY sip matches '51.37.8.63';
>> 
>> just_bytes= FOREACH target GENERATE bytes;
>> 
>> fail = ORDER just_bytes BY bytes DESC;
>> 
>> not_reached = LIMIT fail 10;
>> 
>> dump not_reached;
>> 
>> 
>> 
>> and received the same error as before. I then changed the script to:
>> 
>> 
>> 
>> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
>> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
>> bytes:int, flags:chararray, startTime:long, endTime:long);
>> 
>> target = FILTER start BY sip matches '51.37.8.63';
>> 
>> stored = STORE target INTO 'myoutput';
>> 
>> second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS
>> (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int,
>> packets:int, bytes:int, flags:chararray, startTime:long,
> endTime:long);
>> 
>> fail = ORDER second_start BY bytes DESC;
>> 
>> not_reached = LIMIT fail 10;
>> 
>> dump not_reached;
>> 
>> 
>> 
>> and received the same error.
>> 
>> 
>> 
>> @Mridul
>> 
>> I am using local mode at the moment. I don't understand the second
>> question.
>> 
>> 
>> 
>> Thanks,
>> 
>> Matt
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: Thejas M Nair [mailto:tejas@yahoo-inc.com]
>> Sent: Thursday, August 19, 2010 5:34 PM
>> To: pig-user@hadoop.apache.org; Matthew Smith
>> Subject: Re: ORDER Issue (repost to avoid spam filters)
>> 
>> 
>> 
>> I think 0.7 had an issue where order-by used to fail if the input was
>> empty. But that does not seem to be the case here.
>> I am wondering if there is a parsing/data-format issue that is causing
>> bytes column to be empty , though I am not aware of emtpy/null value
> of
>> sort column causing issues.
>> Can you try dumping just the bytes column ?
>> Another thing you can try is to store the output of filter and load
> data
>> again before doing order-by ..
>> 
>> Please let us know what you find.
>> 
>> Thanks,
>> Thejas
>> 
>> 
>> 
>> 
>> On 8/19/10 11:35 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:
>> 
>> All,
>> 
>> 
>> 
>> I am running pig-0.7.0 and I have been running into an issue running
> the
>> ORDER command. I have attempted to run pig out of the box on 2
> separate
>> LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
>> occurred. I run these commands in a script file:
>> 
>> 
>> 
>> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
>> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
>> bytes:int, flags:chararray, startTime:long, endTime:long);
>> 
>> 
>> 
>> target = FILTER start BY sip matches '51.37.8.63';
>> 
>> 
>> 
>> fail = ORDER target BY bytes DESC;
>> 
>> 
>> 
>> not_reached = LIMIT fail 10;
>> 
>> 
>> 
>> dump not_reached;
>> 
>> 
>> 
>> 
>> 
>> The error is listed below. I then run:
>> 
>> 
>> 
>> 
>> 
>> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
>> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
>> bytes:int, flags:chararray, startTime:long, endTime:long);
>> 
>> 
>> 
>> target = FILTER start BY sip matches '51.37.8.63';
>> 
>> 
>> 
>> dump target;
>> 
>> 
>> 
>> 
>> 
>> This script produces a large list of sips matching the filter.  What
> am
>> I doing wrong that causes pig to not want to ORDER these properly? I
>> have been wrestling with this issue for a week now. Any help would be
>> greatly appreciated.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Best,
>> 
>> 
>> 
>> Matthew
>> 
>> 
>> 
>> /ERROR
>> 
>> 
>> 
>> java.lang.RuntimeException:
>> 
>> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> path
>> does not exist: file:/user/matt/pigsample_24118161_1282155871461
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
>> 
>> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
>> 
>> 117)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
>> 
>> 527)
>> 
>> 
>> 
>>                 at
>> 
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>> 
>> 
>> 
>>                 at
>> 
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 
>> 
>> 
>> Caused by:
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
>> 
>> Input path does not exist:
>> 
>> file:/user/matt/pigsample_24118161_1282155871461
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
>> 
>> utFormat.java:224)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
>> 
>> tFormat.listStatus(PigFileInputFormat.java:37)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
>> 
>> tFormat.java:241)
>> 
>> 
>> 
>>                 at
>> 
>> org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
>> 
>> 
>> 
>>                 at
>> 
>> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
>> 
>> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
>> 
>> 
>> 
>>                 ... 6 more
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
>

RE: ORDER Issue (repost to avoid spam filters)

Posted by Matthew Smith <Ma...@g2-inc.com>.

Update:
After downloading and installing pig-0.6.0, I ran the script again over
the same data set. It produced the desired results. I don't know what I
am doing wrong in 0.7.0, but will be reverting back to 0.6.0 until I can
sort out what went wrong in 0.7.0. Thoughts are still welcome and wanted
:D

Thanks,
Matt

-----Original Message-----
From: Matthew Smith [mailto:Matthew.Smith@g2-inc.com] 
Sent: Monday, August 23, 2010 11:39 AM
To: Thejas M Nair; pig-user@hadoop.apache.org
Subject: RE: ORDER Issue (repost to avoid spam filters)

Changed the script to:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
not_null_bytes = FILTER target BY bytes is not null;
dump not_null_bytes;

and dumped the expected tuples. There were plenty of records that were
valid. I will attempt to revert everything to pig-0.6.0 and re run the
scripts to determine if the issue is in pig-0.7.0.

Matt

-----Original Message-----
From: Thejas M Nair [mailto:tejas@yahoo-inc.com] 
Sent: Friday, August 20, 2010 5:23 PM
To: pig-user@hadoop.apache.org; Matthew Smith
Subject: Re: ORDER Issue (repost to avoid spam filters)

I was wondering if the bytes column is having all null values (probably
because the input has formatting issues.)

Can check you if the following query gives any output -

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

non_null_bytes = FILTER target by bytes is not null;

dump just_bytes;

-Thejas


On 8/20/10 1:56 PM, "Matthew Smith" <Ma...@g2-inc.com> wrote:

> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and
the
> script worked as intended over the data set. This leads me to believe
> that the issue is with pig-0.7.0 or my configuration. I would however
> like to not pay for something that is free :D. Any other ideas would
be
> most welcome
> 
> 
> 
> @Thejas
> 
> I changed the Script to:
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> just_bytes= FOREACH target GENERATE bytes;
> 
> fail = ORDER just_bytes BY bytes DESC;
> 
> not_reached = LIMIT fail 10;
> 
> dump not_reached;
> 
> 
> 
> and received the same error as before. I then changed the script to:
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> stored = STORE target INTO 'myoutput';
> 
> second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS
> (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int,
> packets:int, bytes:int, flags:chararray, startTime:long,
endTime:long);
> 
> fail = ORDER second_start BY bytes DESC;
> 
> not_reached = LIMIT fail 10;
> 
> dump not_reached;
> 
> 
> 
> and received the same error.
> 
> 
> 
> @Mridul
> 
> I am using local mode at the moment. I don't understand the second
> question.
> 
> 
> 
> Thanks,
> 
> Matt
> 
> 
> 
> 
> 
> 
> 
> From: Thejas M Nair [mailto:tejas@yahoo-inc.com]
> Sent: Thursday, August 19, 2010 5:34 PM
> To: pig-user@hadoop.apache.org; Matthew Smith
> Subject: Re: ORDER Issue (repost to avoid spam filters)
> 
> 
> 
> I think 0.7 had an issue where order-by used to fail if the input was
> empty. But that does not seem to be the case here.
> I am wondering if there is a parsing/data-format issue that is causing
> bytes column to be empty , though I am not aware of emtpy/null value
of
> sort column causing issues.
> Can you try dumping just the bytes column ?
> Another thing you can try is to store the output of filter and load
data
> again before doing order-by ..
> 
> Please let us know what you find.
> 
> Thanks,
> Thejas
> 
> 
> 
> 
> On 8/19/10 11:35 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:
> 
> All,
> 
> 
> 
> I am running pig-0.7.0 and I have been running into an issue running
the
> ORDER command. I have attempted to run pig out of the box on 2
separate
> LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
> occurred. I run these commands in a script file:
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> 
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> 
> 
> fail = ORDER target BY bytes DESC;
> 
> 
> 
> not_reached = LIMIT fail 10;
> 
> 
> 
> dump not_reached;
> 
> 
> 
> 
> 
> The error is listed below. I then run:
> 
> 
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> 
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> 
> 
> dump target;
> 
> 
> 
> 
> 
> This script produces a large list of sips matching the filter.  What
am
> I doing wrong that causes pig to not want to ORDER these properly? I
> have been wrestling with this issue for a week now. Any help would be
> greatly appreciated.
> 
> 
> 
> 
> 
> 
> 
> Best,
> 
> 
> 
> Matthew
> 
> 
> 
> /ERROR
> 
> 
> 
> java.lang.RuntimeException:
> 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path
> does not exist: file:/user/matt/pigsample_24118161_1282155871461
> 
> 
> 
>                 at
> 
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
> 
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 
> 117)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
> 
> 527)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> 
> 
> Caused by:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
> 
> Input path does not exist:
> 
> file:/user/matt/pigsample_24118161_1282155871461
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
> 
> utFormat.java:224)
> 
> 
> 
>                 at
> 
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
> 
> tFormat.listStatus(PigFileInputFormat.java:37)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
> 
> tFormat.java:241)
> 
> 
> 
>                 at
> 
> org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
> 
> 
> 
>                 at
> 
>
org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
> 
> 
> 
>                 at
> 
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
> 
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
> 
> 
> 
>                 ... 6 more
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

RE: ORDER Issue (repost to avoid spam filters)

Posted by Matthew Smith <Ma...@g2-inc.com>.

Changed the script to:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
not_null_bytes = FILTER target BY bytes is not null;
dump not_null_bytes;

and dumped the expected tuples. There were plenty of records that were
valid. I will attempt to revert everything to pig-0.6.0 and re run the
scripts to determine if the issue is in pig-0.7.0.

Matt

-----Original Message-----
From: Thejas M Nair [mailto:tejas@yahoo-inc.com] 
Sent: Friday, August 20, 2010 5:23 PM
To: pig-user@hadoop.apache.org; Matthew Smith
Subject: Re: ORDER Issue (repost to avoid spam filters)

I was wondering if the bytes column is having all null values (probably
because the input has formatting issues.)

Can check you if the following query gives any output -

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

non_null_bytes = FILTER target by bytes is not null;

dump just_bytes;

-Thejas


On 8/20/10 1:56 PM, "Matthew Smith" <Ma...@g2-inc.com> wrote:

> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and
the
> script worked as intended over the data set. This leads me to believe
> that the issue is with pig-0.7.0 or my configuration. I would however
> like to not pay for something that is free :D. Any other ideas would
be
> most welcome
> 
> 
> 
> @Thejas
> 
> I changed the Script to:
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> just_bytes= FOREACH target GENERATE bytes;
> 
> fail = ORDER just_bytes BY bytes DESC;
> 
> not_reached = LIMIT fail 10;
> 
> dump not_reached;
> 
> 
> 
> and received the same error as before. I then changed the script to:
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> stored = STORE target INTO 'myoutput';
> 
> second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS
> (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int,
> packets:int, bytes:int, flags:chararray, startTime:long,
endTime:long);
> 
> fail = ORDER second_start BY bytes DESC;
> 
> not_reached = LIMIT fail 10;
> 
> dump not_reached;
> 
> 
> 
> and received the same error.
> 
> 
> 
> @Mridul
> 
> I am using local mode at the moment. I don't understand the second
> question.
> 
> 
> 
> Thanks,
> 
> Matt
> 
> 
> 
> 
> 
> 
> 
> From: Thejas M Nair [mailto:tejas@yahoo-inc.com]
> Sent: Thursday, August 19, 2010 5:34 PM
> To: pig-user@hadoop.apache.org; Matthew Smith
> Subject: Re: ORDER Issue (repost to avoid spam filters)
> 
> 
> 
> I think 0.7 had an issue where order-by used to fail if the input was
> empty. But that does not seem to be the case here.
> I am wondering if there is a parsing/data-format issue that is causing
> bytes column to be empty , though I am not aware of emtpy/null value
of
> sort column causing issues.
> Can you try dumping just the bytes column ?
> Another thing you can try is to store the output of filter and load
data
> again before doing order-by ..
> 
> Please let us know what you find.
> 
> Thanks,
> Thejas
> 
> 
> 
> 
> On 8/19/10 11:35 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:
> 
> All,
> 
> 
> 
> I am running pig-0.7.0 and I have been running into an issue running
the
> ORDER command. I have attempted to run pig out of the box on 2
separate
> LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
> occurred. I run these commands in a script file:
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> 
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> 
> 
> fail = ORDER target BY bytes DESC;
> 
> 
> 
> not_reached = LIMIT fail 10;
> 
> 
> 
> dump not_reached;
> 
> 
> 
> 
> 
> The error is listed below. I then run:
> 
> 
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> 
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> 
> 
> dump target;
> 
> 
> 
> 
> 
> This script produces a large list of sips matching the filter.  What
am
> I doing wrong that causes pig to not want to ORDER these properly? I
> have been wrestling with this issue for a week now. Any help would be
> greatly appreciated.
> 
> 
> 
> 
> 
> 
> 
> Best,
> 
> 
> 
> Matthew
> 
> 
> 
> /ERROR
> 
> 
> 
> java.lang.RuntimeException:
> 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path
> does not exist: file:/user/matt/pigsample_24118161_1282155871461
> 
> 
> 
>                 at
> 
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
> 
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 
> 117)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
> 
> 527)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> 
> 
> Caused by:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
> 
> Input path does not exist:
> 
> file:/user/matt/pigsample_24118161_1282155871461
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
> 
> utFormat.java:224)
> 
> 
> 
>                 at
> 
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
> 
> tFormat.listStatus(PigFileInputFormat.java:37)
> 
> 
> 
>                 at
> 
>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
> 
> tFormat.java:241)
> 
> 
> 
>                 at
> 
> org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
> 
> 
> 
>                 at
> 
>
org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
> 
> 
> 
>                 at
> 
>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
> 
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
> 
> 
> 
>                 ... 6 more
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: ORDER Issue (repost to avoid spam filters)

Posted by Thejas M Nair <te...@yahoo-inc.com>.

I was wondering if the bytes column is having all null values (probably
because the input has formatting issues.)

Can check you if the following query gives any output -

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

non_null_bytes = FILTER target by bytes is not null;

dump just_bytes;

-Thejas


On 8/20/10 1:56 PM, "Matthew Smith" <Ma...@g2-inc.com> wrote:

> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the
> script worked as intended over the data set. This leads me to believe
> that the issue is with pig-0.7.0 or my configuration. I would however
> like to not pay for something that is free :D. Any other ideas would be
> most welcome
> 
> 
> 
> @Thejas
> 
> I changed the Script to:
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> just_bytes= FOREACH target GENERATE bytes;
> 
> fail = ORDER just_bytes BY bytes DESC;
> 
> not_reached = LIMIT fail 10;
> 
> dump not_reached;
> 
> 
> 
> and received the same error as before. I then changed the script to:
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> stored = STORE target INTO 'myoutput';
> 
> second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS
> (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int,
> packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> fail = ORDER second_start BY bytes DESC;
> 
> not_reached = LIMIT fail 10;
> 
> dump not_reached;
> 
> 
> 
> and received the same error.
> 
> 
> 
> @Mridul
> 
> I am using local mode at the moment. I don't understand the second
> question.
> 
> 
> 
> Thanks,
> 
> Matt
> 
> 
> 
> 
> 
> 
> 
> From: Thejas M Nair [mailto:tejas@yahoo-inc.com]
> Sent: Thursday, August 19, 2010 5:34 PM
> To: pig-user@hadoop.apache.org; Matthew Smith
> Subject: Re: ORDER Issue (repost to avoid spam filters)
> 
> 
> 
> I think 0.7 had an issue where order-by used to fail if the input was
> empty. But that does not seem to be the case here.
> I am wondering if there is a parsing/data-format issue that is causing
> bytes column to be empty , though I am not aware of emtpy/null value of
> sort column causing issues.
> Can you try dumping just the bytes column ?
> Another thing you can try is to store the output of filter and load data
> again before doing order-by ..
> 
> Please let us know what you find.
> 
> Thanks,
> Thejas
> 
> 
> 
> 
> On 8/19/10 11:35 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:
> 
> All,
> 
> 
> 
> I am running pig-0.7.0 and I have been running into an issue running the
> ORDER command. I have attempted to run pig out of the box on 2 separate
> LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
> occurred. I run these commands in a script file:
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> 
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> 
> 
> fail = ORDER target BY bytes DESC;
> 
> 
> 
> not_reached = LIMIT fail 10;
> 
> 
> 
> dump not_reached;
> 
> 
> 
> 
> 
> The error is listed below. I then run:
> 
> 
> 
> 
> 
> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
> dip:chararray, sport:int, dport:int, protocol:int, packets:int,
> bytes:int, flags:chararray, startTime:long, endTime:long);
> 
> 
> 
> target = FILTER start BY sip matches '51.37.8.63';
> 
> 
> 
> dump target;
> 
> 
> 
> 
> 
> This script produces a large list of sips matching the filter.  What am
> I doing wrong that causes pig to not want to ORDER these properly? I
> have been wrestling with this issue for a week now. Any help would be
> greatly appreciated.
> 
> 
> 
> 
> 
> 
> 
> Best,
> 
> 
> 
> Matthew
> 
> 
> 
> /ERROR
> 
> 
> 
> java.lang.RuntimeException:
> 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> does not exist: file:/user/matt/pigsample_24118161_1282155871461
> 
> 
> 
>                 at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
> 
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
> 
> 117)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
> 
> 527)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> 
> 
> Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
> 
> Input path does not exist:
> 
> file:/user/matt/pigsample_24118161_1282155871461
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
> 
> utFormat.java:224)
> 
> 
> 
>                 at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
> 
> tFormat.listStatus(PigFileInputFormat.java:37)
> 
> 
> 
>                 at
> 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
> 
> tFormat.java:241)
> 
> 
> 
>                 at
> 
> org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
> 
> 
> 
>                 at
> 
> org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
> 
> 
> 
>                 at
> 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
> 
> s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
> 
> 
> 
>                 ... 6 more
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

RE: ORDER Issue (repost to avoid spam filters)

Posted by Matthew Smith <Ma...@g2-inc.com>.

UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the
script worked as intended over the data set. This leads me to believe
that the issue is with pig-0.7.0 or my configuration. I would however
like to not pay for something that is free :D. Any other ideas would be
most welcome

 

@Thejas

I changed the Script to:

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

just_bytes= FOREACH target GENERATE bytes;

fail = ORDER just_bytes BY bytes DESC;

not_reached = LIMIT fail 10;

dump not_reached;

 

and received the same error as before. I then changed the script to:

 

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

stored = STORE target INTO 'myoutput';

second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS
(sip:chararray, dip:chararray, sport:int, dport:int, protocol:int,
packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);

fail = ORDER second_start BY bytes DESC;

not_reached = LIMIT fail 10;

dump not_reached;

 

and received the same error.

 

@Mridul

I am using local mode at the moment. I don't understand the second
question.

 

Thanks,

Matt

 

 

 

From: Thejas M Nair [mailto:tejas@yahoo-inc.com] 
Sent: Thursday, August 19, 2010 5:34 PM
To: pig-user@hadoop.apache.org; Matthew Smith
Subject: Re: ORDER Issue (repost to avoid spam filters)

 

I think 0.7 had an issue where order-by used to fail if the input was
empty. But that does not seem to be the case here.
I am wondering if there is a parsing/data-format issue that is causing
bytes column to be empty , though I am not aware of emtpy/null value of
sort column causing issues.
Can you try dumping just the bytes column ? 
Another thing you can try is to store the output of filter and load data
again before doing order-by ..

Please let us know what you find.

Thanks,
Thejas




On 8/19/10 11:35 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:

All,



I am running pig-0.7.0 and I have been running into an issue running the
ORDER command. I have attempted to run pig out of the box on 2 separate
LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
occurred. I run these commands in a script file:



start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);



target = FILTER start BY sip matches '51.37.8.63';



fail = ORDER target BY bytes DESC;



not_reached = LIMIT fail 10;



dump not_reached;





The error is listed below. I then run:





start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);



target = FILTER start BY sip matches '51.37.8.63';



dump target;





This script produces a large list of sips matching the filter.  What am
I doing wrong that causes pig to not want to ORDER these properly? I
have been wrestling with this issue for a week now. Any help would be
greatly appreciated.







Best,



Matthew



/ERROR



java.lang.RuntimeException:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: file:/user/matt/pigsample_24118161_1282155871461



                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner

s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)



                at

org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)



                at

org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:

117)



                at

org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:

527)



                at

org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)



                at

org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)



                at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)



Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:

Input path does not exist:

file:/user/matt/pigsample_24118161_1282155871461



                at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp

utFormat.java:224)



                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu

tFormat.listStatus(PigFileInputFormat.java:37)



                at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu

tFormat.java:241)



                at

org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)



                at

org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)



                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner

s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)



                ... 6 more

Re: ORDER Issue (repost to avoid spam filters)

Posted by Thejas M Nair <te...@yahoo-inc.com>.

I think 0.7 had an issue where order-by used to fail if the input was empty. But that does not seem to be the case here.
I am wondering if there is a parsing/data-format issue that is causing bytes column to be empty , though I am not aware of emtpy/null value of sort column causing issues.
Can you try dumping just the bytes column ?
Another thing you can try is to store the output of filter and load data again before doing order-by ..

Please let us know what you find.

Thanks,
Thejas




On 8/19/10 11:35 AM, "Matthew Smith" <Ma...@g2-inc.com> wrote:

All,



I am running pig-0.7.0 and I have been running into an issue running the
ORDER command. I have attempted to run pig out of the box on 2 separate
LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
occurred. I run these commands in a script file:



start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);



target = FILTER start BY sip matches '51.37.8.63';



fail = ORDER target BY bytes DESC;



not_reached = LIMIT fail 10;



dump not_reached;





The error is listed below. I then run:





start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);



target = FILTER start BY sip matches '51.37.8.63';



dump target;





This script produces a large list of sips matching the filter.  What am
I doing wrong that causes pig to not want to ORDER these properly? I
have been wrestling with this issue for a week now. Any help would be
greatly appreciated.







Best,



Matthew



/ERROR



java.lang.RuntimeException:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: file:/user/matt/pigsample_24118161_1282155871461



                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner

s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)



                at

org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)



                at

org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:

117)



                at

org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:

527)



                at

org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)



                at

org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)



                at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)



Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:

Input path does not exist:

file:/user/matt/pigsample_24118161_1282155871461



                at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp

utFormat.java:224)



                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu

tFormat.listStatus(PigFileInputFormat.java:37)



                at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu

tFormat.java:241)



                at

org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)



                at

org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)



                at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner

s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)



                ... 6 more