You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Charles Gonçalves <ch...@gmail.com> on 2011/02/27 04:25:43 UTC

Problem when executionengine.util.MapRedUtil combine input paths

I tried to process a big number of small files on pig and I got a strange
problem.

2011-02-27 00:00:58,746 [Thread-15] INFO
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : *43458*
2011-02-27 00:00:58,755 [Thread-15] INFO
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : *43458*
2011-02-27 00:01:14,173 [Thread-15] INFO
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths (combined) to process : *329*

When the script finish to process, the result is just about a subgroup of
the input files.
These are logs from a whole month,  but the results are just from the day
21.


Maybe I'm missing something.
Any Ideas?

-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

fwiw, something similar happened with the HBase loader in 0.8 -- only the
first of the combined splits was read in (I worked around this by turning
off split combination in the loader's setLocation, see pig-1680)

D

On Tue, Mar 1, 2011 at 2:02 PM, Charles Gonçalves <ch...@gmail.com>wrote:

> Ok ...
>
> I'm sending both.
> Versions:
>
> Apache Pig version 0.8.0 (r1043805)
> compiled Dec 08 2010, 17:26:09
>
> Hadoop 0.20.2
>
>
>
> On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
>
>>  Combine input splits should be able to handle compressed files. It will
>> create seperate RecordReader for each file within one input split. So gzip
>> concatenation should not be the case. I am not sure what happen to your
>> script. If possible, give us more information (script, UDF, data, version).
>>
>> Daniel
>>
>>
>>
>> On 02/28/2011 05:40 PM, Charles Gonçalves wrote:
>>
>> Guys,
>>
>>  The amount of data in the source dir:
>> hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111
>>
>>  What I did was:
>> I run with all logs, 43458 and the counters are:
>>
>>   FILE_BYTES_READ 253,905,706 372,708,857 626,614,563  HDFS_BYTES_READ
>> 2,553,123,734 0 2,553,123,734  FILE_BYTES_WRITTEN 619,877,917 372,708,857
>> 992,586,774  HDFS_BYTES_WRITTEN 0 535 535
>>
>>
>>  I did a manual join of the files and run again for the 336 files (the
>> merge of all those files).
>> The job didn't finished yet and the counters are:
>>
>>    FILE_BYTES_READ 21,054,970,818 0 21,054,970,818  HDFS_BYTES_READ
>> 16,772,063,486 0 16,772,063,486  FILE_BYTES_WRITTEN 39,797,038,008
>> 10,404,287,551 50,201,325,55
>>
>>
>>  I think that the problem could be in the combination of the input files.
>> Is the combination class aware of compression.
>> Because *all my files are compressed*.
>> Maybe the class perform a concatenation and we fall in the hdfs limitation
>> of gzip concatenated files.
>>
>> On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <ch...@gmail.com>wrote:
>>
>>>
>>>
>>>  On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com>wrote:
>>>
>>>>  Hi Charles,
>>>> Which load function are you using ?
>>>>
>>>  I'm using a UD load function ..
>>>
>>>  Is the default (PigStorage?).
>>>>
>>> Nops ...
>>>
>>>
>>>>  In the hadoop counters for the job in the jobtracker ui, do you see the
>>>> expected number of input records being read?
>>>>
>>>  Is possible to see the counter in the history interface on JobTracker?
>>>
>>> I will run the jobs again to compare the counter, but my guess is
>>> probably not!
>>>
>>>   -Thejas
>>>>
>>>>
>>>>
>>>>
>>>> On 2/28/11 10:57 AM, "Charles Gonçalves" <ch...@gmail.com> wrote:
>>>>
>>>>    I'm not using any filtering in the script.
>>>> I'm just want to see the total traffic per day in all logs.
>>>>
>>>> If I combine 1000 log files into  one and run the script on this log
>>>> files I
>>>> got the correct answer for those logs.
>>>> But when I'm run with   all the *43458* log files I got a incorrect
>>>> output.
>>>> The correct would be an histogram for each day from 2010-10 but the
>>>> result
>>>> contain only data from 2010-10-21.
>>>> And if I process all the logs with an awk script I got the correct
>>>> answer.
>>>>
>>>>
>>>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <ji...@yahoo-inc.com>
>>>> wrote:
>>>>
>>>> > Not sure if I get your question. In 0.8, Pig combine small files into
>>>> one
>>>> > map, so it is possible you get less output files.
>>>>
>>>> This is not the problem.
>>>> But thanks anyway!
>>>>
>>>> If that is your concern, you can try to disable split combine using
>>>> > "-Dpig.splitCombination=false"
>>>> >
>>>> > Daniel
>>>> >
>>>> >
>>>> > Charles Gonçalves wrote:
>>>> >
>>>> >> I tried to process a big number of small files on pig and I got a
>>>> strange
>>>> >> problem.
>>>> >>
>>>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>>> >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>>> paths
>>>> >> to process : *43458*
>>>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total
>>>> >> input
>>>> >> paths to process : *43458*
>>>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
>>>> Total
>>>> >> input
>>>> >> paths (combined) to process : *329*
>>>> >>
>>>> >> When the script finish to process, the result is just about a
>>>> subgroup of
>>>> >> the input files.
>>>> >> These are logs from a whole month,  but the results are just from the
>>>> day
>>>> >> 21.
>>>> >>
>>>> >>
>>>> >> Maybe I'm missing something.
>>>> >> Any Ideas?
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>>
>>>>
>>>> --
>>>> *Charles Ferreira Gonçalves *
>>>> http://homepages.dcc.ufmg.br/~charles/
>>>> UFMG - ICEx - Dcc
>>>> Cel.: 55 31 87741485
>>>> Tel.:  55 31 34741485
>>>> Lab.: 55 31 34095840
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>  *Charles Ferreira Gonçalves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.:  55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>
>>
>>
>> --
>> *Charles Ferreira Gonçalves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.:  55 31 34741485
>> Lab.: 55 31 34095840
>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Charles Gonçalves <ch...@gmail.com>.

Ok ...

I'm sending both.
Versions:

Apache Pig version 0.8.0 (r1043805)
compiled Dec 08 2010, 17:26:09

Hadoop 0.20.2



On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:

>  Combine input splits should be able to handle compressed files. It will
> create seperate RecordReader for each file within one input split. So gzip
> concatenation should not be the case. I am not sure what happen to your
> script. If possible, give us more information (script, UDF, data, version).
>
> Daniel
>
>
>
> On 02/28/2011 05:40 PM, Charles Gon�alves wrote:
>
> Guys,
>
>  The amount of data in the source dir:
> hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111
>
>  What I did was:
> I run with all logs, 43458 and the counters are:
>
>   FILE_BYTES_READ 253,905,706 372,708,857 626,614,563  HDFS_BYTES_READ
> 2,553,123,734 0 2,553,123,734  FILE_BYTES_WRITTEN 619,877,917 372,708,857
> 992,586,774  HDFS_BYTES_WRITTEN 0 535 535
>
>
>  I did a manual join of the files and run again for the 336 files (the
> merge of all those files).
> The job didn't finished yet and the counters are:
>
>    FILE_BYTES_READ 21,054,970,818 0 21,054,970,818  HDFS_BYTES_READ
> 16,772,063,486 0 16,772,063,486  FILE_BYTES_WRITTEN 39,797,038,008
> 10,404,287,551 50,201,325,55
>
>
>  I think that the problem could be in the combination of the input files.
> Is the combination class aware of compression.
> Because *all my files are compressed*.
> Maybe the class perform a concatenation and we fall in the hdfs limitation
> of gzip concatenated files.
>
> On Mon, Feb 28, 2011 at 8:47 PM, Charles Gon�alves <ch...@gmail.com>wrote:
>
>>
>>
>>  On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com>wrote:
>>
>>>  Hi Charles,
>>> Which load function are you using ?
>>>
>>  I'm using a UD load function ..
>>
>>  Is the default (PigStorage?).
>>>
>> Nops ...
>>
>>
>>>  In the hadoop counters for the job in the jobtracker ui, do you see the
>>> expected number of input records being read?
>>>
>>  Is possible to see the counter in the history interface on JobTracker?
>> I will run the jobs again to compare the counter, but my guess is probably
>> not!
>>
>>   -Thejas
>>>
>>>
>>>
>>>
>>> On 2/28/11 10:57 AM, "Charles Gon�alves" <ch...@gmail.com> wrote:
>>>
>>>    I'm not using any filtering in the script.
>>> I'm just want to see the total traffic per day in all logs.
>>>
>>> If I combine 1000 log files into  one and run the script on this log
>>> files I
>>> got the correct answer for those logs.
>>> But when I'm run with   all the *43458* log files I got a incorrect
>>> output.
>>> The correct would be an histogram for each day from 2010-10 but the
>>> result
>>> contain only data from 2010-10-21.
>>> And if I process all the logs with an awk script I got the correct
>>> answer.
>>>
>>>
>>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <ji...@yahoo-inc.com>
>>> wrote:
>>>
>>> > Not sure if I get your question. In 0.8, Pig combine small files into
>>> one
>>> > map, so it is possible you get less output files.
>>>
>>> This is not the problem.
>>> But thanks anyway!
>>>
>>> If that is your concern, you can try to disable split combine using
>>> > "-Dpig.splitCombination=false"
>>> >
>>> > Daniel
>>> >
>>> >
>>> > Charles Gon�alves wrote:
>>> >
>>> >> I tried to process a big number of small files on pig and I got a
>>> strange
>>> >> problem.
>>> >>
>>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>> >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths
>>> >> to process : *43458*
>>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> >> input
>>> >> paths to process : *43458*
>>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> >> input
>>> >> paths (combined) to process : *329*
>>> >>
>>> >> When the script finish to process, the result is just about a subgroup
>>> of
>>> >> the input files.
>>> >> These are logs from a whole month,  but the results are just from the
>>> day
>>> >> 21.
>>> >>
>>> >>
>>> >> Maybe I'm missing something.
>>> >> Any Ideas?
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>>
>>>
>>> --
>>> *Charles Ferreira Gon�alves *
>>> http://homepages.dcc.ufmg.br/~charles/
>>> UFMG - ICEx - Dcc
>>> Cel.: 55 31 87741485
>>> Tel.:  55 31 34741485
>>> Lab.: 55 31 34095840
>>>
>>>
>>>
>>
>>
>> --
>>  *Charles Ferreira Gon�alves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.:  55 31 34741485
>> Lab.: 55 31 34095840
>>
>
>
>
> --
> *Charles Ferreira Gon�alves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>
>
>


-- 
*Charles Ferreira Gon�alves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Combine input splits should be able to handle compressed files. It will 
create seperate RecordReader for each file within one input split. So 
gzip concatenation should not be the case. I am not sure what happen to 
your script. If possible, give us more information (script, UDF, data, 
version).

Daniel


On 02/28/2011 05:40 PM, Charles Gonçalves wrote:
> Guys,
>
> The amount of data in the source dir:
> hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111
>
> What I did was:
> I run with all logs, 43458 and the counters are:
>
> FILE_BYTES_READ 	253,905,706 	372,708,857 	626,614,563
> HDFS_BYTES_READ 	2,553,123,734 	0 	2,553,123,734
> FILE_BYTES_WRITTEN 	619,877,917 	372,708,857 	992,586,774
> HDFS_BYTES_WRITTEN 	0 	535 	535
>
>
> I did a manual join of the files and run again for the 336 files (the 
> merge of all those files).
> The job didn't finished yet and the counters are:
>
> FILE_BYTES_READ 	21,054,970,818 	0 	21,054,970,818
> HDFS_BYTES_READ 	16,772,063,486 	0 	16,772,063,486
> FILE_BYTES_WRITTEN 	39,797,038,008 	10,404,287,551 	50,201,325,55
>
>
>
> I think that the problem could be in the combination of the input files.
> Is the combination class aware of compression.
> Because *all my files are compressed*.
> Maybe the class perform a concatenation and we fall in the hdfs 
> limitation of gzip concatenated files.
>
> On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves 
> <charles.fg@gmail.com <ma...@gmail.com>> wrote:
>
>
>
>     On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair
>     <tejas@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
>
>         Hi Charles,
>         Which load function are you using ?
>
>     I'm using a UD load function ..
>
>         Is the default (PigStorage?).
>
>     Nops ...
>
>         In the hadoop counters for the job in the jobtracker ui, do
>         you see the expected number of input records being read?
>
>     Is possible to see the counter in the history interface on
>     JobTracker?
>     I will run the jobs again to compare the counter, but my guess is
>     probably not!
>
>         -Thejas
>
>
>
>
>         On 2/28/11 10:57 AM, "Charles Gonçalves" <charles.fg@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             I'm not using any filtering in the script.
>             I'm just want to see the total traffic per day in all logs.
>
>             If I combine 1000 log files into  one and run the script
>             on this log files I
>             got the correct answer for those logs.
>             But when I'm run with   all the *43458* log files I got a
>             incorrect output.
>             The correct would be an histogram for each day from
>             2010-10 but the result
>             contain only data from 2010-10-21.
>             And if I process all the logs with an awk script I got the
>             correct answer.
>
>
>             On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai
>             <jianyong@yahoo-inc.com <ma...@yahoo-inc.com>>
>             wrote:
>
>             > Not sure if I get your question. In 0.8, Pig combine
>             small files into one
>             > map, so it is possible you get less output files.
>
>             This is not the problem.
>             But thanks anyway!
>
>             If that is your concern, you can try to disable split
>             combine using
>             > "-Dpig.splitCombination=false"
>             >
>             > Daniel
>             >
>             >
>             > Charles Gonçalves wrote:
>             >
>             >> I tried to process a big number of small files on pig
>             and I got a strange
>             >> problem.
>             >>
>             >> 2011-02-27 00:00:58,746 [Thread-15] INFO
>             >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
>             Total input paths
>             >> to process : *43458*
>             >> 2011-02-27 00:00:58,755 [Thread-15] INFO
>             >>
>              org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
>             - Total
>             >> input
>             >> paths to process : *43458*
>             >> 2011-02-27 00:01:14,173 [Thread-15] INFO
>             >>
>              org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
>             - Total
>             >> input
>             >> paths (combined) to process : *329*
>             >>
>             >> When the script finish to process, the result is just
>             about a subgroup of
>             >> the input files.
>             >> These are logs from a whole month,  but the results are
>             just from the day
>             >> 21.
>             >>
>             >>
>             >> Maybe I'm missing something.
>             >> Any Ideas?
>             >>
>             >>
>             >>
>             >
>             >
>
>
>             --
>             *Charles Ferreira Gonçalves *
>             http://homepages.dcc.ufmg.br/~charles/
>             <http://homepages.dcc.ufmg.br/%7Echarles/>
>             UFMG - ICEx - Dcc
>             Cel.: 55 31 87741485
>             Tel.:  55 31 34741485
>             Lab.: 55 31 34095840
>
>
>
>
>
>     -- 
>     *Charles Ferreira Gonçalves *
>     http://homepages.dcc.ufmg.br/~charles/
>     <http://homepages.dcc.ufmg.br/%7Echarles/>
>     UFMG - ICEx - Dcc
>     Cel.: 55 31 87741485
>     Tel.:  55 31 34741485
>     Lab.: 55 31 34095840
>
>
>
>
> -- 
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/ 
> <http://homepages.dcc.ufmg.br/%7Echarles/>
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Charles Gonçalves <ch...@gmail.com>.

Guys,

The amount of data in the source dir:
hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111

What I did was:
I run with all logs, 43458 and the counters are:

FILE_BYTES_READ253,905,706372,708,857626,614,563HDFS_BYTES_READ2,553,123,7340
2,553,123,734FILE_BYTES_WRITTEN619,877,917372,708,857992,586,774
HDFS_BYTES_WRITTEN 0535535


I did a manual join of the files and run again for the 336 files (the merge
of all those files).
The job didn't finished yet and the counters are:

FILE_BYTES_READ21,054,970,818021,054,970,818HDFS_BYTES_READ16,772,063,486 0
16,772,063,486FILE_BYTES_WRITTEN39,797,038,00810,404,287,55150,201,325,55


I think that the problem could be in the combination of the input files.
Is the combination class aware of compression.
Because *all my files are compressed*.
Maybe the class perform a concatenation and we fall in the hdfs limitation
of gzip concatenated files.

On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <ch...@gmail.com>wrote:

>
>
> On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com>wrote:
>
>>  Hi Charles,
>> Which load function are you using ?
>>
> I'm using a UD load function ..
>
> Is the default (PigStorage?).
>>
> Nops ...
>
>
>>  In the hadoop counters for the job in the jobtracker ui, do you see the
>> expected number of input records being read?
>>
> Is possible to see the counter in the history interface on JobTracker?
> I will run the jobs again to compare the counter, but my guess is probably
> not!
>
>  -Thejas
>>
>>
>>
>>
>> On 2/28/11 10:57 AM, "Charles Gonçalves" <ch...@gmail.com> wrote:
>>
>> I'm not using any filtering in the script.
>> I'm just want to see the total traffic per day in all logs.
>>
>> If I combine 1000 log files into  one and run the script on this log files
>> I
>> got the correct answer for those logs.
>> But when I'm run with   all the *43458* log files I got a incorrect
>> output.
>> The correct would be an histogram for each day from 2010-10 but the result
>> contain only data from 2010-10-21.
>> And if I process all the logs with an awk script I got the correct answer.
>>
>>
>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <ji...@yahoo-inc.com>
>> wrote:
>>
>> > Not sure if I get your question. In 0.8, Pig combine small files into
>> one
>> > map, so it is possible you get less output files.
>>
>> This is not the problem.
>> But thanks anyway!
>>
>> If that is your concern, you can try to disable split combine using
>> > "-Dpig.splitCombination=false"
>> >
>> > Daniel
>> >
>> >
>> > Charles Gonçalves wrote:
>> >
>> >> I tried to process a big number of small files on pig and I got a
>> strange
>> >> problem.
>> >>
>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO
>> >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths
>> >> to process : *43458*
>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO
>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> >> input
>> >> paths to process : *43458*
>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO
>> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> >> input
>> >> paths (combined) to process : *329*
>> >>
>> >> When the script finish to process, the result is just about a subgroup
>> of
>> >> the input files.
>> >> These are logs from a whole month,  but the results are just from the
>> day
>> >> 21.
>> >>
>> >>
>> >> Maybe I'm missing something.
>> >> Any Ideas?
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>> --
>> *Charles Ferreira Gonçalves *
>> http://homepages.dcc.ufmg.br/~charles/
>> UFMG - ICEx - Dcc
>> Cel.: 55 31 87741485
>> Tel.:  55 31 34741485
>> Lab.: 55 31 34095840
>>
>>
>>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>



-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Charles Gonçalves <ch...@gmail.com>.

On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <te...@yahoo-inc.com> wrote:

>  Hi Charles,
> Which load function are you using ?
>
I'm using a UD load function ..

Is the default (PigStorage?).
>
Nops ...


>  In the hadoop counters for the job in the jobtracker ui, do you see the
> expected number of input records being read?
>
Is possible to see the counter in the history interface on JobTracker?
I will run the jobs again to compare the counter, but my guess is probably
not!

-Thejas
>
>
>
>
> On 2/28/11 10:57 AM, "Charles Gonçalves" <ch...@gmail.com> wrote:
>
> I'm not using any filtering in the script.
> I'm just want to see the total traffic per day in all logs.
>
> If I combine 1000 log files into  one and run the script on this log files
> I
> got the correct answer for those logs.
> But when I'm run with   all the *43458* log files I got a incorrect output.
> The correct would be an histogram for each day from 2010-10 but the result
> contain only data from 2010-10-21.
> And if I process all the logs with an awk script I got the correct answer.
>
>
> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <ji...@yahoo-inc.com>
> wrote:
>
> > Not sure if I get your question. In 0.8, Pig combine small files into one
> > map, so it is possible you get less output files.
>
> This is not the problem.
> But thanks anyway!
>
> If that is your concern, you can try to disable split combine using
> > "-Dpig.splitCombination=false"
> >
> > Daniel
> >
> >
> > Charles Gonçalves wrote:
> >
> >> I tried to process a big number of small files on pig and I got a
> strange
> >> problem.
> >>
> >> 2011-02-27 00:00:58,746 [Thread-15] INFO
> >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths
> >> to process : *43458*
> >> 2011-02-27 00:00:58,755 [Thread-15] INFO
> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input
> >> paths to process : *43458*
> >> 2011-02-27 00:01:14,173 [Thread-15] INFO
> >>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input
> >> paths (combined) to process : *329*
> >>
> >> When the script finish to process, the result is just about a subgroup
> of
> >> the input files.
> >> These are logs from a whole month,  but the results are just from the
> day
> >> 21.
> >>
> >>
> >> Maybe I'm missing something.
> >> Any Ideas?
> >>
> >>
> >>
> >
> >
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>
>
>


-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Thejas M Nair <te...@yahoo-inc.com>.

Hi Charles,
Which load function are you using ? Is the default (PigStorage?).
In the hadoop counters for the job in the jobtracker ui, do you see the expected number of input records being read?
-Thejas

On 2/28/11 10:57 AM, "Charles Gonçalves" <ch...@gmail.com> wrote:

I'm not using any filtering in the script.
I'm just want to see the total traffic per day in all logs.

If I combine 1000 log files into  one and run the script on this log files I
got the correct answer for those logs.
But when I'm run with   all the *43458* log files I got a incorrect output.
The correct would be an histogram for each day from 2010-10 but the result
contain only data from 2010-10-21.
And if I process all the logs with an awk script I got the correct answer.

On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> Not sure if I get your question. In 0.8, Pig combine small files into one
> map, so it is possible you get less output files.

This is not the problem.
But thanks anyway!

If that is your concern, you can try to disable split combine using
> "-Dpig.splitCombination=false"
>
> Daniel
>
>
> Charles Gonçalves wrote:
>
>> I tried to process a big number of small files on pig and I got a strange
>> problem.
>>
>> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>> to process : *43458*
>> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths to process : *43458*
>> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths (combined) to process : *329*
>>
>> When the script finish to process, the result is just about a subgroup of
>> the input files.
>> These are logs from a whole month,  but the results are just from the day
>> 21.
>>
>>
>> Maybe I'm missing something.
>> Any Ideas?
>>
>>
>>
>
>

--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Charles Gonçalves <ch...@gmail.com>.

I'm not using any filtering in the script.
I'm just want to see the total traffic per day in all logs.

If I combine 1000 log files into  one and run the script on this log files I
got the correct answer for those logs.
But when I'm run with   all the *43458* log files I got a incorrect output.
The correct would be an histogram for each day from 2010-10 but the result
contain only data from 2010-10-21.
And if I process all the logs with an awk script I got the correct answer.

On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> Not sure if I get your question. In 0.8, Pig combine small files into one
> map, so it is possible you get less output files.

This is not the problem.
But thanks anyway!

If that is your concern, you can try to disable split combine using
> "-Dpig.splitCombination=false"
>
> Daniel
>
>
> Charles Gonçalves wrote:
>
>> I tried to process a big number of small files on pig and I got a strange
>> problem.
>>
>> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>> to process : *43458*
>> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths to process : *43458*
>> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths (combined) to process : *329*
>>
>> When the script finish to process, the result is just about a subgroup of
>> the input files.
>> These are logs from a whole month,  but the results are just from the day
>> 21.
>>
>>
>> Maybe I'm missing something.
>> Any Ideas?
>>
>>
>>
>
>

-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Not sure if I get your question. In 0.8, Pig combine small files into 
one map, so it is possible you get less output files. If that is your 
concern, you can try to disable split combine using 
"-Dpig.splitCombination=false"

Daniel

Charles Gonçalves wrote:
> I tried to process a big number of small files on pig and I got a strange
> problem.
>
> 2011-02-27 00:00:58,746 [Thread-15] INFO
>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : *43458*
> 2011-02-27 00:00:58,755 [Thread-15] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths to process : *43458*
> 2011-02-27 00:01:14,173 [Thread-15] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths (combined) to process : *329*
>
> When the script finish to process, the result is just about a subgroup of
> the input files.
> These are logs from a whole month,  but the results are just from the day
> 21.
>
>
> Maybe I'm missing something.
> Any Ideas?
>
>

Re: Problem when executionengine.util.MapRedUtil combine input paths

Posted by Romain Rigaux <ro...@gmail.com>.

Normally Pig 0.8 is just combining the small
files<http://pig.apache.org/docs/r0.8.0/cookbook.html#Combine+Small+Input+Files>into
bigger ones, you should not lose any records.

You might be filtering out/limiting some records in your script. You can try
just a LOAD and STORE and see that the output is the same as the input data.

Romain

On Sat, Feb 26, 2011 at 7:25 PM, Charles Gonçalves <ch...@gmail.com>wrote:

> I tried to process a big number of small files on pig and I got a strange
> problem.
>
> 2011-02-27 00:00:58,746 [Thread-15] INFO
>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : *43458*
> 2011-02-27 00:00:58,755 [Thread-15] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> paths to process : *43458*
> 2011-02-27 00:01:14,173 [Thread-15] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> paths (combined) to process : *329*
>
> When the script finish to process, the result is just about a subgroup of
> the input files.
> These are logs from a whole month,  but the results are just from the day
> 21.
>
>
> Maybe I'm missing something.
> Any Ideas?
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>