You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matthias Scherer <ma...@1und1.de> on 2013/04/18 21:34:54 UTC

How to process only input files containing 100% valid rows

Hi all,

In my mapreduce job, I would like to process only whole input files containing only valid rows. If one map task processing an input split of a file detects an invalid row, the whole file should be "marked" as invalid and not processed at all. This input file will then be cleansed by another process, and taken again as input to the next run of my mapreduce job.

My first idea was to set a counter in the mapper after detecting an invalid line with the name of the file as the counter name (derived from input split). Then additionally put the input filename to the map output value (which is already a MapWritable, so adding the filename is no problem). And in the reducer I could filter out any rows belonging to the counters written in the mapper.

Each job has some thousand input files. So in the worst case there could be as many counters written to mark invalid input files. Is this a feasible approach? Does the framework guarantee that all counters written in the mappers are synchronized (visible) in the reducers? And could this number of counters lead to OOME in the jobtracker?

Are there better approaches? I could also process the files using a non splitable input format. Is there a way to reject the already outputted rows of a the map task processing an input split?

Thanks,
Matthias


Re: How to process only input files containing 100% valid rows

Posted by Niels Basjes <Ni...@basjes.nl>.
How about a different approach:
If you use the multiple output option you can process the valid lines in a
normal way and put the invalid lines in a special separate output file.
On Apr 18, 2013 9:36 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>

Re: How to process only input files containing 100% valid rows

Posted by Wellington Chevreuil <we...@gmail.com>.
How about use a combiner to mark as dirty all rows from a dirty file, for
instance, putting "dirty" flag as part of the key, then in the reducer you
can simply ignore this rows and/or output the bad file name.

It still will have to pass through the whole file, but at least avoids the
case where you could end up with too many counters...

Regards.


2013/4/19 Matthias Scherer <ma...@1und1.de>

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: AW: How to process only input files containing 100% valid rows

Posted by Nitin Pawar <ni...@gmail.com>.
Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: AW: How to process only input files containing 100% valid rows

Posted by Nitin Pawar <ni...@gmail.com>.
Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: AW: How to process only input files containing 100% valid rows

Posted by Nitin Pawar <ni...@gmail.com>.
Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: AW: How to process only input files containing 100% valid rows

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
Matthias,

As far as I know, there are no guarantees on when counters will be updated during the job. One thing you can do is to write a metadata file along with your parsed events listing what files have errors and should be ignored in the next step of your ETL workflow.

If you really don't want to have "dirty" records mixed in, you can accomplish it using secondary sort. In a nutshell:
- create a composite key using filename and an enum BROKEN = 0, CLEAN = 1
- create a sorting comparator that ensures BROKEN comes before CLEAN
- create a grouping comparator and a partitioner on filename only, to ensure both BROKEN and CLEAN are processed by the same reducer
- if you found a broken line, send it with a BROKEN key
- in the reducer, if you get a BROKEN key, write that filename somewhere so you know you will have to scrub and re-submit it, and ignore both BROKEN and CLEAN records

Regards,
Marcos

On 19-04-2013 06:39, Matthias Scherer wote:
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias


Re: AW: How to process only input files containing 100% valid rows

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
Matthias,

As far as I know, there are no guarantees on when counters will be updated during the job. One thing you can do is to write a metadata file along with your parsed events listing what files have errors and should be ignored in the next step of your ETL workflow.

If you really don't want to have "dirty" records mixed in, you can accomplish it using secondary sort. In a nutshell:
- create a composite key using filename and an enum BROKEN = 0, CLEAN = 1
- create a sorting comparator that ensures BROKEN comes before CLEAN
- create a grouping comparator and a partitioner on filename only, to ensure both BROKEN and CLEAN are processed by the same reducer
- if you found a broken line, send it with a BROKEN key
- in the reducer, if you get a BROKEN key, write that filename somewhere so you know you will have to scrub and re-submit it, and ignore both BROKEN and CLEAN records

Regards,
Marcos

On 19-04-2013 06:39, Matthias Scherer wote:
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias


Re: AW: How to process only input files containing 100% valid rows

Posted by Nitin Pawar <ni...@gmail.com>.
Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: How to process only input files containing 100% valid rows

Posted by Wellington Chevreuil <we...@gmail.com>.
How about use a combiner to mark as dirty all rows from a dirty file, for
instance, putting "dirty" flag as part of the key, then in the reducer you
can simply ignore this rows and/or output the bad file name.

It still will have to pass through the whole file, but at least avoids the
case where you could end up with too many counters...

Regards.


2013/4/19 Matthias Scherer <ma...@1und1.de>

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: AW: How to process only input files containing 100% valid rows

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
Matthias,

As far as I know, there are no guarantees on when counters will be updated during the job. One thing you can do is to write a metadata file along with your parsed events listing what files have errors and should be ignored in the next step of your ETL workflow.

If you really don't want to have "dirty" records mixed in, you can accomplish it using secondary sort. In a nutshell:
- create a composite key using filename and an enum BROKEN = 0, CLEAN = 1
- create a sorting comparator that ensures BROKEN comes before CLEAN
- create a grouping comparator and a partitioner on filename only, to ensure both BROKEN and CLEAN are processed by the same reducer
- if you found a broken line, send it with a BROKEN key
- in the reducer, if you get a BROKEN key, write that filename somewhere so you know you will have to scrub and re-submit it, and ignore both BROKEN and CLEAN records

Regards,
Marcos

On 19-04-2013 06:39, Matthias Scherer wote:
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias


Re: How to process only input files containing 100% valid rows

Posted by Wellington Chevreuil <we...@gmail.com>.
How about use a combiner to mark as dirty all rows from a dirty file, for
instance, putting "dirty" flag as part of the key, then in the reducer you
can simply ignore this rows and/or output the bad file name.

It still will have to pass through the whole file, but at least avoids the
case where you could end up with too many counters...

Regards.


2013/4/19 Matthias Scherer <ma...@1und1.de>

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Re: AW: How to process only input files containing 100% valid rows

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
Matthias,

As far as I know, there are no guarantees on when counters will be updated during the job. One thing you can do is to write a metadata file along with your parsed events listing what files have errors and should be ignored in the next step of your ETL workflow.

If you really don't want to have "dirty" records mixed in, you can accomplish it using secondary sort. In a nutshell:
- create a composite key using filename and an enum BROKEN = 0, CLEAN = 1
- create a sorting comparator that ensures BROKEN comes before CLEAN
- create a grouping comparator and a partitioner on filename only, to ensure both BROKEN and CLEAN are processed by the same reducer
- if you found a broken line, send it with a BROKEN key
- in the reducer, if you get a BROKEN key, write that filename somewhere so you know you will have to scrub and re-submit it, and ignore both BROKEN and CLEAN records

Regards,
Marcos

On 19-04-2013 06:39, Matthias Scherer wote:
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias


Re: How to process only input files containing 100% valid rows

Posted by Wellington Chevreuil <we...@gmail.com>.
How about use a combiner to mark as dirty all rows from a dirty file, for
instance, putting "dirty" flag as part of the key, then in the reducer you
can simply ignore this rows and/or output the bad file name.

It still will have to pass through the whole file, but at least avoids the
case where you could end up with too many counters...

Regards.


2013/4/19 Matthias Scherer <ma...@1und1.de>

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

AW: How to process only input files containing 100% valid rows

Posted by Matthias Scherer <ma...@1und1.de>.
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias

AW: How to process only input files containing 100% valid rows

Posted by Matthias Scherer <ma...@1und1.de>.
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias

AW: How to process only input files containing 100% valid rows

Posted by Matthias Scherer <ma...@1und1.de>.
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias

AW: How to process only input files containing 100% valid rows

Posted by Matthias Scherer <ma...@1und1.de>.
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more / better control if we reject whole files containing bad records.

Regards
Matthias

Re: How to process only input files containing 100% valid rows

Posted by Steve Lewis <lo...@gmail.com>.
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as short as a few thousand lines


On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer <
matthias.scherer@1und1.de> wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How to process only input files containing 100% valid rows

Posted by Steve Lewis <lo...@gmail.com>.
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as short as a few thousand lines


On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer <
matthias.scherer@1und1.de> wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How to process only input files containing 100% valid rows

Posted by Steve Lewis <lo...@gmail.com>.
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as short as a few thousand lines


On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer <
matthias.scherer@1und1.de> wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How to process only input files containing 100% valid rows

Posted by Niels Basjes <Ni...@basjes.nl>.
How about a different approach:
If you use the multiple output option you can process the valid lines in a
normal way and put the invalid lines in a special separate output file.
On Apr 18, 2013 9:36 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>

Re: How to process only input files containing 100% valid rows

Posted by Steve Lewis <lo...@gmail.com>.
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as short as a few thousand lines


On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer <
matthias.scherer@1und1.de> wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How to process only input files containing 100% valid rows

Posted by Niels Basjes <Ni...@basjes.nl>.
How about a different approach:
If you use the multiple output option you can process the valid lines in a
normal way and put the invalid lines in a special separate output file.
On Apr 18, 2013 9:36 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>

Re: How to process only input files containing 100% valid rows

Posted by Niels Basjes <Ni...@basjes.nl>.
How about a different approach:
If you use the multiple output option you can process the valid lines in a
normal way and put the invalid lines in a special separate output file.
On Apr 18, 2013 9:36 PM, "Matthias Scherer" <ma...@1und1.de>
wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of a
> file detects an invalid row, the whole file should be “marked” as invalid
> and not processed at all. This input file will then be cleansed by another
> process, and taken again as input to the next run of my mapreduce job.****
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasible
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted rows
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>