You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Niels Basjes <Ni...@basjes.nl> on 2014/06/01 07:53:46 UTC

Re: Change proposal for FileInputFormat isSplitable

The Hadoop framework uses the filename extension  to automatically insert
the "right" decompression codec in the read pipeline.
So if someone does what you describe then they would need to unload all
compression codecs or face decompression errors. And if it really was
gzipped then it would not be splittable at all.

Niels
On May 31, 2014 11:12 PM, "Chris Douglas" <cd...@apache.org> wrote:

> On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > How would someone create the situation you are referring to?
>
> By adopting a naming convention where the filename suffix doesn't
> imply that the raw data are compressed with that codec.
>
> For example, if a user named SequenceFiles foo.lzo and foo.gz to
> record which codec was used, then isSplittable would spuriously return
> false. -C
>
> > On May 31, 2014 1:06 AM, "Doug Cutting" <cu...@apache.org> wrote:
> >
> >> I was trying to explain my comment, where I stated that, "changing the
> >> default implementation to return false would be an incompatible
> >> change".  The patch was added 6 months after that comment, so the
> >> comment didn't address the patch.
> >>
> >> The patch does not appear to change the default implementation to
> >> return false unless the suffix of the file name is that of a known
> >> unsplittable compression format.  So the folks who'd be harmed by this
> >> are those who used a suffix like ".gz" for an Avro, Parquet or
> >> other-format file.  Their applications might suddenly run much slower
> >> and it would be difficult for them to determine why.  Such folks are
> >> probably few, but perhaps exist.  I'd prefer a change that avoided
> >> that possibility entirely.
> >>
> >> Doug
> >>
> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> >> > Hi,
> >> >
> >> > The way I see the effects of the original patch on existing
> subclasses:
> >> > - implemented isSplitable
> >> >    --> no performance difference.
> >> > - did not implement isSplitable
> >> >    --> then there is no performance difference if the container is
> either
> >> > not compressed or uses a splittable compression.
> >> >    --> If it uses a common non splittable compression (like gzip) then
> >> the
> >> > output will suddenly be different (which is the correct answer) and
> the
> >> > jobs will finish sooner because the input is not processed multiple
> >> times.
> >> >
> >> > Where do you see a performance impact?
> >> >
> >> > Niels
> >> > On May 30, 2014 8:06 PM, "Doug Cutting" <cu...@apache.org> wrote:
> >> >
> >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <Ni...@basjes.nl>
> wrote:
> >> >> > For arguments I still do not fully understand this was rejected by
> >> Todd
> >> >> and
> >> >> > Doug.
> >> >>
> >> >> Performance is a part of compatibility.
> >> >>
> >> >> Doug
> >> >>
> >>
>

Re: Change proposal for FileInputFormat isSplitable

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

I talked to some people and they agreed with me that really the situation
where this problem occurs is when they build a FileInputFormat derivative
that also uses a LineRecordReader derivative. This is exactly the scenario
that occurs if someone follows the Yahoo Hadoop tutorial.

Instead of changing the FileInputFormat (which many of the committers
considered to be a bad idea) I created a very simple patch for the
LineRecordReader that throws an exception (intentionally failing the entire
job) when it receives a split for a compressed file that had not been
compressed using a SplittableCompressionCodec and where the split does not
start at the beginning of the file. So fail if it detects a non splittable
file that has been split.

So if you run this against a 1GB gzipped file then the first split of the
whole will complete successfully and all other splits will fail without
even reading a single line.

As far as I can tell this is a simple, clean and compatible patch that does
not break anything. Also the change is limited to the most common place
where this problem occurs.
The only 'big' effect is that people who have been running a broken
implementation will no longer be able to run this broken code iff they feed
it 'large' non-splittable files. Which I thinks is a good thing.

What do you (the committers) think of this approach?

The patch I submitted a few days ago also includes the JavaDoc improvements
(in FileInputFormat) provided by Gian Merlino

https://issues.apache.org/jira/browse/MAPREDUCE-2094

Niels Basjes

P.S. I still thing that the FileInputFormat.isSplitable() should implement
a safe default instead of an optimistic default.



On Sat, Jun 14, 2014 at 10:33 AM, Niels Basjes <Ni...@basjes.nl> wrote:

> I did some digging through the code base and inspected all the situations
> I know where this goes wrong (including the yahoo tutorial) and found a
> place that may be a spot to avoid the effects of this problem. (Instead of
> solving the cause the problem)
>
> It turns out that all of those use cases use the LineRecordReader to read
> the data. This class (both the mapred and mapreduce versions) have the
> notion of the split that needs to be read, if the file is compressed and if
> this is a splittable compression codec.
>
> Now if we were to add code there that validates if the provided splits are
> valid or not (i.e. did the developer make this bug or not) then we could
> avoid the garbage data problem before it is fed into the actual mapper.
> This must  then write error messages (+ message "did you know you have been
> looking at corrupted data for a long time") that will appear in the logs of
> all the mapper attempts.
>
> At that point we can do one of these two actions in the LineRecordReader:
> - Fail hard with an exception. The job fails and the user immediately goes
> to the developer of the inputformat with a bug report.
> - Avoid the problem: Read the entire file iff the start of the split is 0,
> else read nothing. Many users will see a dramatic change in their results
> and (hopefully) start digging deeper. (Iff a human actually looks at the
> data)
>
> I vote for the "fail hard" because then people are forced to fix the
> problem and correct the historical impact.
>
> Would this be a good / compatible solution?
>
> If so then I think we should have this in both the 2.x and 3.x.
>
> For the 3.x I also realized that perhaps the isSplittable is something
> that could be delegated to the record reader. Would that make sense or is
> this something that does not belong there?
> If not then I would still propose making the isSplittable abstract to fix
> the problem before it is created (in 3.x)
>
> Niels Basjes
> On Jun 13, 2014 11:47 PM, "Chris Douglas" <cd...@apache.org> wrote:
>
>> On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > Hmmm, people only look at logs when they have a problem. So I don't
>> think
>> > this would be enough.
>>
>> This change to the framework will cause disruptions to users, to aid
>> InputFormat authors' debugging. The latter is a much smaller
>> population and better equipped to handle this complexity.
>>
>> A log statement would print during submission, so it would be visible
>> to users. If a user's job is producing garbage but submission was
>> non-interactive, a log statement would be sufficient to debug the
>> issue. If the naming conflict is common in some contexts, the warning
>> can be disabled using the log configuration.
>>
>> Beyond that, input validation is the responsibility of the InputFormat
>> author.
>>
>> > Perhaps this makes sense:
>> > - For 3.0: Shout at the developer who does it wrong (i.e. make it
>> abstract
>> > and force them to think about this) i.e. Create new abstract method
>> > isSplittable (tt) in FileInputFormat, remove isSplitable (one t).
>> >
>> > To avoid needless code duplication (which we already have in the
>> codebase)
>> > create a helper method something like 'fileNameIndicatesSplittableFile'
>> (
>> > returns enum:  Splittable/NonSplittable/Unknown ).
>> >
>> > - For 2.x: Keep the enduser safe: Avoid "silently producing garbage" in
>> all
>> > situations where the developer already did it wrong. (i.e. change
>> > isSplitable ==> return false) This costs performance only in those
>> > situations where the developer actually did it wrong (i.e. they didn't
>> > thing this through)
>> >
>> > How about that?
>>
>> -1 on the 2.x change for compatibility reasons.
>>
>> While we can break compatibility in the 3.x line, the tradeoff is
>> still not very compelling, frankly. -C
>>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Change proposal for FileInputFormat isSplitable

Posted by Niels Basjes <Ni...@basjes.nl>.
I did some digging through the code base and inspected all the situations I
know where this goes wrong (including the yahoo tutorial) and found a place
that may be a spot to avoid the effects of this problem. (Instead of
solving the cause the problem)

It turns out that all of those use cases use the LineRecordReader to read
the data. This class (both the mapred and mapreduce versions) have the
notion of the split that needs to be read, if the file is compressed and if
this is a splittable compression codec.

Now if we were to add code there that validates if the provided splits are
valid or not (i.e. did the developer make this bug or not) then we could
avoid the garbage data problem before it is fed into the actual mapper.
This must  then write error messages (+ message "did you know you have been
looking at corrupted data for a long time") that will appear in the logs of
all the mapper attempts.

At that point we can do one of these two actions in the LineRecordReader:
- Fail hard with an exception. The job fails and the user immediately goes
to the developer of the inputformat with a bug report.
- Avoid the problem: Read the entire file iff the start of the split is 0,
else read nothing. Many users will see a dramatic change in their results
and (hopefully) start digging deeper. (Iff a human actually looks at the
data)

I vote for the "fail hard" because then people are forced to fix the
problem and correct the historical impact.

Would this be a good / compatible solution?

If so then I think we should have this in both the 2.x and 3.x.

For the 3.x I also realized that perhaps the isSplittable is something that
could be delegated to the record reader. Would that make sense or is this
something that does not belong there?
If not then I would still propose making the isSplittable abstract to fix
the problem before it is created (in 3.x)

Niels Basjes
On Jun 13, 2014 11:47 PM, "Chris Douglas" <cd...@apache.org> wrote:

> On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> > Hmmm, people only look at logs when they have a problem. So I don't think
> > this would be enough.
>
> This change to the framework will cause disruptions to users, to aid
> InputFormat authors' debugging. The latter is a much smaller
> population and better equipped to handle this complexity.
>
> A log statement would print during submission, so it would be visible
> to users. If a user's job is producing garbage but submission was
> non-interactive, a log statement would be sufficient to debug the
> issue. If the naming conflict is common in some contexts, the warning
> can be disabled using the log configuration.
>
> Beyond that, input validation is the responsibility of the InputFormat
> author.
>
> > Perhaps this makes sense:
> > - For 3.0: Shout at the developer who does it wrong (i.e. make it
> abstract
> > and force them to think about this) i.e. Create new abstract method
> > isSplittable (tt) in FileInputFormat, remove isSplitable (one t).
> >
> > To avoid needless code duplication (which we already have in the
> codebase)
> > create a helper method something like 'fileNameIndicatesSplittableFile' (
> > returns enum:  Splittable/NonSplittable/Unknown ).
> >
> > - For 2.x: Keep the enduser safe: Avoid "silently producing garbage" in
> all
> > situations where the developer already did it wrong. (i.e. change
> > isSplitable ==> return false) This costs performance only in those
> > situations where the developer actually did it wrong (i.e. they didn't
> > thing this through)
> >
> > How about that?
>
> -1 on the 2.x change for compatibility reasons.
>
> While we can break compatibility in the 3.x line, the tradeoff is
> still not very compelling, frankly. -C
>

Re: Change proposal for FileInputFormat isSplitable

Posted by Chris Douglas <cd...@apache.org>.
On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> Hmmm, people only look at logs when they have a problem. So I don't think
> this would be enough.

This change to the framework will cause disruptions to users, to aid
InputFormat authors' debugging. The latter is a much smaller
population and better equipped to handle this complexity.

A log statement would print during submission, so it would be visible
to users. If a user's job is producing garbage but submission was
non-interactive, a log statement would be sufficient to debug the
issue. If the naming conflict is common in some contexts, the warning
can be disabled using the log configuration.

Beyond that, input validation is the responsibility of the InputFormat author.

> Perhaps this makes sense:
> - For 3.0: Shout at the developer who does it wrong (i.e. make it abstract
> and force them to think about this) i.e. Create new abstract method
> isSplittable (tt) in FileInputFormat, remove isSplitable (one t).
>
> To avoid needless code duplication (which we already have in the codebase)
> create a helper method something like 'fileNameIndicatesSplittableFile' (
> returns enum:  Splittable/NonSplittable/Unknown ).
>
> - For 2.x: Keep the enduser safe: Avoid "silently producing garbage" in all
> situations where the developer already did it wrong. (i.e. change
> isSplitable ==> return false) This costs performance only in those
> situations where the developer actually did it wrong (i.e. they didn't
> thing this through)
>
> How about that?

-1 on the 2.x change for compatibility reasons.

While we can break compatibility in the 3.x line, the tradeoff is
still not very compelling, frankly. -C

Re: Change proposal for FileInputFormat isSplitable

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

On Wed, Jun 11, 2014 at 8:25 PM, Chris Douglas <cd...@apache.org> wrote:

> On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> > That's not what I meant. What I understood from what was described is
> that
> > sometimes people use an existing file extension (like .gz) for a file
> that
> > is not a gzipped file.
>


> Understood, but this change also applies to other loaded codecs, like
> .lzo, .bz, etc. Adding a new codec changes the default behavior for
> all InputFormats that don't override this method.
>

Yes it would. I think that forcing the developer of the file based
inputformat to implement this would be the best way to go.
Making this method abstract is the first thing that spring to mind.

This would break backwards compatibility so I think we can only do that
with the 3.0.0 version


> > I consider "silently producing garbage" one of the worst kinds of problem
> > to tackle.
> > Because many custom file based input formats have stumbled (getting
> > "silently produced garbage") over the current isSplitable implementation
> I
> > really want to avoid any more of this in the future.
> > That is why I want to change the implementations in this area of Hadoop
> in
> > such a way that this "silently producing garbage" effect is taken out.
>
> Adding validity assumptions to a common base class will affect a lot
> of users, most of whom are not InputFormat authors.
>

True, the thing is that if a user uses an InputFormat written by someone
else and then it "silently produces garbage" they are also affected in a
much worse way.


> > So the question remains: What is the way this should be changed?
> > I'm willing to build it and submit a patch.
>
> Would a logged warning suffice? This would aid debugging without an
> incompatible change in behavior. It could also be disabled easily. -C


Hmmm, people only look at logs when they have a problem. So I don't think
this would be enough.

Perhaps this makes sense:
- For 3.0: Shout at the developer who does it wrong (i.e. make it abstract
and force them to think about this) i.e. Create new abstract method
isSplittable (tt) in FileInputFormat, remove isSplitable (one t).

To avoid needless code duplication (which we already have in the codebase)
create a helper method something like 'fileNameIndicatesSplittableFile' (
returns enum:  Splittable/NonSplittable/Unknown ).

- For 2.x: Keep the enduser safe: Avoid "silently producing garbage" in all
situations where the developer already did it wrong. (i.e. change
isSplitable ==> return false) This costs performance only in those
situations where the developer actually did it wrong (i.e. they didn't
thing this through)

How about that?

P.S. I created an issue for the NLineInputFormat problem I found:
https://issues.apache.org/jira/browse/MAPREDUCE-5925

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Change proposal for FileInputFormat isSplitable

Posted by Chris Douglas <cd...@apache.org>.
On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> That's not what I meant. What I understood from what was described is that
> sometimes people use an existing file extension (like .gz) for a file that
> is not a gzipped file.

Understood, but this change also applies to other loaded codecs, like
.lzo, .bz, etc. Adding a new codec changes the default behavior for
all InputFormats that don't override this method.

> I consider "silently producing garbage" one of the worst kinds of problem
> to tackle.
> Because many custom file based input formats have stumbled (getting
> "silently produced garbage") over the current isSplitable implementation I
> really want to avoid any more of this in the future.
> That is why I want to change the implementations in this area of Hadoop in
> such a way that this "silently producing garbage" effect is taken out.

Adding validity assumptions to a common base class will affect a lot
of users, most of whom are not InputFormat authors.

> So the question remains: What is the way this should be changed?
> I'm willing to build it and submit a patch.

Would a logged warning suffice? This would aid debugging without an
incompatible change in behavior. It could also be disabled easily. -C

>> > The safest way would be either 2 or 4. Solution 3 would effectively be
>> the
>> > same as the current implementation, yet it would catch the problem
>> > situations as long as people stick to normal file name conventions.
>> > Solution 3 would also allow removing some code duplication in several
>> > subclasses.
>> >
>> > I would go for solution 3.
>> >
>> > Niels Basjes
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Change proposal for FileInputFormat isSplitable

Posted by Niels Basjes <Ni...@basjes.nl>.
On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas <cd...@apache.org> wrote:

> On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > and if you then give the file the .gz extension this breaks all common
> > sense / conventions about file names.
>


> That the suffix for all compression codecs in every context- and all
> future codecs- should determine whether a file can be split is not an
> assumption we can make safely. Again, that's not an assumption that
> held when people built their current systems, and they would be justly
> annoyed with the project for changing it.


That's not what I meant. What I understood from what was described is that
sometimes people use an existing file extension (like .gz) for a file that
is not a gzipped file.
If a file is splittable or not depends greatly on the actual codec
implementation that is used to read it. Using the default GzipCodec a .gz
file is not splittable, but that can be changed with a different
implementation like for example this
https://github.com/nielsbasjes/splittablegzip
So given a file extension the file 'must' be a file that is the format that
is described by the file name extension.

The flow is roughly as follows
- What is the file extension
- Get the codec class registered to that extension
- Is this a splittable codec ? (Does this class implement the
splittablecodec interface)

> I hold "correct data" much higher than performance and scalability; so the
> > performance impact is a concern but it is much less important than the
> list
> > of bugs we are facing right now.
>
> These are not bugs. NLineInputFormat doesn't support compressed input,
> and why would it? -C
>

I'm not saying it should (in fact, for this one I agree that it shouldn't).
The reality is that it accepts the file, decompresses it and then produces
output that 'looks good' but really is garbage.

I consider "silently producing garbage" one of the worst kinds of problem
to tackle.
Because many custom file based input formats have stumbled (getting
"silently produced garbage") over the current isSplitable implementation I
really want to avoid any more of this in the future.
That is why I want to change the implementations in this area of Hadoop in
such a way that this "silently producing garbage" effect is taken out.

So the question remains: What is the way this should be changed?
I'm willing to build it and submit a patch.




> > The safest way would be either 2 or 4. Solution 3 would effectively be
> the
> > same as the current implementation, yet it would catch the problem
> > situations as long as people stick to normal file name conventions.
> > Solution 3 would also allow removing some code duplication in several
> > subclasses.
> >
> > I would go for solution 3.
> >
> > Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Change proposal for FileInputFormat isSplitable

Posted by Chris Douglas <cd...@apache.org>.
On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> and if you then give the file the .gz extension this breaks all common
> sense / conventions about file names.

That the suffix for all compression codecs in every context- and all
future codecs- should determine whether a file can be split is not an
assumption we can make safely. Again, that's not an assumption that
held when people built their current systems, and they would be justly
annoyed with the project for changing it.

> I hold "correct data" much higher than performance and scalability; so the
> performance impact is a concern but it is much less important than the list
> of bugs we are facing right now.

These are not bugs. NLineInputFormat doesn't support compressed input,
and why would it? -C

> The safest way would be either 2 or 4. Solution 3 would effectively be the
> same as the current implementation, yet it would catch the problem
> situations as long as people stick to normal file name conventions.
> Solution 3 would also allow removing some code duplication in several
> subclasses.
>
> I would go for solution 3.
>
> Niels Basjes

Re: Change proposal for FileInputFormat isSplitable

Posted by Niels Basjes <Ni...@basjes.nl>.
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Sat, May 31, 2014 at 10:53 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > The Hadoop framework uses the filename extension  to automatically insert
> > the "right" decompression codec in the read pipeline.
>
> This would be the new behavior, incompatible with existing code.
>

You are right, I was wrong. It is the LineRecordReader that inserts it.

Looking at this code and where it is used I noticed that the bug I'm trying
to prevent is present in the current trunk.
The NLineInputFormat does not override the isSplitable and used the
LineRecordReader that is capable of reading gzipped input. Overall effect
is that this inputformat silently produces garbage (missing lines +
duplicated lines) when when ran against a gzipped file. I just verified
this.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
> returns true because it extends FileInputFormat. After the change, it
> could spuriously return false based on the suffix of the input files.
> In the prenominate example, SequenceFile is splittable, even if the
> codec used in each block is not. -C
>

and if you then give the file the .gz extension this breaks all common
sense / conventions about file names.


Let's reiterate the options I see now:
1) isSplitable --> return true
    Too unsafe, I say "must change". I alone hit my head twice so far on
this, many others have too, event the current trunk still has this bug in
there.

2) isSplitable --> return false
    Safe but too slow in some cases. In those cases the actual
implementation can simply override it very easily and regain their original
performance.

3) isSplitable --> true (same as the current implementation) unless you use
a file extension that is associated with a non splittable compression codec
(i.e.  .gz or something like that).
    If a custom format want to break with well known conventions about
filenames then they should simply override the isSplitable with their own.

4) isSplitable --> abstract
    Compatibility breaker. I see this as the cleanest way to force the
developer of the custom fileinputformat to think about their specific case.

I hold "correct data" much higher than performance and scalability; so the
performance impact is a concern but it is much less important than the list
of bugs we are facing right now.

The safest way would be either 2 or 4. Solution 3 would effectively be the
same as the current implementation, yet it would catch the problem
situations as long as people stick to normal file name conventions.
Solution 3 would also allow removing some code duplication in several
subclasses.

I would go for solution 3.

Niels Basjes

Re: Change proposal for FileInputFormat isSplitable

Posted by Chris Douglas <cd...@apache.org>.
On Sat, May 31, 2014 at 10:53 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> The Hadoop framework uses the filename extension  to automatically insert
> the "right" decompression codec in the read pipeline.

This would be the new behavior, incompatible with existing code.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
returns true because it extends FileInputFormat. After the change, it
could spuriously return false based on the suffix of the input files.
In the prenominate example, SequenceFile is splittable, even if the
codec used in each block is not. -C

> Niels
> On May 31, 2014 11:12 PM, "Chris Douglas" <cd...@apache.org> wrote:
>
>> On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > How would someone create the situation you are referring to?
>>
>> By adopting a naming convention where the filename suffix doesn't
>> imply that the raw data are compressed with that codec.
>>
>> For example, if a user named SequenceFiles foo.lzo and foo.gz to
>> record which codec was used, then isSplittable would spuriously return
>> false. -C
>>
>> > On May 31, 2014 1:06 AM, "Doug Cutting" <cu...@apache.org> wrote:
>> >
>> >> I was trying to explain my comment, where I stated that, "changing the
>> >> default implementation to return false would be an incompatible
>> >> change".  The patch was added 6 months after that comment, so the
>> >> comment didn't address the patch.
>> >>
>> >> The patch does not appear to change the default implementation to
>> >> return false unless the suffix of the file name is that of a known
>> >> unsplittable compression format.  So the folks who'd be harmed by this
>> >> are those who used a suffix like ".gz" for an Avro, Parquet or
>> >> other-format file.  Their applications might suddenly run much slower
>> >> and it would be difficult for them to determine why.  Such folks are
>> >> probably few, but perhaps exist.  I'd prefer a change that avoided
>> >> that possibility entirely.
>> >>
>> >> Doug
>> >>
>> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> >> > Hi,
>> >> >
>> >> > The way I see the effects of the original patch on existing
>> subclasses:
>> >> > - implemented isSplitable
>> >> >    --> no performance difference.
>> >> > - did not implement isSplitable
>> >> >    --> then there is no performance difference if the container is
>> either
>> >> > not compressed or uses a splittable compression.
>> >> >    --> If it uses a common non splittable compression (like gzip) then
>> >> the
>> >> > output will suddenly be different (which is the correct answer) and
>> the
>> >> > jobs will finish sooner because the input is not processed multiple
>> >> times.
>> >> >
>> >> > Where do you see a performance impact?
>> >> >
>> >> > Niels
>> >> > On May 30, 2014 8:06 PM, "Doug Cutting" <cu...@apache.org> wrote:
>> >> >
>> >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <Ni...@basjes.nl>
>> wrote:
>> >> >> > For arguments I still do not fully understand this was rejected by
>> >> Todd
>> >> >> and
>> >> >> > Doug.
>> >> >>
>> >> >> Performance is a part of compatibility.
>> >> >>
>> >> >> Doug
>> >> >>
>> >>
>>