You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/09/15 21:21:15 UTC

Re: svn commit: r446695 - in /lucene/hadoop/trunk: CHANGES.txt src/java/org/apache/hadoop/mapred/JobConf.java

cutting@apache.org wrote:
> Author: cutting
> Date: Fri Sep 15 12:14:50 2006
> New Revision: 446695
>
> URL: http://svn.apache.org/viewvc?view=rev&rev=446695
> Log:
> HADOOP-534.  Change the default value classes in JobConf to be Text, not the now-deprecated UTF8.  Contributed by Hairong.
>   

Shouldn't such changes be reserved for major releases, i.e. for 0.7? 
Nutch relies heavily on UTF8 being the default, this change will make it 
more difficult to upgrade it to 0.6.2.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: svn commit: r446695 - in /lucene/hadoop/trunk: CHANGES.txt src/java/org/apache/hadoop/mapred/JobConf.java

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> If you consider users that collected terabytes of data using 0.6.1, 
> there must be a way for them to upgrade this data to whatever release 
> comes next. My thinking was that if we have a release that contains both 
> UTF8 and Text, we could write a converter, to be included in application 
> packages e.g. in Nutch in this specific release only.

UTF8 it still there, it's just deprecated and no longer the default.

> Let's say I have data in SequenceFile-s and MapFile-s using 0.5.x 
> formats. How would I go about converting them from UTF8 to Text? Would 
> the current code read the data produced by 0.5.x code?

SequenceFiles and MapFiles with UTF8 data can still be written and read 
just fine, since SequenceFile names the classes of its keys and values 
in the file header.  SequenceFile's format has changed, but the change 
is back-compatible, i.e., 0.6 can read sequence files written by 0.5, 
but not vice-versa.

To convert code will require code changes, and code changes may also be 
required to get things to run correctly, since TextInputFormat now 
returns Text instances rather than UTF8 instances.  Code changes are 
also required for things which relied on the default value types (mostly 
things which also used TextInputFormat).

We could find no way to seamlessly upgrade UTF8 so that it was no longer 
limited to less than 64k bytes, so decided it was better to make a clear 
break.  This also permitted us to change to using real UTF-8, rather 
than Java's modified UTF-8 encoding. 
(http://issues.apache.org/jira/browse/HADOOP-302)

>> So should we revert these in 0.6?
> 
> This looks like a lot of work ... perhaps we should just burn bridges, 
> and make a 0.7.0 at this point, because it's definitely not API 
> compatible with 0.6.1.

But 0.6.0 has not yet been a stable release that folks could use.

> As for Nutch ... it could be upgraded to 0.6.1. On the other hand, Nutch 
> is not compatible with 0.6.1 either, so perhaps it should be upgraded to 
> 0.7.0 (plus a suitable converter for existing data).

I think we should release 0.6.2 with this patch, and update Nutch to use 
that release.  In general, we should probably not update Nutch until 
Hadoop releases are stable, which sometimes takes a week.  This is 
isomorphic to calling it 0.7.0, but is more consistent with our monthly 
releases with bugfix point releases in the first week.  With a monthly 
release schedule we cannot afford to do a lot of testing (alphas, betas, 
etc.) before releases are made.  If an incompatible change is 
half-completed, then I think it's reasonable to complete it as a bugfix 
rather than force a new major version number.

Doug

Re: svn commit: r446695 - in /lucene/hadoop/trunk: CHANGES.txt src/java/org/apache/hadoop/mapred/JobConf.java

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Andrzej Bialecki wrote:
>> Shouldn't such changes be reserved for major releases, i.e. for 0.7? 
>> Nutch relies heavily on UTF8 being the default, this change will make 
>> it more difficult to upgrade it to 0.6.2.
>
> Good question.  I think the intent was to switch as much as possible 
> from UTF8 to Text in 0.6.  Lots of things were switched, but these 
> defaults were missed.  So I was considering 0.6 the major release that 
> contains the change from UTF8 to Text in public APIs.

Hmm. Without having at least one official release where we have both 
UTF8 and Text, and the API is compatible, there will be no easy way to 
upgrade existing data. The latest release to offer this is 0.6.1, if I'm 
not mistaken - or perhaps 0.5.x, if we consider changes to SequenceFile 
format...?

If you consider users that collected terabytes of data using 0.6.1, 
there must be a way for them to upgrade this data to whatever release 
comes next. My thinking was that if we have a release that contains both 
UTF8 and Text, we could write a converter, to be included in application 
packages e.g. in Nutch in this specific release only.

Let's say I have data in SequenceFile-s and MapFile-s using 0.5.x 
formats. How would I go about converting them from UTF8 to Text? Would 
the current code read the data produced by 0.5.x code?

>
> Right now, in 0.6, the default input format is not consistent 
> (TextInputFormat now returns Text, not UTF8).  In our current monthly 
> release strategy, the .0 releases are effectively alphas, candidates 
> that sometimes are good enough to become the final release, and 
> sometimes require point releases.
>
> A consistent alternative might be to revert other places where UTF8 
> was changed to Text.
>
> http://issues.apache.org/jira/browse/HADOOP-450 (TextInputFormat)
> http://issues.apache.org/jira/browse/HADOOP-499 (contrib/streaming)
> http://issues.apache.org/jira/browse/HADOOP-460 (smallJobsBenchmark)
>
> So should we revert these in 0.6?

This looks like a lot of work ... perhaps we should just burn bridges, 
and make a 0.7.0 at this point, because it's definitely not API 
compatible with 0.6.1.

As for Nutch ... it could be upgraded to 0.6.1. On the other hand, Nutch 
is not compatible with 0.6.1 either, so perhaps it should be upgraded to 
0.7.0 (plus a suitable converter for existing data).

> I hate incompatible changes, but didn't see a way to make this change 
> compatibly, yet it seems like a good change.  What do you think?

I propose to skip 0.6.2, and go directly to 0.7.0. And I would 
appreciate any insights into the above questions  about converting old 
data ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: svn commit: r446695 - in /lucene/hadoop/trunk: CHANGES.txt src/java/org/apache/hadoop/mapred/JobConf.java

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> Shouldn't such changes be reserved for major releases, i.e. for 0.7? 
> Nutch relies heavily on UTF8 being the default, this change will make it 
> more difficult to upgrade it to 0.6.2.

Good question.  I think the intent was to switch as much as possible 
from UTF8 to Text in 0.6.  Lots of things were switched, but these 
defaults were missed.  So I was considering 0.6 the major release that 
contains the change from UTF8 to Text in public APIs.

Right now, in 0.6, the default input format is not consistent 
(TextInputFormat now returns Text, not UTF8).  In our current monthly 
release strategy, the .0 releases are effectively alphas, candidates 
that sometimes are good enough to become the final release, and 
sometimes require point releases.

A consistent alternative might be to revert other places where UTF8 was 
changed to Text.

http://issues.apache.org/jira/browse/HADOOP-450 (TextInputFormat)
http://issues.apache.org/jira/browse/HADOOP-499 (contrib/streaming)
http://issues.apache.org/jira/browse/HADOOP-460 (smallJobsBenchmark)

So should we revert these in 0.6?

The patch of http://issues.apache.org/jira/browse/HADOOP-533 seemed like 
the simplest way to make 0.6 consistent.

I hate incompatible changes, but didn't see a way to make this change 
compatibly, yet it seems like a good change.  What do you think?

Doug