You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Johannes Herr (JIRA)" <ji...@apache.org> on 2014/08/01 17:06:38 UTC

[jira] [Created] (HADOOP-10921) MapFile.fix fails silently when file is block compressed

Johannes Herr created HADOOP-10921:
--------------------------------------

Summary: MapFile.fix fails silently when file is block compressed
Key: HADOOP-10921
URL: https://issues.apache.org/jira/browse/HADOOP-10921
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 0.20.2
Reporter: Johannes Herr

MapFile provides a method 'fix' to reconstruct missing 'index' files. If the 'data' file is block compressed the method will compute offsets that are to large, which will lead to keys not being found in the mapfile. (See the attached test case.)

Tested against 0.20.2 but the trunk version looks like it has the same problem.

Cause of the problem is, that 'dataReader.getPosition()' is used to find the offset to write for the next entry that should be indexed. When the file is block compressed however 'dataReader.getPosition()' seems to return the position of the next compressed block, not of block that contains the last entry. This position will thus be to large in most cases and a seek operation with this offset will incorrectly report the key as not present.

I think its not obvious how to fix it, since the SequenceFile-Reader does not provide the offset of the currently buffered entries. I've experimented with watching the offset change and that seems to work mostly, but is quiet ugly and not exact in edge cases.

The method should probably throw an exception when the 'data' file is block compressed instead of silently creating invalid files. A workaround for block compressed files is to read the sequence file and write the entries to a new mapfile and then replace the old file. This also avoids the problems mentioned below.

A few side notes:

1. The 'index' files created by the fix-method are not block compressed (which the 'index' files created by MapFile Writer always are, since the 'index' file is read completely anyway).

2. The fix method does not index the first entry, the Writer does.

3. The header offset is not used.

--
This message was sent by Atlassian JIRA
(v6.2#6252)