You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Ma...@materna.de on 2017/07/26 14:27:49 UTC
MapFile.Reader.getClosest() seeks to behind keys

I am using MapFile.Reader from Hadoop-commons 2.6.5 (which is packaged with apache spark 2.2.0). When using methods like seek() or getClosest() very different keys - which are much larger than the key I searched - are returned.

In a long debug-session a found the location, which is responsible for my problem:

    private synchronized int seekInternal(WritableComparable key,
        final boolean before)
      throws IOException {
      readIndex();                                // make sure index is read

      if (seekIndex != -1                         // seeked before
          && seekIndex+1 < count
          && comparator.compare(key, keys[seekIndex+1])<0 // before next indexed
          && comparator.compare(key, nextKey)
          >= 0) {                                 // but after last seeked
        // do nothing
      } else {
        seekIndex = binarySearch(key);
        if (seekIndex < 0)                        // decode insertion point
          seekIndex = -seekIndex-2;

        if (seekIndex == -1)                      // belongs before first entry
          seekPosition = firstPosition;           // use beginning of file
        else
          seekPosition = positions[seekIndex];    // else use index
      }
      data.seek(seekPosition);

Operation readIndex() builds an in-memory map from index-file contents. With my example data, I see about 300 entries with positions. There are 3 different positions at position 300k, 600k and 900k. Because of the higher position I assume these map stores references from second up to last block in underlaying sequence file. Also firstPosition references 203, which is a position at the very beginning of the data file.

Variable seekPosition is always set to -1, so the else-block is executed. Method binarySearch() seems to be a algorithm of kind quick-sort and returns an offset to in-memory map (from readIndex()). In my example I am searching a key between very first and second key, binarySearch() returns a negative value of -4. In all my test a seekPosition from is chosen from positions[] array and never firstPosition is used. As result the requested key is not found.

While debugging I set seekPosition = firstPosition and a wonder happened: now the correct key is found. I worked with severals other mapfiles and never had such issues. Does anyone have an idea whats wrong here?


-    I rebuild the index-file with fix() method (files are identical)

-    Wrote all keys to an text file. Entries have correct order and look fine.

-    Checked configuration settings, but it seems there are no setting which affect mapfiles in this way. All settings are in system defaults.

-    Tests with other keys show the same effects, closest key are always larger then the requested one. They are behind.

Any ideas?