You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Ma...@materna.de on 2017/07/26 14:27:49 UTC
MapFile.Reader.getClosest() seeks to behind keys
I am using MapFile.Reader from Hadoop-commons 2.6.5 (which is packaged with apache spark 2.2.0). When using methods like seek() or getClosest() very different keys - which are much larger than the key I searched - are returned.
In a long debug-session a found the location, which is responsible for my problem:
private synchronized int seekInternal(WritableComparable key,
final boolean before)
throws IOException {
readIndex(); // make sure index is read
if (seekIndex != -1 // seeked before
&& seekIndex+1 < count
&& comparator.compare(key, keys[seekIndex+1])<0 // before next indexed
&& comparator.compare(key, nextKey)
>= 0) { // but after last seeked
// do nothing
} else {
seekIndex = binarySearch(key);
if (seekIndex < 0) // decode insertion point
seekIndex = -seekIndex-2;
if (seekIndex == -1) // belongs before first entry
seekPosition = firstPosition; // use beginning of file
else
seekPosition = positions[seekIndex]; // else use index
}
data.seek(seekPosition);
Operation readIndex() builds an in-memory map from index-file contents. With my example data, I see about 300 entries with positions. There are 3 different positions at position 300k, 600k and 900k. Because of the higher position I assume these map stores references from second up to last block in underlaying sequence file. Also firstPosition references 203, which is a position at the very beginning of the data file.
Variable seekPosition is always set to -1, so the else-block is executed. Method binarySearch() seems to be a algorithm of kind quick-sort and returns an offset to in-memory map (from readIndex()). In my example I am searching a key between very first and second key, binarySearch() returns a negative value of -4. In all my test a seekPosition from is chosen from positions[] array and never firstPosition is used. As result the requested key is not found.
While debugging I set seekPosition = firstPosition and a wonder happened: now the correct key is found. I worked with severals other mapfiles and never had such issues. Does anyone have an idea whats wrong here?
- I rebuild the index-file with fix() method (files are identical)
- Wrote all keys to an text file. Entries have correct order and look fine.
- Checked configuration settings, but it seems there are no setting which affect mapfiles in this way. All settings are in system defaults.
- Tests with other keys show the same effects, closest key are always larger then the requested one. They are behind.
Any ideas?