You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by "cshannon (via GitHub)" <gi...@apache.org> on 2023/04/14 20:05:28 UTC

[GitHub] [accumulo] cshannon commented on issue #1327: No-chop merges

cshannon commented on issue #1327:
URL: https://github.com/apache/accumulo/issues/1327#issuecomment-1509174746

I talked to @keith-turner about this quite a bit today and we came up with a bit of an alternative strategy to what I was trying with my original two draft PRs (#3246 and #3286) where I was trying to handle multiple ranges per file with fencing and still just storing a single file metadata entry per RFile.

After talking through everything I am going to try the following in one or more new PRs to handle both the reading/fencing case and then the storing of metadata ranges.

1. After going through the scenarios with how fencing off rfiles might be used with merges, splits, scans, etc we think it might be better to go with treating each range as its own file. (Basically a variation of option 1 I detailed in my post [here](https://github.com/apache/accumulo/issues/1327#issuecomment-1427102074)). The idea being that if we can treat each range as its own file the rest of the code wouldn't need as much modification as it's still just dealing with file abstractions.
2. We would only need to create a Fenced Rfile iterator to handle a single range (wouldn't need an iterator to handle multiple ranges anymore). It's to be determined if the fencing iterator can just implement SortedKeyValueIterator or needs to also implement FileSKVIterator. We may also need to fence the index as well.
3. For storing files and ranges in the metadata table (DataFileValue) we realized that it may be better to associate a file metadata entry per range and not try and store multiple ranges for a single file entry. This should work better because after thinking about how the the metadata is used for splits, etc we realized that the current DataFileValue fields of size, numEntries, and time really should be associated per Range and not per file. To accomplish this we think it could work to change the DataFile column qualifier (StoredTabletFile) to also include an optional range instead of just the URI to make it unique per range so you'd end up with 1 to many entries per file stored (just 1 entry still if no range or range that covers the entire file).
4. The code (file operations, scans, etc) that deal with StoredTableFiles would hopefully not need a lot of modification if we can encapsulate the fencing and range handling in the iterator and encapsulate the range in StoredTabletFile so that they are just treated like normal files. In other words (for example) if we do it right hopefully the code that iterates/scans over 10 unique files vs 10 "files" that are really just 10 unique ranges of 1 file would be identical as the code scanning wouldn't know or care about the difference.

Anyways, I'm going to work on it and see how it goes. It may not work as well in practice or could run into some roadblocks but if it works it could make things a lot cleaner. As I said, I'll do the work in a new PR(s) and keep the current ones open so we can compare the difference.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org