You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/11/16 20:17:41 UTC

[GitHub] [accumulo] milleruntime opened a new issue #2361: Utility to generate splits

milleruntime opened a new issue #2361:
URL: https://github.com/apache/accumulo/issues/2361


   It would be cool to have a utility that takes a set of RFiles and a desired number of split points and then generates the split points across those RFiles. This would help users who have the data and want to quickly generate the split points instead of having to analyze the files manually. I think this would be fairly easy to do using the RFile API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-975872579


   @keith-turner I got the index reader working pretty well but I noticed it will pick up the empty bytes quite a bit. I copied encoding  that uses DefaultFormatter from the GetSplitsCommand so it will print them properly. But I am not sure we want to include the empty byte for user splits. For example, using test ingest and running the command I get splits like this:
   <pre>
   row_0000000010\x00
   row_0000000054\x00
   row_0000000098\x00
   row_0000000142\x00
   row_0000000186\x00
   </pre>
   What do you think? Should I drop the empty byte?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-972149849


   > And getting that index through the Rfile reader here
   
   Yeah thatis the code I was thinking about.  Looked around and found the following code that the tserver uses to find a single split point by inspecting indexes.
   
   https://github.com/apache/accumulo/blob/f8bb900ae080fe0f54dfe04f9e1ad8c4dd2e7930/server/base/src/main/java/org/apache/accumulo/server/util/FileUtil.java#L289
   
   The code makes two passes.  First it counts the number of index entries.  Second read through them again using a merged view of the indexes and takes the count/2 entry.  Could possibly do something similar for N entries.  Do one pass over the index data to count the entries and then another path to take every count/N entry.  The code above falls back to scanning the data in the rfiles instead of the index, would probably need to do that sometime for this use case also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-974464158


   @keith-turner I am having trouble trying to figure out how to merge multiple FileSKVIterator to get one iterable across multiple rfiles. I want to do what `RFileScanner` is doing but not have to worry about creating the iterator stack. I tried doing this:
   <pre>
   List<SortedKeyValueIterator<Key,Value>> readers = new ArrayList<>(files.size());
   FileSKVIterator reader = FileOperations.getInstance().newIndexReaderBuilder()...build();
   readers.add(reader);
   var iterator = new MultiIterator(readers, false);
   </pre>
   
   But I am getting an `UnsupportedOperationException` in `MultiIndexIterator` when calling iterator.seek(). I started to rewrite it to use `new RFile.Reader(cb)` like the `PrintInfo` command does but wasn't sure if that would work either. Any ideas?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner edited a comment on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner edited a comment on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-972149849


   > And getting that index through the Rfile reader here
   
   Yeah that is the code I was thinking about.  Looked around and found the following code that the tserver uses to find a single split point by inspecting indexes.
   
   https://github.com/apache/accumulo/blob/f8bb900ae080fe0f54dfe04f9e1ad8c4dd2e7930/server/base/src/main/java/org/apache/accumulo/server/util/FileUtil.java#L289
   
   The code makes two passes.  First it counts the number of index entries.  Second read through them again using a merged view of the indexes and takes the count/2 entry.  Could possibly do something similar for N entries.  Do one pass over the index data to count the entries and then another pass to take every count/N entry.  The code above falls back to scanning the data in the rfiles instead of the index from some cases, would probably need to do that sometime for this use case also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-973228335


   > when a tablet is only using a sub range of an rfile and that subrange falls between index entries.
   
   OK so if I am looking at the entire rfile and not just a range, then maybe I don't have to worry about this case. The use case I was thinking of was where the user just has a file (or set of files) and want the splits across the whole file.
   
   Also, I thought I could fall back calculating the splits just based on the size of the file (to prevent having to scan the file twice) but it doesn't look like that works.
   <pre>
   long fileSize = fs.getFileStatus(file).getLen();
   long splitSize = fileSize / numSplits;
   ...
   int size = key.getSize() + val.getSize();
   count += size;
   if (count > splitSize) {
       splits.add(stripOffEmptyByte(key.getRow()));
   }
   </pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-973101955


   > The code above falls back to scanning the data in the rfiles instead of the index from some cases, would probably need to do that sometime for this use case also.
   
   What is the case where there is no index entries?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-973141536


   > What is the case where there is no index entries?
   
   I think the following was the case I remembered.  MAybe this can happen when a tablet is only using a sub range of an rfile and  that subrange falls between index entries.  For this issue if the number of index entries is less than the desired number of splits, then may want to fall back to scanning the data.
   
   https://github.com/apache/accumulo/blob/f8bb900ae080fe0f54dfe04f9e1ad8c4dd2e7930/server/base/src/main/java/org/apache/accumulo/server/util/FileUtil.java#L327-L340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-972118749


   > > I think this would be fairly easy to do using the RFile API.
   > 
   > May be best to do this with internal code that can access the rfile indexes directly. Thinking that the rows in the rfile indexes could be used to quickly generate splits points.
   
   Are you talking about the Key object stored here: https://github.com/apache/accumulo/blob/6a74b4667e3bd33e34b5262c5dd8ea64167fb657/core/src/main/java/org/apache/accumulo/core/file/rfile/MultiLevelIndex.java#L50
   And getting that index through the Rfile reader here:
   https://github.com/apache/accumulo/blob/6cfb9180a0d3e5115922314ff2062e0706ef0795/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L1449


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner edited a comment on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner edited a comment on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-972149849


   > And getting that index through the Rfile reader here
   
   Yeah tha tis the code I was thinking about.  Looked around and found the following code that the tserver uses to find a single split point by inspecting indexes.
   
   https://github.com/apache/accumulo/blob/f8bb900ae080fe0f54dfe04f9e1ad8c4dd2e7930/server/base/src/main/java/org/apache/accumulo/server/util/FileUtil.java#L289
   
   The code makes two passes.  First it counts the number of index entries.  Second read through them again using a merged view of the indexes and takes the count/2 entry.  Could possibly do something similar for N entries.  Do one pass over the index data to count the entries and then another path to take every count/N entry.  The code above falls back to scanning the data in the rfiles instead of the index, would probably need to do that sometime for this use case also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-970739501


   > I think this would be fairly easy to do using the RFile API.
   
   May be best to do this with internal code that can access the rfile indexes directly.  Thinking that the rows in the rfile indexes could be used to quickly generate splits points.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner edited a comment on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner edited a comment on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-972149849


   > And getting that index through the Rfile reader here
   
   Yeah that is the code I was thinking about.  Looked around and found the following code that the tserver uses to find a single split point by inspecting indexes.
   
   https://github.com/apache/accumulo/blob/f8bb900ae080fe0f54dfe04f9e1ad8c4dd2e7930/server/base/src/main/java/org/apache/accumulo/server/util/FileUtil.java#L289
   
   The code makes two passes.  First it counts the number of index entries.  Second read through them again using a merged view of the indexes and takes the count/2 entry.  Could possibly do something similar for N entries.  Do one pass over the index data to count the entries and then another path to take every count/N entry.  The code above falls back to scanning the data in the rfiles instead of the index from some cases, would probably need to do that sometime for this use case also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-973262661


   Another consideration may be to just accept the desired size (based on split threshold?) and then run through the file(s) and spit out a split that would be the row-index before the desired size was met / exceeded.
   
   Also, depending on compression, file hdfs size and entity size may report differently. Assuming that you would track / care about the uncompressed size because I think that's what is maintained in the metadata and used in split calculations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-975655485


   It appears that you can't call seek() when using the newIndexReaderBuilder() but I guess you don't need to when just reading the indices.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime closed issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

milleruntime closed issue #2361:
URL: https://github.com/apache/accumulo/issues/2361


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner edited a comment on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner edited a comment on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-972149849


   > And getting that index through the Rfile reader here
   
   Yeah tha tis the code I was thinking about.  Looked around and found the following code that the tserver uses to find a single split point by inspecting indexes.
   
   https://github.com/apache/accumulo/blob/f8bb900ae080fe0f54dfe04f9e1ad8c4dd2e7930/server/base/src/main/java/org/apache/accumulo/server/util/FileUtil.java#L289
   
   The code makes two passes.  First it counts the number of index entries.  Second read through them again using a merged view of the indexes and takes the count/2 entry.  Could possibly do something similar for N entries.  Do one pass over the index data to count the entries and then another path to take every count/N entry.  The code above falls back to scanning the data in the rfiles instead of the index from some cases, would probably need to do that sometime for this use case also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on issue #2361: Utility to generate splits

Posted by GitBox <gi...@apache.org>.

keith-turner commented on issue #2361:
URL: https://github.com/apache/accumulo/issues/2361#issuecomment-973254755


   > OK so if I am looking at the entire rfile and not just a range, then maybe I don't have to worry about this case.
   
   May have to handle a case that is similar.  If a user request 100 split points and the files only have 60 index entries, then this is not enough satisfy the request.  Could scan the data instead of indexes in this case to get the 100 split points.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org