You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/06/14 07:38:04 UTC

[GitHub] [accumulo] milleruntime opened a new pull request #2160: Attempt to fix problems with cacheId

milleruntime opened a new pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160


   The code between the BCFile cache and the RFile API is a mess. This is an attempt to prevent cacheId from being null. I am hoping with all the tests passing we can catch all the spots where it isn't set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2160: Make CachableBlockFile builder require nonNull for cacheId

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r675618707



##########
File path: core/src/main/java/org/apache/accumulo/core/file/blockfile/impl/CachableBlockFile.java
##########
@@ -94,7 +89,8 @@ public CachableBuilder fsPath(FileSystem fs, Path dataFile) {
       return this;
     }
 
-    public CachableBuilder input(InputStream is) {
+    public CachableBuilder input(InputStream is, String cacheId) {

Review comment:
       This is a nice change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #2160: Attempt to fix problems with cacheId

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r651826376



##########
File path: core/src/test/java/org/apache/accumulo/core/file/rfile/MultiLevelIndexTest.java
##########
@@ -18,6 +18,7 @@
  */
 package org.apache.accumulo.core.file.rfile;
 
+import static org.apache.accumulo.core.crypto.CryptoServiceFactory.newInstance;

Review comment:
       I like static imports for some things, but `newInstance` seems too generic and prone to error, since it looks like `Class.newInstance` and it isn't obvious it's using the CryptoServiceFactory.

##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -347,7 +349,7 @@ public SamplerConfiguration getSamplerConfiguration() {
       for (int i = 0; i < sources.length; i++) {
         // TODO may have been a bug with multiple files and caching in older version...
         FSDataInputStream inputStream = (FSDataInputStream) sources[i].getInputStream();
-        CachableBuilder cb = new CachableBuilder().cacheId("source-" + i).input(inputStream)
+        CachableBuilder cb = new CachableBuilder().input(inputStream, "cache-" + rand + i)

Review comment:
       I like the idea of specifying the cache id as a required field alongside the input stream. However, I wonder if we need the cache key to be more strongly related to the input stream, like how Object.toString() shows unique names for objects.

##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -339,6 +340,7 @@ public SamplerConfiguration getSamplerConfiguration() {
   @Override
   public Iterator<Entry<Key,Value>> iterator() {
     try {
+      int rand = new SplittableRandom().nextInt(1_000_000);

Review comment:
       (Could also have used ThreadLocalRandom instead)
   
   I think you're trying to use random numbers to get a unique value for the cache keys. However, since the cache only exists for the lifetime of the process, a 1-up `final static AtomicLong` would be better than random for guaranteeing uniqueness for the lifetime of the process. Could even make it local to the current class, as in:
   
   ```java
   private static final AtomicLong counter = new AtomicLong(0);
   ...
     .cacheId(this.getClass().getName() + counter.incrementAndGet())
   ...
   ```
   
   Also (and this shows my lack of understanding of the caching code), wouldn't we want these cache keys to be predictable so we can use them to look up cached blocks later? What value are they for efficient subsequent scans if they contain random elements (or even unpredictable 1-up elements) that prevent looking up the cached content?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2160: Attempt to fix problems with cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r651875373



##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -339,6 +340,7 @@ public SamplerConfiguration getSamplerConfiguration() {
   @Override
   public Iterator<Entry<Key,Value>> iterator() {
     try {
+      int rand = new SplittableRandom().nextInt(1_000_000);

Review comment:
       Reverted changing the cacheId 04920fb 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2160: Attempt to fix problems with cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r651087146



##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -347,7 +347,7 @@ public SamplerConfiguration getSamplerConfiguration() {
       for (int i = 0; i < sources.length; i++) {
         // TODO may have been a bug with multiple files and caching in older version...
         FSDataInputStream inputStream = (FSDataInputStream) sources[i].getInputStream();
-        CachableBuilder cb = new CachableBuilder().cacheId("source-" + i).input(inputStream)
+        CachableBuilder cb = new CachableBuilder().cacheId(opts.in.getPaths()[i]).input(inputStream)

Review comment:
       What about starting cacheId with "cache-" and then use Random int for the second part? With something like 1000 max. So we always set it to "cache-####". We could just create one random integer per scanner and then add "i".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2160: Attempt to fix problems with cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r651756120



##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -347,7 +347,7 @@ public SamplerConfiguration getSamplerConfiguration() {
       for (int i = 0; i < sources.length; i++) {
         // TODO may have been a bug with multiple files and caching in older version...
         FSDataInputStream inputStream = (FSDataInputStream) sources[i].getInputStream();
-        CachableBuilder cb = new CachableBuilder().cacheId("source-" + i).input(inputStream)
+        CachableBuilder cb = new CachableBuilder().cacheId(opts.in.getPaths()[i]).input(inputStream)

Review comment:
       Found `SplittableRandom` which is supposed to be better and faster than `Random`. Whether we need a random or not, it is nice that it doesn't get dinged by sec-bugs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #2160: Attempt to fix problems with cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#issuecomment-861544072


   > I think I need to spend some time doing a deep dive looking into how this cacheId is used to know if the overall strategy here is correct.
   > 
   > I do like the check for nonnull and the API changes to the builder to require the cacheId when the InputStream is used, but have concerns about the value of using cacheIds that aren't predictable.
   
   OK how about I just simplify this change to add the nonnull check and include the cacheId (keeping the value the same). Then you can take it and explore further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #2160: Attempt to fix problems with cacheId

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r651002414



##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -347,7 +347,7 @@ public SamplerConfiguration getSamplerConfiguration() {
       for (int i = 0; i < sources.length; i++) {
         // TODO may have been a bug with multiple files and caching in older version...
         FSDataInputStream inputStream = (FSDataInputStream) sources[i].getInputStream();
-        CachableBuilder cb = new CachableBuilder().cacheId("source-" + i).input(inputStream)
+        CachableBuilder cb = new CachableBuilder().cacheId(opts.in.getPaths()[i]).input(inputStream)

Review comment:
       While this is more informative than `source-1,source-2,...`, I'm not sure this will work in all cases. The Opts are constructed using RFileScannerBuilder from either a set of RFileSource objects, or a set of String filenames. We only have the path information if the builder used the String filenames. But, the RFileSource version only has InputStreams.
   
   The `getSources()` method will translate any provided String filenames into RFileSources, but any that were passed in as RFileSource won't materialize path information in the reverse, in your new `getPaths()` method. Rather, that new method you added will only show paths if it was constructed with paths. It won't show anything that was constructed with RFileSources.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #2160: Make CachableBlockFile builder require nonNull for cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#issuecomment-885718535


   There is definitely more room for improvement with this code but I am going to merge this change since it should at least throw an exception when the cacheId is null.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2160: Make CachableBlockFile builder require nonNull for cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160#discussion_r651891798



##########
File path: core/src/main/java/org/apache/accumulo/core/client/rfile/RFileScanner.java
##########
@@ -347,7 +349,7 @@ public SamplerConfiguration getSamplerConfiguration() {
       for (int i = 0; i < sources.length; i++) {
         // TODO may have been a bug with multiple files and caching in older version...
         FSDataInputStream inputStream = (FSDataInputStream) sources[i].getInputStream();
-        CachableBuilder cb = new CachableBuilder().cacheId("source-" + i).input(inputStream)
+        CachableBuilder cb = new CachableBuilder().input(inputStream, "cache-" + rand + i)

Review comment:
       Reverted changing the cacheId 04920fb and just specified what was there in the builder method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime merged pull request #2160: Make CachableBlockFile builder require nonNull for cacheId

Posted by GitBox <gi...@apache.org>.
milleruntime merged pull request #2160:
URL: https://github.com/apache/accumulo/pull/2160


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org