You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by "ddanielr (via GitHub)" <gi...@apache.org> on 2023/02/03 18:03:24 UTC

[GitHub] [accumulo] ddanielr opened a new pull request, #3187: WIP: Make scan initial wait configurable

ddanielr opened a new pull request, #3187:
URL: https://github.com/apache/accumulo/pull/3187

   Add new property `scan.inital.wait.enabled` to allow clients the option to wait for writes on inital scans.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on a diff in pull request #3187: WIP: Make scan initial wait configurable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion commented on code in PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#discussion_r1097510205


##########
core/src/main/java/org/apache/accumulo/core/conf/Property.java:
##########
@@ -931,9 +934,7 @@ public enum Property {
       "The sampling percentage to use for replication traces"),
   REPLICATION_RPC_TIMEOUT("replication.rpc.timeout", "2m", PropertyType.TIMEDURATION,
       "Amount of time for a single replication RPC call to last before failing"
-          + " the attempt. See replication.work.attempts."),
-
-  ;
+          + " the attempt. See replication.work.attempts."),;

Review Comment:
   Should be able to remove trailing comma before the semi-colon.



##########
core/src/main/java/org/apache/accumulo/core/client/impl/ThriftScanner.java:
##########
@@ -116,7 +116,10 @@ public static boolean getBatchFromServer(ClientContext context, Range range, Key
             Constants.SCANNER_DEFAULT_READAHEAD_THRESHOLD, null, batchTimeOut, classLoaderContext);
 
         TabletType ttype = TabletType.type(extent);
-        boolean waitForWrites = !serversWaitedForWrites.get(ttype).contains(server);
+        boolean waitForWrites = false;
+        if (context.getConfiguration().getBoolean(Property.TSERV_SCAN_INITIAL_WAIT_ENABLED)) {

Review Comment:
   It should only be configurable where TabletType == TabletType.USER.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1460893844

   I closed #3182 as won't fix. We can re-open in the future if we need to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.

ctubbsii commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1420032733

   > In the client code, the first time the client performs a scan on a server, the client waits for all writes to complete before the scan can begin. It can be argued, that for user tables, the application may not be concerned with waiting for the writes to complete given that the waits are there for a specific use case.
   
   My understanding is that the specific use case is consistency from the same client, to ensure that a previous write is visible on subsequent scans. The comment says this is especially important for metadata, but I think it's just as important for consistency in general. While this may not be strictly required in all use cases, I don't know of a compelling reason to relax the behavior. I don't think effort should be put into making client views of the data less consistent... not without a substantially compelling reason, and I haven't seen any reason even proposed.
   
   In any case, I definitely don't think we should do anything in 1.10 at all for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.

keith-turner commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1420122157

   > My understanding is that the specific use case is consistency from the same client, to ensure that a previous write is visible on subsequent scans. 
   
   I don't believe this was for same client.  The wait for write behavior is for the following case.
   
   1. client does a write and then the client process dies
   2. client process starts again and does a read, expecting to see anything it wrote before.  However if the write is still in flight on the tserver side it may not see it.
   
   Within the same process if a client always flushes the batch writer before reading it will see the changes w/o this check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] keith-turner commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.

keith-turner commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1420129527

   This comment is just for discussion, I don't have a particular action in mind.  I am wondering if this should be scoped differently, like should it be scoped at the table level (with a per table prop) or at the scanner level with a method on the scanner to disable this.  With the way the config is so widely scoped seems like it could be set because someone wants to change the behavior on one table, but it ends up unintentionally changing the behavior for other tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion closed pull request #3187: WIP: Make scan initial wait configurable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion closed pull request #3187: WIP: Make scan initial wait configurable
URL: https://github.com/apache/accumulo/pull/3187


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1419662983

   In the client code, the first time the client performs a scan on a server, the client waits for all writes to complete before the scan can begin. It can be argued, that for user tables, the application may not be concerned with waiting for the writes to complete given that the waits are there for a specific use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ddanielr commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "ddanielr (via GitHub)" <gi...@apache.org>.

ddanielr commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1420075671

   > > In the client code, the first time the client performs a scan on a server, the client waits for all writes to complete before the scan can begin. It can be argued, that for user tables, the application may not be concerned with waiting for the writes to complete given that the waits are there for a specific use case.
   > 
   > My understanding is that the specific use case is consistency from the same client, to ensure that a previous write is visible on subsequent scans. The comment says this is especially important for metadata, but I think it's just as important for consistency in general. While this may not be strictly required in all use cases (some users may not care about this consistency), I don't know of a compelling reason to relax the behavior. I don't think effort should be put into making client views of the data less consistent... not without a substantially compelling reason, and I haven't seen any reason even proposed, least of all a substantially compelling one.
   > 
   > In any case, I definitely don't think we should do anything in 1.10 at all for this.
   
   I had asked for branch clarification in #3182. The issue wasn't assigned to a particular branch so I started at 1.10 with the expectation that I could re-target as desired. Happy to move this work to 2.1 instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.

ctubbsii commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1422338787

   @dlmarion wrote:
   > This was something that a user noticed their client waiting on, and not performing scans, when the user thought scans should have been running.
   
   Aside from their surprise at the observation, what kind of impact did this cause? Do we know how long the user's scans were delayed? Was it reasonable? Was it a problem? I'm asking because I don't want to add a bunch of complexity and configuration knobs if it's not really a problem, and it's not clear from the discussion how much, if at all, it is a problem.
   
   The current behavior seems "safe", and I'm also reluctant to add knobs to allow the user to shoot themselves in the foot and perform operations that aren't really safe. Also, if the user doesn't care about consistency, can't they query using a ScanServer instead? If so, it's probably fine if this is what users get when the access a read/write server (tserver).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ddanielr commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "ddanielr (via GitHub)" <gi...@apache.org>.

ddanielr commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1416226257

   I'm working on the IT test now but all the current IT's use the `getConnector()` call from the harness.
   
   Since this is a client config property, I need to create a new client with this property set to `false`.
   
   For the IT, would comparing scan duration between the two clients be a valid test of this property?
   Or, is there a better way to test this functionality? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on a diff in pull request #3187: WIP: Make scan initial wait configurable

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.

EdColeman commented on code in PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#discussion_r1096167446


##########
core/src/main/java/org/apache/accumulo/core/conf/Property.java:
##########
@@ -419,6 +419,9 @@ public enum Property {
       PropertyType.COUNT, "Max number of files a major compaction thread can open at once. "),
   TSERV_SCAN_MAX_OPENFILES("tserver.scan.files.open.max", "100", PropertyType.COUNT,
       "Maximum total files that all tablets in a tablet server can open for scans. "),
+  @Experimental
+  TSERV_SCAN_INITIAL_WAIT_ENABLED("tserver.scan.initial.wait.enabled", "true", PropertyType.BOOLEAN,
+      "Determines if the client waits for writes before scanning a table for the first time"),

Review Comment:
   Might want to include why.  Something along the lines of:
   
   `Determines if the client waits for writes in progress before scanning a table for the first time to ensure the client sees all data previously written.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on pull request #3187: WIP: Make scan initial wait configurable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion commented on PR #3187:
URL: https://github.com/apache/accumulo/pull/3187#issuecomment-1420627043

   > I haven't seen any reason even proposed, least of all a substantially compelling one.
   
   This was something that a user noticed their client waiting on, and not performing scans, when the user thought scans should have been running.
   
   > My understanding is that the specific use case is consistency from the same client, to ensure that a previous write is visible on subsequent scans.
   > I don't believe this was for same client. The wait for write behavior is for the following case.
   > client does a write and then the client process dies
   > client process starts again and does a read, expecting to see anything it wrote before. However if the write is still in flight on the tserver side it may not see it.
   
   My understanding of the use-case for waiting for writes is you have an application running on a host, the application 
    dies and is restarted, so you want the application to wait for writes to finish before reading. The current implementation makes several assumptions:
   
     1. All user applications both read and write data.
     2. All newly started applications are able to wait for some unknown amount of time for writes to complete before their scans can start.
     3. That in an environment where you could be constantly loading data, the application performing scans has an understanding of what consistent means.
   
   The thought here was to be able to bypass the wait condition in the client for user tables. I have no issue with the implementation being done at the table or scanner level. I think doing it as either a client property or scanner setting would be easier, as it would not involve any server side code to deal with the table property. Doing it at the client seemed sufficient to meet the needs of an application running in an environment where 1, 2, and 3 were not the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org