You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/04 01:13:29 UTC

[GitHub] [iceberg] rdblue opened a new pull request #1165: Make CharSequenceSet thread safe

rdblue opened a new pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165


   `MergingSnapshotProducer` was recently refactored to separate out manifest filtering and merging so it could be reused for delete files. That refactor also updated the filter to use a `CharSequenceSet` instead of a `HashSet` and `CharSequenceWrapper`. The `CharSequenceSet` embeds the wrapper and is easier to use, but this introduced a bug where multiple threads using the same `CharSequenceSet` would use the same seq wrapper in `contains`. This was causing deletes of specific files to miss data files if the wrapper was reused while testing a file that should be deleted was in the delete set.
   
   The solution is to make `CharSequenceSet` thread safe by using a thread-local wrapper.
   
   This only affects operations that delete specific files, which are mostly in tests. Spark deletes using an expression and using partition tuples. User-facing operations that are affected are `RewriteFiles` and `SnapshotManager` (when cherry-picking a dynamic overwrite commit). `OverwriteFiles` also exposes the method, but it is only used in tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#issuecomment-654347550


   @jun-he, I've made the changes you suggested. Thanks for pointing those out!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JingsongLi commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
JingsongLi commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r449824804



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {

Review comment:
       Maybe you can use guava `Suppliers.memoize` to help lazy-init.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r450322844



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {
+    if (containsWrapper == null) {
+      synchronized (this) {
+        if (containsWrapper == null) {
+          this.containsWrapper = ThreadLocal.withInitial(() -> CharSequenceWrapper.wrap(null));
+        }
+      }
+    }
+    return containsWrapper.get();
+  }
+
   @Override
   public boolean contains(Object obj) {
     if (obj instanceof CharSequence) {
-      return wrapperSet.contains(containsWrapper.set((CharSequence) obj));
+      return wrapperSet.contains(wrapper().set((CharSequence) obj));

Review comment:
       Yeah, this is probably a good idea.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r450349373



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {
+    if (containsWrapper == null) {
+      synchronized (this) {
+        if (containsWrapper == null) {
+          this.containsWrapper = ThreadLocal.withInitial(() -> CharSequenceWrapper.wrap(null));

Review comment:
       Yes, this is a simpler solution. Good point!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JingsongLi commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
JingsongLi commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r449952579



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {

Review comment:
       Sounds good to me




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#issuecomment-654383112


   Merging this now that tests are passing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r449916239



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {

Review comment:
       Thanks for pointing this out. I didn't know about `Suppliers.memoize` and it looks useful.
   
   I'm not sure that we want to use it here. We've had quite a few problems with Guava and have finally decided to shade just the classes we use into an Iceberg module. I think it is probably better to avoid adding to the set of Guava classes that we use. Since this is already done and it doesn't look like there is much of a benefit to changing over to Guava, I'm inclined to leave it as it is.
   
   If others feel strongly about this, I can use it, but I think we should avoid expanding our use of Guava.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r449939544



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {

Review comment:
       Sounds good to me




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #1165: Make CharSequenceSet thread safe

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #1165:
URL: https://github.com/apache/iceberg/pull/1165#discussion_r450014758



##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {
+    if (containsWrapper == null) {
+      synchronized (this) {
+        if (containsWrapper == null) {
+          this.containsWrapper = ThreadLocal.withInitial(() -> CharSequenceWrapper.wrap(null));
+        }
+      }
+    }
+    return containsWrapper.get();
+  }
+
   @Override
   public boolean contains(Object obj) {
     if (obj instanceof CharSequence) {
-      return wrapperSet.contains(containsWrapper.set((CharSequence) obj));
+      return wrapperSet.contains(wrapper().set((CharSequence) obj));

Review comment:
       should it be set back to `null` after the operation?

##########
File path: api/src/main/java/org/apache/iceberg/util/CharSequenceSet.java
##########
@@ -54,10 +54,21 @@ public boolean isEmpty() {
     return wrapperSet.isEmpty();
   }
 
+  public CharSequenceWrapper wrapper() {
+    if (containsWrapper == null) {
+      synchronized (this) {
+        if (containsWrapper == null) {
+          this.containsWrapper = ThreadLocal.withInitial(() -> CharSequenceWrapper.wrap(null));

Review comment:
       Wondering if making sense to have a static ThreadLocal object shared by all `CharSequenceSet`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org