You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/10/12 00:37:14 UTC

[GitHub] [druid] gianm commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

gianm commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r992875793


##########
core/src/main/java/org/apache/druid/data/input/impl/CloudObjectInputSource.java:
##########
@@ -108,9 +112,40 @@ public List<CloudObjectLocation> getObjects()
   @Nullable
   @JsonProperty
   @JsonInclude(JsonInclude.Include.NON_NULL)
-  public String getFilter()
+  public String getObjectGlob()
   {
-    return filter;
+    return objectGlob;
+  }
+
+  /**
+   * Strips out blob store's protocol and bucket from {@link #getObjectGlob}.
+   * This is to reduce user errors when crafting a objectGlob.

Review Comment:
   I think we shouldn't remove the bucket without verifying that it actually matches. This leads to surprising outcomes like the glob `s3://foo/*.txt` matching the file `s3://bar/x.txt`. I think we have a couple of options:
   
   - Don't remove the protocol and bucket: if the user provides the glob `s3://foo/*.txt` then it matches nothing.
   - Be "smart": apply the glob to the entire URI if it starts with a protocol; else apply it only to the object.
   
   IMO, the first option is best. My experience with "smart" features is they end up being more confusing than they're worth.



##########
core/src/test/java/org/apache/druid/data/input/impl/CloudObjectInputSourceTest.java:
##########
@@ -200,7 +205,72 @@ public void testWithObjects()
 
     List<CloudObjectLocation> returnedLocations = splits.map(InputSplit::get).collect(Collectors.toList()).get(0);
 
-    Assert.assertEquals(null, inputSource.getFilter());
+    Assert.assertEquals(null, inputSource.getObjectGlob());
     Assert.assertEquals(OBJECTS, returnedLocations);
   }
+
+  @Test
+  public void testGlobSubdirectories()
+  {
+    PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:**.parquet");
+    Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+    Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/002.parquet")));
+
+    PathMatcher m2 = FileSystems.getDefault().getPathMatcher("glob:db/date=2022-08-01/*.parquet");
+    Assert.assertTrue(m2.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+    Assert.assertFalse(m2.matches(Paths.get("db/date=2022-08-01/_junk/0/001.parquet")));
+  }
+
+  @Test
+  public void testGlobSubdirectories2()

Review Comment:
   This test cast isn't needed: we don't need to test this library that we aren't using.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org