You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/04 15:40:32 UTC

[GitHub] [druid] didip opened a new pull request, #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

didip opened a new pull request, #13027:
URL: https://github.com/apache/druid/pull/13027

   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   <!-- Please read the doc for contribution (https://github.com/apache/druid/blob/master/CONTRIBUTING.md) before making this PR. Also, once you open a PR, please _avoid using force pushes and rebasing_ since these make it difficult for reviewers to see what you've changed in response to their reviews. See [the 'If your pull request shows conflicts with master' section](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master) for more details. -->
   
   Fixes https://github.com/apache/druid/pull/12659.
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   ### Description
   
   <!-- Describe the goal of this PR, what problem are you fixing. If there is a corresponding issue (referenced above), it's not necessary to repeat the description here, however, you may choose to keep one summary sentence. -->
   
   Removed:
   ```
   import org.apache.commons.io.FilenameUtils;
   ```
   Add:
   ```
   import java.nio.file.FileSystems;
   import java.nio.file.PathMatcher;
   import java.nio.file.Paths;
   ```
   <!-- Describe your patch: what did you change in code? How did you fix the problem? -->
   
   `org.apache.commons.io.FilenameUtils` failed to stop at the correct folder structure.
   
   ```java
   Assert.assertTrue(FilenameUtils.wildcardMatch("db/date=2022-08-01/001.parquet", "db/date=2022-08-01/*.parquet"));
   
   // This proved that FilenameUtils did the wrong thing. It should have been false.
   Assert.assertTrue(FilenameUtils.wildcardMatch("db/date=2022-08-01/_junk/0/001.parquet", "db/date=2022-08-01/*.parquet"));
   ```
   
   `java.nio.file.FileSystems` does the right thing.
   
   ```java
   PathMatcher m2 = FileSystems.getDefault().getPathMatcher("glob:db/date=2022-08-01/*.parquet");
   Assert.assertTrue(m2.matches(Paths.get("db/date=2022-08-01/001.parquet")));
   Assert.assertFalse(m2.matches(Paths.get("db/date=2022-08-01/_junk/0/001.parquet")));
   ```
   
   <!-- If there are several relatively logically separate changes in this PR, create a mini-section for each of them. For example: -->
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are corner cases and error conditions handled, such as when there are insufficient resources?
    - Class organization and design (how the logic is split between classes, inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of emitted metrics)
   -->
   
   
   <!-- It's good to describe an alternative design (or mention an alternative name) for every design (or naming) decision point and compare the alternatives with the designs that you've implemented (or the names you've chosen) to highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in the development mailing list), link to that discussion from this PR description and explain what have changed in your final design compared to your original proposal or the consensus version in the end of the discussion. If something hasn't changed since the original discussion, you can omit a detailed discussion of those aspects of the design here, perhaps apart from brief mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small changes. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `S3InputSource`
    * `OssInputSource`
    * `GoogleCloudStorageInputSource`
    * `AzureInputSource`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1289907979

   @gianm @abhishekagarwal87 ok, the rest of the failures are due to unrelated Python errors. What do you think now?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1244769844

   @gianm Sure thing.
   
   All of our customers (data scientists & data engineers) use Spark to massage data for Druid to consume. They typically do a number of transformations and at the end the push raw parquet files to a specific S3 bucket+path.
   
   Now, these people have varying proficiency in Spark. Some are extremely good, some are barely passable. Spark does a lot of things behind the scene and they usually involved in creating `_temporary` folders, e.g. during merge or shuffling. These are the junk folders I talked about. 
   
   It's incredibly tedious to chase these down and remind my customers to clean up. Some are capable to clean up, some completely ignored my requests (but later complained when their count data didn't match because the parquet files inside `_temporary` folder is accidentally ingested).
   
   As for the confusion with `s3://mybucket/path/to/parquet` confusion, I can provide a convenience method to strip them out, that should help reducing confusion.
   
   As for filename filter, I am not sure if it's that useful if you have the full path glob feature. It might introduce another confusion by having 2 different filter JSON attributes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1304424989

   This pull request **introduces 2 alerts** when merging c91807e52c19efd382d1822839086abac43cddcd into a17ffdfc5d82c8cf60dfb91015f69142e427d76f - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-e2748f5191d53b089bff6e3a35733e60dd2a1960)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1299270329

   This pull request **introduces 2 alerts** when merging f389e38a107ca575c87e19ebaf3fe8081f636c08 into 176934e849d50a2360f50db7771c1775110308f3 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-0d022448403c79db006ea023e99eae69ee82b5b2)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r985988603


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -773,7 +773,7 @@ The following is an example of a Combining input source spec:
         "delegates" : [
          {
           "type": "local",
-          "filter" : "*.csv",
+          "filter" : "**.csv",

Review Comment:
   This is not really true though. The local input source filter continues to work the old way, right? 
   
   I think that's also where the contention lies. We can't change the filter for the local input source as that has been in use for a while. We do want to have similar `filter` behavior for other input sources too. I think the only way out is to have a different `globFilter` in addition to `filter`. The latter applies just to the filer names. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r985988603


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -773,7 +773,7 @@ The following is an example of a Combining input source spec:
         "delegates" : [
          {
           "type": "local",
-          "filter" : "*.csv",
+          "filter" : "**.csv",

Review Comment:
   This is not really correct change though. The local input source filter continues to work the old way, right? 
   
   I think that's also where the contention lies. We can't change the filter for the local input source as that has been in use for a while. We do want to have similar `filter` behavior for other input sources too. I think the only way out is to have a different `globFilter` in addition to `filter`. The latter applies just to the filer names. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r996613354


##########
core/src/main/java/org/apache/druid/data/input/impl/CloudObjectInputSource.java:
##########
@@ -108,9 +112,40 @@ public List<CloudObjectLocation> getObjects()
   @Nullable
   @JsonProperty
   @JsonInclude(JsonInclude.Include.NON_NULL)
-  public String getFilter()
+  public String getObjectGlob()
   {
-    return filter;
+    return objectGlob;
+  }
+
+  /**
+   * Strips out blob store's protocol and bucket from {@link #getObjectGlob}.
+   * This is to reduce user errors when crafting a objectGlob.

Review Comment:
   I agree with you, let's not be too clever.



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -63,7 +63,7 @@ Sample specs:
       "type": "index_parallel",

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1244734899

   > @gianm I would say it is quite common when people are massaging data using Spark into Iceberg.
   
   @didip would you mind giving an example of how people would use the filter glob to read Iceberg data? I'm not familiar with how Iceberg stores data, so this would help me understand how the feature is likely to be used.
   
   I continue to be concerned about the confusingness of whole-path globs, so, I do think if we ship the feature then we should be really clear about how it works. Docs should explain what string is used as the path for the match. For example, if your prefix is `s3://mybucket/myprefix`, and there is an object `s3://mybucket/myprefix/foo/bar.txt` then is the path `/foo/bar.txt` (the part after the prefix) or is it `foo/bar.txt` (the part after the prefix, with leading `/` stripped), or is it `myprefix/foo/bar.txt` (the entire S3 object key)?
   
   We could also offer both `filter` (name glob) and `pathFilter` (whole-path glob) options. That way, consistency with `local` input source is preserved (where its `filter` applies to filenames only), and also, users with simple use cases that don't involve Iceberg integration can have a more-intuitive name-based matching.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r993242690


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -63,7 +63,7 @@ Sample specs:
       "type": "index_parallel",

Review Comment:
   the documentation needs an update as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1009641832


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -773,7 +773,7 @@ The following is an example of a Combining input source spec:
         "delegates" : [
          {
           "type": "local",
-          "filter" : "*.csv",
+          "objectGlob" : "**.csv",

Review Comment:
   Ah, I must have searched and replaced too agressively.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r996613223


##########
core/src/test/java/org/apache/druid/data/input/impl/CloudObjectInputSourceTest.java:
##########
@@ -200,7 +205,72 @@ public void testWithObjects()
 
     List<CloudObjectLocation> returnedLocations = splits.map(InputSplit::get).collect(Collectors.toList()).get(0);
 
-    Assert.assertEquals(null, inputSource.getFilter());
+    Assert.assertEquals(null, inputSource.getObjectGlob());
     Assert.assertEquals(OBJECTS, returnedLocations);
   }
+
+  @Test
+  public void testGlobSubdirectories()
+  {
+    PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:**.parquet");
+    Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+    Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/002.parquet")));
+
+    PathMatcher m2 = FileSystems.getDefault().getPathMatcher("glob:db/date=2022-08-01/*.parquet");
+    Assert.assertTrue(m2.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+    Assert.assertFalse(m2.matches(Paths.get("db/date=2022-08-01/_junk/0/001.parquet")));
+  }
+
+  @Test
+  public void testGlobSubdirectories2()

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1009641477


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -307,7 +307,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "azure",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1013105918


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -101,7 +101,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "s3",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -46,7 +46,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "s3",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -81,7 +81,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "s3",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -122,7 +122,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "s3",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -81,7 +81,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "s3",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -122,7 +122,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "s3",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -226,7 +226,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "google",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -261,7 +261,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "google",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   `**.json`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1295732338

   This pull request **introduces 2 alerts** when merging 37a026cfcb8ad860d444a7370b63048fdbe6c1f2 into d851985cf5b0faf8e6630c3011dc80288e959f28 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-1247e5239164675bb1e48775b33f5ee35641addb)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1237417948

   This pull request **introduces 2 alerts** when merging 5ee673ab6d02c210efed99b6a1e1c1ecd92a1d2f into 6805a7f9c243445b95090e9aa777512f6411b506 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-d0801e6c5dce11f96e7f9345ec90351b91d1cbb7)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1237606085

   The last error I am seeing seems unrelated to my work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1289822137

   This pull request **introduces 2 alerts** when merging d101b2864684a4899351d76b9b3af0158ebba095 into d98c808d3f087645d8dcfd516e77b0058bdfd389 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-238ea26370b598e9d3937ca5a24df6b474002e2d)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1295748619

   An unrelated typescript check failed:
   ```
    FAIL  e2e-tests/multi-stage-query.spec.ts (302.447 s)
     ● Multi-stage query › runs a query that reads external data
       thrown: "Exceeded timeout of 300000 ms for a test.
       Use jest.setTimeout(newTimeout) to increase the timeout value, if this is a long-running test."
   ```
   
   @abhishekagarwal87 @gianm Any thought on the PR so far?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1299035954

   This pull request **introduces 2 alerts** when merging 14069f1d563c02260add3da56cf521fc22b930ef into 176934e849d50a2360f50db7771c1775110308f3 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-d2af05bf2d25d27d94b547012fc4e60ca61abfac)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1274028384

   @abhishekagarwal87 @gianm thought?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1242590704

   @gianm question, if it only matched just the filename, we lose the ability to filter out junk subfolders per example below.
   
   ```
   ("db/date=2022-08-01/001.parquet", "db/date=2022-08-01/*.parquet"));
   ("db/date=2022-08-01/_junk/0/001.parquet", "db/date=2022-08-01/*.parquet"));
   ```
   Sometimes, we don't have control over the input folder, this globing technique will allow us to do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1010760240


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -182,7 +182,7 @@ Sample specs:
 |uris|JSON array of URIs where S3 objects to be ingested are located.|None|`uris` or `prefixes` or `objects` must be set|
 |prefixes|JSON array of URI prefixes for the locations of S3 objects to be ingested. Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set|
 |objects|JSON array of S3 Objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set|
-|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information. Files matching the filter criteria are considered for ingestion. Files not matching the filter criteria are ignored.|None|no|
+|objectGlob|A wildcard filter for files. See [here](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-) for more information. Files matching the filter criteria are considered for ingestion. Files not matching the filter criteria are ignored.|None|no|

Review Comment:
   Addressed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1297484020

   This pull request **introduces 2 alerts** when merging 4c89a1bfbb010c1ac3d76a4fe0d71a13d5813d2d into 675fd982fb5ca274b057495a90563ecc248ad823 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-c0714566943e94d7d8159ff733604020323a8f72)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1243003517

   That's true, it's an ease-of-use vs. power tradeoff. I am legitimately worried that people will find it confusing to reason about whether, for example, prefix `s3://a/b` plus filter `b/*txt` applies to `s3://a/b.txt` or `s3://a/b/c.txt` or both or neither.
   
   Is this case (where there is a `_junk` subdirectory that we want to ignore) common? If it doesn't happen, we don't need to worry about it. If it happens, but is uncommon, maybe we can deal with it some other way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1289921107

   This pull request **introduces 2 alerts** when merging 98633fb91014ff553f60cfec1183453eabc78f4a into d98c808d3f087645d8dcfd516e77b0058bdfd389 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-1eeb72c96d00ab14d8f866f0c8fb64efcbbf1ea9)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1299285648

   @gianm addressed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1307902204

   What do you think @gianm ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1009348853


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -307,7 +307,7 @@ Sample specs:
       "type": "index_parallel",
       "inputSource": {
         "type": "azure",
-        "filter": "*.json",
+        "objectGlob": "*.json",

Review Comment:
   ```suggestion
           "objectGlob": "**.json",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1243244677

   FWIW, I think it's best if the feature lets us read data folders that standard big data pipelines generate. Someone will likely ask for it soon enough if it's a common pattern. so just like that, I am going to change my vote again and say that we go ahead with this glob notation :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1244874475

   This pull request **introduces 2 alerts** when merging dd925886305618531977d7ef5c39b54e23dddeb8 into 08d6aca52865ab7bcbf75e16c4ca153bd3ca06f3 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-6b3e775808a9a8f447521ef47c3387dd7c527975)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1245013199

   This pull request **introduces 2 alerts** when merging 56426aa0d3c3b200896047395727bf4c2dc27531 into 08d6aca52865ab7bcbf75e16c4ca153bd3ca06f3 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-a64a51feb4ac9114778f78c3f28a4593bfc01377)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1280345118

   This pull request **introduces 2 alerts** when merging 207e6d60980139ab6ab9afad45882e195230d138 into 3bbb76f17bab32b3d3e12a472e8403affeb09108 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-744b50f0d5676756442de9014aa641b6d74c417b)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1280288713

   @gianm @abhishekagarwal87 ok, I addressed the 3 comments. We'll see what the CI results look like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1009976897


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -182,7 +182,7 @@ Sample specs:
 |uris|JSON array of URIs where S3 objects to be ingested are located.|None|`uris` or `prefixes` or `objects` must be set|
 |prefixes|JSON array of URI prefixes for the locations of S3 objects to be ingested. Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set|
 |objects|JSON array of S3 Objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set|
-|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information. Files matching the filter criteria are considered for ingestion. Files not matching the filter criteria are ignored.|None|no|
+|objectGlob|A wildcard filter for files. See [here](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-) for more information. Files matching the filter criteria are considered for ingestion. Files not matching the filter criteria are ignored.|None|no|

Review Comment:
   IMO, this isn't clear enough for a new user to understand how it works. I suggest this instead:
   
   ```
   A glob for the object part of the S3 URI. In the URI `s3://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br /><br />The glob must match the entire object part, not just the filename. For example, the glob `*.json` does not match `s3://foo/bar/file.json`, because the object part is `bar/file.json`, and the`*` does not match the slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br />For more information, refer to the documentation for [`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).
   ```
   
   It would be formatted like this:
   
   > A glob for the object part of the S3 URI. In the URI `s3://foo/bar/file.json`, the glob is applied to `bar/file.json`.
   >
   > The glob must match the entire object part, not just the filename. For example, the glob `*.json` does not match `s3://foo/bar/file.json`, because the object part is `bar/file.json`, and the`*` does not match the slash. To match all objects ending in `.json`, use `**.json` instead.
   >
   > For more information, refer to the documentation for [`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r1009347352


##########
docs/ingestion/native-batch-input-source.md:
##########
@@ -773,7 +773,7 @@ The following is an example of a Combining input source spec:
         "delegates" : [
          {
           "type": "local",
-          "filter" : "*.csv",
+          "objectGlob" : "**.csv",

Review Comment:
   does local input source has `objectGlob`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1244820505

   @gianm ok, I made an improvement, if users put blob store scheme and bucket, the filtering logic will ignore them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1243057390

   @gianm I would say it is quite common when people are massaging data using Spark into Iceberg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm merged pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm merged PR #13027:
URL: https://github.com/apache/druid/pull/13027


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1304412404

   Fix documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] lgtm-com[bot] commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1237460987

   This pull request **introduces 2 alerts** when merging 172521a9ecc3d01519f53a2158ac9e2643611916 into 6805a7f9c243445b95090e9aa777512f6411b506 - [view on LGTM.com](https://lgtm.com/projects/g/apache/druid/rev/pr-7b13b493355feae95c69036357c62d07048251ff)
   
   **new alerts:**
   
   * 2 for Uncontrolled data used in path expression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1239058349

   Hmm. IMO, we should definitely change something, since the behavior of `FilenameUtils.wildcardMatch` is just really weird. For example, this returns `true`:
   
   ```
   FilenameUtils.wildcardMatch("a/b/c.txt", "a*.txt")
   ```
   
   Which is weird since no real shell works this way. Generally it is expected that `*` does not match `/`.
   
   This patch fixes the weirdness to a globbing implementation where `*` properly doesn't match `/`, and changing the examples to use `**.suffix` (which _does_ match `/` in normal shells) instead of `*.suffix`.
   
   However, there is another way to fix it that IMO is better. We could change the `filter` glob to match file _names_ rather than _paths_. Then, `*.suffix` would still work fine. It's closer to what the `local` input source does. It's also close to what the `find` Unix command does when you do `find [directory] -name [glob]`. (It searches in directory for files whose names match the provided glob.)
   
   I like this way better because it avoids the awkward `**` construction in the examples, and avoids the need for people to think about entire paths in their minds: they can simply think about the file names. (One reason to avoid working with entire paths is that gets weird with cloud files. Like, in `s3://a/b`, will the path-glob be applied to `s3://a/b`, or `/a/b/`, or `/b`, or `b`? Better to dodge the question entirely by using names, i.e. apply it to `b` alone.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1239160250

   I do like the `find` like approach as well. That's easier to understand. It might not be as powerful but I will happily trade that off with the ease of use and understanding. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1264393073

   I really think using `**.parquet` is a small price to pay. Especially considering that this is a standard Java syntax.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1264390564

   @gianm @abhishekagarwal87 Any further thought?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a diff in pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Posted by GitBox <gi...@apache.org>.
gianm commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r992875793


##########
core/src/main/java/org/apache/druid/data/input/impl/CloudObjectInputSource.java:
##########
@@ -108,9 +112,40 @@ public List<CloudObjectLocation> getObjects()
   @Nullable
   @JsonProperty
   @JsonInclude(JsonInclude.Include.NON_NULL)
-  public String getFilter()
+  public String getObjectGlob()
   {
-    return filter;
+    return objectGlob;
+  }
+
+  /**
+   * Strips out blob store's protocol and bucket from {@link #getObjectGlob}.
+   * This is to reduce user errors when crafting a objectGlob.

Review Comment:
   I think we shouldn't remove the bucket without verifying that it actually matches. This leads to surprising outcomes like the glob `s3://foo/*.txt` matching the file `s3://bar/x.txt`. I think we have a couple of options:
   
   - Don't remove the protocol and bucket: if the user provides the glob `s3://foo/*.txt` then it matches nothing.
   - Be "smart": apply the glob to the entire URI if it starts with a protocol; else apply it only to the object.
   
   IMO, the first option is best. My experience with "smart" features is they end up being more confusing than they're worth.



##########
core/src/test/java/org/apache/druid/data/input/impl/CloudObjectInputSourceTest.java:
##########
@@ -200,7 +205,72 @@ public void testWithObjects()
 
     List<CloudObjectLocation> returnedLocations = splits.map(InputSplit::get).collect(Collectors.toList()).get(0);
 
-    Assert.assertEquals(null, inputSource.getFilter());
+    Assert.assertEquals(null, inputSource.getObjectGlob());
     Assert.assertEquals(OBJECTS, returnedLocations);
   }
+
+  @Test
+  public void testGlobSubdirectories()
+  {
+    PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:**.parquet");
+    Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+    Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/002.parquet")));
+
+    PathMatcher m2 = FileSystems.getDefault().getPathMatcher("glob:db/date=2022-08-01/*.parquet");
+    Assert.assertTrue(m2.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+    Assert.assertFalse(m2.matches(Paths.get("db/date=2022-08-01/_junk/0/001.parquet")));
+  }
+
+  @Test
+  public void testGlobSubdirectories2()

Review Comment:
   This test cast isn't needed: we don't need to test this library that we aren't using.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org