You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "travis-cook-sfdc (via GitHub)" <gi...@apache.org> on 2023/04/14 00:02:14 UTC

[GitHub] [pinot] travis-cook-sfdc opened a new issue, #10611: IngestionJobSpec: includeFileNamePattern with Regex does not work as documented

travis-cook-sfdc opened a new issue, #10611:
URL: https://github.com/apache/pinot/issues/10611

   According to the [docs](https://docs.pinot.apache.org/configuration-reference/job-specification#top-level-spec), `includeFileNamePattern` and `excludeFileNamePattern` are documented like:
   
   > Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported.
   Examples:
   Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI.
   Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use [Glob tool](https://www.digitalocean.com/community/tools/glob) or [Regex Tool ](https://www.regextester.com/)to test out your patterns.
   
   
   A few issues here:
   
   1️⃣  The example of `regex:^..(avro)$` does not actually work.  When running a job with this pattern, you'll get an error like this
   ```
   Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
   SimpleTemplateScript1.groovy: 1: illegal string body character after dollar sign;
      solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 1, column 10.
      out.print("""
               ^
   
   1 error
   ```
   
   I'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both `$`, which needs to be escaped: `\$`, and backslashes which are automatically escaped to `\\`
   
   2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex.  For example, I think this is an impossible regex to use `.*\.parquet$` because it's not clear how to get the single backslash character.  `\` turns into `\\` and `\\` stays as `\\`. 
   This issue can be worked around by using character classes and writing `.*[.]parquet$`, but it feels wrong.
   
   3️⃣ What flavor of regex is actually being used here?  `regextester.com` linked in the documentation only supports PCRE and Javascript regex.  However, I suspect this really java regex, which has different syntax.  Given the code uses [PathMatcher](https://docs.oracle.com/javase/7/docs/api/java/nio/file/FileSystem.html#getPathMatcher(java.lang.String)), it's java regex.  Pinot should link to a regex tester that will be accurate
   
   4️⃣ Can you provide some examples of the _absolute path_ I should be matching to?  I've submitted an ingestion job spec that has `includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$`
   
   I have an s3 file with the following name at the path:
   `s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet`
   
   According to regex101.com, this is a match using Java8 syntax: 
   https://regex101.com/r/9ZKOhm/1
   
   It's unclear to me what I'm doing wrong that's causing this pattern to not match.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] travis-cook-sfdc commented on issue #10611: IngestionJobSpec: includeFileNamePattern with Regex does not work as documented

Posted by "travis-cook-sfdc (via GitHub)" <gi...@apache.org>.
travis-cook-sfdc commented on issue #10611:
URL: https://github.com/apache/pinot/issues/10611#issuecomment-1507851543

   I spent a little bit more time with this and now understand why 4️⃣ was an issue.
   
   ```java
   public class FileTest {
   
       public static void matches(Path path, String glob){
           PathMatcher matcher = FileSystems.getDefault().getPathMatcher(glob);
           System.out.println(matcher.matches(path));
       }
       public static void main(String[] args) throws IOException {
           Path path = Paths.get("s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet");
           System.out.println(path.toString());
           matches(path, "regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$");
           matches(path, "regex:^s3:/redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$");
       }
   }
   
   FileTest.main(new String[] {})
   
   ```
   
   Returns
   ```
   s3:/redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet
   false
   true
   ```
   
   Because Pinot regex matches on the Java Path object using `getPathMatcher`, and java path's convert `//` to `/`, it's critical that the regex matches that are sent for ingestion are aware of that fact.
   
   I think it would be useful to clean up the documentation significantly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org