You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/11/16 13:24:28 UTC

[GitHub] [pinot] mneedham opened a new pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

mneedham opened a new pull request #7776:
URL: https://github.com/apache/pinot/pull/7776


   I wanted to import a CSV file that contains a DateTime field. 
   
   The CSV file looks like this:
   
   ```
   ID,Date
   10224738,09-05-2015T09:58:00
   ```
   And then the schema file:
   
   ```
   {
       "schemaName": "crimes",
       "dimensionFieldSpecs": [
         {
           "name": "ID",
           "dataType": "INT"
         }
       ],
       "dateTimeFieldSpecs": [{
         "name": "Date",
         "dataType": "STRING",
         "format" : "1:SECONDS:SIMPLE_DATE_FORMAT:MM-dd-yyyy'T'HH:mm:ss",
         "granularity": "1:HOURS"
       }]
   }
   ```
     
   But we get this error when running the ingestion job:
   
   ```
   2021/11/16 11:37:50.382 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] Failed to generate Pinot segment for file - file:/data/mark.csv
   java.lang.IllegalArgumentException: null
   	at shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:108) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-540e70e9e3e24bdb2a14f56b2c1264180abaeda8]
   	at org.apache.pinot.segment.spi.creator.name.SimpleSegmentNameGenerator.generateSegmentName(SimpleSegmentNameGenerator.java:53) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-540e70e9e3e24bdb2a14f56b2c1264180abaeda8]
   	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.handlePostCreation(SegmentIndexCreationDriverImpl.java:268) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-540e70e9e3e24bdb2a14f56b2c1264180abaeda8]
   	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:258) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-540e70e9e3e24bdb2a14f56b2c1264180abaeda8]
   	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:119) ~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-540e70e9e3e24bdb2a14f56b2c1264180abaeda8]
   	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:263) ~[pinot-batch-ingestion-standalone-0.9.0-SNAPSHOT-shaded.jar:0.9.0-SNAPSHOT-540e70e9e3e24bdb2a14f56b2c1264180abaeda8]
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
   	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
   	at java.lang.Thread.run(Thread.java:829) [?:?]
   ```
   
   And the issue is that the min and max times don't pass the `isValidSegmentName` function that was added to `SimpleSegmentNameGenerator`  in https://github.com/apache/pinot/pull/7085. The min and max values are both `09-05-2015T09:58:00`  and the issue is that they have the : in their name, but we would have the same issue with other characters that may appear in date fields, such as a space or forward slash.
   
   This PR replaces those problematic characters inside `SimpleSegmentNameGenerator` before the `isValidSegmentName` check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mneedham commented on a change in pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

Posted by GitBox <gi...@apache.org>.
mneedham commented on a change in pull request #7776:
URL: https://github.com/apache/pinot/pull/7776#discussion_r750314519



##########
File path: pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/name/SimpleSegmentNameGenerator.java
##########
@@ -50,6 +50,12 @@ public SimpleSegmentNameGenerator(String segmentNamePrefix, @Nullable String seg
 
   @Override
   public String generateSegmentName(int sequenceId, @Nullable Object minTimeValue, @Nullable Object maxTimeValue) {
+    if (minTimeValue != null) {
+      minTimeValue = minTimeValue.toString().replaceAll("[: \\/]", "_");

Review comment:
       got it, have updated it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7776:
URL: https://github.com/apache/pinot/pull/7776#discussion_r750305717



##########
File path: pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/name/SimpleSegmentNameGenerator.java
##########
@@ -50,6 +50,12 @@ public SimpleSegmentNameGenerator(String segmentNamePrefix, @Nullable String seg
 
   @Override
   public String generateSegmentName(int sequenceId, @Nullable Object minTimeValue, @Nullable Object maxTimeValue) {
+    if (minTimeValue != null) {
+      minTimeValue = minTimeValue.toString().replaceAll("[: \\/]", "_");

Review comment:
       It's better to precompile these i.e. 
   
   ```java
   private static final Pattern REPLACEMENT_REGEX = Pattern.compile("[: \\/]");
   
   ...
   
   minTimeValue = minTimeValue.toString().replaceAll(REPLACEMENT_REGEX, "_");
   ```
   
   This isn't a very hot code path but always doing this will make it easier to ban on the fly regex compilation with static analysis.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mneedham closed pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

Posted by GitBox <gi...@apache.org>.
mneedham closed pull request #7776:
URL: https://github.com/apache/pinot/pull/7776


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mneedham commented on pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

Posted by GitBox <gi...@apache.org>.
mneedham commented on pull request #7776:
URL: https://github.com/apache/pinot/pull/7776#issuecomment-970506423


   I think we can close this and rather use the normalised segment generator


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on a change in pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

Posted by GitBox <gi...@apache.org>.
richardstartin commented on a change in pull request #7776:
URL: https://github.com/apache/pinot/pull/7776#discussion_r750305717



##########
File path: pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/name/SimpleSegmentNameGenerator.java
##########
@@ -50,6 +50,12 @@ public SimpleSegmentNameGenerator(String segmentNamePrefix, @Nullable String seg
 
   @Override
   public String generateSegmentName(int sequenceId, @Nullable Object minTimeValue, @Nullable Object maxTimeValue) {
+    if (minTimeValue != null) {
+      minTimeValue = minTimeValue.toString().replaceAll("[: \\/]", "_");

Review comment:
       It's better to precompile these i.e. 
   
   ```java
   private static final Pattern REPLACEMENT_REGEX = Pattern.compile("[: \\/]");
   
   ...
   
   minTimeValue = REPLACEMENT_REGEX.matcher(minTimeValue.toString()).replaceAll("_");
   ```
   
   This isn't a very hot code path but always doing this will make it easier to ban on the fly regex compilation with static analysis.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mneedham commented on pull request #7776: Handle segment names that may contain characters typically used in DateTime fields

Posted by GitBox <gi...@apache.org>.
mneedham commented on pull request #7776:
URL: https://github.com/apache/pinot/pull/7776#issuecomment-970326686


   cc'ing @walterddr and @Jackie-Jiang as they worked on the original PR around this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org