You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/22 09:21:40 UTC

[GitHub] [hudi] Zhangshunyu opened a new issue, #7032: [SUPPORT] When metatable enabled some query result will be empty

Zhangshunyu opened a new issue, #7032:
URL: https://github.com/apache/hudi/issues/7032

   When we enable metadata table,   we use "id, t" as stats column and dataskip is enabled, we get some id values from table (both values exist in table) as filter to query details, but we find that some id will get result but some will be empty, the query like following:
   select * from table_a where id in ('id001');
   select * from table_a where id in ('id002');
   both 'id001' and 'id002' exist, but 'id001' can get result , but 'id002' get empty result.
   by the way, we find the candidate files after  index filter applied is empty for 'id002', it seems the MIN/MAX values has some problem in metatable?
   our config as following: 
   
   hudi 0.11
   spark 3.1.1
   
   DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "years,months,days",
   	"hoodie.sql.insert.mode" ->  "non-strict", 
   	      "hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
   	      "hoodie.metadata.enable" -> "true",
   	      "hoodie.bulkinsert.shuffle.parallelism" -> "300",  
   		  "hoodie.parquet.max.file.size" -> "134217728", 
   		  "hoodie.parquet.compression.codec" -> "snappy", 
   		   "hoodie.parquet.dictionary.enabled" -> "false",
   	      "hoodie.metadata.index.column.stats.enable"  -> "true",
   	      "hoodie.enable.data.skipping" -> "true",
   	      "hoodie.cleaner.policy.failed.writes" -> "LAZY",
   	"hoodie.clean.automatic" -> "false",
   	 "hoodie.metadata.index.column.stats.column.list" ->"id, t",
         "hoodie.metadata.index.column.stats.file.group.count" -> "10",
   	  "hoodie.metadata.clean.async" -> "true",
          "hoodie.metadata.compact.max.delta.commits" -> "4")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1289945167

   thanks. 
   closing the issue as we have a fix already. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1289823373

   So, does this impact any MOR table? @alexeykudinkin : we have tests on cols stats index right. may I know how did we miss this. seems basic. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1288748469

   > We are observing the same behavior with Hudi 0.11.1 and Spark 3.3.0. In our case we are filtering by a string column containing a timestamp like "202001110858". We obtain different results if enabling or disabling "hoodie.enable.data.skipping" when reading
   
   yes, we are also using a string column as index col and use it to do filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1289519305

   @Zhangshunyu do you want to put a PR for the change you've made?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result
URL: https://github.com/apache/hudi/issues/7032


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1288977072

   ![image](https://user-images.githubusercontent.com/13940237/197527861-8369f1e4-209e-4680-a998-24e560cb1003.png)
   we print it like this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1288901483

   we checked the parquet file minmax, its ok
   but the minmax in transposedColStatsDF of hoodiefile index is wrong: it use min as max...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1289832785

   > @Zhangshunyu do you want to put a PR for the change you've made?
   
   OK, i will raise a PR to fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ssandona commented on issue #7032: [SUPPORT] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
ssandona commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1288506888

   We are observing the same behavior with Hudi 0.11.1. In our case we are filtering by a string column containing a timestamp like "202001110858". We obtain different results if enabling or disabling "hoodie.enable.data.skipping".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1288973266

   @alexeykudinkin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] [The max values are incorrect in hudi metatable dataframe] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1289215164

   I have fixed this problem by change this code in hoodiemetadatapayload:
       Comparable maxValue =
           (Comparable) Stream.of(
               (Comparable) unwrapStatisticValueWrapper(prevColumnStats.**getMinValue()**),
               (Comparable) unwrapStatisticValueWrapper(newColumnStats.**getMinValue()**))
           .filter(Objects::nonNull)
           .max(Comparator.naturalOrder())
           .orElse(null);
   
   it should use max value
   
   @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhangshunyu commented on issue #7032: [SUPPORT] When metatable enabled, some query using index column as filter will get empty result

Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on issue #7032:
URL: https://github.com/apache/hudi/issues/7032#issuecomment-1288515267

   @alexeykudinkin @yihua @nsivabalan 
   Could you pls have a look at this problem? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org