You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/12/19 07:38:00 UTC

[jira] [Commented] (KYLIN-5371) Kylin4 在多分区查询bug

    [ https://issues.apache.org/jira/browse/KYLIN-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649191#comment-17649191 ] 

ASF GitHub Bot commented on KYLIN-5371:
---------------------------------------

liuzhao-lz opened a new pull request, #2049:
URL: https://github.com/apache/kylin/pull/2049

   ## Proposed changes
   
   创建model是在partition部分指定了天分区和时分区,格式分别为:yyyy-MM-dd、HH,在根据天分区值查询时结果非预期。原因是这种场景下segment裁剪有bug。
   
   --不加天分区过滤得到的结果
   ![1e9077f923e57f65e3712b75d639a31](https://user-images.githubusercontent.com/49258176/208371800-d8d86f55-d342-47d1-962a-f668d759dd65.png)
   
   --加天分区在修复前查询结果
   ![6bbf3bf08dea6b5a269d57c65b4a1b7](https://user-images.githubusercontent.com/49258176/208371938-f4977bbf-678b-460e-9b7b-ff6443110350.png)
   ![e4046323d0a7a8521dae904c5030f26](https://user-images.githubusercontent.com/49258176/208371954-2f5a0a03-b181-4ded-a2c1-4a1b05b58f82.png)
   ![6858ba031d6d44b050edc76a6e96251](https://user-images.githubusercontent.com/49258176/208371975-6e64407d-0b40-49f5-bc48-d16234e45647.png)
   
   --加天分区在修复后查询结果
   ![7f372221624b83ff4ad29023c86d373](https://user-images.githubusercontent.com/49258176/208372022-d51d7b49-1348-40ca-a5cc-8ead7b8ab068.png)
   ![38abe82bacc6c48a8b2ee8d1b31a180](https://user-images.githubusercontent.com/49258176/208372047-9d3b3364-9b99-47f5-9b46-060eace9177c.png)
   ![559a5f4b672c8d2521cb7f68fd88f72](https://user-images.githubusercontent.com/49258176/208372063-91cac6fd-2da2-4fc8-969f-3e1314164b4d.png)
   
   ## Branch to commit
   - [ ] Branch **kylin3** for v2.x to v3.x
   - [x] Branch **kylin4** for v4.x
   - [ ] Branch **kylin5** for v5.x
   
   ## Types of changes
   
   What types of changes does your code introduce to Kylin?
   _Put an `x` in the boxes that apply_
   
   - [x] Bugfix (non-breaking change which fixes an issue)
   - [ ] New feature (non-breaking change which adds functionality)
   - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
   - [ ] Documentation Update (if none of the other choices apply)
   
   ## Checklist
   
   _Put an `x` in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code._
   
   - [x] I have created an issue on [Kylin's jira](https://issues.apache.org/jira/browse/KYLIN), and have described the bug/feature there in detail
   - [x] Commit messages in my PR start with the related jira ID, like "KYLIN-0000 Make Kylin project open-source"
   - [ ] Compiling and unit tests pass locally with my changes
   - [x] I have added tests that prove my fix is effective or that my feature works
   - [ ] I have added necessary documentation (if appropriate)
   - [ ] Any dependent changes have been merged
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at user@kylin.apache.org or dev@kylin.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...
   




> Kylin4 在多分区查询bug
> ----------------
>
>                 Key: KYLIN-5371
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5371
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: v4.0.1, v4.0.2
>            Reporter: Liu Zhao
>            Assignee: Liu Zhao
>            Priority: Major
>         Attachments: image-2022-12-19-11-33-36-654.png, image-2022-12-19-11-34-06-372.png, image-2022-12-19-11-34-45-932.png, image-2022-12-19-11-35-03-652.png, image-2022-12-19-11-49-48-323.png
>
>
> 在创建model时如果增量构建时指定了两个partition列,date 和 hour,构建没有问题,但在查询时如果where只指定 = 某个date值,查询结果非预期值。
> // pdate, phour 都是分区列,在创建model时也指定为partition,详情见附件图片
> --q1:
> select pdate, phour, count(1) from lz_test_partition where pdate = '2022-12-19' group by pdate, phour
> --q2:
> select pdate, phour, count(1) from lz_test_partition group by pdate, phour
> 查看源码,bug 出现在 org.apache.spark.sql.execution.datasource.SegFilters#foldFilter 和 org.apache.spark.sql.execution.datasource.SegFilters#insurance 中,一处只用日期判断一处用到time级判断。
> {code:java}
> case class SegFilters(start: Long, end: Long, pattern: String) extends Logging {
>   private def insurance(value: Any)
>                        (func: Long => Filter): Filter = {
>     value match {
>       case v: Date =>
>         // see SPARK-27546
>         val ts = DateFormat.stringToMillis(v.toString)
>         func(ts)
>       case v @ (_:String | _: Int | _: Long) if pattern != null =>
>         val format = DateFormat.getDateFormat(pattern)
>         val time = format.parse(v.toString).getTime
>         func(time)
>       case v: Timestamp =>
>         func(v.getTime)
>       case _ =>
>         Trivial(true)
>     }
>   }
>   /**
>    * Recursively fold provided filters to trivial,
>    * blocks are always non-empty.
>    */
>   def foldFilter(filter: Filter): Filter = {
>     filter match {
>       case EqualTo(_, value: Any) =>
>         insurance(value) {
>           ts => Trivial(ts >= start && ts < end)    --注意在这个地方是有问题的,ts 是date,但start 和 end 可以是到time级,因此在这里的过滤会丢
>         }
>       case In(_, values: Array[Any]) =>
>         val satisfied = values.map(v => insurance(v) {
>           ts => Trivial(ts >= start && ts < end)
>         }).exists(_.equals(Trivial(true)))
>         Trivial(satisfied)
>       case IsNull(_) =>
>         Trivial(false)
>       case IsNotNull(_) =>
>         Trivial(true)
>       case GreaterThan(_, value: Any) =>
>         insurance(value) {
>           ts => Trivial(ts < end)
>         }
>       case GreaterThanOrEqual(_, value: Any) =>
>         insurance(value) {
>           ts => Trivial(ts < end)
>         }
>       case LessThan(_, value: Any) =>
>         insurance(value) {
>           ts => Trivial(ts > start)
>         }
>       case LessThanOrEqual(_, value: Any) =>
>         insurance(value) {
>           ts => Trivial(ts >= start)
>         }
>       case And(left: Filter, right: Filter) =>
>         And(foldFilter(left), foldFilter(right)) match {
>           case And(AlwaysFalse, _) => Trivial(false)
>           case And(_, AlwaysFalse) => Trivial(false)
>           case And(AlwaysTrue, right) => right
>           case And(left, AlwaysTrue) => left
>           case other => other
>         }
>       case Or(left: Filter, right: Filter) =>
>         Or(foldFilter(left), foldFilter(right)) match {
>           case Or(AlwaysTrue, _) => Trivial(true)
>           case Or(_, AlwaysTrue) => Trivial(true)
>           case Or(AlwaysFalse, right) => right
>           case Or(left, AlwaysFalse) => left
>           case other => other
>         }
>       case unsupportedFilter =>
>         // return 'true' to scan all partitions
>         // currently unsupported filters are:
>         // - StringStartsWith
>         // - StringEndsWith
>         // - StringContains
>         // - EqualNullSafe
>         Trivial(true)
>     }
>   }
>   def Trivial(value: Boolean): Filter = {
>     if (value) AlwaysTrue else AlwaysFalse
>   }
> }
> {code}
> 详情及原因看附件图片:
>  !image-2022-12-19-11-49-48-323.png! 
>  !image-2022-12-19-11-33-36-654.png! 
>  !image-2022-12-19-11-34-06-372.png! 
>  !image-2022-12-19-11-34-45-932.png! 
>  !image-2022-12-19-11-35-03-652.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)