You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tanruixiang (via GitHub)" <gi...@apache.org> on 2023/06/28 06:28:56 UTC

[GitHub] [arrow-datafusion] tanruixiang opened a new issue, #6784: The `dict_id` was lost when constructing the logic plan.

tanruixiang opened a new issue, #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784

   ### Describe the bug
   
   One of the simplest sql statements: `select * from table`; In the construction of the logical plan, `Projection` will use `to_field` at the bottom to construct `DFField`, and then `to_field` ignore the `dict_id`, which will lead to the use of `IPC` when the dictionary encoding error.
   
   We are happy to contribute to the community and solve this problem.
   To solve this problem, it may be necessary to add interfaces to the `ExprSchema`, for example by adding `dict_is_ordered` and `dict_id` interfaces, or by adding a direct `get_dffield` interface. Or there is a better way other than the two mentioned above. Both methods have a certain amount of work and we are not sure which one to use or if there is a better way. We hope community can provide some comments and help.
   
   Here is a draft of one of the methods:
   https://github.com/CeresDB/arrow-datafusion/pull/3
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tanruixiang commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "tanruixiang (via GitHub)" <gi...@apache.org>.

tanruixiang commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1616345871

   Thank you very much for your reply. I think that after providing a reasonable `provider`, `dict_id` should not be lost when building the plan, because `dict_id` and other variables such as `name` are located at the same level(In `pub struct Field`), and since `name` will not be lost, `dict_id` should not be lost either.
   At the same time we get the `name`, we can get the `dict_id`, so we need to keep the `dict_id` at least for the whole process of building the plan. (In particular, I think it's always better to keep the `dict_id` as much as possible than to just discard it and use the `default value`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1616599606

   I wonder if you can explain how you are using `dict_id` in Schema? That field in particular is quite old in the code, and @tustvold  and I have discussed at various times removing it as it doesn't seem to be widely used / supported. However we couldn't convince ourselves that was a good idea -- mostly because we don't know what people are using it for
   
   > By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.
   
   Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1629246188

   🤔  it might be time to write up some sort of description of how this would work more generally


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1629024255

   > I agree -- I think the right place to start might be in [arrow-rs](https://github.com/apache/arrow-rs) (aka to where the kernels that implement Eq are)
   
   I'm not sure this is something that can be handled at this level, as the dict_id is part of the schema not the arrays, I think it needs to be handled at the DF level - perhaps by selecting specialized dictionary operators


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1615148606

   I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with `metadata` which can be hung off a schema or a field and gets lost frequently
   
   I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan 
   
   One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep `dict_id` all the way through the plan 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tanruixiang commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "tanruixiang (via GitHub)" <gi...@apache.org>.

tanruixiang commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1627577672

   > I wonder if you can explain how you are using `dict_id` in Schema? That field in particular is quite old in the code, and @tustvold and I have discussed at various times removing it as it doesn't seem to be widely used / supported. However we couldn't convince ourselves that was a good idea -- mostly because we don't know what people are using it for
   > 
   > > By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.
   > 
   > Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use)
   
   Thank you for your reply. I got what your mean. Currently `dict_id` is only used for IPC, my point is that we can further use on `dict_id` and `dict_is_ordered`, for the some operations, such as `Eq`, if `left` and `right` have the same `dict_id` ,it can only need to compare whether the Key is the same(In case of string type, the complexity changes: `O(max(leftstring_ength,rightstring_length) -> O(1)`). For sorting operations, further speedups can also be done based on `dict_is_ordered`. 
   Of course, this is a relatively big feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tanruixiang commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "tanruixiang (via GitHub)" <gi...@apache.org>.

tanruixiang commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1616345905

   > I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with `metadata` which can be hung off a schema or a field and gets lost frequently
   > 
   > I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan
   > 
   > One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep `dict_id` all the way through the plan 🤔
   
   Thank you very much for your reply. I think that after providing a reasonable `provider`, `dict_id` should not be lost when building the plan, because `dict_id` and other variables such as `name` are located at the same level(In `pub struct Field`), and since `name` will not be lost, `dict_id` should not be lost either.
   At the same time we get the `name`, we can get the `dict_id`, so we need to keep the `dict_id` at least for the whole process of building the plan. (In particular, I think it's always better to keep the `dict_id` as much as possible than to just discard it and use the `default value`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6784: The `dict_id` was lost when constructing the logic plan.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6784:
URL: https://github.com/apache/arrow-datafusion/issues/6784#issuecomment-1629020588

   Got it -- thank you for the clarification @tanruixiang 
   
   > Of course, this is a relatively big feature.
   
   I agree -- I think the right place to start might be in [arrow-rs](https://github.com/apache/arrow-rs) (aka to where the kernels that implement `Eq` are) and then once we have a good story about how that will work we can update/adjust DataFusion to account for them


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org