You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/25 04:36:19 UTC

[GitHub] [arrow] sanjibansg opened a new pull request, #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

sanjibansg opened a new pull request, #14491:
URL: https://github.com/apache/arrow/pull/14491

   This PR modifies the `__filename`  field in a dataset fragment to be a `DictionaryScalar` of string instead of a string field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #14491:
URL: https://github.com/apache/arrow/pull/14491#issuecomment-1490645549

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- closed pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- closed pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields
URL: https://github.com/apache/arrow/pull/14491


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by GitBox <gi...@apache.org>.
wjones127 commented on PR #14491:
URL: https://github.com/apache/arrow/pull/14491#issuecomment-1315719161

   > Perhaps a more general fix would be that, when we broadcast a scalar, if the type is a binary data type, we could always broadcast it into a dictionary array.
   
   I worry that doing this might mean the results of `ExecBatch::ToRecordBatch` would return a batch with an unexpected schema, if we transform it like that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14491:
URL: https://github.com/apache/arrow/pull/14491#issuecomment-1297630322

   I'm probably missing something, but does this get turned into a dictionary array at some point? Returning dictionary scalars isn't really useful... @westonpace 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14491:
URL: https://github.com/apache/arrow/pull/14491#issuecomment-1289970529

   https://issues.apache.org/jira/browse/ARROW-17923


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #14491:
URL: https://github.com/apache/arrow/pull/14491#issuecomment-1297795472

   Yes, it gets broadcast to an array before it is sent out.  Although your point is a valid one, a dictionary scalar is probably a smell of some kind.  I had not been thinking of it this way initially.  Perhaps a more general fix would be that, when we broadcast a scalar, if the type is a binary data type, we could always broadcast it into a dictionary array.
   
   I think it might be worth an attempt to play around with this idea a bit.  I suspect we might run into problems with columns that may or may not be dictionary.  For example, if a column happens to be a partition column, we can represent it as a scalar.  However, we don't necessarily know if a column is a partition column or not when we are constructing the plan, and we might bind kernels thinking the type is a normal type and then suddenly get a dictionary array.
   
   So __filename is a little bit special in that it is the only field that easily know will always be a scalar.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #14491: ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

Posted by GitBox <gi...@apache.org>.
wjones127 commented on PR #14491:
URL: https://github.com/apache/arrow/pull/14491#issuecomment-1315732686

   Looking at the R test failures, it seems like this change is blocked because hash join and aggregate don't support unifying dictionaries. Thinking about it, having to do that might be expensive, since that requires re-mapping indices each time.
   
   Instead, perhaps in `MakeScanNode` we could pre-compute the dictionary buffer and pass that to the generator that processes each fragment. @sanjibansg do you want to try that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org