You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "icexelloss (via GitHub)" <gi...@apache.org> on 2023/04/05 19:09:23 UTC

[GitHub] [arrow] icexelloss opened a new issue, #34908: [C++] Switching the column ordering of "segment key" and "hash key"

icexelloss opened a new issue, #34908:
URL: https://github.com/apache/arrow/issues/34908

   ### Describe the enhancement requested
   
   Recently, this PR changes the ordering output columns of aggregation to
   `(hash)keys + segment_keys + aggregates`
   (https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/aggregate_node.cc#L654)
   
   I wondering it makes more sense to change this to
   `segment_keys + (hash)keys + aggregates`
   
   My reasoning is as follows:
   This only really affects the result in the case of segmented aggregation so I will solely reasoning about segmented aggregation (in non-segmented aggregation, segment_keys is empty so the result wouldn't change regardless).
   
   In segmented aggregation, the hash key is used within each segment (i.e., we don't combine grouped from different segments together) so in a way, the hash key feels like "second key", i.e., data is first split up by segments, then by hash keys. And therefore, it feels more natural to order the output columns: primary key (segment keys) + secondary key (hash keys) + aggregates. 
   
   To give an example of time ordered aggregations (an application of segmented aggregations). where segment key = "time" and hash key = "id", we almost always wanted the output to be "time, id" instead of "id, time" because the full stream is time ordered so it makes sense to make that the first column.
   
   @westonpace I wonder what is your thoughts on switch this up? If you feel strongly about keeping the current order I think we might still be able to tweak this at the substrait level with emit output to get the order we want, but I figured it's probably easier / more consistent if we just switch up the default ordering.
   
   
    
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] icexelloss closed issue #34908: [C++] Switching the column ordering of "segment key" and "hash key"

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.

icexelloss closed issue #34908: [C++] Switching the column ordering of "segment key" and "hash key"
URL: https://github.com/apache/arrow/issues/34908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] icexelloss commented on issue #34908: [C++] Switching the column ordering of "segment key" and "hash key"

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.

icexelloss commented on issue #34908:
URL: https://github.com/apache/arrow/issues/34908#issuecomment-1501785067

   Thanks Weston!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34908: [C++] Switching the column ordering of "segment key" and "hash key"

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34908:
URL: https://github.com/apache/arrow/issues/34908#issuecomment-1500355286

   I'm fine with this switch.  I had no idea which should come first so I left the relative ordering the same when I changed `measures | keys | segment_keys` to `keys | segment_keys | measures`.  However, your reasoning seems pretty solid to me.
   
   Another option, though trickier, would be to keep `keys` and `segment_keys` ordered as they were in the input schema.  So, for example, if the user has `a, b, c, d, e, f` and `a` and `c` are segment_keys and `b` is a `key` then you could have the output `a, b, c,`.
   
   As long as `time` was first in the input then `time` would be first in the output.  I'm +1 either way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org