You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/19 23:21:21 UTC

[GitHub] [arrow-ballista] andygrove opened a new issue, #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

andygrove opened a new issue, #18:
URL: https://github.com/apache/arrow-ballista/issues/18

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   We cannot fix the shuffle mechanism until we have partition stats, or ShuffleReaderExec will attempt to read empty partitions, causing an error.
   
   **Describe the solution you'd like**
   Scheduler should receive partition stats and only try and read from non-empty shuffle partitions.
   
   **Describe alternatives you've considered**
   As a workaround we could write empty shuffle files for empty partitions.
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] mingmwang commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1236326407

   Other improvement I can think of is if the shuffle data was colocated on the same host with the shuffle reader, we should allow the reader to read from disk directly(LocalShuffle Reader) instead of reading from remote Rpc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] metesynnada commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
metesynnada commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1260944301

   @mingmwang Do you need help? I may help with the implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] mingmwang commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1260992548

   @thinkharderdev 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] Ted-Jiang commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1261865081

   I update the tpch-q3 to test, intend to use `o_shippriority = 'none'` to produce this
   ```
   select
       l_orderkey,
       sum(l_extendedprice * (1 - l_discount)) as revenue,
       o_orderdate,
       o_shippriority
   from
       customer,
       orders,
       lineitem
   where
           c_mktsegment = 'BUILDING'
     and c_custkey = o_custkey
     and l_orderkey = o_orderkey
     and o_orderdate < date '1995-03-15'
     and o_shippriority = 'none'
   group by
       l_orderkey,
       o_orderdate,
       o_shippriority
   order by
       revenue desc,
       o_orderdate;
   0 rows in set. Query took 11.347 seconds.
   
   ```
   
   @andygrove how could you produce this error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] mingmwang commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1260991174

   Let the CompletedTask return the partition stats is quite heavy.  Imaging we have 1000 map tasks and 1000 reduce tasks(partition = 1000), the stats will become 1M. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] mingmwang commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1260987026

   @metesynnada @andygrove 
   
   Sorry, I do not get a chance to look into this. Regarding the issue, I'm not sure we would like to fix it in this way. In my opinion, even read empty partitions, ShuffleReaderExec should not return Error or cause any data quality issue.
   We will do some test and see what the specific error it is.
   
   @yahoNanJing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] mingmwang commented on issue #18: Ballista: Executor must return statistics in CompletedTask / CompletedJob

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #18:
URL: https://github.com/apache/arrow-ballista/issues/18#issuecomment-1236326221

   I can work on this improvement. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org