You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2015/04/14 22:05:58 UTC

[jira] [Comment Edited] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

    [ https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494751#comment-14494751 ] 

James Taylor edited comment on PHOENIX-1779 at 4/14/15 8:05 PM:
----------------------------------------------------------------

bq. Having two parallel arrays sounds more complicated that maintaing a map, IMHO. 
But you don't need a map. You've got an index that will get you exactly what you need. It'd be like use a Map<Integer,Object> where the key of the Map is the index. Sure, it'll work to do a map.get(3) to get the fourth element, but so would an array[3] or a list.get(3). If you don't want to do parallel arrays, then do a List<Pair<PeekingResultIterator,Integer>> or maybe more clear a List<RoundRobinIteratorState> where RoundRobinIteratorState is a class with two member variables PeekingResultIterator iterator and int rowsRead.


was (Author: jamestaylor):
bq. Having two parallel arrays sounds more complicated that maintaing a map, IMHO. 
But you don't need a map. You've got an index that will get you exactly what you need. If you don't want to do parallel arrays, then do a List<Pair<PeekingResultIterator,Integer>> or maybe more clear a List<RoundRobinIteratorState> where RoundRobinIteratorState is a class with two member variables PeekingResultIterator iterator and int rowsRead.

> Parallelize fetching of next batch of records for scans corresponding to queries with no order by 
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1779
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1779
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Assignee: Samarth Jain
>         Attachments: PHOENIX-1779.patch, wip.patch, wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load only the first batch of records up to the scan's cache size in parallel. Loading of subsequent batches of records in scanners is essentially serial. This could be improved especially for queries, including the ones with no order by clauses,  that do not need any kind of merge sort on the client. This could also potentially improve the performance of UPSERT SELECT statements that load data from one table and insert into another. One such use case being creating immutable indexes for tables that already have data. It could also potentially improve the performance of our MapReduce solution for bulk loading data by improving the speed of the loading/mapping phase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)