You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@quickstep.apache.org by J Patel <jm...@gmail.com> on 2016/08/23 01:23:52 UTC

List of potential work to do on Quickstep

Hi folks,

Here is a list of features that would be good for the community to work on.
Feel free to add or comment on this list.

1: Improve handling of aggregation: Aggregate handling in Quickstep is slow
as a separate hash table is being built for each aggregate. PR
https://github.com/apache/incubator-quickstep/pull/90 is a step in fixing
this, but there is more to be done, including increasing the space
efficiency of the hash table, improving the finalize operation (which is
single-threaded), and considering partitioning (so that finalize can be
parallelized).

2: The use of ColumnVectors is very expensive as it involves a full extra
read and write of data, and results in a bad memory access pattern. That
design needs to be rethought/refactored. Nav has suggested using an
iterator model v/s accessors and that is a good idea. We can probably go
beyond that and think of defining patterns for taking an input, applying a
predicate, and applying a projection (copy). Any ideas here are welcome.

3: We have bloomfilters and that needs to be optimized to work with joins.
Jianqiao is working on this.

4: Error handling in the system can be improved. Here we need to consider
if we want to use error return codes or C++ throw/catch mechanism. Right
now we use a mix of both. I am starting to turn in favor of throw/catch as
that way we at least have a way of catching the error at the top (rather
than crashing). We can then refactor the code to add entire throw/catch
chains. Right now the most serious error handling that is lacking, IMHO, is
when we are loading a large file and there is a corrupted tuple near the
end. The system crashes after making the user wait, and there is no
cleanup.

5: Our type system also needs a major surgery to make it easier to add new
types. Clean UDFs support is also missing.

Other thoughts?

Cheers,
Jignesh

Re: List of potential work to do on Quickstep

Posted by Harshad Deshmukh <ha...@cs.wisc.edu>.

Hi Jignesh,

Thanks for sending the list. I want to share an update on point 1.

At present I am working on partitioned aggregation, which builds on top 
of QUICKSTEP-28 and QUICKSTEP-29 JIRA issues. As the first step in this 
goal, I have created QUICKSTEP-43 JIRA issue (and a corresponding GitHub 
PR), in which we create a new operator to destruct the Aggregation state 
(similar to the destroy hash table operator). This operator will be 
useful when finalize step in aggregation is parallel and thus the shared 
state can only be destructed once the finalize phase is complete.

On 08/22/2016 08:23 PM, J Patel wrote:
> Hi folks,
>
> Here is a list of features that would be good for the community to work on.
> Feel free to add or comment on this list.
>
> 1: Improve handling of aggregation: Aggregate handling in Quickstep is slow
> as a separate hash table is being built for each aggregate. PR
> https://github.com/apache/incubator-quickstep/pull/90 is a step in fixing
> this, but there is more to be done, including increasing the space
> efficiency of the hash table, improving the finalize operation (which is
> single-threaded), and considering partitioning (so that finalize can be
> parallelized).
>
> 2: The use of ColumnVectors is very expensive as it involves a full extra
> read and write of data, and results in a bad memory access pattern. That
> design needs to be rethought/refactored. Nav has suggested using an
> iterator model v/s accessors and that is a good idea. We can probably go
> beyond that and think of defining patterns for taking an input, applying a
> predicate, and applying a projection (copy). Any ideas here are welcome.
>
> 3: We have bloomfilters and that needs to be optimized to work with joins.
> Jianqiao is working on this.
>
> 4: Error handling in the system can be improved. Here we need to consider
> if we want to use error return codes or C++ throw/catch mechanism. Right
> now we use a mix of both. I am starting to turn in favor of throw/catch as
> that way we at least have a way of catching the error at the top (rather
> than crashing). We can then refactor the code to add entire throw/catch
> chains. Right now the most serious error handling that is lacking, IMHO, is
> when we are loading a large file and there is a corrupted tuple near the
> end. The system crashes after making the user wait, and there is no
> cleanup.
>
> 5: Our type system also needs a major surgery to make it easier to add new
> types. Clean UDFs support is also missing.
>
> Other thoughts?
>
> Cheers,
> Jignesh
>

-- 
Thanks,
Harshad