You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Xuelin Cao <xu...@gmail.com> on 2015/01/20 03:27:08 UTC

Will Spark-SQL support vectorized query engine someday?

Hi,

     Correct me if I were wrong. It looks like, the current version of
Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
operator produces a tuple by recursively call child->execute .

     There are papers that illustrate the benefits of vectorized query
engine. And Hive-Stinger also embrace this style.

     So, the question is, will Spark-SQL give a support to vectorized query
execution someday?

     Thanks

Re: Will Spark-SQL support vectorized query engine someday?

Posted by Reynold Xin <rx...@databricks.com>.

I don't know if there is a list, but in general running performance
profiler can identify a lot of things...

On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao <xu...@gmail.com>
wrote:

>
> Thanks, Reynold
>
>       Regarding the "lower hanging fruits", can you give me some example?
> Where can I find them in JIRA?
>
>
> On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> It will probably eventually make its way into part of the query engine,
>> one way or another. Note that there are in general a lot of other lower
>> hanging fruits before you have to do vectorization.
>>
>> As far as I know, Hive doesn't really have vectorization because the
>> vectorization in Hive is simply writing everything in small batches, in
>> order to avoid the virtual function call overhead, and hoping the JVM can
>> unroll some of the loops. There is no SIMD involved.
>>
>> Something that is pretty useful, which isn't exactly from vectorization
>> but comes from similar lines of research, is being able to push predicates
>> down into the columnar compression encoding. For example, one can turn
>> string comparisons into integer comparisons. These will probably give much
>> larger performance improvements in common queries.
>>
>>
>> On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao <xu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>      Correct me if I were wrong. It looks like, the current version of
>>> Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
>>> operator produces a tuple by recursively call child->execute .
>>>
>>>      There are papers that illustrate the benefits of vectorized query
>>> engine. And Hive-Stinger also embrace this style.
>>>
>>>      So, the question is, will Spark-SQL give a support to vectorized
>>> query
>>> execution someday?
>>>
>>>      Thanks
>>>
>>
>>
>

Re: Will Spark-SQL support vectorized query engine someday?

Posted by Xuelin Cao <xu...@gmail.com>.

Thanks, Reynold

      Regarding the "lower hanging fruits", can you give me some example?
Where can I find them in JIRA?


On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin <rx...@databricks.com> wrote:

> It will probably eventually make its way into part of the query engine,
> one way or another. Note that there are in general a lot of other lower
> hanging fruits before you have to do vectorization.
>
> As far as I know, Hive doesn't really have vectorization because the
> vectorization in Hive is simply writing everything in small batches, in
> order to avoid the virtual function call overhead, and hoping the JVM can
> unroll some of the loops. There is no SIMD involved.
>
> Something that is pretty useful, which isn't exactly from vectorization
> but comes from similar lines of research, is being able to push predicates
> down into the columnar compression encoding. For example, one can turn
> string comparisons into integer comparisons. These will probably give much
> larger performance improvements in common queries.
>
>
> On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao <xu...@gmail.com>
> wrote:
>
>> Hi,
>>
>>      Correct me if I were wrong. It looks like, the current version of
>> Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
>> operator produces a tuple by recursively call child->execute .
>>
>>      There are papers that illustrate the benefits of vectorized query
>> engine. And Hive-Stinger also embrace this style.
>>
>>      So, the question is, will Spark-SQL give a support to vectorized
>> query
>> execution someday?
>>
>>      Thanks
>>
>
>

Re: Will Spark-SQL support vectorized query engine someday?

Posted by Reynold Xin <rx...@databricks.com>.

It will probably eventually make its way into part of the query engine, one
way or another. Note that there are in general a lot of other lower hanging
fruits before you have to do vectorization.

As far as I know, Hive doesn't really have vectorization because the
vectorization in Hive is simply writing everything in small batches, in
order to avoid the virtual function call overhead, and hoping the JVM can
unroll some of the loops. There is no SIMD involved.

Something that is pretty useful, which isn't exactly from vectorization but
comes from similar lines of research, is being able to push predicates down
into the columnar compression encoding. For example, one can turn string
comparisons into integer comparisons. These will probably give much larger
performance improvements in common queries.

On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao <xu...@gmail.com> wrote:

> Hi,
>
>      Correct me if I were wrong. It looks like, the current version of
> Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
> operator produces a tuple by recursively call child->execute .
>
>      There are papers that illustrate the benefits of vectorized query
> engine. And Hive-Stinger also embrace this style.
>
>      So, the question is, will Spark-SQL give a support to vectorized query
> execution someday?
>
>      Thanks
>