You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Patrick Woody <pa...@gmail.com> on 2015/03/28 16:26:44 UTC

Lazy casting with Catalyst

Hi all,

In my application, we take input from Parquet files where BigDecimals are
written as Strings to maintain arbitrary precision.

I was hoping to convert these back over to Decimal with Unlimited
precision, but I'd still like to maintain the Parquet column pruning (all
my attempts thus far seem to bring in the whole Row). Is it possible to do
this lazily through catalyst?

Basically I'd want to do Cast(col, DecimalType()) whenever col is actually
referenced. Any tips on how to approach this would be appreciated.

Thanks!
-Pat

Re: Lazy casting with Catalyst

Posted by Patrick Woody <pa...@gmail.com>.
So it looks like this was actually a combination of using out of date
artifacts and further debugging needed on my part. Ripping the logic out
and testing in spark-shell works fine, so it is likely something upstream
in my application that causes it to take the whole Row.

Thanks!
-Pat





On Sat, Mar 28, 2015 at 12:34 PM, Cheng Lian <li...@gmail.com> wrote:

>
> On 3/29/15 12:26 AM, Patrick Woody wrote:
>
>  Hey Cheng,
>
>  I didn't meant that catalyst casting was eager, just that my approaches
> thus far seem to have been. Maybe I should give a concrete example?
>
> I have columns A, B, C where B is saved as a String but I'd like all
> references to B to go through a Cast to decimal regardless of the code used
> on the SchemaRDD. So if someone does a min(B) it uses Decimal ordering
> instead of String.
>
>  One approach that I had taken was to do a select of everything with the
> casts on certain columns, but then when I did a count(literal(1)) on top of
> that RDD it seemed to bring in the whole row.
>
> What version of Spark SQL are you using? Would you mind to provide a brief
> snippet that can reproduce this issue? This might be a bug depending on
> your concrete usage. Thanks in advance!
>
>
>  Thanks!
> -Pat
>
> On Sat, Mar 28, 2015 at 11:35 AM, Cheng Lian <li...@gmail.com>
> wrote:
>
>> Hi Pat,
>>
>> I don't understand what "lazy casting" mean here. Why do you think
>> current Catalyst casting is "eager"? Casting happens at runtime, and
>> doesn't disable column pruning.
>>
>> Cheng
>>
>>
>> On 3/28/15 11:26 PM, Patrick Woody wrote:
>>
>>> Hi all,
>>>
>>> In my application, we take input from Parquet files where BigDecimals are
>>> written as Strings to maintain arbitrary precision.
>>>
>>> I was hoping to convert these back over to Decimal with Unlimited
>>> precision, but I'd still like to maintain the Parquet column pruning (all
>>> my attempts thus far seem to bring in the whole Row). Is it possible to
>>> do
>>> this lazily through catalyst?
>>>
>>> Basically I'd want to do Cast(col, DecimalType()) whenever col is
>>> actually
>>> referenced. Any tips on how to approach this would be appreciated.
>>>
>>> Thanks!
>>> -Pat
>>>
>>>
>>
>
>

Re: Lazy casting with Catalyst

Posted by Cheng Lian <li...@gmail.com>.
On 3/29/15 12:26 AM, Patrick Woody wrote:
> Hey Cheng,
>
> I didn't meant that catalyst casting was eager, just that my 
> approaches thus far seem to have been. Maybe I should give a concrete 
> example?
>
> I have columns A, B, C where B is saved as a String but I'd like all 
> references to B to go through a Cast to decimal regardless of the code 
> used on the SchemaRDD. So if someone does a min(B) it uses Decimal 
> ordering instead of String.
>
> One approach that I had taken was to do a select of everything with 
> the casts on certain columns, but then when I did a count(literal(1)) 
> on top of that RDD it seemed to bring in the whole row.
What version of Spark SQL are you using? Would you mind to provide a 
brief snippet that can reproduce this issue? This might be a bug 
depending on your concrete usage. Thanks in advance!
>
> Thanks!
> -Pat
>
> On Sat, Mar 28, 2015 at 11:35 AM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi Pat,
>
>     I don't understand what "lazy casting" mean here. Why do you think
>     current Catalyst casting is "eager"? Casting happens at runtime,
>     and doesn't disable column pruning.
>
>     Cheng
>
>
>     On 3/28/15 11:26 PM, Patrick Woody wrote:
>
>         Hi all,
>
>         In my application, we take input from Parquet files where
>         BigDecimals are
>         written as Strings to maintain arbitrary precision.
>
>         I was hoping to convert these back over to Decimal with Unlimited
>         precision, but I'd still like to maintain the Parquet column
>         pruning (all
>         my attempts thus far seem to bring in the whole Row). Is it
>         possible to do
>         this lazily through catalyst?
>
>         Basically I'd want to do Cast(col, DecimalType()) whenever col
>         is actually
>         referenced. Any tips on how to approach this would be appreciated.
>
>         Thanks!
>         -Pat
>
>
>


Re: Lazy casting with Catalyst

Posted by Patrick Woody <pa...@gmail.com>.
Hey Cheng,

I didn't meant that catalyst casting was eager, just that my approaches
thus far seem to have been. Maybe I should give a concrete example?

I have columns A, B, C where B is saved as a String but I'd like all
references to B to go through a Cast to decimal regardless of the code used
on the SchemaRDD. So if someone does a min(B) it uses Decimal ordering
instead of String.

One approach that I had taken was to do a select of everything with the
casts on certain columns, but then when I did a count(literal(1)) on top of
that RDD it seemed to bring in the whole row.

Thanks!
-Pat

On Sat, Mar 28, 2015 at 11:35 AM, Cheng Lian <li...@gmail.com> wrote:

> Hi Pat,
>
> I don't understand what "lazy casting" mean here. Why do you think current
> Catalyst casting is "eager"? Casting happens at runtime, and doesn't
> disable column pruning.
>
> Cheng
>
>
> On 3/28/15 11:26 PM, Patrick Woody wrote:
>
>> Hi all,
>>
>> In my application, we take input from Parquet files where BigDecimals are
>> written as Strings to maintain arbitrary precision.
>>
>> I was hoping to convert these back over to Decimal with Unlimited
>> precision, but I'd still like to maintain the Parquet column pruning (all
>> my attempts thus far seem to bring in the whole Row). Is it possible to do
>> this lazily through catalyst?
>>
>> Basically I'd want to do Cast(col, DecimalType()) whenever col is actually
>> referenced. Any tips on how to approach this would be appreciated.
>>
>> Thanks!
>> -Pat
>>
>>
>

Re: Lazy casting with Catalyst

Posted by Cheng Lian <li...@gmail.com>.
Hi Pat,

I don't understand what "lazy casting" mean here. Why do you think 
current Catalyst casting is "eager"? Casting happens at runtime, and 
doesn't disable column pruning.

Cheng

On 3/28/15 11:26 PM, Patrick Woody wrote:
> Hi all,
>
> In my application, we take input from Parquet files where BigDecimals are
> written as Strings to maintain arbitrary precision.
>
> I was hoping to convert these back over to Decimal with Unlimited
> precision, but I'd still like to maintain the Parquet column pruning (all
> my attempts thus far seem to bring in the whole Row). Is it possible to do
> this lazily through catalyst?
>
> Basically I'd want to do Cast(col, DecimalType()) whenever col is actually
> referenced. Any tips on how to approach this would be appreciated.
>
> Thanks!
> -Pat
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org