You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Suraj Shetiya <su...@gmail.com> on 2015/04/11 03:57:46 UTC

Query regarding infering data types in pyspark

Hi,

In pyspark when if I read a json file using sqlcontext I find that the date
field is not infered as date instead it is converted to string. And when I
try to convert it to date using df.withColumn(df.DateCol.cast("timestamp"))
it does not parse it successfuly and adds a null instead there. Should I
use UDF to convert the date ? Is this expected behaviour (not throwing an
error after failure to cast all fields)?

-- 
Regards,
Suraj

Re: Query regarding infering data types in pyspark

Posted by Davies Liu <da...@databricks.com>.
It does not work now, could you file a jira for it?

On Wed, Apr 15, 2015 at 9:29 AM, Suraj Shetiya <su...@gmail.com> wrote:
> Thank you :)
>
> That worked. I had another query regarding date being used as filter.
>
> With the new df which has the column cast as date I am unable to apply a
> filter that compares the dates.
> The query I am using is :
> df.filter(df.Datecol > datetime.date(2015,1,1)).show()
>
> I do not want to use date as a string to compare them. Please suggest.
>
>
> On Tue, Apr 14, 2015 at 4:59 AM, Davies Liu <da...@databricks.com> wrote:
>>
>> Hey Suraj,
>>
>> You should use "date" for DataType:
>>
>> df.withColumn(df.DateCol.cast("date"))
>>
>> Davies
>>
>> On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya <su...@gmail.com>
>> wrote:
>> > Humble reminder
>> >
>> > On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya <su...@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Below is one line from the json file.
>> >> I have highlighted the field that represents the date.
>> >>
>> >>
>> >>
>> >> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
>> >>
>> >> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>> >>
>> >> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
>> >> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>> >>
>> >> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>> >> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
>> >> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45
>> >> ,"DEST_STATE_NM":"South
>> >>
>> >> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
>> >>
>> >> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
>> >>
>> >> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
>> >> 47":null}
>> >>
>> >> Please let me know if you need access to the dataset.
>> >>
>> >> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu <da...@databricks.com>
>> >> wrote:
>> >>>
>> >>> What's the format you have in json file?
>> >>>
>> >>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya
>> >>> <su...@gmail.com>
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > In pyspark when if I read a json file using sqlcontext I find that
>> >>> > the
>> >>> > date
>> >>> > field is not infered as date instead it is converted to string. And
>> >>> > when I
>> >>> > try to convert it to date using
>> >>> > df.withColumn(df.DateCol.cast("timestamp"))
>> >>> > it does not parse it successfuly and adds a null instead there.
>> >>> > Should
>> >>> > I
>> >>> > use UDF to convert the date ? Is this expected behaviour (not
>> >>> > throwing
>> >>> > an
>> >>> > error after failure to cast all fields)?
>> >>> >
>> >>> > --
>> >>> > Regards,
>> >>> > Suraj
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Suraj
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Suraj
>
>
>
>
> --
> Regards,
> Suraj

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Query regarding infering data types in pyspark

Posted by Suraj Shetiya <su...@gmail.com>.
Thank you :)

That worked. I had another query regarding date being used as filter.

With the new df which has the column cast as date I am unable to apply a
filter that compares the dates.
The query I am using is :
df.filter(df.Datecol > datetime.date(2015,1,1)).show()

I do not want to use date as a string to compare them. Please suggest.


On Tue, Apr 14, 2015 at 4:59 AM, Davies Liu <da...@databricks.com> wrote:

> Hey Suraj,
>
> You should use "date" for DataType:
>
> df.withColumn(df.DateCol.cast("date"))
>
> Davies
>
> On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya <su...@gmail.com>
> wrote:
> > Humble reminder
> >
> > On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya <su...@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> Below is one line from the json file.
> >> I have highlighted the field that represents the date.
> >>
> >>
> >>
> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
> >>
> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
> >>
> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
> >> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
> >>
> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
> >> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
> >> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45
> ,"DEST_STATE_NM":"South
> >>
> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
> >>
> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
> >>
> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
> >> 47":null}
> >>
> >> Please let me know if you need access to the dataset.
> >>
> >> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu <da...@databricks.com>
> >> wrote:
> >>>
> >>> What's the format you have in json file?
> >>>
> >>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya <surajshetiya@gmail.com
> >
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > In pyspark when if I read a json file using sqlcontext I find that
> the
> >>> > date
> >>> > field is not infered as date instead it is converted to string. And
> >>> > when I
> >>> > try to convert it to date using
> >>> > df.withColumn(df.DateCol.cast("timestamp"))
> >>> > it does not parse it successfuly and adds a null instead there.
> Should
> >>> > I
> >>> > use UDF to convert the date ? Is this expected behaviour (not
> throwing
> >>> > an
> >>> > error after failure to cast all fields)?
> >>> >
> >>> > --
> >>> > Regards,
> >>> > Suraj
> >>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Suraj
> >
> >
> >
> >
> > --
> > Regards,
> > Suraj
>



-- 
Regards,
Suraj

Re: Query regarding infering data types in pyspark

Posted by Davies Liu <da...@databricks.com>.
Hey Suraj,

You should use "date" for DataType:

df.withColumn(df.DateCol.cast("date"))

Davies

On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya <su...@gmail.com> wrote:
> Humble reminder
>
> On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya <su...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> Below is one line from the json file.
>> I have highlighted the field that represents the date.
>>
>>
>> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
>> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
>> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
>> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45    ,"DEST_STATE_NM":"South
>> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
>> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
>> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
>> 47":null}
>>
>> Please let me know if you need access to the dataset.
>>
>> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu <da...@databricks.com>
>> wrote:
>>>
>>> What's the format you have in json file?
>>>
>>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya <su...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > In pyspark when if I read a json file using sqlcontext I find that the
>>> > date
>>> > field is not infered as date instead it is converted to string. And
>>> > when I
>>> > try to convert it to date using
>>> > df.withColumn(df.DateCol.cast("timestamp"))
>>> > it does not parse it successfuly and adds a null instead there. Should
>>> > I
>>> > use UDF to convert the date ? Is this expected behaviour (not throwing
>>> > an
>>> > error after failure to cast all fields)?
>>> >
>>> > --
>>> > Regards,
>>> > Suraj
>>
>>
>>
>>
>> --
>> Regards,
>> Suraj
>
>
>
>
> --
> Regards,
> Suraj

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Query regarding infering data types in pyspark

Posted by Suraj Shetiya <su...@gmail.com>.
Humble reminder

On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya <su...@gmail.com>
wrote:

> Hi,
>
> Below is one line from the json file.
> I have highlighted the field that represents the date.
>
> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,
> *"FL_DATE":"2015-01-31"*,"UNIQUE_CARRIER":"NK","AI
>  RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_
> NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>  703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>  GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_
> WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>  DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45    ,"DEST_STATE_NM":"South
> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":
> 2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"    :0.0,"DEP_DEL15":0.0,"DEP_
> DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":
> 17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"    :2147.0,"TAXI_IN":5.0,"CRS_
> ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_
> DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_
> TIME_BLK":"2100-2159","Unnamed: 47":null}
>
> Please let me know if you need access to the dataset.
>
> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu <da...@databricks.com>
> wrote:
>
>> What's the format you have in json file?
>>
>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya <su...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > In pyspark when if I read a json file using sqlcontext I find that the
>> date
>> > field is not infered as date instead it is converted to string. And
>> when I
>> > try to convert it to date using
>> df.withColumn(df.DateCol.cast("timestamp"))
>> > it does not parse it successfuly and adds a null instead there. Should I
>> > use UDF to convert the date ? Is this expected behaviour (not throwing
>> an
>> > error after failure to cast all fields)?
>> >
>> > --
>> > Regards,
>> > Suraj
>>
>
>
>
> --
> Regards,
> Suraj
>



-- 
Regards,
Suraj

Re: Query regarding infering data types in pyspark

Posted by Suraj Shetiya <su...@gmail.com>.
Hi,

Below is one line from the json file.
I have highlighted the field that represents the date.

"YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,
*"FL_DATE":"2015-01-31"*,"UNIQUE_CARRIER":"NK","AI
 RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_
NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI    GIN_STATE_FIPS":12,"ORIGIN_
STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":
13577,"DEST_AIRPORT_SEQ_ID":1357702,"    DEST_CITY_MARKET_ID":31135,"
DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45
   ,"DEST_STATE_NM":"South Carolina","DEST_WAC":37,"CRS_
DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
 :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_
BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
 :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":
2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.
0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed: 47":null}

Please let me know if you need access to the dataset.

On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu <da...@databricks.com> wrote:

> What's the format you have in json file?
>
> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya <su...@gmail.com>
> wrote:
> > Hi,
> >
> > In pyspark when if I read a json file using sqlcontext I find that the
> date
> > field is not infered as date instead it is converted to string. And when
> I
> > try to convert it to date using
> df.withColumn(df.DateCol.cast("timestamp"))
> > it does not parse it successfuly and adds a null instead there. Should I
> > use UDF to convert the date ? Is this expected behaviour (not throwing an
> > error after failure to cast all fields)?
> >
> > --
> > Regards,
> > Suraj
>



-- 
Regards,
Suraj

Re: Query regarding infering data types in pyspark

Posted by Davies Liu <da...@databricks.com>.
What's the format you have in json file?

On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya <su...@gmail.com> wrote:
> Hi,
>
> In pyspark when if I read a json file using sqlcontext I find that the date
> field is not infered as date instead it is converted to string. And when I
> try to convert it to date using df.withColumn(df.DateCol.cast("timestamp"))
> it does not parse it successfuly and adds a null instead there. Should I
> use UDF to convert the date ? Is this expected behaviour (not throwing an
> error after failure to cast all fields)?
>
> --
> Regards,
> Suraj

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org