You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adaryl Wakefield <ad...@hotmail.com> on 2017/10/03 17:43:51 UTC

how do you deal with datetime in Spark?

I gave myself a project to start actually writing Spark programs. I'm using Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful and took forever. I was trying to use dataframes and SQL as much as possible. I see that there are date functions in the dataframe API but trying to use them was frustrating. Even following code samples was a headache because apparently the code is different depending on which version of Spark you are using. I was really hoping for a rich set of date functions like you'd find in T-SQL but I never really found them.

Is there a best practice for dealing with dates and time in Spark? I feel like taking a date/time string and converting it to a date/time object and then manipulating data based on the various components of the timestamp object (hour, day, year etc.) should be a heck of a lot easier than what I'm finding and perhaps I'm just not looking in the right place.

You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>



RE: how do you deal with datetime in Spark?

Posted by Adaryl Wakefield <ad...@hotmail.com>.
In my first attempt, I actually tried using case classes and then putting them into a data set. Scala, I guess doesn’t have a date time data type and I still wound up having to do some sort of conversion. When I tried to put the data into the dataset because I still had to define the column as a string. I mean is that right? Is it not possible to create a case class with a datatype of date or timestamp?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


From: Nicholas Hakobian [mailto:nicholas.hakobian@rallyhealth.com]
Sent: Tuesday, October 3, 2017 1:04 PM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: how do you deal with datetime in Spark?

I'd suggest first converting your string containing your date/time to a TimestampType or a DateType. Then the built in functions for year, month, day, etc. will then work as expected. If your date is in a "standard" format, you can perform the conversion just by casting the column to a date or timestamp type. The list of types it can auto-convert are listed at this link:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L270-L295

If casting won't work, you can manually convert it by specifying a format string with the following builtin function:
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.unix_timestamp

The format string uses the java simpleDateFormat format string, if I remember correctly (http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com<ma...@rallyhealth.com>


On Tue, Oct 3, 2017 at 10:43 AM, Adaryl Wakefield <ad...@hotmail.com>> wrote:
I gave myself a project to start actually writing Spark programs. I’m using Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful and took forever. I was trying to use dataframes and SQL as much as possible. I see that there are date functions in the dataframe API but trying to use them was frustrating. Even following code samples was a headache because apparently the code is different depending on which version of Spark you are using. I was really hoping for a rich set of date functions like you’d find in T-SQL but I never really found them.

Is there a best practice for dealing with dates and time in Spark? I feel like taking a date/time string and converting it to a date/time object and then manipulating data based on the various components of the timestamp object (hour, day, year etc.) should be a heck of a lot easier than what I’m finding and perhaps I’m just not looking in the right place.

You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<tel:(913)%20938-6685>
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>




Re: how do you deal with datetime in Spark?

Posted by Nicholas Hakobian <ni...@rallyhealth.com>.
I'd suggest first converting your string containing your date/time to a
TimestampType or a DateType. Then the built in functions for year, month,
day, etc. will then work as expected. If your date is in a "standard"
format, you can perform the conversion just by casting the column to a date
or timestamp type. The list of types it can auto-convert are listed at this
link:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L270-L295

If casting won't work, you can manually convert it by specifying a format
string with the following builtin function:
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.unix_timestamp

The format string uses the java simpleDateFormat format string, if I
remember correctly (
http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).

Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com


On Tue, Oct 3, 2017 at 10:43 AM, Adaryl Wakefield <
adaryl.wakefield@hotmail.com> wrote:

> I gave myself a project to start actually writing Spark programs. I’m
> using Scala and Spark 2.2.0. In my project, I had to do some grouping and
> filtering by dates. It was awful and took forever. I was trying to use
> dataframes and SQL as much as possible. I see that there are date functions
> in the dataframe API but trying to use them was frustrating. Even following
> code samples was a headache because apparently the code is different
> depending on which version of Spark you are using. I was really hoping for
> a rich set of date functions like you’d find in T-SQL but I never really
> found them.
>
>
>
> Is there a best practice for dealing with dates and time in Spark? I feel
> like taking a date/time string and converting it to a date/time object and
> then manipulating data based on the various components of the timestamp
> object (hour, day, year etc.) should be a heck of a lot easier than what
> I’m finding and perhaps I’m just not looking in the right place.
>
>
>
> You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-
> 24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685 <(913)%20938-6685>
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData <http://twitter.com/BobLovesData>
>
>
>
>
>

RE: how do you deal with datetime in Spark?

Posted by Adaryl Wakefield <ad...@hotmail.com>.
HA! Yeah in an earlier attempt, I tried to convert everything to unix_timestamp. That went over like a lead ballon…

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


From: Steve Loughran [mailto:stevel@hortonworks.com]
Sent: Tuesday, October 3, 2017 2:19 PM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: how do you deal with datetime in Spark?


On 3 Oct 2017, at 18:43, Adaryl Wakefield <ad...@hotmail.com>> wrote:

I gave myself a project to start actually writing Spark programs. I’m using Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful and took forever. I was trying to use dataframes and SQL as much as possible. I see that there are date functions in the dataframe API but trying to use them was frustrating. Even following code samples was a headache because apparently the code is different depending on which version of Spark you are using. I was really hoping for a rich set of date functions like you’d find in T-SQL but I never really found them.

Is there a best practice for dealing with dates and time in Spark? I feel like taking a date/time string and converting it to a date/time object and then manipulating data based on the various components of the timestamp object (hour, day, year etc.) should be a heck of a lot easier than what I’m finding and perhaps I’m just not looking in the right place.

You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala


Once you've done that one, I have a few hundred MB of london bike stats if you wan then. Their timestamps come in as strings, but "01/01/1970" is by far the most popular dropoff time, which is 0 in the epoch...

9809600,0,6248,01/01/1970 00:00,0,NA,31/01/2012 19:31,365,City Road: Angel
9806201,0,6422,01/01/1970 00:00,0,NA,31/01/2012 19:32,17,Hatton Wall: Holborn
9802063,0,4096,01/01/1970 00:00,0,NA,31/01/2012 19:34,338,Wellington Street : Strand
9804765,0,5276,01/01/1970 00:00,0,NA,31/01/2012 19:37,93,Cloudesley Road: Angel
9806779,1970,14,31/01/2012 20:11,410,Edgware Road Station: Paddington
9813333,0,5810,01/01/1970 00:00,0,NA,31/01/2012 19:39,114,Park Road (Baker Street): Regent's Park
9803952,0,5682,01/01/1970 00:00,0,NA,31/01/2012 19:41,210,Hinde Street: Marylebone
9818659,0,5572,01/01/1970 00:00,0,NA,31/01/2012 19:41,87,Devonshire Square: Liverpool Street
9808144,0,5244,01/01/1970 00:00,0,NA,31/01/2012 19:42,374,Waterloo Station 1: Waterloo
9814365,0,5422,01/01/1970 00:00,0,NA,31/01/2012 19:48,15,Great Russell Street: Bloomsbury
9816863,0,6079,01/01/1970 00:00,0,NA,31/01/2012 19:49,258,Kensington Gore: Knightsbridge
9818469,0,4903,01/01/1970 00:00,0,NA,31/01/2012 19:50,341,Craven Street: Strand
9811512,0,5572,01/01/1970 00:00,0,NA,31/01/2012 19:50,298,Curlew Street: Shad Thames
9817931,0,708,01/01/1970 00:00,0,NA,31/01/2012 19:51,341,Craven Street: Strand
9816429,0,3210,01/01/1970 00:00,0,NA,31/01/2012 19:59,388,Southampton Street: Strand
9806284,0,4359,01/01/1970 00:00,0,NA,31/01/2012 20:06,335,Tavistock Street: Covent Garden



Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


Re: how do you deal with datetime in Spark?

Posted by Steve Loughran <st...@hortonworks.com>.
On 3 Oct 2017, at 18:43, Adaryl Wakefield <ad...@hotmail.com>> wrote:

I gave myself a project to start actually writing Spark programs. I’m using Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful and took forever. I was trying to use dataframes and SQL as much as possible. I see that there are date functions in the dataframe API but trying to use them was frustrating. Even following code samples was a headache because apparently the code is different depending on which version of Spark you are using. I was really hoping for a rich set of date functions like you’d find in T-SQL but I never really found them.

Is there a best practice for dealing with dates and time in Spark? I feel like taking a date/time string and converting it to a date/time object and then manipulating data based on the various components of the timestamp object (hour, day, year etc.) should be a heck of a lot easier than what I’m finding and perhaps I’m just not looking in the right place.

You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala


Once you've done that one, I have a few hundred MB of london bike stats if you wan then. Their timestamps come in as strings, but "01/01/1970" is by far the most popular dropoff time, which is 0 in the epoch...

9809600,0,6248,01/01/1970 00:00,0,NA,31/01/2012 19:31,365,City Road: Angel
9806201,0,6422,01/01/1970 00:00,0,NA,31/01/2012 19:32,17,Hatton Wall: Holborn
9802063,0,4096,01/01/1970 00:00,0,NA,31/01/2012 19:34,338,Wellington Street : Strand
9804765,0,5276,01/01/1970 00:00,0,NA,31/01/2012 19:37,93,Cloudesley Road: Angel
9806779,1970,14,31/01/2012 20:11,410,Edgware Road Station: Paddington
9813333,0,5810,01/01/1970 00:00,0,NA,31/01/2012 19:39,114,Park Road (Baker Street): Regent's Park
9803952,0,5682,01/01/1970 00:00,0,NA,31/01/2012 19:41,210,Hinde Street: Marylebone
9818659,0,5572,01/01/1970 00:00,0,NA,31/01/2012 19:41,87,Devonshire Square: Liverpool Street
9808144,0,5244,01/01/1970 00:00,0,NA,31/01/2012 19:42,374,Waterloo Station 1: Waterloo
9814365,0,5422,01/01/1970 00:00,0,NA,31/01/2012 19:48,15,Great Russell Street: Bloomsbury
9816863,0,6079,01/01/1970 00:00,0,NA,31/01/2012 19:49,258,Kensington Gore: Knightsbridge
9818469,0,4903,01/01/1970 00:00,0,NA,31/01/2012 19:50,341,Craven Street: Strand
9811512,0,5572,01/01/1970 00:00,0,NA,31/01/2012 19:50,298,Curlew Street: Shad Thames
9817931,0,708,01/01/1970 00:00,0,NA,31/01/2012 19:51,341,Craven Street: Strand
9816429,0,3210,01/01/1970 00:00,0,NA,31/01/2012 19:59,388,Southampton Street: Strand
9806284,0,4359,01/01/1970 00:00,0,NA,31/01/2012 20:06,335,Tavistock Street: Covent Garden


Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


Re: how do you deal with datetime in Spark?

Posted by Vadim Semenov <va...@datadoghq.com>.
I usually check the list of Hive UDFs as Spark has implemented almost all
of them
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

Or/and check `org.apache.spark.sql.functions` directly:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html

Also you can check the list of all Datetime functions here
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L368-L399

and what they do here
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala



On Tue, Oct 3, 2017 at 1:43 PM, Adaryl Wakefield <
adaryl.wakefield@hotmail.com> wrote:

> I gave myself a project to start actually writing Spark programs. I’m
> using Scala and Spark 2.2.0. In my project, I had to do some grouping and
> filtering by dates. It was awful and took forever. I was trying to use
> dataframes and SQL as much as possible. I see that there are date functions
> in the dataframe API but trying to use them was frustrating. Even following
> code samples was a headache because apparently the code is different
> depending on which version of Spark you are using. I was really hoping for
> a rich set of date functions like you’d find in T-SQL but I never really
> found them.
>
>
>
> Is there a best practice for dealing with dates and time in Spark? I feel
> like taking a date/time string and converting it to a date/time object and
> then manipulating data based on the various components of the timestamp
> object (hour, day, year etc.) should be a heck of a lot easier than what
> I’m finding and perhaps I’m just not looking in the right place.
>
>
>
> You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-
> 24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685 <(913)%20938-6685>
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData <http://twitter.com/BobLovesData>
>
>
>
>
>