You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jerry <je...@gmail.com> on 2015/08/10 22:26:23 UTC

Is there any external dependencies for lag() and lead() when using data frames?

Hello,

Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries
to a data frame and I'm trying to figure out if I just have a bad setup or
if this is a bug. As for the exceptions I get: when using selectExpr() with
a string as an argument, I get "NoSuchElementException: key not found: lag"
and when using the select method and ...spark.sql.functions.lag I get an
AnalysisException. If I replace lag with abs in the first case, Spark runs
without exception, so none of the other syntax is incorrect.

As for how I'm running it; the code is written in Java with a static method
that takes the SparkContext as an argument which is used to create a
JavaSparkContext which then is used to create an SQLContext which loads a
json file from the local disk and runs those queries on that data frame
object. FYI: the java code is compiled, jared and then pointed to with -cp
when starting the spark shell, so all I do is "Test.run(sc)" in shell.

Let me know what to look for to debug this problem. I'm not sure where to
look to solve this problem.

Thanks,
        Jerry

RE: Is there any external dependencies for lag() and lead() when using data frames?

Posted by Benjamin Ross <br...@Lattice-Engines.com>.
I forgot to mention, my setup was:

-          Spark 1.4.1 running in standalone mode

-          Datastax spark cassandra connector 1.4.0-M1

-          Cassandra DB

-          Scala version 2.10.4


From: Benjamin Ross
Sent: Tuesday, August 11, 2015 10:16 AM
To: Jerry; Michael Armbrust
Cc: user
Subject: RE: Is there any external dependencies for lag() and lead() when using data frames?

Jerry,
I was able to use window functions without the hive thrift server.  HiveContext does not imply that you need the hive thrift server running.

Here’s what I used to test this out:
    var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")

    val sc = new SparkContext(conf)
    val sqlContext = new HiveContext(sc)
    val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "kv", "keyspace" -> "test"))
      .load()
    val w = Window.orderBy("value").rowsBetween(-2, 0)


I then submitted this using spark-submit.



From: Jerry [mailto:jerry.comp@gmail.com]
Sent: Monday, August 10, 2015 10:55 PM
To: Michael Armbrust
Cc: user
Subject: Re: Is there any external dependencies for lag() and lead() when using data frames?

By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent.
Thanks,
        Jerry

On Mon, Aug 10, 2015 at 2:38 PM, Jerry <je...@gmail.com>> wrote:
Thanks...   looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located?
Thanks,
        Jerry

On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust <mi...@databricks.com>> wrote:
You will need to use a HiveContext for window functions to work.

On Mon, Aug 10, 2015 at 1:26 PM, Jerry <je...@gmail.com>> wrote:
Hello,
Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get "NoSuchElementException: key not found: lag" and when using the select method and ...spark.sql.functions.lag I get an AnalysisException. If I replace lag with abs in the first case, Spark runs without exception, so none of the other syntax is incorrect.
As for how I'm running it; the code is written in Java with a static method that takes the SparkContext as an argument which is used to create a JavaSparkContext which then is used to create an SQLContext which loads a json file from the local disk and runs those queries on that data frame object. FYI: the java code is compiled, jared and then pointed to with -cp when starting the spark shell, so all I do is "Test.run(sc)" in shell.
Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem.
Thanks,
        Jerry




RE: Is there any external dependencies for lag() and lead() when using data frames?

Posted by Benjamin Ross <br...@Lattice-Engines.com>.
Jerry,
I was able to use window functions without the hive thrift server.  HiveContext does not imply that you need the hive thrift server running.

Here’s what I used to test this out:
    var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")

    val sc = new SparkContext(conf)
    val sqlContext = new HiveContext(sc)
    val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "kv", "keyspace" -> "test"))
      .load()
    val w = Window.orderBy("value").rowsBetween(-2, 0)


I then submitted this using spark-submit.



From: Jerry [mailto:jerry.comp@gmail.com]
Sent: Monday, August 10, 2015 10:55 PM
To: Michael Armbrust
Cc: user
Subject: Re: Is there any external dependencies for lag() and lead() when using data frames?

By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent.
Thanks,
        Jerry

On Mon, Aug 10, 2015 at 2:38 PM, Jerry <je...@gmail.com>> wrote:
Thanks...   looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located?
Thanks,
        Jerry

On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust <mi...@databricks.com>> wrote:
You will need to use a HiveContext for window functions to work.

On Mon, Aug 10, 2015 at 1:26 PM, Jerry <je...@gmail.com>> wrote:
Hello,
Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get "NoSuchElementException: key not found: lag" and when using the select method and ...spark.sql.functions.lag I get an AnalysisException. If I replace lag with abs in the first case, Spark runs without exception, so none of the other syntax is incorrect.
As for how I'm running it; the code is written in Java with a static method that takes the SparkContext as an argument which is used to create a JavaSparkContext which then is used to create an SQLContext which loads a json file from the local disk and runs those queries on that data frame object. FYI: the java code is compiled, jared and then pointed to with -cp when starting the spark shell, so all I do is "Test.run(sc)" in shell.
Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem.
Thanks,
        Jerry




Re: Is there any external dependencies for lag() and lead() when using data frames?

Posted by Jerry <je...@gmail.com>.
By the way, if Hive is present in the Spark install, does show up in text
when you start the spark shell? Any commands I can run to check if it
exists? I didn't setup the spark machine that I use, so I don't know what's
present or absent.

Thanks,
        Jerry

On Mon, Aug 10, 2015 at 2:38 PM, Jerry <je...@gmail.com> wrote:

> Thanks...   looks like I now hit that bug about HiveMetaStoreClient as I
> now get the message about being unable to instantiate it. On a side note,
> does anyone know where hive-site.xml is typically located?
>
> Thanks,
>         Jerry
>
> On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> You will need to use a HiveContext for window functions to work.
>>
>> On Mon, Aug 10, 2015 at 1:26 PM, Jerry <je...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Using Apache Spark 1.4.1 I'm unable to use lag or lead when making
>>> queries to a data frame and I'm trying to figure out if I just have a bad
>>> setup or if this is a bug. As for the exceptions I get: when using
>>> selectExpr() with a string as an argument, I get "NoSuchElementException:
>>> key not found: lag" and when using the select method and
>>> ...spark.sql.functions.lag I get an AnalysisException. If I replace lag
>>> with abs in the first case, Spark runs without exception, so none of the
>>> other syntax is incorrect.
>>>
>>> As for how I'm running it; the code is written in Java with a static
>>> method that takes the SparkContext as an argument which is used to create a
>>> JavaSparkContext which then is used to create an SQLContext which loads a
>>> json file from the local disk and runs those queries on that data frame
>>> object. FYI: the java code is compiled, jared and then pointed to with -cp
>>> when starting the spark shell, so all I do is "Test.run(sc)" in shell.
>>>
>>> Let me know what to look for to debug this problem. I'm not sure where
>>> to look to solve this problem.
>>>
>>> Thanks,
>>>         Jerry
>>>
>>
>>
>

Re: Is there any external dependencies for lag() and lead() when using data frames?

Posted by Jerry <je...@gmail.com>.
Thanks...   looks like I now hit that bug about HiveMetaStoreClient as I
now get the message about being unable to instantiate it. On a side note,
does anyone know where hive-site.xml is typically located?

Thanks,
        Jerry

On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> You will need to use a HiveContext for window functions to work.
>
> On Mon, Aug 10, 2015 at 1:26 PM, Jerry <je...@gmail.com> wrote:
>
>> Hello,
>>
>> Using Apache Spark 1.4.1 I'm unable to use lag or lead when making
>> queries to a data frame and I'm trying to figure out if I just have a bad
>> setup or if this is a bug. As for the exceptions I get: when using
>> selectExpr() with a string as an argument, I get "NoSuchElementException:
>> key not found: lag" and when using the select method and
>> ...spark.sql.functions.lag I get an AnalysisException. If I replace lag
>> with abs in the first case, Spark runs without exception, so none of the
>> other syntax is incorrect.
>>
>> As for how I'm running it; the code is written in Java with a static
>> method that takes the SparkContext as an argument which is used to create a
>> JavaSparkContext which then is used to create an SQLContext which loads a
>> json file from the local disk and runs those queries on that data frame
>> object. FYI: the java code is compiled, jared and then pointed to with -cp
>> when starting the spark shell, so all I do is "Test.run(sc)" in shell.
>>
>> Let me know what to look for to debug this problem. I'm not sure where to
>> look to solve this problem.
>>
>> Thanks,
>>         Jerry
>>
>
>

When will window ....

Posted by Martin Senne <ma...@martin-senne.de>.
When will window functions be integrated into Spark (without HiveContext?)

Gesendet mit AquaMail für Android
http://www.aqua-mail.com


Am 10. August 2015 23:04:22 schrieb Michael Armbrust <mi...@databricks.com>:

> You will need to use a HiveContext for window functions to work.
>
> On Mon, Aug 10, 2015 at 1:26 PM, Jerry <je...@gmail.com> wrote:
>
> > Hello,
> >
> > Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries
> > to a data frame and I'm trying to figure out if I just have a bad setup or
> > if this is a bug. As for the exceptions I get: when using selectExpr() with
> > a string as an argument, I get "NoSuchElementException: key not found: lag"
> > and when using the select method and ...spark.sql.functions.lag I get an
> > AnalysisException. If I replace lag with abs in the first case, Spark runs
> > without exception, so none of the other syntax is incorrect.
> >
> > As for how I'm running it; the code is written in Java with a static
> > method that takes the SparkContext as an argument which is used to create a
> > JavaSparkContext which then is used to create an SQLContext which loads a
> > json file from the local disk and runs those queries on that data frame
> > object. FYI: the java code is compiled, jared and then pointed to with -cp
> > when starting the spark shell, so all I do is "Test.run(sc)" in shell.
> >
> > Let me know what to look for to debug this problem. I'm not sure where to
> > look to solve this problem.
> >
> > Thanks,
> >         Jerry
> >

Re: Is there any external dependencies for lag() and lead() when using data frames?

Posted by Michael Armbrust <mi...@databricks.com>.
You will need to use a HiveContext for window functions to work.

On Mon, Aug 10, 2015 at 1:26 PM, Jerry <je...@gmail.com> wrote:

> Hello,
>
> Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries
> to a data frame and I'm trying to figure out if I just have a bad setup or
> if this is a bug. As for the exceptions I get: when using selectExpr() with
> a string as an argument, I get "NoSuchElementException: key not found: lag"
> and when using the select method and ...spark.sql.functions.lag I get an
> AnalysisException. If I replace lag with abs in the first case, Spark runs
> without exception, so none of the other syntax is incorrect.
>
> As for how I'm running it; the code is written in Java with a static
> method that takes the SparkContext as an argument which is used to create a
> JavaSparkContext which then is used to create an SQLContext which loads a
> json file from the local disk and runs those queries on that data frame
> object. FYI: the java code is compiled, jared and then pointed to with -cp
> when starting the spark shell, so all I do is "Test.run(sc)" in shell.
>
> Let me know what to look for to debug this problem. I'm not sure where to
> look to solve this problem.
>
> Thanks,
>         Jerry
>