You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Scott Ribe <sc...@elevated-dev.com> on 2021/02/18 19:42:16 UTC

how to serve data over JDBC using simplest setup

I need a little help figuring out how some pieces fit together. I have some tables in parquet files, and I want to access them using SQL over JDBC. I gather that I need to run the thrift server, but how do I configure it to load my files into datasets and expose views?

The context is this: trying to figure out if we want to use Spark for historical data, and so far, just using spark shell for some experiments:

- I have established that we can easily export to Parquet and it is very efficient at storing this data
- Spark SQL queries the data with reasonable performance

Now I am at the step of testing whether the client-side that we are considering can deal effectively with querying the volume of data.

Which is why I'm looking for the simplest setup. If the client integration works, then yes we move on to configuring a proper cluster. (And it is a real question, I've already had one potential client-side piece be totally incompetent at handling a decent volume of data...)

(The environment I am working in is just the straight download of spark-3.0.1-bin-hadoop3.2)

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: how to serve data over JDBC using simplest setup

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.
Presto has slightly lower latency than Spark, but I've found that it gets stuck on some edge cases. 

If you are on AWS, then the simplest solution is to use Athena. Athena is built on Presto, has a JDBC driver, and is serverless, so you don't have to take any headaches

On 2/18/21, 3:32 PM, "Scott Ribe" <sc...@elevated-dev.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    > On Feb 18, 2021, at 12:52 PM, Jeff Evans <je...@gmail.com> wrote:
    >
    > It sounds like the tool you're after, then, is a distributed SQL engine like Presto.  But I could be totally misunderstanding what you're trying to do.

    Presto may well be a longer-term solution as our use grows. For now, a simple data set loaded into spark and served via JDBC (to be accessed via a Postgres foreign data wrapper) will get us the next small step.
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org



Re: how to serve data over JDBC using simplest setup

Posted by Scott Ribe <sc...@elevated-dev.com>.
> On Feb 18, 2021, at 12:52 PM, Jeff Evans <je...@gmail.com> wrote:
> 
> It sounds like the tool you're after, then, is a distributed SQL engine like Presto.  But I could be totally misunderstanding what you're trying to do.

Presto may well be a longer-term solution as our use grows. For now, a simple data set loaded into spark and served via JDBC (to be accessed via a Postgres foreign data wrapper) will get us the next small step.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: how to serve data over JDBC using simplest setup

Posted by Jeff Evans <je...@gmail.com>.
It sounds like the tool you're after, then, is a distributed SQL engine
like Presto.  But I could be totally misunderstanding what you're trying to
do.

On Thu, Feb 18, 2021 at 1:48 PM Scott Ribe <sc...@elevated-dev.com>
wrote:

> I have a client side piece that needs access via JDBC.
>
> > On Feb 18, 2021, at 12:45 PM, Jeff Evans <je...@gmail.com>
> wrote:
> >
> > If the data is already in Parquet files, I don't see any reason to
> involve JDBC at all.  You can read Parquet files directly into a
> DataFrame.
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
>
>

Re: how to serve data over JDBC using simplest setup

Posted by Scott Ribe <sc...@elevated-dev.com>.
I have a client side piece that needs access via JDBC.

> On Feb 18, 2021, at 12:45 PM, Jeff Evans <je...@gmail.com> wrote:
> 
> If the data is already in Parquet files, I don't see any reason to involve JDBC at all.  You can read Parquet files directly into a DataFrame.  https://spark.apache.org/docs/latest/sql-data-sources-parquet.html


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: how to serve data over JDBC using simplest setup

Posted by Jeff Evans <je...@gmail.com>.
If the data is already in Parquet files, I don't see any reason to involve
JDBC at all.  You can read Parquet files directly into a DataFrame.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

On Thu, Feb 18, 2021 at 1:42 PM Scott Ribe <sc...@elevated-dev.com>
wrote:

> I need a little help figuring out how some pieces fit together. I have
> some tables in parquet files, and I want to access them using SQL over
> JDBC. I gather that I need to run the thrift server, but how do I configure
> it to load my files into datasets and expose views?
>
> The context is this: trying to figure out if we want to use Spark for
> historical data, and so far, just using spark shell for some experiments:
>
> - I have established that we can easily export to Parquet and it is very
> efficient at storing this data
> - Spark SQL queries the data with reasonable performance
>
> Now I am at the step of testing whether the client-side that we are
> considering can deal effectively with querying the volume of data.
>
> Which is why I'm looking for the simplest setup. If the client integration
> works, then yes we move on to configuring a proper cluster. (And it is a
> real question, I've already had one potential client-side piece be totally
> incompetent at handling a decent volume of data...)
>
> (The environment I am working in is just the straight download of
> spark-3.0.1-bin-hadoop3.2)
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> https://www.linkedin.com/in/scottribe/
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: how to serve data over JDBC using simplest setup

Posted by Scott Ribe <sc...@elevated-dev.com>.
> On Feb 18, 2021, at 1:13 PM, Lalwani, Jayesh <jl...@amazon.com.INVALID> wrote:
> 
> Have you tried any of those? Where are you getting stuck?

Thanks! The 3rd one in your list I had not found, and it seems to fill in what I was missing (CREATE EXTERNAL TABLE).

I'd found the first two, but they only got me creating and querying tables in spark shell, or launching a hive server that had no data. (Google had also provided me with a wide variety of irrelevant material--mostly about using JDBC from within spark to import data, which I had figured out pretty quickly.)


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: how to serve data over JDBC using simplest setup

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.
There are several step by step guides that you can find online by googling

https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-thrift-server.html
https://medium.com/@saipeddy/setting-up-a-thrift-server-4eb0c55c11f0
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_spark-component-guide/content/config-sts.html

Have you tried any of those? Where are you getting stuck?


On 2/18/21, 2:44 PM, "Scott Ribe" <sc...@elevated-dev.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    I need a little help figuring out how some pieces fit together. I have some tables in parquet files, and I want to access them using SQL over JDBC. I gather that I need to run the thrift server, but how do I configure it to load my files into datasets and expose views?

    The context is this: trying to figure out if we want to use Spark for historical data, and so far, just using spark shell for some experiments:

    - I have established that we can easily export to Parquet and it is very efficient at storing this data
    - Spark SQL queries the data with reasonable performance

    Now I am at the step of testing whether the client-side that we are considering can deal effectively with querying the volume of data.

    Which is why I'm looking for the simplest setup. If the client integration works, then yes we move on to configuring a proper cluster. (And it is a real question, I've already had one potential client-side piece be totally incompetent at handling a decent volume of data...)

    (The environment I am working in is just the straight download of spark-3.0.1-bin-hadoop3.2)

    --
    Scott Ribe
    scott_ribe@elevated-dev.com
    https://www.linkedin.com/in/scottribe/




    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org