You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mitch Shepherd <Mi...@marklogic.com> on 2022/11/23 16:31:40 UTC

Creating a Spark 3 Connector

Hello,

I’m wondering if anyone can point me in the right direction for a Spark connector developer guide.

I’m looking for information on writing a new connector for Spark to move data between Apache Spark and other systems.

Any information would be helpful. I found a similar thing for Kafka<https://docs.confluent.io/platform/current/connect/devguide.html> but haven’t been able to track down documentation for Spark.

Best,
Mitch

This message and any attached documents contain information of MarkLogic and/or its customers that may be confidential and/or privileged. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message. This email may contain pricing or other suggested contract terms related to MarkLogic software or services. Any such terms are not binding on MarkLogic unless and until they are included in a definitive agreement executed by MarkLogic.

Re: Creating a Spark 3 Connector

Posted by Jungtaek Lim <ka...@gmail.com>.

Bjørn, that is the project of "spark connect" which dissociates client and
server from Spark driver. Not related to data source (which is also known
as connector).

Mitch, as I understand correctly, unfortunately we don't have dedicated
documentation for implementing data source/connectors. It's encouraged to
look at reference implementations like Kafka and understand interfaces.
Each interface has its own documentation so it will guide you to implement
your own.

Please post any question on dev@ mailing list if you have doubts or are
stuck with implementing it.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Nov 24, 2022 at 7:06 AM Bjørn Jørgensen <bj...@gmail.com>
wrote:

> This is from the vote for spark connector. Is this you are looking for?
>
> The goal of the SPIP is to introduce a Dataframe based client/server API
> for Spark
>
> Please also refer to:
>
> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark Connect
> - A client and server interface for Apache Spark.
> <https://lists.apache.org/thread/3fd2n34hlyg872nr55rylbv5cg8m1556>
> - Design doc: Spark Connect - A client and server interface for Apache
> Spark.
> <https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit?usp=sharing>
> - JIRA: SPARK-39375 <https://issues.apache.org/jira/browse/SPARK-39375>
>
> ons. 23. nov. 2022 kl. 17:36 skrev Mitch Shepherd <
> Mitch.Shepherd@marklogic.com>:
>
>> Hello,
>>
>>
>>
>> I’m wondering if anyone can point me in the right direction for a Spark
>> connector developer guide.
>>
>>
>>
>> I’m looking for information on writing a new connector for Spark to move
>> data between Apache Spark and other systems.
>>
>>
>>
>> Any information would be helpful. I found a similar thing for Kafka
>> <https://docs.confluent.io/platform/current/connect/devguide.html> but
>> haven’t been able to track down documentation for Spark.
>>
>>
>>
>> Best,
>>
>> Mitch
>>
>> This message and any attached documents contain information of MarkLogic
>> and/or its customers that may be confidential and/or privileged. If you are
>> not the intended recipient, you may not read, copy, distribute, or use this
>> information. If you have received this transmission in error, please notify
>> the sender immediately by reply e-mail and then delete this message. This
>> email may contain pricing or other suggested contract terms related to
>> MarkLogic software or services. Any such terms are not binding on MarkLogic
>> unless and until they are included in a definitive agreement executed by
>> MarkLogic.
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Creating a Spark 3 Connector

Posted by Bjørn Jørgensen <bj...@gmail.com>.

This is from the vote for spark connector. Is this you are looking for?

The goal of the SPIP is to introduce a Dataframe based client/server API
for Spark

Please also refer to:

- Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark Connect -
A client and server interface for Apache Spark.
<https://lists.apache.org/thread/3fd2n34hlyg872nr55rylbv5cg8m1556>
- Design doc: Spark Connect - A client and server interface for Apache
Spark.
<https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit?usp=sharing>
- JIRA: SPARK-39375 <https://issues.apache.org/jira/browse/SPARK-39375>

ons. 23. nov. 2022 kl. 17:36 skrev Mitch Shepherd <
Mitch.Shepherd@marklogic.com>:

> Hello,
>
>
>
> I’m wondering if anyone can point me in the right direction for a Spark
> connector developer guide.
>
>
>
> I’m looking for information on writing a new connector for Spark to move
> data between Apache Spark and other systems.
>
>
>
> Any information would be helpful. I found a similar thing for Kafka
> <https://docs.confluent.io/platform/current/connect/devguide.html> but
> haven’t been able to track down documentation for Spark.
>
>
>
> Best,
>
> Mitch
>
> This message and any attached documents contain information of MarkLogic
> and/or its customers that may be confidential and/or privileged. If you are
> not the intended recipient, you may not read, copy, distribute, or use this
> information. If you have received this transmission in error, please notify
> the sender immediately by reply e-mail and then delete this message. This
> email may contain pricing or other suggested contract terms related to
> MarkLogic software or services. Any such terms are not binding on MarkLogic
> unless and until they are included in a definitive agreement executed by
> MarkLogic.
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Spark Partitions Size control

Posted by vijay khatri <vi...@gmail.com>.

Hi Team,

I am reading data from sql server tables through pyspark and storing data
into S3 as parquet file format.

In some table I have lots of data so I am getting file size in S3 for those
tables in GBs.

I need help on this following:

I want to assign 128 MB to each partition. How we can assign?

I don't know the data size in tables. Some tables have 2 column but
billions records and some tables have 200 columns but thousands of records.

Thanks in advance for your help.

Regards,
Vijay

On Wed, 23 Nov, 2022, 10:05 pm Mitch Shepherd, <Mi...@marklogic.com>
wrote:

> Hello,
>
>
>
> I’m wondering if anyone can point me in the right direction for a Spark
> connector developer guide.
>
>
>
> I’m looking for information on writing a new connector for Spark to move
> data between Apache Spark and other systems.
>
>
>
> Any information would be helpful. I found a similar thing for Kafka
> <https://docs.confluent.io/platform/current/connect/devguide.html> but
> haven’t been able to track down documentation for Spark.
>
>
>
> Best,
>
> Mitch
>
> This message and any attached documents contain information of MarkLogic
> and/or its customers that may be confidential and/or privileged. If you are
> not the intended recipient, you may not read, copy, distribute, or use this
> information. If you have received this transmission in error, please notify
> the sender immediately by reply e-mail and then delete this message. This
> email may contain pricing or other suggested contract terms related to
> MarkLogic software or services. Any such terms are not binding on MarkLogic
> unless and until they are included in a definitive agreement executed by
> MarkLogic.
>