You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chetan Khatri <ch...@gmail.com> on 2020/05/14 21:03:07 UTC

Calling HTTP Rest APIs from Spark Job

Hi Spark Users,

How can I invoke the Rest API call from Spark Code which is not only
running on Spark Driver but distributed / parallel?

Spark with Scala is my tech stack.

Thanks

Re: Calling HTTP Rest APIs from Spark Job

Posted by Chetan Khatri <ch...@gmail.com>.
Hi Sean,
Thanks for great answer.

What I am trying to do is to use something like Scala Future (cats-effect
IO) to do concurrent calls. Was understanding if any limitation
thresholds to make those calls.

On Thu, May 14, 2020 at 7:28 PM Sean Owen <sr...@gmail.com> wrote:

> No, it means # HTTP calls = # executor slots. But even then, you're
> welcome to, say, use thread pools to execute even more concurrently as
> most are I/O bound. Your code can do what you want.
>
> On Thu, May 14, 2020 at 6:14 PM Chetan Khatri
> <ch...@gmail.com> wrote:
> >
> > Thanks, that means number of executor = number of http calls, I can
> make. I can't boost more number of http calls in single executors, I mean -
> I can't go beyond the threashold of number of executors.
> >
> > On Thu, May 14, 2020 at 6:26 PM Sean Owen <sr...@gmail.com> wrote:
> >>
> >> Default is not 200, but the number of executor slots. Yes you can only
> simultaneously execute as many tasks as slots regardless of partitions.
> >>
> >> On Thu, May 14, 2020, 5:19 PM Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
> >>>
> >>> Thanks Sean, Jerry.
> >>>
> >>> Default Spark DataFrame partitions are 200 right? does it have
> relationship with number of cores? 8 cores - 4 workers. is not it like I
> can do only 8 * 4 = 32 http calls. Because in Spark number of partitions =
> number cores is untrue.
> >>>
> >>> Thanks
> >>>
> >>> On Thu, May 14, 2020 at 6:11 PM Sean Owen <sr...@gmail.com> wrote:
> >>>>
> >>>> Yes any code that you write in code that you apply with Spark runs in
> >>>> the executors. You would be running as many HTTP clients as you have
> >>>> partitions.
> >>>>
> >>>> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov <
> grapesmoker@gmail.com> wrote:
> >>>> >
> >>>> > I believe that if you do this within the context of an operation
> that is already parallelized such as a map, the work will be distributed to
> executors and they will do it in parallel. I could be wrong about this as I
> never investigated this specific use case, though.
> >>>> >
> >>>> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
> >>>> >>
> >>>> >> Thanks for the quick response.
> >>>> >>
> >>>> >> I am curious to know whether would it be parallel pulling data for
> 100+ HTTP request or it will only go on Driver node? the post body would be
> part of DataFrame. Think as I have a data frame of employee_id,
> employee_name now the http GET call has to be made for each employee_id and
> DataFrame is dynamic for each spark job run.
> >>>> >>
> >>>> >> Does it make sense?
> >>>> >>
> >>>> >> Thanks
> >>>> >>
> >>>> >>
> >>>> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <
> grapesmoker@gmail.com> wrote:
> >>>> >>>
> >>>> >>> Hi Chetan,
> >>>> >>>
> >>>> >>> You can pretty much use any client to do this. When I was using
> Spark at a previous job, we used OkHttp, but I'm sure there are plenty of
> others. In our case, we had a startup phase in which we gathered metadata
> via a REST API and then broadcast it to the workers. I think if you need
> all the workers to have access to whatever you're getting from the API,
> that's the way to do it.
> >>>> >>>
> >>>> >>> Jerry
> >>>> >>>
> >>>> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
> >>>> >>>>
> >>>> >>>> Hi Spark Users,
> >>>> >>>>
> >>>> >>>> How can I invoke the Rest API call from Spark Code which is not
> only running on Spark Driver but distributed / parallel?
> >>>> >>>>
> >>>> >>>> Spark with Scala is my tech stack.
> >>>> >>>>
> >>>> >>>> Thanks
> >>>> >>>>
> >>>> >>>>
> >>>> >>>
> >>>> >>>
> >>>> >>> --
> >>>> >>> http://www.google.com/profiles/grapesmoker
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> > http://www.google.com/profiles/grapesmoker
>

Re: Calling HTTP Rest APIs from Spark Job

Posted by Sean Owen <sr...@gmail.com>.
No, it means # HTTP calls = # executor slots. But even then, you're
welcome to, say, use thread pools to execute even more concurrently as
most are I/O bound. Your code can do what you want.

On Thu, May 14, 2020 at 6:14 PM Chetan Khatri
<ch...@gmail.com> wrote:
>
> Thanks, that means number of executor = number of http calls, I can make. I can't boost more number of http calls in single executors, I mean - I can't go beyond the threashold of number of executors.
>
> On Thu, May 14, 2020 at 6:26 PM Sean Owen <sr...@gmail.com> wrote:
>>
>> Default is not 200, but the number of executor slots. Yes you can only simultaneously execute as many tasks as slots regardless of partitions.
>>
>> On Thu, May 14, 2020, 5:19 PM Chetan Khatri <ch...@gmail.com> wrote:
>>>
>>> Thanks Sean, Jerry.
>>>
>>> Default Spark DataFrame partitions are 200 right? does it have relationship with number of cores? 8 cores - 4 workers. is not it like I can do only 8 * 4 = 32 http calls. Because in Spark number of partitions = number cores is untrue.
>>>
>>> Thanks
>>>
>>> On Thu, May 14, 2020 at 6:11 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>> Yes any code that you write in code that you apply with Spark runs in
>>>> the executors. You would be running as many HTTP clients as you have
>>>> partitions.
>>>>
>>>> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov <gr...@gmail.com> wrote:
>>>> >
>>>> > I believe that if you do this within the context of an operation that is already parallelized such as a map, the work will be distributed to executors and they will do it in parallel. I could be wrong about this as I never investigated this specific use case, though.
>>>> >
>>>> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <ch...@gmail.com> wrote:
>>>> >>
>>>> >> Thanks for the quick response.
>>>> >>
>>>> >> I am curious to know whether would it be parallel pulling data for 100+ HTTP request or it will only go on Driver node? the post body would be part of DataFrame. Think as I have a data frame of employee_id, employee_name now the http GET call has to be made for each employee_id and DataFrame is dynamic for each spark job run.
>>>> >>
>>>> >> Does it make sense?
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >>
>>>> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <gr...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi Chetan,
>>>> >>>
>>>> >>> You can pretty much use any client to do this. When I was using Spark at a previous job, we used OkHttp, but I'm sure there are plenty of others. In our case, we had a startup phase in which we gathered metadata via a REST API and then broadcast it to the workers. I think if you need all the workers to have access to whatever you're getting from the API, that's the way to do it.
>>>> >>>
>>>> >>> Jerry
>>>> >>>
>>>> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <ch...@gmail.com> wrote:
>>>> >>>>
>>>> >>>> Hi Spark Users,
>>>> >>>>
>>>> >>>> How can I invoke the Rest API call from Spark Code which is not only running on Spark Driver but distributed / parallel?
>>>> >>>>
>>>> >>>> Spark with Scala is my tech stack.
>>>> >>>>
>>>> >>>> Thanks
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> http://www.google.com/profiles/grapesmoker
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > http://www.google.com/profiles/grapesmoker

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Calling HTTP Rest APIs from Spark Job

Posted by Chetan Khatri <ch...@gmail.com>.
Thanks, that means number of executor = number of http calls, I can make. I
can't boost more number of http calls in single executors, I mean - I can't
go beyond the threashold of number of executors.

On Thu, May 14, 2020 at 6:26 PM Sean Owen <sr...@gmail.com> wrote:

> Default is not 200, but the number of executor slots. Yes you can only
> simultaneously execute as many tasks as slots regardless of partitions.
>
> On Thu, May 14, 2020, 5:19 PM Chetan Khatri <ch...@gmail.com>
> wrote:
>
>> Thanks Sean, Jerry.
>>
>> Default Spark DataFrame partitions are 200 right? does it have
>> relationship with number of cores? 8 cores - 4 workers. is not it like I
>> can do only 8 * 4 = 32 http calls. Because in Spark number of partitions =
>> number cores is untrue.
>>
>> Thanks
>>
>> On Thu, May 14, 2020 at 6:11 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Yes any code that you write in code that you apply with Spark runs in
>>> the executors. You would be running as many HTTP clients as you have
>>> partitions.
>>>
>>> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov <gr...@gmail.com>
>>> wrote:
>>> >
>>> > I believe that if you do this within the context of an operation that
>>> is already parallelized such as a map, the work will be distributed to
>>> executors and they will do it in parallel. I could be wrong about this as I
>>> never investigated this specific use case, though.
>>> >
>>> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>> >>
>>> >> Thanks for the quick response.
>>> >>
>>> >> I am curious to know whether would it be parallel pulling data for
>>> 100+ HTTP request or it will only go on Driver node? the post body would be
>>> part of DataFrame. Think as I have a data frame of employee_id,
>>> employee_name now the http GET call has to be made for each employee_id and
>>> DataFrame is dynamic for each spark job run.
>>> >>
>>> >> Does it make sense?
>>> >>
>>> >> Thanks
>>> >>
>>> >>
>>> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <
>>> grapesmoker@gmail.com> wrote:
>>> >>>
>>> >>> Hi Chetan,
>>> >>>
>>> >>> You can pretty much use any client to do this. When I was using
>>> Spark at a previous job, we used OkHttp, but I'm sure there are plenty of
>>> others. In our case, we had a startup phase in which we gathered metadata
>>> via a REST API and then broadcast it to the workers. I think if you need
>>> all the workers to have access to whatever you're getting from the API,
>>> that's the way to do it.
>>> >>>
>>> >>> Jerry
>>> >>>
>>> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>> >>>>
>>> >>>> Hi Spark Users,
>>> >>>>
>>> >>>> How can I invoke the Rest API call from Spark Code which is not
>>> only running on Spark Driver but distributed / parallel?
>>> >>>>
>>> >>>> Spark with Scala is my tech stack.
>>> >>>>
>>> >>>> Thanks
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> http://www.google.com/profiles/grapesmoker
>>> >
>>> >
>>> >
>>> > --
>>> > http://www.google.com/profiles/grapesmoker
>>>
>>

Re: Calling HTTP Rest APIs from Spark Job

Posted by Sean Owen <sr...@gmail.com>.
Default is not 200, but the number of executor slots. Yes you can only
simultaneously execute as many tasks as slots regardless of partitions.

On Thu, May 14, 2020, 5:19 PM Chetan Khatri <ch...@gmail.com>
wrote:

> Thanks Sean, Jerry.
>
> Default Spark DataFrame partitions are 200 right? does it have
> relationship with number of cores? 8 cores - 4 workers. is not it like I
> can do only 8 * 4 = 32 http calls. Because in Spark number of partitions =
> number cores is untrue.
>
> Thanks
>
> On Thu, May 14, 2020 at 6:11 PM Sean Owen <sr...@gmail.com> wrote:
>
>> Yes any code that you write in code that you apply with Spark runs in
>> the executors. You would be running as many HTTP clients as you have
>> partitions.
>>
>> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov <gr...@gmail.com>
>> wrote:
>> >
>> > I believe that if you do this within the context of an operation that
>> is already parallelized such as a map, the work will be distributed to
>> executors and they will do it in parallel. I could be wrong about this as I
>> never investigated this specific use case, though.
>> >
>> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>> >>
>> >> Thanks for the quick response.
>> >>
>> >> I am curious to know whether would it be parallel pulling data for
>> 100+ HTTP request or it will only go on Driver node? the post body would be
>> part of DataFrame. Think as I have a data frame of employee_id,
>> employee_name now the http GET call has to be made for each employee_id and
>> DataFrame is dynamic for each spark job run.
>> >>
>> >> Does it make sense?
>> >>
>> >> Thanks
>> >>
>> >>
>> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <gr...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi Chetan,
>> >>>
>> >>> You can pretty much use any client to do this. When I was using Spark
>> at a previous job, we used OkHttp, but I'm sure there are plenty of others.
>> In our case, we had a startup phase in which we gathered metadata via a
>> REST API and then broadcast it to the workers. I think if you need all the
>> workers to have access to whatever you're getting from the API, that's the
>> way to do it.
>> >>>
>> >>> Jerry
>> >>>
>> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>> >>>>
>> >>>> Hi Spark Users,
>> >>>>
>> >>>> How can I invoke the Rest API call from Spark Code which is not only
>> running on Spark Driver but distributed / parallel?
>> >>>>
>> >>>> Spark with Scala is my tech stack.
>> >>>>
>> >>>> Thanks
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> http://www.google.com/profiles/grapesmoker
>> >
>> >
>> >
>> > --
>> > http://www.google.com/profiles/grapesmoker
>>
>

Re: Calling HTTP Rest APIs from Spark Job

Posted by Chetan Khatri <ch...@gmail.com>.
Thanks Sean, Jerry.

Default Spark DataFrame partitions are 200 right? does it have relationship
with number of cores? 8 cores - 4 workers. is not it like I can do only 8 *
4 = 32 http calls. Because in Spark number of partitions = number cores is
untrue.

Thanks

On Thu, May 14, 2020 at 6:11 PM Sean Owen <sr...@gmail.com> wrote:

> Yes any code that you write in code that you apply with Spark runs in
> the executors. You would be running as many HTTP clients as you have
> partitions.
>
> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov <gr...@gmail.com>
> wrote:
> >
> > I believe that if you do this within the context of an operation that is
> already parallelized such as a map, the work will be distributed to
> executors and they will do it in parallel. I could be wrong about this as I
> never investigated this specific use case, though.
> >
> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
> >>
> >> Thanks for the quick response.
> >>
> >> I am curious to know whether would it be parallel pulling data for 100+
> HTTP request or it will only go on Driver node? the post body would be part
> of DataFrame. Think as I have a data frame of employee_id, employee_name
> now the http GET call has to be made for each employee_id and DataFrame is
> dynamic for each spark job run.
> >>
> >> Does it make sense?
> >>
> >> Thanks
> >>
> >>
> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <gr...@gmail.com>
> wrote:
> >>>
> >>> Hi Chetan,
> >>>
> >>> You can pretty much use any client to do this. When I was using Spark
> at a previous job, we used OkHttp, but I'm sure there are plenty of others.
> In our case, we had a startup phase in which we gathered metadata via a
> REST API and then broadcast it to the workers. I think if you need all the
> workers to have access to whatever you're getting from the API, that's the
> way to do it.
> >>>
> >>> Jerry
> >>>
> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
> >>>>
> >>>> Hi Spark Users,
> >>>>
> >>>> How can I invoke the Rest API call from Spark Code which is not only
> running on Spark Driver but distributed / parallel?
> >>>>
> >>>> Spark with Scala is my tech stack.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> http://www.google.com/profiles/grapesmoker
> >
> >
> >
> > --
> > http://www.google.com/profiles/grapesmoker
>

Re: Calling HTTP Rest APIs from Spark Job

Posted by Sean Owen <sr...@gmail.com>.
Yes any code that you write in code that you apply with Spark runs in
the executors. You would be running as many HTTP clients as you have
partitions.

On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov <gr...@gmail.com> wrote:
>
> I believe that if you do this within the context of an operation that is already parallelized such as a map, the work will be distributed to executors and they will do it in parallel. I could be wrong about this as I never investigated this specific use case, though.
>
> On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <ch...@gmail.com> wrote:
>>
>> Thanks for the quick response.
>>
>> I am curious to know whether would it be parallel pulling data for 100+ HTTP request or it will only go on Driver node? the post body would be part of DataFrame. Think as I have a data frame of employee_id, employee_name now the http GET call has to be made for each employee_id and DataFrame is dynamic for each spark job run.
>>
>> Does it make sense?
>>
>> Thanks
>>
>>
>> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <gr...@gmail.com> wrote:
>>>
>>> Hi Chetan,
>>>
>>> You can pretty much use any client to do this. When I was using Spark at a previous job, we used OkHttp, but I'm sure there are plenty of others. In our case, we had a startup phase in which we gathered metadata via a REST API and then broadcast it to the workers. I think if you need all the workers to have access to whatever you're getting from the API, that's the way to do it.
>>>
>>> Jerry
>>>
>>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <ch...@gmail.com> wrote:
>>>>
>>>> Hi Spark Users,
>>>>
>>>> How can I invoke the Rest API call from Spark Code which is not only running on Spark Driver but distributed / parallel?
>>>>
>>>> Spark with Scala is my tech stack.
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>>
>>> --
>>> http://www.google.com/profiles/grapesmoker
>
>
>
> --
> http://www.google.com/profiles/grapesmoker

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Calling HTTP Rest APIs from Spark Job

Posted by Jerry Vinokurov <gr...@gmail.com>.
I believe that if you do this within the context of an operation that is
already parallelized such as a map, the work will be distributed to
executors and they will do it in parallel. I could be wrong about this as I
never investigated this specific use case, though.

On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <ch...@gmail.com>
wrote:

> Thanks for the quick response.
>
> I am curious to know whether would it be parallel pulling data for 100+
> HTTP request or it will only go on Driver node? the post body would be part
> of DataFrame. Think as I have a data frame of employee_id, employee_name
> now the http GET call has to be made for each employee_id and DataFrame is
> dynamic for each spark job run.
>
> Does it make sense?
>
> Thanks
>
>
> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <gr...@gmail.com>
> wrote:
>
>> Hi Chetan,
>>
>> You can pretty much use any client to do this. When I was using Spark at
>> a previous job, we used OkHttp, but I'm sure there are plenty of others. In
>> our case, we had a startup phase in which we gathered metadata via a REST
>> API and then broadcast it to the workers. I think if you need all the
>> workers to have access to whatever you're getting from the API, that's the
>> way to do it.
>>
>> Jerry
>>
>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> How can I invoke the Rest API call from Spark Code which is not only
>>> running on Spark Driver but distributed / parallel?
>>>
>>> Spark with Scala is my tech stack.
>>>
>>> Thanks
>>>
>>>
>>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>

-- 
http://www.google.com/profiles/grapesmoker

Re: Calling HTTP Rest APIs from Spark Job

Posted by Chetan Khatri <ch...@gmail.com>.
Thanks for the quick response.

I am curious to know whether would it be parallel pulling data for 100+
HTTP request or it will only go on Driver node? the post body would be part
of DataFrame. Think as I have a data frame of employee_id, employee_name
now the http GET call has to be made for each employee_id and DataFrame is
dynamic for each spark job run.

Does it make sense?

Thanks


On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <gr...@gmail.com>
wrote:

> Hi Chetan,
>
> You can pretty much use any client to do this. When I was using Spark at a
> previous job, we used OkHttp, but I'm sure there are plenty of others. In
> our case, we had a startup phase in which we gathered metadata via a REST
> API and then broadcast it to the workers. I think if you need all the
> workers to have access to whatever you're getting from the API, that's the
> way to do it.
>
> Jerry
>
> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <ch...@gmail.com>
> wrote:
>
>> Hi Spark Users,
>>
>> How can I invoke the Rest API call from Spark Code which is not only
>> running on Spark Driver but distributed / parallel?
>>
>> Spark with Scala is my tech stack.
>>
>> Thanks
>>
>>
>>
>
> --
> http://www.google.com/profiles/grapesmoker
>

Re: Calling HTTP Rest APIs from Spark Job

Posted by Jerry Vinokurov <gr...@gmail.com>.
Hi Chetan,

You can pretty much use any client to do this. When I was using Spark at a
previous job, we used OkHttp, but I'm sure there are plenty of others. In
our case, we had a startup phase in which we gathered metadata via a REST
API and then broadcast it to the workers. I think if you need all the
workers to have access to whatever you're getting from the API, that's the
way to do it.

Jerry

On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <ch...@gmail.com>
wrote:

> Hi Spark Users,
>
> How can I invoke the Rest API call from Spark Code which is not only
> running on Spark Driver but distributed / parallel?
>
> Spark with Scala is my tech stack.
>
> Thanks
>
>
>

-- 
http://www.google.com/profiles/grapesmoker