You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sourav Mazumder <so...@gmail.com> on 2017/11/21 17:07:37 UTC

Custom Data Source for getting data from Rest based services

Hi All,

Need your thoughts/inputs on a custom Data Source for accessing Rest based
services in parallel using Spark.

Many a times for business applications (batch oriented) one has to call a
target Rest service for a high number of times (with different set of
values of parameters/KV pairs).

The example use cases for the same are -

- Getting results/prediction from Machine Learning/NLP systems,
- Accessing utility APIs (like address validation) in bulk for 1000s of
inputs
- Ingesting data from systems who support only parametric data query (say
for time series data),
- Indexing data to Search systems
- Web crawling
- Accessing business applications which do not support bulk download
- others ....

Typically, for these use cases, the number of time the Service is called
(with various parameters/data) can be high. So people use/develop a
parallel processing framework (specific to his/her choice of language) to
call the APIs in parallel. But typically it is hard to make such thing run
in a distributed manner using multiple machines.

I found Spark's distributed programming paradigm can be used in a great way
for this. And was trying to create a custom Data Source for the same. Here
is the link to the repo -
https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest

The interface goes like this :
- Inputs : REST API endpoint URL, input Data in a Temporary Spark Table -
the name of the table has to be passed, type of method (Get, Post, Put or
Delete), userid/password (for the sites which need authentication),
connection parameters (connection time, read time), parameter to call the
target Rest API only once (useful for services for which you have to pay or
have a daily/hourly limit)
- Output : A DataFrame with Rows of Struct. The Struct will have the output
returned by the target API.

Any thoughts/inputs on this ?
a) Will this be useful for the applications/use cases you develop ?
b) What you typically use to address this type of needs ?
c) What else should be considered to make this framework more generic
/useful ?

Regards,
Sourav

P.S. I found this resource (https://www.alibabacloud.com/forum/read-474)
where the similar requirement is discussed and a solution is proposed. Not
sure what is the status of the proposal. However, some more things I found
need to be addressed in that proposal -
a) The proposal covers calling the Rest API for one set of key/value
parameter. In the above approach one can call same Rest API multiple times
with different sets of values of the keys.
b) There should be an option where Rest API should be called only once for
a given set of key/value parameters. This is important as many a times one
has to pay for accessing a Rest API and also there may be a limit per
day/hour basis.
c) Does not support calling a Rest service which is based on Post or other
HTTP methods
d) The results in other formats (like xml, csv) cannot be addressed

Re: Custom Data Source for getting data from Rest based services

Posted by Jean Georges Perrin <jg...@jgp.net>.

If you need Java code, you can have a look @: 
https://github.com/jgperrin/net.jgp.labs.spark.datasources <https://github.com/jgperrin/net.jgp.labs.spark.datasources>

and:
https://databricks.com/session/extending-apache-sparks-ingestion-building-your-own-java-data-source <https://databricks.com/session/extending-apache-sparks-ingestion-building-your-own-java-data-source>

> On Dec 24, 2017, at 2:56 AM, Subarna Bhattacharyya <su...@climformatics.com> wrote:
> 
> Hi Sourav,
> Looks like this would be a good utility for the development of large scale
> data driven product based on Data services. 
> 
> We are an early stage startup called Climformatics and  we are building a
> customized high resolution climate prediction tool. This effort requires
> synthesis of large scale data input from multiple data sources. This tool
> can help in getting large volume of data from multiple data services through
> api calls which are somewhat limited to their bulk use.
> 
> One feature that would help us further is if you could have a handle on
> setting the limits on how many data points can be grabbed at once, since the
> data sources that we access are often limited by the number of service calls
> that one can do at a time (say per minute).
> 
> Also we need a way to pass the parameter inputs (for multiple calls) through
> the url path itself. Many of the data sources we use need the parameters are
> to be included in the uri path itself instead of passing them as key/value
> parameter. An example is https://www.wunderground.com/weather/api/d/docs.
> 
> We would try to give a closer look to the github link you provided and get
> back to you with feedback.
> 
> Thanks,
> Sincerely,
> Subarna
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: Custom Data Source for getting data from Rest based services

Posted by Subarna Bhattacharyya <su...@climformatics.com>.

Hi Sourav,
Looks like this would be a good utility for the development of large scale
data driven product based on Data services. 

We are an early stage startup called Climformatics and  we are building a
customized high resolution climate prediction tool. This effort requires
synthesis of large scale data input from multiple data sources. This tool
can help in getting large volume of data from multiple data services through
api calls which are somewhat limited to their bulk use.

One feature that would help us further is if you could have a handle on
setting the limits on how many data points can be grabbed at once, since the
data sources that we access are often limited by the number of service calls
that one can do at a time (say per minute).

Also we need a way to pass the parameter inputs (for multiple calls) through
the url path itself. Many of the data sources we use need the parameters are
to be included in the uri path itself instead of passing them as key/value
parameter. An example is https://www.wunderground.com/weather/api/d/docs.

We would try to give a closer look to the github link you provided and get
back to you with feedback.

Thanks,
Sincerely,
Subarna



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Custom Data Source for getting data from Rest based services

Posted by Sourav Mazumder <so...@gmail.com>.

It would be great if you can elaborate on the bulk provisioning use case.

Regards,
Sourav

On Sun, Nov 26, 2017 at 11:53 PM, shankar.roy <sh...@gmail.com> wrote:

> This would be a useful feature.
> We can leverage it while doing bulk provisioning.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Custom Data Source for getting data from Rest based services

Posted by "shankar.roy" <sh...@gmail.com>.

This would be a useful feature.
We can leverage it while doing bulk provisioning.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Custom Data Source for getting data from Rest based services

Posted by smazumder <so...@gmail.com>.

@sathich

Here are my thoughts on your points -

1. Yes this should be able to handle any complex json structure returned by
the target rest API. Essentially what it would be returning is Rows of that
complex structure. Then one can use Spark SQL to further flatten it using
the functions like inline, explode, etc.

2. In my current implementation I have kept an option as "callStrictlyOnce".
This will ensure that the REST API is called only once for each set of
parameter values and the result would be persisted/cached for next time use.

3. I'm not sure what exactly you have in mind regarding extending this to
Spark Streaming. As such this cannot be used as a Spark Streaming receiver
right now as this does not implement the necessary interfaces for a custom
streaming receiver. But you can use this within your Spark Streaming
application as a regular Data Source to merge the data you are receiving
from streaming source.

Regards,
Sourav



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Custom Data Source for getting data from Rest based services

Posted by sathich <sa...@yahoo.com>.

Hi Sourav,
This is quite an useful addition  to the spark family, this is a usecase
that comes more often than talked about.
* to get a 3rd party mapping data(geo coordinates) , 
* access database data through rest.
* download data from from bulk data api service   


It will be really useful to be able to interact with application layer
through restapi send over data to the rest api(case of post request which
you already mentioned) 

I have few follow up thoughts
1) What's your thought when a resapi returns more complex nested json data ,
will this seamlessly  map to a dataframe as  dataframes are more flatter in
nature. 
2) how can this dataframe be kept in distributed cache in spark workers to
be available , to encourage re-use of slow-changing data (does broadcast
work on a dataframe?) . This is related to your b) 
3) Last case in my mind is how can this be extended for streaming , and
control the frequency  of the resapi call and perform a join of two
dataframes, one is slow-moving(may be a lookup table in db getting accessed
over rest) and fast moving event stream.


Thanks
Sathi

 








--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Custom Data Source for getting data from Rest based services

Posted by vaish02 <vc...@gmail.com>.

We extensively use pubmed & clinical trial databases for our work and it
involves making large amount of parametric rest api queries, usually if the
data download is large the requests get timed out ad we have to run queries
in very small batches . We also extensively use large number(thousands) of
NLP queries for our ML work. 
      Given that our content is quite large and we are constrained by the
public database interfaces, such a framework would be very beneficial for
our use case. Since I just stumbled on this post will try to use this
package in context of our framework and let you know the difference between
using the library vs the way we do it conventionally. Thanks for sharing it
with the community.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org