You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Ralph Soika <ra...@imixs.com> on 2017/07/30 21:21:34 UTC

How to write a Job for importing Files from an external Rest API into Hadoop

Hi,

I want to ask, what's the best way implementing a Job which is importing 
files into the HDFS?

I have an external System offering data accessible through a Rest API. 
My goal is to have a job running in Hadoop which is periodical (maybe 
started by chron?) looking into the Rest API if new data is available.

It would be nice if also this job could run on multiple data nodes. But 
in difference to all the MapReduce examples I found, is my job looking 
for new Data or changed data from an external interface and compares the 
data with existing one.

This is a conceptual example of the job:

 1. The job ask the Rest API if there are new files
 2. if so, the job imports the first file in the list
 3. look if the file already exits
     1. if not, the job imports the file
     2. if yes, the job compares the data with the data already stored
         1. if changed the job updates the file
 4. if more file exits the job continues with 2 -
 5. otherwise ends.


Can anybody give me a little help how to start (its my first job I 
write...) ?


===
Ralph




-- 


Re: How to write a Job for importing Files from an external Rest API into Hadoop

Posted by Ralph Soika <ra...@imixs.com>.
Hi Ravi,

thanks a lot for your response and the code example!
I think this will help me a lot to get started .I am glad to see that my 
idea is not to exotic.
I will report if I can adapt the solution for my problem.

best regards
Ralph


On 31.07.2017 22:05, Ravi Prakash wrote:
> Hi Ralph!
>
> Although not totally similar to your use case, DistCp may be the 
> closest thing to what you want. 
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java 
> . The client builds a file list, and then submits an MR job to copy 
> over all the files.
>
> HTH
> Ravi
>
> On Sun, Jul 30, 2017 at 2:21 PM, Ralph Soika <ralph.soika@imixs.com 
> <ma...@imixs.com>> wrote:
>
>     Hi,
>
>     I want to ask, what's the best way implementing a Job which is
>     importing files into the HDFS?
>
>     I have an external System offering data accessible through a Rest
>     API. My goal is to have a job running in Hadoop which is
>     periodical (maybe started by chron?) looking into the Rest API if
>     new data is available.
>
>     It would be nice if also this job could run on multiple data
>     nodes. But in difference to all the MapReduce examples I found, is
>     my job looking for new Data or changed data from an external
>     interface and compares the data with existing one.
>
>     This is a conceptual example of the job:
>
>      1. The job ask the Rest API if there are new files
>      2. if so, the job imports the first file in the list
>      3. look if the file already exits
>          1. if not, the job imports the file
>          2. if yes, the job compares the data with the data already stored
>              1. if changed the job updates the file
>      4. if more file exits the job continues with 2 -
>      5. otherwise ends.
>
>
>     Can anybody give me a little help how to start (its my first job I
>     write...) ?
>
>
>     ===
>     Ralph
>
>
>
>
>     -- 
>
>

-- 
*Imixs*...extends the way people work together
We are an open source company, read more at: www.imixs.org 
<http://www.imixs.org>
------------------------------------------------------------------------
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com <http://www.imixs.com>
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika


Re: How to write a Job for importing Files from an external Rest API into Hadoop

Posted by Ravi Prakash <ra...@gmail.com>.
Hi Ralph!

Although not totally similar to your use case, DistCp may be the closest
thing to what you want.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java
. The client builds a file list, and then submits an MR job to copy over
all the files.

HTH
Ravi

On Sun, Jul 30, 2017 at 2:21 PM, Ralph Soika <ra...@imixs.com> wrote:

> Hi,
>
> I want to ask, what's the best way implementing a Job which is importing
> files into the HDFS?
>
> I have an external System offering data accessible through a Rest API. My
> goal is to have a job running in Hadoop which is periodical (maybe started
> by chron?) looking into the Rest API if new data is available.
>
> It would be nice if also this job could run on multiple data nodes. But in
> difference to all the MapReduce examples I found, is my job looking for new
> Data or changed data from an external interface and compares the data with
> existing one.
>
> This is a conceptual example of the job:
>
>    1. The job ask the Rest API if there are new files
>    2. if so, the job imports the first file in the list
>    3. look if the file already exits
>       1. if not, the job imports the file
>       2. if yes, the job compares the data with the data already stored
>          1. if changed the job updates the file
>          4. if more file exits the job continues with 2 -
>    5. otherwise ends.
>
>
> Can anybody give me a little help how to start (its my first job I
> write...) ?
>
>
> ===
> Ralph
>
>
>
>
> --
>
>