You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Varadhan, Jawahar" <va...@yahoo.com.INVALID> on 2015/08/14 22:15:43 UTC

Setting up Spark/flume/? to Ingest 10TB from FTP

What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated. 
I prefer a application/process running inside Hadoop which is doing this transfer
Thanks.

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Posted by Jörn Franke <jo...@gmail.com>.

Well what do you do in case of failure?
I think one should use a professional ingestion tool that ideally does not
need to reload everything in case of failure and verifies that the file has
been transferred correctly via checksums.
I am not sure if Flume supports ftp, but Ssh,scp should be supported. You
may check also other Flume sources or write your own in case of ftp (taking
into account comments above). I hope your file is compressed

Le ven. 14 août 2015 à 22:23, Marcelo Vanzin <va...@cloudera.com> a écrit :

> Why do you need to use Spark or Flume for this?
>
> You can just use curl and hdfs:
>
>   curl ftp://blah | hdfs dfs -put - /blah
>
>
> On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <
> varadhan@yahoo.com.invalid> wrote:
>
>> What is the best way to bring such a huge file from a FTP server into
>> Hadoop to persist in HDFS? Since a single jvm process might run out of
>> memory, I was wondering if I can use Spark or Flume to do this. Any help on
>> this matter is appreciated.
>>
>> I prefer a application/process running inside Hadoop which is doing this
>> transfer
>>
>> Thanks.
>>
>
>
>
> --
> Marcelo
>

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Posted by Steve Loughran <st...@hortonworks.com>.

with the right ftp client JAR on your classpath (I forget which), you can use ftp:// a a source for a hadoop FS operation. you may even be able to use it as an input for some spark (non streaming job directly.


On 14 Aug 2015, at 14:11, Varadhan, Jawahar <va...@yahoo.com.INVALID>> wrote:

Thanks Marcelo. But our problem is little complicated.

We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename, credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically connect to the ftp site and download those fat files and store it in HDFS.

And hence, I was planning to use Spark Streaming with Kafka or Flume with Kafka. But flume runs on a JVM and may not be the best option as the huge file will create memory issues. Please suggest someway to run it inside the cluster.




________________________________
From: Marcelo Vanzin <va...@cloudera.com>>
To: "Varadhan, Jawahar" <va...@yahoo.com>>
Cc: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Sent: Friday, August 14, 2015 3:23 PM
Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Why do you need to use Spark or Flume for this?

You can just use curl and hdfs:

  curl ftp://blah<ftp://blah/> | hdfs dfs -put - /blah




On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <va...@yahoo.com.invalid>> wrote:
What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated.

I prefer a application/process running inside Hadoop which is doing this transfer

Thanks.



--
Marcelo

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Fri, Aug 14, 2015 at 2:11 PM, Varadhan, Jawahar <
varadhan@yahoo.com.invalid> wrote:

> And hence, I was planning to use Spark Streaming with Kafka or Flume with
> Kafka. But flume runs on a JVM and may not be the best option as the huge
> file will create memory issues. Please suggest someway to run it inside the
> cluster.
>

I'm not sure why you think memory would be a problem. You don't need to
read all 10GB into memory to transfer the file.

I'm far from the best person to give advice about Flume, but this seems
like it would be a job more in line with what Sqoop does; although a quick
search seems to indicate Sqoop cannot yet read from FTP.

But writing your own code to read from an FTP server when a message arrives
from Kafka shouldn't really be hard.

-- 
Marcelo

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Posted by "Varadhan, Jawahar" <va...@yahoo.com.INVALID>.

Thanks Marcelo. But our problem is little complicated.

We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename, credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically connect to the ftp site and download those fat files and store it in HDFS.
And hence, I was planning to use Spark Streaming with Kafka or Flume with Kafka. But flume runs on a JVM and may not be the best option as the huge file will create memory issues. Please suggest someway to run it inside the cluster.

     From: Marcelo Vanzin <va...@cloudera.com>
 To: "Varadhan, Jawahar" <va...@yahoo.com> 
Cc: "dev@spark.apache.org" <de...@spark.apache.org> 
 Sent: Friday, August 14, 2015 3:23 PM
 Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Why do you need to use Spark or Flume for this?
You can just use curl and hdfs:
  curl ftp://blah | hdfs dfs -put - /blah

On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <va...@yahoo.com.invalid> wrote:

What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated. 
I prefer a application/process running inside Hadoop which is doing this transfer
Thanks.

-- 
Marcelo

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Posted by Marcelo Vanzin <va...@cloudera.com>.

Why do you need to use Spark or Flume for this?

You can just use curl and hdfs:

  curl ftp://blah | hdfs dfs -put - /blah


On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <
varadhan@yahoo.com.invalid> wrote:

> What is the best way to bring such a huge file from a FTP server into
> Hadoop to persist in HDFS? Since a single jvm process might run out of
> memory, I was wondering if I can use Spark or Flume to do this. Any help on
> this matter is appreciated.
>
> I prefer a application/process running inside Hadoop which is doing this
> transfer
>
> Thanks.
>



-- 
Marcelo