You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Paul Tremblay <pa...@gmail.com> on 2017/04/21 18:36:31 UTC

splitting a huge file

We are tasked with loading a big file (possibly 2TB) into a data warehouse.
In order to do this efficiently, we need to split the file into smaller
files.

I don't believe there is a way to do this with Spark, because in order for
Spark to distribute the file to the worker nodes, it first has to be split
up, right?

We ended up using a single machine with a single thread to do the
splitting. I just want to make sure I am not missing something obvious.

Thanks!

-- 
Paul Henry Tremblay
Attunix

Re: splitting a huge file

Posted by Steve Loughran <st...@hortonworks.com>.

> On 21 Apr 2017, at 19:36, Paul Tremblay <pa...@gmail.com> wrote:
> 
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order to do this efficiently, we need to split the file into smaller files.
> 
> I don't believe there is a way to do this with Spark, because in order for Spark to distribute the file to the worker nodes, it first has to be split up, right? 

if it is in HDFS, it's already been broken up by block size and scattered around the filesystem, so probably split up by 128/256MB blocks, 3x replicated each, offering lots of places for local data.

If its in another FS, different strategies may apply, including no lo

> 
> We ended up using a single machine with a single thread to do the splitting. I just want to make sure I am not missing something obvious.
> 

you don't explicitly need to split up the file if you can run different workers against different parts of the same file, which means you need to split it up,

This is what org.apache.hadoop.mapreduce.InputFormat.getSplits() does: you will need to define an input format for your data source, and provide the split calculation

> Thanks!
> 
> -- 
> Paul Henry Tremblay
> Attunix


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: splitting a huge file

Posted by Roger Marin <ro...@rogersmarin.com>.

If the file is in HDFS already you can use spark to read the file using a
specific input format (depending on file type) to split it.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html

On Sat, Apr 22, 2017 at 4:36 AM, Paul Tremblay <pa...@gmail.com>
wrote:

> We are tasked with loading a big file (possibly 2TB) into a data
> warehouse. In order to do this efficiently, we need to split the file into
> smaller files.
>
> I don't believe there is a way to do this with Spark, because in order for
> Spark to distribute the file to the worker nodes, it first has to be split
> up, right?
>
> We ended up using a single machine with a single thread to do the
> splitting. I just want to make sure I am not missing something obvious.
>
> Thanks!
>
> --
> Paul Henry Tremblay
> Attunix
>

Re: splitting a huge file

Posted by Jörn Franke <jo...@gmail.com>.

What is your DWH technology?
If the file is on HDFS and depending on the format than Spark can read parts of it in parallel.

> On 21. Apr 2017, at 20:36, Paul Tremblay <pa...@gmail.com> wrote:
> 
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order to do this efficiently, we need to split the file into smaller files.
> 
> I don't believe there is a way to do this with Spark, because in order for Spark to distribute the file to the worker nodes, it first has to be split up, right? 
> 
> We ended up using a single machine with a single thread to do the splitting. I just want to make sure I am not missing something obvious.
> 
> Thanks!
> 
> -- 
> Paul Henry Tremblay
> Attunix

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org