You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ayan guha <gu...@gmail.com> on 2017/06/08 05:26:40 UTC

Read Data From NFS

Hi Guys

Quick one: How spark deals (ie create partitions) with large files sitting
on NFS, assuming the all executors can see the file exactly same way.

ie, when I run

r = sc.textFile("file://my/file")

what happens if the file is on NFS?

is there any difference from

r = sc.textFile("hdfs://my/file")

Are the input formats used same in both cases?

-- 
Best Regards,
Ayan Guha

Re: Read Data From NFS

Posted by ayan guha <gu...@gmail.com>.
Hi

So, for example, if I specify parallelism to 100, 100 partitions will be
created, right? My question is how spark divides the file? In other words,
how does it specify first x lines will be read by first partition and
further y lines will be read by second partition and so on? In case of
hdfs, file is already divided into blocks, but for NFS it is the whole
file, isn't it?

On Tue, 13 Jun 2017 at 6:04 pm, Riccardo Ferrari <fe...@gmail.com> wrote:

> Hi Ayan,
> You might be interested in the official Spark docs:
> https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and
> its spark.default.parallelism setting
>
> Best,
>
> On Mon, Jun 12, 2017 at 6:18 AM, ayan guha <gu...@gmail.com> wrote:
>
>> I understand how it works with hdfs. My question is when hdfs is not the
>> file sustem, how number of partitions are calculated. Hope that makes it
>> clearer.
>>
>> On Mon, 12 Jun 2017 at 2:42 am, vaquar khan <va...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> As per spark doc :
>>> The textFile method also takes an optional second argument for
>>> controlling the number of partitions of the file.* By default, Spark
>>> creates one partition for each block of the file (blocks being 128MB by
>>> default in HDFS)*, but you can also ask for a higher number of
>>> partitions by passing a larger value. Note that you cannot have fewer
>>> partitions than blocks.
>>>
>>>
>>> sc.textFile doesn't commence any reading. It simply defines a
>>> driver-resident data structure which can be used for further processing.
>>>
>>> It is not until an action is called on an RDD that Spark will build up a
>>> strategy to perform all the required transforms (including the read) and
>>> then return the result.
>>>
>>> If there is an action called to run the sequence, and your next
>>> transformation after the read is to map, then Spark will need to read a
>>> small section of lines of the file (according to the partitioning strategy
>>> based on the number of cores) and then immediately start to map it until it
>>> needs to return a result to the driver, or shuffle before the next sequence
>>> of transformations.
>>>
>>> If your partitioning strategy (defaultMinPartitions) seems to be
>>> swamping the workers because the java representation of your partition (an
>>> InputSplit in HDFS terms) is bigger than available executor memory,
>>> then you need to specify the number of partitions to read as the second
>>> parameter to textFile. You can calculate the ideal number of partitions
>>> by dividing your file size by your target partition size (allowing for
>>> memory growth). A simple check that the file can be read would be:
>>>
>>> sc.textFile(file, numPartitions).count()
>>>
>>> You can get good explanation here :
>>>
>>> https://stackoverflow.com/questions/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs
>>>
>>>
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>>
>>> On Jun 11, 2017 5:28 AM, "ayan guha" <gu...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> My question is what happens if I have 1 file of say 100gb. Then how
>>>> many partitions will be there?
>>>>
>>>> Best
>>>> Ayan
>>>> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <va...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ayan,
>>>>>
>>>>> If you have multiple files (example 12 files )and you are using
>>>>> following code then you will get 12 partition.
>>>>>
>>>>> r = sc.textFile("file://my/file/*")
>>>>>
>>>>> Not sure what you want to know about file system ,please check API doc.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Vaquar khan
>>>>>
>>>>>
>>>>> On Jun 8, 2017 10:44 AM, "ayan guha" <gu...@gmail.com> wrote:
>>>>>
>>>>> Any one?
>>>>>
>>>>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> Hi Guys
>>>>>>
>>>>>> Quick one: How spark deals (ie create partitions) with large files
>>>>>> sitting on NFS, assuming the all executors can see the file exactly same
>>>>>> way.
>>>>>>
>>>>>> ie, when I run
>>>>>>
>>>>>> r = sc.textFile("file://my/file")
>>>>>>
>>>>>> what happens if the file is on NFS?
>>>>>>
>>>>>> is there any difference from
>>>>>>
>>>>>> r = sc.textFile("hdfs://my/file")
>>>>>>
>>>>>> Are the input formats used same in both cases?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>>
>>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
> --
Best Regards,
Ayan Guha

Re: Read Data From NFS

Posted by Riccardo Ferrari <fe...@gmail.com>.
Hi Ayan,
You might be interested in the official Spark docs:
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and
its spark.default.parallelism setting

Best,

On Mon, Jun 12, 2017 at 6:18 AM, ayan guha <gu...@gmail.com> wrote:

> I understand how it works with hdfs. My question is when hdfs is not the
> file sustem, how number of partitions are calculated. Hope that makes it
> clearer.
>
> On Mon, 12 Jun 2017 at 2:42 am, vaquar khan <va...@gmail.com> wrote:
>
>>
>>
>> As per spark doc :
>> The textFile method also takes an optional second argument for
>> controlling the number of partitions of the file.* By default, Spark
>> creates one partition for each block of the file (blocks being 128MB by
>> default in HDFS)*, but you can also ask for a higher number of
>> partitions by passing a larger value. Note that you cannot have fewer
>> partitions than blocks.
>>
>>
>> sc.textFile doesn't commence any reading. It simply defines a
>> driver-resident data structure which can be used for further processing.
>>
>> It is not until an action is called on an RDD that Spark will build up a
>> strategy to perform all the required transforms (including the read) and
>> then return the result.
>>
>> If there is an action called to run the sequence, and your next
>> transformation after the read is to map, then Spark will need to read a
>> small section of lines of the file (according to the partitioning strategy
>> based on the number of cores) and then immediately start to map it until it
>> needs to return a result to the driver, or shuffle before the next sequence
>> of transformations.
>>
>> If your partitioning strategy (defaultMinPartitions) seems to be
>> swamping the workers because the java representation of your partition (an
>> InputSplit in HDFS terms) is bigger than available executor memory, then
>> you need to specify the number of partitions to read as the second
>> parameter to textFile. You can calculate the ideal number of partitions
>> by dividing your file size by your target partition size (allowing for
>> memory growth). A simple check that the file can be read would be:
>>
>> sc.textFile(file, numPartitions).count()
>>
>> You can get good explanation here :
>> https://stackoverflow.com/questions/29011574/how-does-
>> partitioning-work-for-data-from-files-on-hdfs
>>
>>
>>
>> Regards,
>> Vaquar khan
>>
>>
>> On Jun 11, 2017 5:28 AM, "ayan guha" <gu...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> My question is what happens if I have 1 file of say 100gb. Then how many
>>> partitions will be there?
>>>
>>> Best
>>> Ayan
>>> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <va...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ayan,
>>>>
>>>> If you have multiple files (example 12 files )and you are using
>>>> following code then you will get 12 partition.
>>>>
>>>> r = sc.textFile("file://my/file/*")
>>>>
>>>> Not sure what you want to know about file system ,please check API doc.
>>>>
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>>
>>>> On Jun 8, 2017 10:44 AM, "ayan guha" <gu...@gmail.com> wrote:
>>>>
>>>> Any one?
>>>>
>>>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> Hi Guys
>>>>>
>>>>> Quick one: How spark deals (ie create partitions) with large files
>>>>> sitting on NFS, assuming the all executors can see the file exactly same
>>>>> way.
>>>>>
>>>>> ie, when I run
>>>>>
>>>>> r = sc.textFile("file://my/file")
>>>>>
>>>>> what happens if the file is on NFS?
>>>>>
>>>>> is there any difference from
>>>>>
>>>>> r = sc.textFile("hdfs://my/file")
>>>>>
>>>>> Are the input formats used same in both cases?
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
> Best Regards,
> Ayan Guha
>

Re: Read Data From NFS

Posted by ayan guha <gu...@gmail.com>.
I understand how it works with hdfs. My question is when hdfs is not the
file sustem, how number of partitions are calculated. Hope that makes it
clearer.

On Mon, 12 Jun 2017 at 2:42 am, vaquar khan <va...@gmail.com> wrote:

>
>
> As per spark doc :
> The textFile method also takes an optional second argument for
> controlling the number of partitions of the file.* By default, Spark
> creates one partition for each block of the file (blocks being 128MB by
> default in HDFS)*, but you can also ask for a higher number of partitions
> by passing a larger value. Note that you cannot have fewer partitions than
> blocks.
>
>
> sc.textFile doesn't commence any reading. It simply defines a
> driver-resident data structure which can be used for further processing.
>
> It is not until an action is called on an RDD that Spark will build up a
> strategy to perform all the required transforms (including the read) and
> then return the result.
>
> If there is an action called to run the sequence, and your next
> transformation after the read is to map, then Spark will need to read a
> small section of lines of the file (according to the partitioning strategy
> based on the number of cores) and then immediately start to map it until it
> needs to return a result to the driver, or shuffle before the next sequence
> of transformations.
>
> If your partitioning strategy (defaultMinPartitions) seems to be swamping
> the workers because the java representation of your partition (an
> InputSplit in HDFS terms) is bigger than available executor memory, then
> you need to specify the number of partitions to read as the second
> parameter to textFile. You can calculate the ideal number of partitions
> by dividing your file size by your target partition size (allowing for
> memory growth). A simple check that the file can be read would be:
>
> sc.textFile(file, numPartitions).count()
>
> You can get good explanation here :
>
> https://stackoverflow.com/questions/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs
>
>
>
> Regards,
> Vaquar khan
>
>
> On Jun 11, 2017 5:28 AM, "ayan guha" <gu...@gmail.com> wrote:
>
>> Hi
>>
>> My question is what happens if I have 1 file of say 100gb. Then how many
>> partitions will be there?
>>
>> Best
>> Ayan
>> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <va...@gmail.com>
>> wrote:
>>
>>> Hi Ayan,
>>>
>>> If you have multiple files (example 12 files )and you are using
>>> following code then you will get 12 partition.
>>>
>>> r = sc.textFile("file://my/file/*")
>>>
>>> Not sure what you want to know about file system ,please check API doc.
>>>
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>>
>>> On Jun 8, 2017 10:44 AM, "ayan guha" <gu...@gmail.com> wrote:
>>>
>>> Any one?
>>>
>>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> Hi Guys
>>>>
>>>> Quick one: How spark deals (ie create partitions) with large files
>>>> sitting on NFS, assuming the all executors can see the file exactly same
>>>> way.
>>>>
>>>> ie, when I run
>>>>
>>>> r = sc.textFile("file://my/file")
>>>>
>>>> what happens if the file is on NFS?
>>>>
>>>> is there any difference from
>>>>
>>>> r = sc.textFile("hdfs://my/file")
>>>>
>>>> Are the input formats used same in both cases?
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
> --
Best Regards,
Ayan Guha

Re: Read Data From NFS

Posted by vaquar khan <va...@gmail.com>.
As per spark doc :
The textFile method also takes an optional second argument for controlling
the number of partitions of the file.* By default, Spark creates one
partition for each block of the file (blocks being 128MB by default in
HDFS)*, but you can also ask for a higher number of partitions by passing a
larger value. Note that you cannot have fewer partitions than blocks.


sc.textFile doesn't commence any reading. It simply defines a
driver-resident data structure which can be used for further processing.

It is not until an action is called on an RDD that Spark will build up a
strategy to perform all the required transforms (including the read) and
then return the result.

If there is an action called to run the sequence, and your next
transformation after the read is to map, then Spark will need to read a
small section of lines of the file (according to the partitioning strategy
based on the number of cores) and then immediately start to map it until it
needs to return a result to the driver, or shuffle before the next sequence
of transformations.

If your partitioning strategy (defaultMinPartitions) seems to be swamping
the workers because the java representation of your partition (an InputSplit in
HDFS terms) is bigger than available executor memory, then you need to
specify the number of partitions to read as the second parameter to textFile.
You can calculate the ideal number of partitions by dividing your file size
by your target partition size (allowing for memory growth). A simple check
that the file can be read would be:

sc.textFile(file, numPartitions).count()

You can get good explanation here :
https://stackoverflow.com/questions/29011574/how-does-
partitioning-work-for-data-from-files-on-hdfs



Regards,
Vaquar khan


On Jun 11, 2017 5:28 AM, "ayan guha" <gu...@gmail.com> wrote:

> Hi
>
> My question is what happens if I have 1 file of say 100gb. Then how many
> partitions will be there?
>
> Best
> Ayan
> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <va...@gmail.com> wrote:
>
>> Hi Ayan,
>>
>> If you have multiple files (example 12 files )and you are using following
>> code then you will get 12 partition.
>>
>> r = sc.textFile("file://my/file/*")
>>
>> Not sure what you want to know about file system ,please check API doc.
>>
>>
>> Regards,
>> Vaquar khan
>>
>>
>> On Jun 8, 2017 10:44 AM, "ayan guha" <gu...@gmail.com> wrote:
>>
>> Any one?
>>
>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi Guys
>>>
>>> Quick one: How spark deals (ie create partitions) with large files
>>> sitting on NFS, assuming the all executors can see the file exactly same
>>> way.
>>>
>>> ie, when I run
>>>
>>> r = sc.textFile("file://my/file")
>>>
>>> what happens if the file is on NFS?
>>>
>>> is there any difference from
>>>
>>> r = sc.textFile("hdfs://my/file")
>>>
>>> Are the input formats used same in both cases?
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>> --
> Best Regards,
> Ayan Guha
>

Re: Read Data From NFS

Posted by ayan guha <gu...@gmail.com>.
Hi

My question is what happens if I have 1 file of say 100gb. Then how many
partitions will be there?

Best
Ayan
On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <va...@gmail.com> wrote:

> Hi Ayan,
>
> If you have multiple files (example 12 files )and you are using following
> code then you will get 12 partition.
>
> r = sc.textFile("file://my/file/*")
>
> Not sure what you want to know about file system ,please check API doc.
>
>
> Regards,
> Vaquar khan
>
>
> On Jun 8, 2017 10:44 AM, "ayan guha" <gu...@gmail.com> wrote:
>
> Any one?
>
> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:
>
>> Hi Guys
>>
>> Quick one: How spark deals (ie create partitions) with large files
>> sitting on NFS, assuming the all executors can see the file exactly same
>> way.
>>
>> ie, when I run
>>
>> r = sc.textFile("file://my/file")
>>
>> what happens if the file is on NFS?
>>
>> is there any difference from
>>
>> r = sc.textFile("hdfs://my/file")
>>
>> Are the input formats used same in both cases?
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
> --
> Best Regards,
> Ayan Guha
>
>
> --
Best Regards,
Ayan Guha

Re: Read Data From NFS

Posted by vaquar khan <va...@gmail.com>.
Hi Ayan,

If you have multiple files (example 12 files )and you are using following
code then you will get 12 partition.

r = sc.textFile("file://my/file/*")

Not sure what you want to know about file system ,please check API doc.


Regards,
Vaquar khan

On Jun 8, 2017 10:44 AM, "ayan guha" <gu...@gmail.com> wrote:

Any one?

On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:

> Hi Guys
>
> Quick one: How spark deals (ie create partitions) with large files sitting
> on NFS, assuming the all executors can see the file exactly same way.
>
> ie, when I run
>
> r = sc.textFile("file://my/file")
>
> what happens if the file is on NFS?
>
> is there any difference from
>
> r = sc.textFile("hdfs://my/file")
>
> Are the input formats used same in both cases?
>
>
> --
> Best Regards,
> Ayan Guha
>
-- 
Best Regards,
Ayan Guha

Re: Read Data From NFS

Posted by ayan guha <gu...@gmail.com>.
Any one?

On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <gu...@gmail.com> wrote:

> Hi Guys
>
> Quick one: How spark deals (ie create partitions) with large files sitting
> on NFS, assuming the all executors can see the file exactly same way.
>
> ie, when I run
>
> r = sc.textFile("file://my/file")
>
> what happens if the file is on NFS?
>
> is there any difference from
>
> r = sc.textFile("hdfs://my/file")
>
> Are the input formats used same in both cases?
>
>
> --
> Best Regards,
> Ayan Guha
>
-- 
Best Regards,
Ayan Guha