You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Sidharth Kumar <si...@gmail.com> on 2017/04/06 19:55:52 UTC

Anatomy of read in hdfs

Hi Genies,

I have a small doubt that hdfs read operation is parallel or sequential
process. Because from my understanding it should be parallel but if I read
"hadoop definitive guide 4" in anatomy of read it says "*Data is streamed
from the datanode back **to the client, which calls read() repeatedly on
the stream (step 4). When the end of the **block is reached, DFSInputStream
will close the connection to the datanode, then find **the best datanode
for the next block (step 5). This happens transparently to the client, **which
from its point of view is just reading a continuous stream*."

So can you kindly explain me how read operation will exactly happens.


Thanks for your help in advance

Sidharth

Re: Anatomy of read in hdfs

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Sidharth,

I'm sorry I didn't quite get the first part your question. What do you mean
by real time? Could you please elaborate it a bit? That'll help me
answering your question in a better manner.

And for your second question,

This is how write happens -

Suppose your file resides in your local file system and you have written a
program(using the HDFS API) then an input stream gets created on this file
and data gets read from it. Data continues to get buffered at the client
side and once it reaches a certain threshold, which is the block size, it
gets pushed to the datanode where it has to be written. Once the data gets
written onto this datanode it gets propagated to other datanodes for
replication based on the replication factor you have specified in your
configuration.

This process continues until the whole data gets written at the target HDFS
location. Again, since this program is a standalone application the write
will happen sequentially.

However, if your source file is already in HDFS and you have written a
distributed application, say a MapReduce program, to copy it to some other
HDFS location then reads and writes will happen in parallel based on the
number of mappers and reducers you have.

One important thing to note here os that parallelism at the read side is
based on the number of mappers created by the InputFormat you are using and
it cannot be controlled, unless you change the way InputFormat behaves, or
do some other tweaking. However, you can tweak the write operation
parallelism by changing the number of reducers in your program.

Hope this helps!

[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>

[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>

On Sun, Apr 9, 2017 at 3:20 PM, Sidharth Kumar <si...@gmail.com>
wrote:

> Thanks Tariq, It really helped me to understand but just one another doubt
> that if reading is not a parallel process then to ready a file of 100GB and
>  hdfs block size is 128MB. It will take lot much to read the complete file
> but it's not the scenerio in the real time. And second question is write
> operations as well is sequential process ? And will every datanode have
> their own data streamer which listen to data queue to get the packets and
> create pipeline. So, can you kindly help me to get clear idea of hdfs read
> and write operations.
>
> Regards
> Sidharth
>
> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:
>
> Hi Sidhart,
>
> When you read data from HDFS using a framework, like MapReduce, blocks of
> a HDFS file are read in parallel by multiple mappers created in that
> particular program. Input splits to be precise.
>
> On the other hand if you have a standalone java program then it's just a
> single thread process and will read the data sequentially.
>
>
> On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
> wrote:
>
>> Thanks for your response . But I dint understand yet,if you don't mind
>> can you tell me what do you mean by "*With Hadoop, the idea is to
>> parallelize the readers (one per block for the mapper) with processing
>> framework like MapReduce.*"
>>
>> And also how the concept of parallelize the readers will work with hdfs
>>
>> Thanks a lot in advance for your help.
>>
>>
>> Regards
>> Sidharth
>>
>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>>
>> Hi Sidharth,
>>
>> The reads are sequential.
>> With Hadoop, the idea is to parallelize the readers (one per block for
>> the mapper) with processing framework like MapReduce.
>>
>> Regards,
>> Philippe
>>
>>
>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>> sidharthkumar2707@gmail.com> wrote:
>>
>>> Hi Genies,
>>>
>>> I have a small doubt that hdfs read operation is parallel or sequential
>>> process. Because from my understanding it should be parallel but if I read
>>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>>> streamed from the datanode back **to the client, which calls read()
>>> repeatedly on the stream (step 4). When the end of the **block is
>>> reached, DFSInputStream will close the connection to the datanode, then
>>> find **the best datanode for the next block (step 5). This happens
>>> transparently to the client, **which from its point of view is just
>>> reading a continuous stream*."
>>>
>>> So can you kindly explain me how read operation will exactly happens.
>>>
>>>
>>> Thanks for your help in advance
>>>
>>> Sidharth
>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.ch
>>
>>
>>
>
> --
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
>
>

Re: Anatomy of read in hdfs

Posted by Sidharth Kumar <si...@gmail.com>.

Thanks Philippe but your answers raised another sets of questions to me
.please help me to understand it
1) If we read anatomy of hdfs read in hadoop definitive guide it says data
queue is consumed by streamer. So, can you just tell me that will there be
only one streamer in a cluster which consume packets from data queue and
create pipeline for each packets to store into data node or there will be
multiple streamer which will consume packets from data queue and store into
data node parallel .
2) There are multiple blogs has been written claiming read and write is a
parallel process(below I have pasted one such link). Can you also help me
by justifying  if they are wrong
http://stackoverflow.com/questions/30400249/hadoop-pipeline-write-and-parallel-read


Thanks for your help in advance

Sidharth

On 10-Apr-2017 3:31 PM, "Philippe Kernévez" <pk...@octo.com> wrote:

>
>
> On Mon, Apr 10, 2017 at 11:46 AM, Sidharth Kumar <
> sidharthkumar2707@gmail.com> wrote:
>
>> Thanks Philippe,
>>
>> I am looking for answer only restricted to HDFS. Because we can do read
>> and write operations from CLI using commands like "*hadoop fs
>> -copyfromlocal /(local disk location) /(hdfs path)" *and read using "*hadoop
>> fs -text /(hdfs file)" *as well.
>>
>> So my question are
>> 1) when I write data using -copyfromlocal command how data from data
>> queue is being pushed to data streamer ? Do we have only one data streamer
>> which listen to data queue and store data into individual datanode one by
>> one or we have multiple streamer which listen to data queue and create
>> pipeline for each individual packets?
>>
> On stream per command. You may start several command, one per file, but
> the bottleneck will quickly be the network.
> This command is only used to do import/export data from/to hadoop cluster.
> The main reads and writes should occurs inside the cluster, when you will
> do processing.
>
>
>> 2) Similarly when we read data, client will receive packets one after
>> another in sequential manner like 2nd data node will wait for 1st node to
>> send it's block first or it will be a parallel process.
>>
> Depend on the reader. If you use cmd cli, yes the reads will be
> sequential. If you use Hadoop Yarn processing patterns (MapReduce, Spark,
> Tez, etc.) then multiple reader (Map) will be started to do parallel
> processing of you data.
>
> What do you want to do with the data that you read ?
>
> Regards,
> Philippe
>
>
>
>>
>>
>> Thanks for your help in advance.
>>
>> Sidharth
>>
>>
>> On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>>
>>> Hi Sidharth,
>>>
>>> As it has been explained, HDFS is not just a file system. It's a part of
>>> the Hadoop platform. To take advantage of HDFS you have to understand how
>>> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
>>> to implements jobs and parallel processing.
>>> That says that you will have to rethink the design of your programs to
>>> take advantage of HDFS.
>>>
>>> You may start with this kind of tutorial
>>> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>>>
>>> Then have a deeper read of the Hadoop documentation
>>> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client
>>> /hadoop-mapreduce-client-core/MapReduceTutorial.html
>>>
>>> Regards,
>>> Philippe
>>>
>>>
>>>
>>> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <da...@gmail.com>
>>> wrote:
>>>
>>>> Readers ARE parallel processes, one per map task. There are defaults in
>>>> map phase, about how many readers there are for the input file(s). Default
>>>> is one mapper task block (or file, where any file is smaller than the hdfs
>>>> block size). There is no java framework per se for splitting up an file
>>>> (technically not so, but let's simplify, outside of your own custom code).
>>>>
>>>>
>>>> *.......*
>>>>
>>>>
>>>>
>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>>
>>>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>>>> sidharthkumar2707@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq, It really helped me to understand but just one another
>>>>> doubt that if reading is not a parallel process then to ready a file of
>>>>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>>>>> complete file but it's not the scenerio in the real time. And second
>>>>> question is write operations as well is sequential process ? And will every
>>>>> datanode have their own data streamer which listen to data queue to get the
>>>>> packets and create pipeline. So, can you kindly help me to get clear idea
>>>>> of hdfs read and write operations.
>>>>>
>>>>> Regards
>>>>> Sidharth
>>>>>
>>>>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:
>>>>>
>>>>> Hi Sidhart,
>>>>>
>>>>> When you read data from HDFS using a framework, like MapReduce, blocks
>>>>> of a HDFS file are read in parallel by multiple mappers created in that
>>>>> particular program. Input splits to be precise.
>>>>>
>>>>> On the other hand if you have a standalone java program then it's just
>>>>> a single thread process and will read the data sequentially.
>>>>>
>>>>>
>>>>> On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for your response . But I dint understand yet,if you don't
>>>>>> mind can you tell me what do you mean by "*With Hadoop, the idea is
>>>>>> to parallelize the readers (one per block for the mapper) with processing
>>>>>> framework like MapReduce.*"
>>>>>>
>>>>>> And also how the concept of parallelize the readers will work with
>>>>>> hdfs
>>>>>>
>>>>>> Thanks a lot in advance for your help.
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sidharth
>>>>>>
>>>>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Sidharth,
>>>>>>
>>>>>> The reads are sequential.
>>>>>> With Hadoop, the idea is to parallelize the readers (one per block
>>>>>> for the mapper) with processing framework like MapReduce.
>>>>>>
>>>>>> Regards,
>>>>>> Philippe
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>>>>>> sidharthkumar2707@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Genies,
>>>>>>>
>>>>>>> I have a small doubt that hdfs read operation is parallel or
>>>>>>> sequential process. Because from my understanding it should be parallel but
>>>>>>> if I read "hadoop definitive guide 4" in anatomy of read it says "*Data
>>>>>>> is streamed from the datanode back **to the client, which calls
>>>>>>> read() repeatedly on the stream (step 4). When the end of the **block
>>>>>>> is reached, DFSInputStream will close the connection to the datanode, then
>>>>>>> find **the best datanode for the next block (step 5). This happens
>>>>>>> transparently to the client, **which from its point of view is just
>>>>>>> reading a continuous stream*."
>>>>>>>
>>>>>>> So can you kindly explain me how read operation will exactly happens.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your help in advance
>>>>>>>
>>>>>>> Sidharth
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Philippe Kernévez
>>>>>>
>>>>>>
>>>>>>
>>>>>> Directeur technique (Suisse),
>>>>>> pkernevez@octo.com
>>>>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>>>>
>>>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>>>> OCTO Technology http://www.octo.ch
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> [image: http://]
>>>>>
>>>>> Tariq, Mohammad
>>>>> about.me/mti
>>>>> [image: http://]
>>>>> <http://about.me/mti>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Philippe Kernévez
>>>
>>>
>>>
>>> Directeur technique (Suisse),
>>> pkernevez@octo.com
>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>
>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>> OCTO Technology http://www.octo.ch
>>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.ch
>

Re: Anatomy of read in hdfs

Posted by Philippe Kernévez <pk...@octo.com>.

On Mon, Apr 10, 2017 at 11:46 AM, Sidharth Kumar <
sidharthkumar2707@gmail.com> wrote:

> Thanks Philippe,
>
> I am looking for answer only restricted to HDFS. Because we can do read
> and write operations from CLI using commands like "*hadoop fs
> -copyfromlocal /(local disk location) /(hdfs path)" *and read using "*hadoop
> fs -text /(hdfs file)" *as well.
>
> So my question are
> 1) when I write data using -copyfromlocal command how data from data queue
> is being pushed to data streamer ? Do we have only one data streamer which
> listen to data queue and store data into individual datanode one by one or
> we have multiple streamer which listen to data queue and create pipeline
> for each individual packets?
>
On stream per command. You may start several command, one per file, but
the bottleneck will quickly be the network.
This command is only used to do import/export data from/to hadoop cluster.
The main reads and writes should occurs inside the cluster, when you will
do processing.


> 2) Similarly when we read data, client will receive packets one after
> another in sequential manner like 2nd data node will wait for 1st node to
> send it's block first or it will be a parallel process.
>
Depend on the reader. If you use cmd cli, yes the reads will be
sequential. If you use Hadoop Yarn processing patterns (MapReduce, Spark,
Tez, etc.) then multiple reader (Map) will be started to do parallel
processing of you data.

What do you want to do with the data that you read ?

Regards,
Philippe



>
>
> Thanks for your help in advance.
>
> Sidharth
>
>
> On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>
>> Hi Sidharth,
>>
>> As it has been explained, HDFS is not just a file system. It's a part of
>> the Hadoop platform. To take advantage of HDFS you have to understand how
>> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
>> to implements jobs and parallel processing.
>> That says that you will have to rethink the design of your programs to
>> take advantage of HDFS.
>>
>> You may start with this kind of tutorial
>> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>>
>> Then have a deeper read of the Hadoop documentation
>> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client
>> /hadoop-mapreduce-client-core/MapReduceTutorial.html
>>
>> Regards,
>> Philippe
>>
>>
>>
>> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <da...@gmail.com>
>> wrote:
>>
>>> Readers ARE parallel processes, one per map task. There are defaults in
>>> map phase, about how many readers there are for the input file(s). Default
>>> is one mapper task block (or file, where any file is smaller than the hdfs
>>> block size). There is no java framework per se for splitting up an file
>>> (technically not so, but let's simplify, outside of your own custom code).
>>>
>>>
>>> *.......*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>
>>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>>> sidharthkumar2707@gmail.com> wrote:
>>>
>>>> Thanks Tariq, It really helped me to understand but just one another
>>>> doubt that if reading is not a parallel process then to ready a file of
>>>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>>>> complete file but it's not the scenerio in the real time. And second
>>>> question is write operations as well is sequential process ? And will every
>>>> datanode have their own data streamer which listen to data queue to get the
>>>> packets and create pipeline. So, can you kindly help me to get clear idea
>>>> of hdfs read and write operations.
>>>>
>>>> Regards
>>>> Sidharth
>>>>
>>>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:
>>>>
>>>> Hi Sidhart,
>>>>
>>>> When you read data from HDFS using a framework, like MapReduce, blocks
>>>> of a HDFS file are read in parallel by multiple mappers created in that
>>>> particular program. Input splits to be precise.
>>>>
>>>> On the other hand if you have a standalone java program then it's just
>>>> a single thread process and will read the data sequentially.
>>>>
>>>>
>>>> On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for your response . But I dint understand yet,if you don't mind
>>>>> can you tell me what do you mean by "*With Hadoop, the idea is to
>>>>> parallelize the readers (one per block for the mapper) with processing
>>>>> framework like MapReduce.*"
>>>>>
>>>>> And also how the concept of parallelize the readers will work with hdfs
>>>>>
>>>>> Thanks a lot in advance for your help.
>>>>>
>>>>>
>>>>> Regards
>>>>> Sidharth
>>>>>
>>>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com>
>>>>> wrote:
>>>>>
>>>>> Hi Sidharth,
>>>>>
>>>>> The reads are sequential.
>>>>> With Hadoop, the idea is to parallelize the readers (one per block for
>>>>> the mapper) with processing framework like MapReduce.
>>>>>
>>>>> Regards,
>>>>> Philippe
>>>>>
>>>>>
>>>>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>>>>> sidharthkumar2707@gmail.com> wrote:
>>>>>
>>>>>> Hi Genies,
>>>>>>
>>>>>> I have a small doubt that hdfs read operation is parallel or
>>>>>> sequential process. Because from my understanding it should be parallel but
>>>>>> if I read "hadoop definitive guide 4" in anatomy of read it says "*Data
>>>>>> is streamed from the datanode back **to the client, which calls
>>>>>> read() repeatedly on the stream (step 4). When the end of the **block
>>>>>> is reached, DFSInputStream will close the connection to the datanode, then
>>>>>> find **the best datanode for the next block (step 5). This happens
>>>>>> transparently to the client, **which from its point of view is just
>>>>>> reading a continuous stream*."
>>>>>>
>>>>>> So can you kindly explain me how read operation will exactly happens.
>>>>>>
>>>>>>
>>>>>> Thanks for your help in advance
>>>>>>
>>>>>> Sidharth
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Philippe Kernévez
>>>>>
>>>>>
>>>>>
>>>>> Directeur technique (Suisse),
>>>>> pkernevez@octo.com
>>>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>>>
>>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>>> OCTO Technology http://www.octo.ch
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> [image: http://]
>>>>
>>>> Tariq, Mohammad
>>>> about.me/mti
>>>> [image: http://]
>>>> <http://about.me/mti>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.ch
>>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch

Re: Anatomy of read in hdfs

Posted by Sidharth Kumar <si...@gmail.com>.

Thanks Philippe,

I am looking for answer only restricted to HDFS. Because we can do read and
write operations from CLI using commands like "*hadoop fs -copyfromlocal
/(local disk location) /(hdfs path)" *and read using "*hadoop fs -text
/(hdfs file)" *as well.

So my question are
1) when I write data using -copyfromlocal command how data from data queue
is being pushed to data streamer ? Do we have only one data streamer which
listen to data queue and store data into individual datanode one by one or
we have multiple streamer which listen to data queue and create pipeline
for each individual packets?

2) Similarly when we read data, client will receive packets one after
another in sequential manner like 2nd data node will wait for 1st node to
send it's block first or it will be a parallel process.


Thanks for your help in advance.

Sidharth


On 10-Apr-2017 1:50 PM, "Philippe Kernévez" <pk...@octo.com> wrote:

> Hi Sidharth,
>
> As it has been explained, HDFS is not just a file system. It's a part of
> the Hadoop platform. To take advantage of HDFS you have to understand how
> Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
> to implements jobs and parallel processing.
> That says that you will have to rethink the design of your programs to
> take advantage of HDFS.
>
> You may start with this kind of tutorial
> https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm
>
> Then have a deeper read of the Hadoop documentation
> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-
> client/hadoop-mapreduce-client-core/MapReduceTutorial.html
>
> Regards,
> Philippe
>
>
>
> On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <da...@gmail.com>
> wrote:
>
>> Readers ARE parallel processes, one per map task. There are defaults in
>> map phase, about how many readers there are for the input file(s). Default
>> is one mapper task block (or file, where any file is smaller than the hdfs
>> block size). There is no java framework per se for splitting up an file
>> (technically not so, but let's simplify, outside of your own custom code).
>>
>>
>> *.......*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
>> sidharthkumar2707@gmail.com> wrote:
>>
>>> Thanks Tariq, It really helped me to understand but just one another
>>> doubt that if reading is not a parallel process then to ready a file of
>>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>>> complete file but it's not the scenerio in the real time. And second
>>> question is write operations as well is sequential process ? And will every
>>> datanode have their own data streamer which listen to data queue to get the
>>> packets and create pipeline. So, can you kindly help me to get clear idea
>>> of hdfs read and write operations.
>>>
>>> Regards
>>> Sidharth
>>>
>>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:
>>>
>>> Hi Sidhart,
>>>
>>> When you read data from HDFS using a framework, like MapReduce, blocks
>>> of a HDFS file are read in parallel by multiple mappers created in that
>>> particular program. Input splits to be precise.
>>>
>>> On the other hand if you have a standalone java program then it's just a
>>> single thread process and will read the data sequentially.
>>>
>>>
>>> On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for your response . But I dint understand yet,if you don't mind
>>>> can you tell me what do you mean by "*With Hadoop, the idea is to
>>>> parallelize the readers (one per block for the mapper) with processing
>>>> framework like MapReduce.*"
>>>>
>>>> And also how the concept of parallelize the readers will work with hdfs
>>>>
>>>> Thanks a lot in advance for your help.
>>>>
>>>>
>>>> Regards
>>>> Sidharth
>>>>
>>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>>>>
>>>> Hi Sidharth,
>>>>
>>>> The reads are sequential.
>>>> With Hadoop, the idea is to parallelize the readers (one per block for
>>>> the mapper) with processing framework like MapReduce.
>>>>
>>>> Regards,
>>>> Philippe
>>>>
>>>>
>>>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>>>> sidharthkumar2707@gmail.com> wrote:
>>>>
>>>>> Hi Genies,
>>>>>
>>>>> I have a small doubt that hdfs read operation is parallel or
>>>>> sequential process. Because from my understanding it should be parallel but
>>>>> if I read "hadoop definitive guide 4" in anatomy of read it says "*Data
>>>>> is streamed from the datanode back **to the client, which calls
>>>>> read() repeatedly on the stream (step 4). When the end of the **block
>>>>> is reached, DFSInputStream will close the connection to the datanode, then
>>>>> find **the best datanode for the next block (step 5). This happens
>>>>> transparently to the client, **which from its point of view is just
>>>>> reading a continuous stream*."
>>>>>
>>>>> So can you kindly explain me how read operation will exactly happens.
>>>>>
>>>>>
>>>>> Thanks for your help in advance
>>>>>
>>>>> Sidharth
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Philippe Kernévez
>>>>
>>>>
>>>>
>>>> Directeur technique (Suisse),
>>>> pkernevez@octo.com
>>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>>
>>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>>> OCTO Technology http://www.octo.ch
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> <http://about.me/mti>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.ch
>

Re: Anatomy of read in hdfs

Posted by Philippe Kernévez <pk...@octo.com>.

Hi Sidharth,

As it has been explained, HDFS is not just a file system. It's a part of
the Hadoop platform. To take advantage of HDFS you have to understand how
Hadoop storage (HDFS) AND Yarn processing (say MapReduce) work all together
to implements jobs and parallel processing.
That says that you will have to rethink the design of your programs to take
advantage of HDFS.

You may start with this kind of tutorial
https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm

Then have a deeper read of the Hadoop documentation
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Regards,
Philippe



On Sun, Apr 9, 2017 at 11:13 PM, daemeon reiydelle <da...@gmail.com>
wrote:

> Readers ARE parallel processes, one per map task. There are defaults in
> map phase, about how many readers there are for the input file(s). Default
> is one mapper task block (or file, where any file is smaller than the hdfs
> block size). There is no java framework per se for splitting up an file
> (technically not so, but let's simplify, outside of your own custom code).
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>
> On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <
> sidharthkumar2707@gmail.com> wrote:
>
>> Thanks Tariq, It really helped me to understand but just one another
>> doubt that if reading is not a parallel process then to ready a file of
>> 100GB and  hdfs block size is 128MB. It will take lot much to read the
>> complete file but it's not the scenerio in the real time. And second
>> question is write operations as well is sequential process ? And will every
>> datanode have their own data streamer which listen to data queue to get the
>> packets and create pipeline. So, can you kindly help me to get clear idea
>> of hdfs read and write operations.
>>
>> Regards
>> Sidharth
>>
>> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:
>>
>> Hi Sidhart,
>>
>> When you read data from HDFS using a framework, like MapReduce, blocks of
>> a HDFS file are read in parallel by multiple mappers created in that
>> particular program. Input splits to be precise.
>>
>> On the other hand if you have a standalone java program then it's just a
>> single thread process and will read the data sequentially.
>>
>>
>> On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
>> wrote:
>>
>>> Thanks for your response . But I dint understand yet,if you don't mind
>>> can you tell me what do you mean by "*With Hadoop, the idea is to
>>> parallelize the readers (one per block for the mapper) with processing
>>> framework like MapReduce.*"
>>>
>>> And also how the concept of parallelize the readers will work with hdfs
>>>
>>> Thanks a lot in advance for your help.
>>>
>>>
>>> Regards
>>> Sidharth
>>>
>>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>>>
>>> Hi Sidharth,
>>>
>>> The reads are sequential.
>>> With Hadoop, the idea is to parallelize the readers (one per block for
>>> the mapper) with processing framework like MapReduce.
>>>
>>> Regards,
>>> Philippe
>>>
>>>
>>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>>> sidharthkumar2707@gmail.com> wrote:
>>>
>>>> Hi Genies,
>>>>
>>>> I have a small doubt that hdfs read operation is parallel or sequential
>>>> process. Because from my understanding it should be parallel but if I read
>>>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>>>> streamed from the datanode back **to the client, which calls read()
>>>> repeatedly on the stream (step 4). When the end of the **block is
>>>> reached, DFSInputStream will close the connection to the datanode, then
>>>> find **the best datanode for the next block (step 5). This happens
>>>> transparently to the client, **which from its point of view is just
>>>> reading a continuous stream*."
>>>>
>>>> So can you kindly explain me how read operation will exactly happens.
>>>>
>>>>
>>>> Thanks for your help in advance
>>>>
>>>> Sidharth
>>>>
>>>>
>>>
>>>
>>> --
>>> Philippe Kernévez
>>>
>>>
>>>
>>> Directeur technique (Suisse),
>>> pkernevez@octo.com
>>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>>
>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>> OCTO Technology http://www.octo.ch
>>>
>>>
>>>
>>
>> --
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>>
>>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch

Re: Anatomy of read in hdfs

Posted by daemeon reiydelle <da...@gmail.com>.

Readers ARE parallel processes, one per map task. There are defaults in map
phase, about how many readers there are for the input file(s). Default is
one mapper task block (or file, where any file is smaller than the hdfs
block size). There is no java framework per se for splitting up an file
(technically not so, but let's simplify, outside of your own custom code).


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Apr 9, 2017 at 2:50 AM, Sidharth Kumar <si...@gmail.com>
wrote:

> Thanks Tariq, It really helped me to understand but just one another doubt
> that if reading is not a parallel process then to ready a file of 100GB and
>  hdfs block size is 128MB. It will take lot much to read the complete file
> but it's not the scenerio in the real time. And second question is write
> operations as well is sequential process ? And will every datanode have
> their own data streamer which listen to data queue to get the packets and
> create pipeline. So, can you kindly help me to get clear idea of hdfs read
> and write operations.
>
> Regards
> Sidharth
>
> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:
>
> Hi Sidhart,
>
> When you read data from HDFS using a framework, like MapReduce, blocks of
> a HDFS file are read in parallel by multiple mappers created in that
> particular program. Input splits to be precise.
>
> On the other hand if you have a standalone java program then it's just a
> single thread process and will read the data sequentially.
>
>
> On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
> wrote:
>
>> Thanks for your response . But I dint understand yet,if you don't mind
>> can you tell me what do you mean by "*With Hadoop, the idea is to
>> parallelize the readers (one per block for the mapper) with processing
>> framework like MapReduce.*"
>>
>> And also how the concept of parallelize the readers will work with hdfs
>>
>> Thanks a lot in advance for your help.
>>
>>
>> Regards
>> Sidharth
>>
>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>>
>> Hi Sidharth,
>>
>> The reads are sequential.
>> With Hadoop, the idea is to parallelize the readers (one per block for
>> the mapper) with processing framework like MapReduce.
>>
>> Regards,
>> Philippe
>>
>>
>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>> sidharthkumar2707@gmail.com> wrote:
>>
>>> Hi Genies,
>>>
>>> I have a small doubt that hdfs read operation is parallel or sequential
>>> process. Because from my understanding it should be parallel but if I read
>>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>>> streamed from the datanode back **to the client, which calls read()
>>> repeatedly on the stream (step 4). When the end of the **block is
>>> reached, DFSInputStream will close the connection to the datanode, then
>>> find **the best datanode for the next block (step 5). This happens
>>> transparently to the client, **which from its point of view is just
>>> reading a continuous stream*."
>>>
>>> So can you kindly explain me how read operation will exactly happens.
>>>
>>>
>>> Thanks for your help in advance
>>>
>>> Sidharth
>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.ch
>>
>>
>>
>
> --
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
>
>

Re: Anatomy of read in hdfs

Posted by Sidharth Kumar <si...@gmail.com>.

Thanks Tariq, It really helped me to understand but just one another doubt
that if reading is not a parallel process then to ready a file of 100GB and
 hdfs block size is 128MB. It will take lot much to read the complete file
but it's not the scenerio in the real time. And second question is write
operations as well is sequential process ? And will every datanode have
their own data streamer which listen to data queue to get the packets and
create pipeline. So, can you kindly help me to get clear idea of hdfs read
and write operations.

Regards
Sidharth

On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <do...@gmail.com> wrote:

Hi Sidhart,

When you read data from HDFS using a framework, like MapReduce, blocks of a
HDFS file are read in parallel by multiple mappers created in that
particular program. Input splits to be precise.

On the other hand if you have a standalone java program then it's just a
single thread process and will read the data sequentially.

On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
wrote:

> Thanks for your response . But I dint understand yet,if you don't mind can
> you tell me what do you mean by "*With Hadoop, the idea is to parallelize
> the readers (one per block for the mapper) with processing framework like
> MapReduce.*"
>
> And also how the concept of parallelize the readers will work with hdfs
>
> Thanks a lot in advance for your help.
>
>
> Regards
> Sidharth
>
> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com> wrote:
>
> Hi Sidharth,
>
> The reads are sequential.
> With Hadoop, the idea is to parallelize the readers (one per block for the
> mapper) with processing framework like MapReduce.
>
> Regards,
> Philippe
>
>
> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
> sidharthkumar2707@gmail.com> wrote:
>
>> Hi Genies,
>>
>> I have a small doubt that hdfs read operation is parallel or sequential
>> process. Because from my understanding it should be parallel but if I read
>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>> streamed from the datanode back **to the client, which calls read()
>> repeatedly on the stream (step 4). When the end of the **block is
>> reached, DFSInputStream will close the connection to the datanode, then
>> find **the best datanode for the next block (step 5). This happens
>> transparently to the client, **which from its point of view is just
>> reading a continuous stream*."
>>
>> So can you kindly explain me how read operation will exactly happens.
>>
>>
>> Thanks for your help in advance
>>
>> Sidharth
>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.ch
>
>
>

-- 

[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>

Re: Anatomy of read in hdfs

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Sidhart,

When you read data from HDFS using a framework, like MapReduce, blocks of a
HDFS file are read in parallel by multiple mappers created in that
particular program. Input splits to be precise.

On the other hand if you have a standalone java program then it's just a
single thread process and will read the data sequentially.

On Friday, April 7, 2017, Sidharth Kumar <si...@gmail.com>
wrote:

> Thanks for your response . But I dint understand yet,if you don't mind can
> you tell me what do you mean by "*With Hadoop, the idea is to parallelize
> the readers (one per block for the mapper) with processing framework like
> MapReduce.*"
>
> And also how the concept of parallelize the readers will work with hdfs
>
> Thanks a lot in advance for your help.
>
>
> Regards
> Sidharth
>
> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pkernevez@octo.com
> <javascript:_e(%7B%7D,'cvml','pkernevez@octo.com');>> wrote:
>
> Hi Sidharth,
>
> The reads are sequential.
> With Hadoop, the idea is to parallelize the readers (one per block for the
> mapper) with processing framework like MapReduce.
>
> Regards,
> Philippe
>
>
> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
> sidharthkumar2707@gmail.com
> <javascript:_e(%7B%7D,'cvml','sidharthkumar2707@gmail.com');>> wrote:
>
>> Hi Genies,
>>
>> I have a small doubt that hdfs read operation is parallel or sequential
>> process. Because from my understanding it should be parallel but if I read
>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>> streamed from the datanode back **to the client, which calls read()
>> repeatedly on the stream (step 4). When the end of the **block is
>> reached, DFSInputStream will close the connection to the datanode, then
>> find **the best datanode for the next block (step 5). This happens
>> transparently to the client, **which from its point of view is just
>> reading a continuous stream*."
>>
>> So can you kindly explain me how read operation will exactly happens.
>>
>>
>> Thanks for your help in advance
>>
>> Sidharth
>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkernevez@octo.com <javascript:_e(%7B%7D,'cvml','pkernevez@octo.com');>
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.ch
>
>
>

-- 


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>

Re: Anatomy of read in hdfs

Posted by Sidharth Kumar <si...@gmail.com>.

Thanks for your response . But I dint understand yet,if you don't mind can
you tell me what do you mean by "*With Hadoop, the idea is to parallelize
the readers (one per block for the mapper) with processing framework like
MapReduce.*"

And also how the concept of parallelize the readers will work with hdfs

Thanks a lot in advance for your help.

Regards
Sidharth

On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <pk...@octo.com> wrote:

Hi Sidharth,

The reads are sequential.
With Hadoop, the idea is to parallelize the readers (one per block for the
mapper) with processing framework like MapReduce.

Regards,
Philippe

On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <si...@gmail.com>
wrote:

> Hi Genies,
>
> I have a small doubt that hdfs read operation is parallel or sequential
> process. Because from my understanding it should be parallel but if I read
> "hadoop definitive guide 4" in anatomy of read it says "*Data is streamed
> from the datanode back **to the client, which calls read() repeatedly on
> the stream (step 4). When the end of the **block is reached,
> DFSInputStream will close the connection to the datanode, then find **the
> best datanode for the next block (step 5). This happens transparently to
> the client, **which from its point of view is just reading a continuous
> stream*."
>
> So can you kindly explain me how read operation will exactly happens.
>
>
> Thanks for your help in advance
>
> Sidharth
>
>

-- 
Philippe Kernévez

Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch

Re: Anatomy of read in hdfs

Posted by Philippe Kernévez <pk...@octo.com>.

Hi Sidharth,

The reads are sequential.
With Hadoop, the idea is to parallelize the readers (one per block for the
mapper) with processing framework like MapReduce.

Regards,
Philippe


On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <si...@gmail.com>
wrote:

> Hi Genies,
>
> I have a small doubt that hdfs read operation is parallel or sequential
> process. Because from my understanding it should be parallel but if I read
> "hadoop definitive guide 4" in anatomy of read it says "*Data is streamed
> from the datanode back **to the client, which calls read() repeatedly on
> the stream (step 4). When the end of the **block is reached,
> DFSInputStream will close the connection to the datanode, then find **the
> best datanode for the next block (step 5). This happens transparently to
> the client, **which from its point of view is just reading a continuous
> stream*."
>
> So can you kindly explain me how read operation will exactly happens.
>
>
> Thanks for your help in advance
>
> Sidharth
>
>


-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.ch