You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Venkates .P.B." <ve...@gmail.com> on 2007/08/01 15:09:54 UTC

Loading data into HDFS

Few queries regarding the way data is loaded into HDFS.

-Is it a common practice to load the data into HDFS only through the master
node ? We are able to copy only around 35 logs (64K each) per minute in a 2
slave configuration.

-We are concerned about time it would take to update filenames and block
maps in the master node when data is loaded from few/all the slave nodes.
Can anyone let me know how long generally it takes for this update to
happen.

And one more question, what if the node crashes soon after the data is
copied into one it. How is data consistency maintained here ?

Thanks in advance,
Venkates P B

Re: Loading data into HDFS

Posted by Dmitry <dm...@hotmail.com>.

I am not sure that you are following the right techniqs. We have the same 
issue concerning loading  master/slave, still trying to find some more 
details how to do it better  but could not advice you now..
keep posting probably sombody can give you the correct answer, good 
questions actually

thanks,
DT
www.ejinz.com
Search News

----- Original Message ----- 
From: "Venkates .P.B." <ve...@gmail.com>
To: <ha...@lucene.apache.org>
Sent: Friday, August 03, 2007 1:41 AM
Subject: Re: Loading data into HDFS


> Am I missing something very fundamental ? Can someone comment on these
> queries ?
>
> Thanks,
> Venkates P B
>
> On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
>>
>>
>> Few queries regarding the way data is loaded into HDFS.
>>
>> -Is it a common practice to load the data into HDFS only through the
>> master node ? We are able to copy only around 35 logs (64K each) per 
>> minute
>> in a 2 slave configuration.
>>
>> -We are concerned about time it would take to update filenames and block
>> maps in the master node when data is loaded from few/all the slave nodes.
>> Can anyone let me know how long generally it takes for this update to
>> happen.
>>
>> And one more question, what if the node crashes soon after the data is
>> copied into one it. How is data consistency maintained here ?
>>
>> Thanks in advance,
>> Venkates P B
>>
>

RE: Loading data into HDFS

Posted by Runping Qi <ru...@yahoo-inc.com>.

Hadoop Aggregate package (o.a.h.mapred.lib.aggregate) is a good fit for your
aggregation problem.

Runping


> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Tuesday, August 07, 2007 12:09 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Loading data into HDFS
> 
> 
> 
> One issue that we see in building log aggregators using hadoop is that we
> often want to do several aggregations in a single reduce task.
> 
> For instance, we have viewers who view videos and sometimes watch them to
> completion and sometimes scrub to different points in the video and
> sometimes close a browser without a session completion event.  We want to
> run map to extract session id as a reduce key and then have a reduce that
> summarizes the basics of the the session (user + video + what happened).
> The interesting stuff happens in the second map/reduce where we want to
> have
> map emit records for user aggregation, user by day aggregation, video
> aggregation, and video by day aggregation (this is somewhat simplified, of
> course).  In  the second reduce, we want to compute various aggregation
> functions such as total count, total distinct count and a few
> distributional
> measures such as estimated non-robot distinct views or long-tail
> coefficients or day part usage pattern.
> 
> Systems like Pig seem to be built with the assumption that there should be
> a
> separate reduce task per aggregation.  Since we have about a dozen
> aggregate
> functions that we need to compute per aggregation type, that would entail
> about an order of magnitude decrease in performance for us unless the
> language is clever enough to put all aggregation functions for the same
> aggregation into the same reduce task.
> 
> Also, our current ingestion rates are completely dominated by our
> downstream
> processing so log file collection isn't a huge deal.  For backfilling old
> data, we have some need for higher bandwidth, but even a single NFS server
> suffices as a source for heavily compressed and consolidated log files.
> Our
> transaction volumes are pretty modest compared to something like Yahoo, of
> course, since we still have less than 20 million monthly uniques, but we
> expect this to continue the strong growth that we have been seeing.
> 
> 
> On 8/7/07 10:45 AM, "Eric Baldeschwieler" <er...@yahoo-inc.com> wrote:
> 
> > I'll have our operations folks comment on our current techniques.
> >
> > We use map-reduce jobs to copy from all nodes in the cluster from the
> > source.   Generally using either HTTP(S) or HDFS protocol.
> >
> > We've seen write rates as high as 8.3 GBytes/sec on 900 nodes.  This
> > is network limited.  We see roughly 20MBytes/sec/node (double the
> > other rate) on one rack clusters, with everything connected with
> > gigabit.
> >
> > We (the yahoo grid team) are planning to put some more energy into
> > making the system more useful for real-time log handling in the next
> > few releases.  For example, I would like to be able to tail -f a file
> > as it is written, I would like to have a generic log aggregation
> > system and I would like to have the map-reduce framework log directly
> > into HDFS using that system.
> >
> > I'd love to hear thoughts on other achievable improvements that would
> > really help in this area.
> >
> > On Aug 3, 2007, at 1:42 AM, Jeff Hammerbacher wrote:
> >
> >> We have a service which writes one copy of a logfile directly into
> >> HDFS
> >> (writes go to namenode).  As Dennis mentions, since HDFS does not
> >> support
> >> atomic appends, if a failure occurs before closing a file, it never
> >> appears
> >> in the file system.  Thus we have to rotate logfiles at a greater
> >> frequency
> >> that we'd like to "checkpoint" the data into HDFS.  The system
> >> certainly
> >> isn't perfect but bulk-loading the data into HDFS was proving
> >> rather slow.
> >> I'd be curious to hear actual performance numbers and methodologies
> >> for bulk
> >> loads.  I'll try to dig some up myself on Monday.
> >>
> >> On 8/2/07, Dennis Kubes <ku...@apache.org> wrote:
> >>>
> >>> You can copy data from any node, so if you can do it from
> >> multiple nodes
> >>> your performance would be better (although be sure not to overlap
> >>> files).  The master node is updated once a the block is copied it
> >>> replication number of times.  So if default replication is 3 then
> >> the 3
> >>> replicates must be active before the master is updated and the data
> >>> "appears" int the dfs.
> >>>
> >>> How long the updates take to happen is a function of your server
> >> load
> >>> and network speed and file size.  Generally it is fast.
> >>>
> >>> So the process is the data is loaded into the dfs, replicates are
> >>> created, and the master node is updated.  In terms of
> >> consistency, if
> >>> the data node crashes before the data is loaded then the data won't
> >>> appear in the dfs.  If the name node crashes before it is updated
> >> but
> >>> all replicates are active, the data would appear once the name
> >> node has
> >>> been fixed and updated through block reports.  If a single node
> >> crashes
> >>> that has a replicate once the namenode has been updated then the
> >> data
> >>> will be replicated from one of the other 2 replicates to another 3
> >>> system if available.
> >>>
> >>> Dennis Kubes
> >>>
> >>> Venkates .P.B. wrote:
> >>>> Am I missing something very fundamental ? Can someone comment
> >> on these
> >>>> queries ?
> >>>>
> >>>> Thanks,
> >>>> Venkates P B
> >>>>
> >>>> On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
> >>>>>
> >>>>> Few queries regarding the way data is loaded into HDFS.
> >>>>>
> >>>>> -Is it a common practice to load the data into HDFS only
> >> through the
> >>>>> master node ? We are able to copy only around 35 logs (64K
> >> each) per
> >>> minute
> >>>>> in a 2 slave configuration.
> >>>>>
> >>>>> -We are concerned about time it would take to update filenames
> >> and
> >>> block
> >>>>> maps in the master node when data is loaded from few/all the
> >> slave
> >>> nodes.
> >>>>> Can anyone let me know how long generally it takes for this
> >> update to
> >>>>> happen.
> >>>>>
> >>>>> And one more question, what if the node crashes soon after the
> >> data is
> >>>>> copied into one it. How is data consistency maintained here ?
> >>>>>
> >>>>> Thanks in advance,
> >>>>> Venkates P B
> >>>>>
> >>>>
> >>>
> >>
> >

Re: Loading data into HDFS

Posted by Jim Kellerman <ji...@powerset.com>.

This request isn't so much about loading data into HDFS, but we really
need the ability to create a file that supports atomic appends for the
HBase redo log. Since HDFS files currently don't exist until they are
closed, the best we can do right now is close the current redo log and
open a new one fairly frequently to minimize the number of updates that
would get lost otherwise. I don't think we need the multi-appender model
that GFS supports, just a single appender.

-Jim

On Tue, 2007-08-07 at 10:45 -0700, Eric Baldeschwieler wrote:
> I'll have our operations folks comment on our current techniques.
> 
> We use map-reduce jobs to copy from all nodes in the cluster from the  
> source.   Generally using either HTTP(S) or HDFS protocol.
> 
> We've seen write rates as high as 8.3 GBytes/sec on 900 nodes.  This  
> is network limited.  We see roughly 20MBytes/sec/node (double the  
> other rate) on one rack clusters, with everything connected with  
> gigabit.
> 
> We (the yahoo grid team) are planning to put some more energy into  
> making the system more useful for real-time log handling in the next  
> few releases.  For example, I would like to be able to tail -f a file  
> as it is written, I would like to have a generic log aggregation  
> system and I would like to have the map-reduce framework log directly  
> into HDFS using that system.
> 
> I'd love to hear thoughts on other achievable improvements that would  
> really help in this area.
> 
> On Aug 3, 2007, at 1:42 AM, Jeff Hammerbacher wrote:
> 
> > We have a service which writes one copy of a logfile directly into  
> > HDFS
> > (writes go to namenode).  As Dennis mentions, since HDFS does not  
> > support
> > atomic appends, if a failure occurs before closing a file, it never  
> > appears
> > in the file system.  Thus we have to rotate logfiles at a greater  
> > frequency
> > that we'd like to "checkpoint" the data into HDFS.  The system  
> > certainly
> > isn't perfect but bulk-loading the data into HDFS was proving  
> > rather slow.
> > I'd be curious to hear actual performance numbers and methodologies  
> > for bulk
> > loads.  I'll try to dig some up myself on Monday.
> >
> > On 8/2/07, Dennis Kubes <ku...@apache.org> wrote:
> > >
> > > You can copy data from any node, so if you can do it from  
> > multiple nodes
> > > your performance would be better (although be sure not to overlap
> > > files).  The master node is updated once a the block is copied it
> > > replication number of times.  So if default replication is 3 then  
> > the 3
> > > replicates must be active before the master is updated and the data
> > > "appears" int the dfs.
> > >
> > > How long the updates take to happen is a function of your server  
> > load
> > > and network speed and file size.  Generally it is fast.
> > >
> > > So the process is the data is loaded into the dfs, replicates are
> > > created, and the master node is updated.  In terms of  
> > consistency, if
> > > the data node crashes before the data is loaded then the data won't
> > > appear in the dfs.  If the name node crashes before it is updated  
> > but
> > > all replicates are active, the data would appear once the name  
> > node has
> > > been fixed and updated through block reports.  If a single node  
> > crashes
> > > that has a replicate once the namenode has been updated then the  
> > data
> > > will be replicated from one of the other 2 replicates to another 3
> > > system if available.
> > >
> > > Dennis Kubes
> > >
> > > Venkates .P.B. wrote:
> > > > Am I missing something very fundamental ? Can someone comment  
> > on these
> > > > queries ?
> > > >
> > > > Thanks,
> > > > Venkates P B
> > > >
> > > > On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
> > > >>
> > > >> Few queries regarding the way data is loaded into HDFS.
> > > >>
> > > >> -Is it a common practice to load the data into HDFS only  
> > through the
> > > >> master node ? We are able to copy only around 35 logs (64K  
> > each) per
> > > minute
> > > >> in a 2 slave configuration.
> > > >>
> > > >> -We are concerned about time it would take to update filenames  
> > and
> > > block
> > > >> maps in the master node when data is loaded from few/all the  
> > slave
> > > nodes.
> > > >> Can anyone let me know how long generally it takes for this  
> > update to
> > > >> happen.
> > > >>
> > > >> And one more question, what if the node crashes soon after the  
> > data is
> > > >> copied into one it. How is data consistency maintained here ?
> > > >>
> > > >> Thanks in advance,
> > > >> Venkates P B
> > > >>
> > > >
> > >
> >
> 
-- 
Jim Kellerman, Senior Engineer; Powerset
jim@powerset.com

Re: Loading data into HDFS

Posted by Ted Dunning <td...@veoh.com>.

One issue that we see in building log aggregators using hadoop is that we
often want to do several aggregations in a single reduce task.

For instance, we have viewers who view videos and sometimes watch them to
completion and sometimes scrub to different points in the video and
sometimes close a browser without a session completion event.  We want to
run map to extract session id as a reduce key and then have a reduce that
summarizes the basics of the the session (user + video + what happened).
The interesting stuff happens in the second map/reduce where we want to have
map emit records for user aggregation, user by day aggregation, video
aggregation, and video by day aggregation (this is somewhat simplified, of
course).  In  the second reduce, we want to compute various aggregation
functions such as total count, total distinct count and a few distributional
measures such as estimated non-robot distinct views or long-tail
coefficients or day part usage pattern.

Systems like Pig seem to be built with the assumption that there should be a
separate reduce task per aggregation.  Since we have about a dozen aggregate
functions that we need to compute per aggregation type, that would entail
about an order of magnitude decrease in performance for us unless the
language is clever enough to put all aggregation functions for the same
aggregation into the same reduce task.

Also, our current ingestion rates are completely dominated by our downstream
processing so log file collection isn't a huge deal.  For backfilling old
data, we have some need for higher bandwidth, but even a single NFS server
suffices as a source for heavily compressed and consolidated log files.  Our
transaction volumes are pretty modest compared to something like Yahoo, of
course, since we still have less than 20 million monthly uniques, but we
expect this to continue the strong growth that we have been seeing.

On 8/7/07 10:45 AM, "Eric Baldeschwieler" <er...@yahoo-inc.com> wrote:

> I'll have our operations folks comment on our current techniques.
> 
> We use map-reduce jobs to copy from all nodes in the cluster from the
> source.   Generally using either HTTP(S) or HDFS protocol.
> 
> We've seen write rates as high as 8.3 GBytes/sec on 900 nodes.  This
> is network limited.  We see roughly 20MBytes/sec/node (double the
> other rate) on one rack clusters, with everything connected with
> gigabit.
> 
> We (the yahoo grid team) are planning to put some more energy into
> making the system more useful for real-time log handling in the next
> few releases.  For example, I would like to be able to tail -f a file
> as it is written, I would like to have a generic log aggregation
> system and I would like to have the map-reduce framework log directly
> into HDFS using that system.
> 
> I'd love to hear thoughts on other achievable improvements that would
> really help in this area.
> 
> On Aug 3, 2007, at 1:42 AM, Jeff Hammerbacher wrote:
> 
>> We have a service which writes one copy of a logfile directly into
>> HDFS
>> (writes go to namenode).  As Dennis mentions, since HDFS does not
>> support
>> atomic appends, if a failure occurs before closing a file, it never
>> appears
>> in the file system.  Thus we have to rotate logfiles at a greater
>> frequency
>> that we'd like to "checkpoint" the data into HDFS.  The system
>> certainly
>> isn't perfect but bulk-loading the data into HDFS was proving
>> rather slow.
>> I'd be curious to hear actual performance numbers and methodologies
>> for bulk
>> loads.  I'll try to dig some up myself on Monday.
>> 
>> On 8/2/07, Dennis Kubes <ku...@apache.org> wrote:
>>> 
>>> You can copy data from any node, so if you can do it from
>> multiple nodes
>>> your performance would be better (although be sure not to overlap
>>> files).  The master node is updated once a the block is copied it
>>> replication number of times.  So if default replication is 3 then
>> the 3
>>> replicates must be active before the master is updated and the data
>>> "appears" int the dfs.
>>> 
>>> How long the updates take to happen is a function of your server
>> load
>>> and network speed and file size.  Generally it is fast.
>>> 
>>> So the process is the data is loaded into the dfs, replicates are
>>> created, and the master node is updated.  In terms of
>> consistency, if
>>> the data node crashes before the data is loaded then the data won't
>>> appear in the dfs.  If the name node crashes before it is updated
>> but
>>> all replicates are active, the data would appear once the name
>> node has
>>> been fixed and updated through block reports.  If a single node
>> crashes
>>> that has a replicate once the namenode has been updated then the
>> data
>>> will be replicated from one of the other 2 replicates to another 3
>>> system if available.
>>> 
>>> Dennis Kubes
>>> 
>>> Venkates .P.B. wrote:
>>>> Am I missing something very fundamental ? Can someone comment
>> on these
>>>> queries ?
>>>> 
>>>> Thanks,
>>>> Venkates P B
>>>> 
>>>> On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
>>>>> 
>>>>> Few queries regarding the way data is loaded into HDFS.
>>>>> 
>>>>> -Is it a common practice to load the data into HDFS only
>> through the
>>>>> master node ? We are able to copy only around 35 logs (64K
>> each) per
>>> minute
>>>>> in a 2 slave configuration.
>>>>> 
>>>>> -We are concerned about time it would take to update filenames
>> and
>>> block
>>>>> maps in the master node when data is loaded from few/all the
>> slave
>>> nodes.
>>>>> Can anyone let me know how long generally it takes for this
>> update to
>>>>> happen.
>>>>> 
>>>>> And one more question, what if the node crashes soon after the
>> data is
>>>>> copied into one it. How is data consistency maintained here ?
>>>>> 
>>>>> Thanks in advance,
>>>>> Venkates P B
>>>>> 
>>>> 
>>> 
>> 
>

Re: Loading data into HDFS

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

I'll have our operations folks comment on our current techniques.

We use map-reduce jobs to copy from all nodes in the cluster from the  
source.   Generally using either HTTP(S) or HDFS protocol.

We've seen write rates as high as 8.3 GBytes/sec on 900 nodes.  This  
is network limited.  We see roughly 20MBytes/sec/node (double the  
other rate) on one rack clusters, with everything connected with  
gigabit.

We (the yahoo grid team) are planning to put some more energy into  
making the system more useful for real-time log handling in the next  
few releases.  For example, I would like to be able to tail -f a file  
as it is written, I would like to have a generic log aggregation  
system and I would like to have the map-reduce framework log directly  
into HDFS using that system.

I'd love to hear thoughts on other achievable improvements that would  
really help in this area.

On Aug 3, 2007, at 1:42 AM, Jeff Hammerbacher wrote:

> We have a service which writes one copy of a logfile directly into  
> HDFS
> (writes go to namenode).  As Dennis mentions, since HDFS does not  
> support
> atomic appends, if a failure occurs before closing a file, it never  
> appears
> in the file system.  Thus we have to rotate logfiles at a greater  
> frequency
> that we'd like to "checkpoint" the data into HDFS.  The system  
> certainly
> isn't perfect but bulk-loading the data into HDFS was proving  
> rather slow.
> I'd be curious to hear actual performance numbers and methodologies  
> for bulk
> loads.  I'll try to dig some up myself on Monday.
>
> On 8/2/07, Dennis Kubes <ku...@apache.org> wrote:
> >
> > You can copy data from any node, so if you can do it from  
> multiple nodes
> > your performance would be better (although be sure not to overlap
> > files).  The master node is updated once a the block is copied it
> > replication number of times.  So if default replication is 3 then  
> the 3
> > replicates must be active before the master is updated and the data
> > "appears" int the dfs.
> >
> > How long the updates take to happen is a function of your server  
> load
> > and network speed and file size.  Generally it is fast.
> >
> > So the process is the data is loaded into the dfs, replicates are
> > created, and the master node is updated.  In terms of  
> consistency, if
> > the data node crashes before the data is loaded then the data won't
> > appear in the dfs.  If the name node crashes before it is updated  
> but
> > all replicates are active, the data would appear once the name  
> node has
> > been fixed and updated through block reports.  If a single node  
> crashes
> > that has a replicate once the namenode has been updated then the  
> data
> > will be replicated from one of the other 2 replicates to another 3
> > system if available.
> >
> > Dennis Kubes
> >
> > Venkates .P.B. wrote:
> > > Am I missing something very fundamental ? Can someone comment  
> on these
> > > queries ?
> > >
> > > Thanks,
> > > Venkates P B
> > >
> > > On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
> > >>
> > >> Few queries regarding the way data is loaded into HDFS.
> > >>
> > >> -Is it a common practice to load the data into HDFS only  
> through the
> > >> master node ? We are able to copy only around 35 logs (64K  
> each) per
> > minute
> > >> in a 2 slave configuration.
> > >>
> > >> -We are concerned about time it would take to update filenames  
> and
> > block
> > >> maps in the master node when data is loaded from few/all the  
> slave
> > nodes.
> > >> Can anyone let me know how long generally it takes for this  
> update to
> > >> happen.
> > >>
> > >> And one more question, what if the node crashes soon after the  
> data is
> > >> copied into one it. How is data consistency maintained here ?
> > >>
> > >> Thanks in advance,
> > >> Venkates P B
> > >>
> > >
> >
>

Re: Loading data into HDFS

Posted by Jeff Hammerbacher <je...@gmail.com>.

We have a service which writes one copy of a logfile directly into HDFS
(writes go to namenode).  As Dennis mentions, since HDFS does not support
atomic appends, if a failure occurs before closing a file, it never appears
in the file system.  Thus we have to rotate logfiles at a greater frequency
that we'd like to "checkpoint" the data into HDFS.  The system certainly
isn't perfect but bulk-loading the data into HDFS was proving rather slow.
I'd be curious to hear actual performance numbers and methodologies for bulk
loads.  I'll try to dig some up myself on Monday.

On 8/2/07, Dennis Kubes <ku...@apache.org> wrote:
>
> You can copy data from any node, so if you can do it from multiple nodes
> your performance would be better (although be sure not to overlap
> files).  The master node is updated once a the block is copied it
> replication number of times.  So if default replication is 3 then the 3
> replicates must be active before the master is updated and the data
> "appears" int the dfs.
>
> How long the updates take to happen is a function of your server load
> and network speed and file size.  Generally it is fast.
>
> So the process is the data is loaded into the dfs, replicates are
> created, and the master node is updated.  In terms of consistency, if
> the data node crashes before the data is loaded then the data won't
> appear in the dfs.  If the name node crashes before it is updated but
> all replicates are active, the data would appear once the name node has
> been fixed and updated through block reports.  If a single node crashes
> that has a replicate once the namenode has been updated then the data
> will be replicated from one of the other 2 replicates to another 3
> system if available.
>
> Dennis Kubes
>
> Venkates .P.B. wrote:
> > Am I missing something very fundamental ? Can someone comment on these
> > queries ?
> >
> > Thanks,
> > Venkates P B
> >
> > On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
> >>
> >> Few queries regarding the way data is loaded into HDFS.
> >>
> >> -Is it a common practice to load the data into HDFS only through the
> >> master node ? We are able to copy only around 35 logs (64K each) per
> minute
> >> in a 2 slave configuration.
> >>
> >> -We are concerned about time it would take to update filenames and
> block
> >> maps in the master node when data is loaded from few/all the slave
> nodes.
> >> Can anyone let me know how long generally it takes for this update to
> >> happen.
> >>
> >> And one more question, what if the node crashes soon after the data is
> >> copied into one it. How is data consistency maintained here ?
> >>
> >> Thanks in advance,
> >> Venkates P B
> >>
> >
>

Re: Loading data into HDFS

Posted by Dennis Kubes <ku...@apache.org>.

You can copy data from any node, so if you can do it from multiple nodes 
your performance would be better (although be sure not to overlap 
files).  The master node is updated once a the block is copied it 
replication number of times.  So if default replication is 3 then the 3 
replicates must be active before the master is updated and the data 
"appears" int the dfs.

How long the updates take to happen is a function of your server load 
and network speed and file size.  Generally it is fast.

So the process is the data is loaded into the dfs, replicates are 
created, and the master node is updated.  In terms of consistency, if 
the data node crashes before the data is loaded then the data won't 
appear in the dfs.  If the name node crashes before it is updated but 
all replicates are active, the data would appear once the name node has 
been fixed and updated through block reports.  If a single node crashes 
that has a replicate once the namenode has been updated then the data 
will be replicated from one of the other 2 replicates to another 3 
system if available.

Dennis Kubes

Venkates .P.B. wrote:
> Am I missing something very fundamental ? Can someone comment on these
> queries ?
> 
> Thanks,
> Venkates P B
> 
> On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
>>
>> Few queries regarding the way data is loaded into HDFS.
>>
>> -Is it a common practice to load the data into HDFS only through the
>> master node ? We are able to copy only around 35 logs (64K each) per minute
>> in a 2 slave configuration.
>>
>> -We are concerned about time it would take to update filenames and block
>> maps in the master node when data is loaded from few/all the slave nodes.
>> Can anyone let me know how long generally it takes for this update to
>> happen.
>>
>> And one more question, what if the node crashes soon after the data is
>> copied into one it. How is data consistency maintained here ?
>>
>> Thanks in advance,
>> Venkates P B
>>
>

Re: Loading data into HDFS

Posted by "Venkates .P.B." <ve...@gmail.com>.

Am I missing something very fundamental ? Can someone comment on these
queries ?

Thanks,
Venkates P B

On 8/1/07, Venkates .P.B. <ve...@gmail.com> wrote:
>
>
> Few queries regarding the way data is loaded into HDFS.
>
> -Is it a common practice to load the data into HDFS only through the
> master node ? We are able to copy only around 35 logs (64K each) per minute
> in a 2 slave configuration.
>
> -We are concerned about time it would take to update filenames and block
> maps in the master node when data is loaded from few/all the slave nodes.
> Can anyone let me know how long generally it takes for this update to
> happen.
>
> And one more question, what if the node crashes soon after the data is
> copied into one it. How is data consistency maintained here ?
>
> Thanks in advance,
> Venkates P B
>