You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Garri Santos <ga...@phpugph.com> on 2008/04/03 08:39:34 UTC

Is it possible in Hadoop to overwrite or update a file?

Hi!

I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering
if it's just fine to update or overwrite a file copied to hadoop?


Thanks,
Garri

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Ted Dunning <td...@veoh.com>.

My suggestion actually is similar to what bigtable and hbase do.

That is to keep some recent updates in memory, burping them to disk at
relatively frequent intervals.  Then when a number of burps are available,
they can be merged to a larger burp.  This pyramid can be extended as
needed.

Searches then would have to probe the files at each level to find the latest
version of the record.

File append or a highly reliable in-memory storage for the current partial
burp (something like zookeeper or replicated memcache?) is required for this
approach to not lose recent data on machine failure, but it has the
potential for high write rates while maintaining reasonable read rates.  Of
course, truly random reads will kill you, but reads biased toward recently
updated records should have awesome performance.

On 4/4/08 5:04 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Ted Dunning wrote:
> 
>> This factor of 1500 in speed seems pretty significant and is the motivation
>> for not supporting random read/write.
>> 
>> This doesn't mean that random access update should never be done, but it
>> does mean that scaling a design based around random access will be more
>> difficult than scaling a design based on sequential read and write.
>> 
>> On 4/3/08 12:07 PM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>> 
>>> In general, if updates are relatively frequent and small compared to the
>>> size of data then this could be useful.
>> 
>> 
> 
> Hehe ... yes, good calculations :) What I had in mind though when saying
> "relatively frequent" was rather a situation when updates are usually
> small and come at unpredictable intervals (e.g. picked from a queue
> listener) and then need to set flags on a few records. Running
> sequential update in face of such minor changes doesn't usually pay off,
> and queueing the changes so that it starts to pay off is sometimes not
> possible (takes too long to fill the batch).
>

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ted Dunning wrote:

> This factor of 1500 in speed seems pretty significant and is the motivation
> for not supporting random read/write.
> 
> This doesn't mean that random access update should never be done, but it
> does mean that scaling a design based around random access will be more
> difficult than scaling a design based on sequential read and write.
> 
> On 4/3/08 12:07 PM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
> 
>> In general, if updates are relatively frequent and small compared to the
>> size of data then this could be useful.
> 
> 

Hehe ... yes, good calculations :) What I had in mind though when saying 
"relatively frequent" was rather a situation when updates are usually 
small and come at unpredictable intervals (e.g. picked from a queue 
listener) and then need to set flags on a few records. Running 
sequential update in face of such minor changes doesn't usually pay off, 
and queueing the changes so that it starts to pay off is sometimes not 
possible (takes too long to fill the batch).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Ted Dunning <td...@veoh.com>.

Interesting you should say this.

I have been using this exact example (slightly modified) as an interview
question lately.  I have to admit I stole it from Doug's Hadoop slides.

If you have a 1TB database with 100 B records and you want to update 1% of
them, how long will it take?

Assume for argument's sake that you have a disk with 100MB/s max bandwidth,
10ms seek time, an 6000 rpm.

The answer is surprising.  If you seek for each update, then you need to do

  n = 1% x 10^12 B / 100 B = 10^8 updates

Each update will require (roughly) 10ms seek, 10 ms read (one rotation) and
10 ms write for a total of 30 ms.

The total time for this strategy is n x 30 ms = 3 x 10^6 seconds = 35 days.

On the other hand, if you read the entire database sequentially and then
rewrite it sequentially, it will take 2 x 10^12 B / (10^8 B/s) = 2 x 10^4 s
= 5.6 hours.

This factor of 1500 in speed seems pretty significant and is the motivation
for not supporting random read/write.

This doesn't mean that random access update should never be done, but it
does mean that scaling a design based around random access will be more
difficult than scaling a design based on sequential read and write.

On 4/3/08 12:07 PM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> In general, if updates are relatively frequent and small compared to the
> size of data then this could be useful.

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ted Dunning wrote:
> I sympathize fully with Owen's thoughts here, but Andrej's point that
> (essentially) users ought to be able to do it if they really, really want to
> is a good one.

One particular scenario where having the ability to update blocks would 
be beneficial is when flipping flags in records - there is no change in 
file size. Well, for record-compressed files this could result in block 
size change (different compressed size due to a different bit pattern) - 
but in that case we could re-compress the record and resize the block as 
needed.

In general, if updates are relatively frequent and small compared to the 
size of data then this could be useful.

> 
> It IS true that the original point of hadoop is high performance sequential
> writing applications.  It does that, more or less, pretty well.  But does it
> follow that supporting high performance computing imply the low-performance
> access paradigms should be completely unsupported?

Also, my point was that FileSystem is a very useful abstraction, because 
you can move your application transparently from using local data to 
using distributed data, but it's also very limiting at the moment - even 
though some of the FS implementations support much richer functionality, 
this functionality is not available through the FileSystem abstraction.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Ted Dunning <td...@veoh.com>.

I sympathize fully with Owen's thoughts here, but Andrej's point that
(essentially) users ought to be able to do it if they really, really want to
is a good one.

It IS true that the original point of hadoop is high performance sequential
writing applications.  It does that, more or less, pretty well.  But does it
follow that supporting high performance computing imply the low-performance
access paradigms should be completely unsupported?

On 4/3/08 7:59 AM, "Owen O'Malley" <oo...@yahoo-inc.com> wrote:

> 
> On Apr 3, 2008, at 3:53 AM, Andrzej Bialecki wrote:
> 
>> Hmm ... Exactly why random writes are not possible? For performance
>> reasons? Or the problem of synchronization of replicas?
> 
> The HDFS protocols to support random write would be much more
> complicated. Furthermore, part of the performance of map/reduce is
> based on doing large streaming reads and writes. An application that
> is seeking all over the file to make updates will perform very badly.
> Additionally, if the file is compressed, there is no possibility of
> updating it anyways.
> 
> -- Owen

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.

On Apr 3, 2008, at 3:53 AM, Andrzej Bialecki wrote:

> Hmm ... Exactly why random writes are not possible? For performance  
> reasons? Or the problem of synchronization of replicas?

The HDFS protocols to support random write would be much more  
complicated. Furthermore, part of the performance of map/reduce is  
based on doing large streaming reads and writes. An application that  
is seeking all over the file to make updates will perform very badly.  
Additionally, if the file is compressed, there is no possibility of  
updating it anyways.

-- Owen

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Owen O'Malley wrote:
> 
> On Apr 2, 2008, at 11:39 PM, Garri Santos wrote:
> 
>> Hi!
>>
>> I'm starting to take alook at hadoop and the whole HDFS idea. I'm 
>> wondering
>> if it's just fine to update or overwrite a file copied to hadoop?
> 
> No. Although we are making progress on HADOOP-1700, which would allow 
> appending onto files.

Hmm ... Exactly why random writes are not possible? For performance 
reasons? Or the problem of synchronization of replicas?

I imagine that we could allow the client to "update" an arbitrary byte 
range within a FS block and let the namenode know that this new block 
content should replace older copies of the same block on all datanodes.

If we could do that, and we we had the append functionality, then this 
should be equivalent to full random write access.

And another thought: the FileSystem abstraction now aims for the lowest 
common denominator of all supported filesystems, although even now some 
operations are not supported on some filesystems. Shouldn't we raise the 
bar for the common abstraction and bring it to at least a more or less 
POSIX-y level with random writes and appends? Concrete filesystems could 
always throw UnsupportedOperation exceptions, but additional much needed 
functionality would be available for those filesystems where we can 
easily support it even now. What do you think?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.

On Apr 2, 2008, at 11:39 PM, Garri Santos wrote:

> Hi!
>
> I'm starting to take alook at hadoop and the whole HDFS idea. I'm  
> wondering
> if it's just fine to update or overwrite a file copied to hadoop?

No. Although we are making progress on HADOOP-1700, which would allow  
appending onto files.

-- Owen

Re: Is it possible in Hadoop to overwrite or update a file?

Posted by Ted Dunning <td...@veoh.com>.

You can overwrite it, but you can't update it.  Soon you will be able to
append to it, but you won't be able to do any other updates.

On 4/2/08 11:39 PM, "Garri Santos" <ga...@phpugph.com> wrote:

> Hi!
> 
> I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering
> if it's just fine to update or overwrite a file copied to hadoop?
> 
> 
> Thanks,
> Garri