You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Thomas FRIOL <th...@anyware-tech.com> on 2006/07/13 17:17:19 UTC

What about append in hadoop files ?

Hi,

I would like to know today why it is not possible to append datas into 
an existing file (Path) or why the FSDataOutputStream must be closed 
before the file is written to the DFS.

In fact, my problem is that I have a servlet which is regularly writing 
datas into a file in the DFS. Today, if my JVM crashes, I lose all my 
datas because my output stream is closed only when the JVM stops itself.

So, has someone  a solution to avoid my problem.

Thanks for any help.

Thomas.

-- 
Thomas FRIOL
Développeur Eclipse / Eclipse Developer
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tél      : +33 (0)561 000 653
Portable : +33 (0)609 704 810
Fax      : +33 (0)561 005 146
www.anyware-tech.com

Re: What about append in hadoop files ?

Posted by Thomas FRIOL <th...@anyware-tech.com>.

Doug Cutting a écrit :
> Thomas FRIOL wrote:
>> I would like to know today why it is not possible to append datas 
>> into an existing file (Path) or why the FSDataOutputStream must be 
>> closed before the file is written to the DFS.
>
> Those are the current semantics of the filesytem: a file is not 
> readable until it is closed, and files are write-once.  This 
> considerably simplifies the implementation and supports the primary 
> intended uses for DFS.  The simpler we keep DFS the easier it is to 
> make it reliable and scalable.  At this point we are prioritizing 
> reliability and scalability over new features.  Over time, when 
> reliability and scalability are sufficiently demonstrated, these 
> restrictions may be removed.
>
>> In fact, my problem is that I have a servlet which is regularly 
>> writing datas into a file in the DFS. Today, if my JVM crashes, I 
>> lose all my datas because my output stream is closed only when the 
>> JVM stops itself.
>
> You could periodically close the file and start writing a new file.
It's exactly what I am doing now, but there is still some datas that we 
cannot ensure not to be lost due to a JVM crash.
>
> DFS is currently primarily used to support large, offline, batch 
> computations.  For example, a log of critical data with tight 
> transactional requirements is probably an inappropriate use of DFS at 
> this time.  Again, this may change, but that's where we are now.
>
Ok thanks for your help.
> Doug

Re: What about append in hadoop files ?

Posted by Paul Sutter <su...@gmail.com>.

When I first started using Hadoop, I was shocked and disturbed that
the append functionality didnt exist.

But as it turns out, we've had no problem at all working around it. I
have grown to really like the simple atomicness of the current
featureset.

On 7/14/06, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
> Eric,
>
> I remember Doug advised somebody on a related issue to use a directory
> instead of a file for long lasting appends.
> You can logically divide your output into smaller files and close them
> whenever the logical boundary is reached.
> The directory can be treated as a collection of records. May be this
> will work for you.
> IMO the concurrent append feature is a high priority task.
>
> --Konstantin
>
> Doug Cutting wrote:
>
> > drwho wrote:
> >
> >> If so, GFS, is also suitable only for large, offline, batch
> >> computations ?
> >> I wonder how Google is going to use GFS for writely or their online
> >> spreadsheet or their  BigTable (their gigantic relational DB).
> >
> >
> > Did I say anything about GFS?  I don't think so.  Also, I said,
> > "currently" and "primarily", not "forever" and "exclusively".  I would
> > love for DFS to be more suitable for online, incremental stuff, but
> > we're a ways from that right now.  As I said, we're pursuing
> > reliability, scalability and performance before features like append.
> > If you'd like to try to implement append w/o disrupting work on
> > reliability scalability and performance, we'd welcome your
> > contributions.  The project direction is determined by contributors.
> >
> > Note that BigTable is a complex layer on top of GFS that caches and
> > batches i/o.  So, while GFS does implement some features that DFS
> > still does not (like appends), GFS is probably not used directly by,
> > e.g., writely.  Finally, BigTable is not relational.
> >
> > Doug
> >
> >> Doug Cutting <cu...@apache.org> wrote: <chopped>
> >>
> >> DFS is currently primarily used to support large, offline, batch
> >> computations.  For example, a log of critical data with tight
> >> transactional requirements is probably an inappropriate use of DFS at
> >> this time.  Again, this may change, but that's where we are now.
> >>
> >> Doug
> >>
> >>
> >>
> >>
> >> Thanks much.
> >>
> >> -eric
> >>
> >
> >
> >
>
>

Re: What about append in hadoop files ?

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Eric,

I remember Doug advised somebody on a related issue to use a directory 
instead of a file for long lasting appends.
You can logically divide your output into smaller files and close them 
whenever the logical boundary is reached.
The directory can be treated as a collection of records. May be this 
will work for you.
IMO the concurrent append feature is a high priority task.

--Konstantin

Doug Cutting wrote:

> drwho wrote:
>
>> If so, GFS, is also suitable only for large, offline, batch 
>> computations ?
>> I wonder how Google is going to use GFS for writely or their online
>> spreadsheet or their  BigTable (their gigantic relational DB).
>
>
> Did I say anything about GFS?  I don't think so.  Also, I said, 
> "currently" and "primarily", not "forever" and "exclusively".  I would 
> love for DFS to be more suitable for online, incremental stuff, but 
> we're a ways from that right now.  As I said, we're pursuing 
> reliability, scalability and performance before features like append. 
> If you'd like to try to implement append w/o disrupting work on 
> reliability scalability and performance, we'd welcome your 
> contributions.  The project direction is determined by contributors.
>
> Note that BigTable is a complex layer on top of GFS that caches and 
> batches i/o.  So, while GFS does implement some features that DFS 
> still does not (like appends), GFS is probably not used directly by, 
> e.g., writely.  Finally, BigTable is not relational.
>
> Doug
>
>> Doug Cutting <cu...@apache.org> wrote: <chopped>
>>
>> DFS is currently primarily used to support large, offline, batch 
>> computations.  For example, a log of critical data with tight 
>> transactional requirements is probably an inappropriate use of DFS at 
>> this time.  Again, this may change, but that's where we are now.
>>
>> Doug
>>
>>
>>
>>
>> Thanks much.
>>
>> -eric 
>>
>
>
>

Re: What about append in hadoop files ?

Posted by Doug Cutting <cu...@apache.org>.

drwho wrote:
> If so, GFS, is also suitable only for large, offline, batch computations ?
> I wonder how Google is going to use GFS for writely or their online
> spreadsheet or their  BigTable (their gigantic relational DB).

Did I say anything about GFS?  I don't think so.  Also, I said, 
"currently" and "primarily", not "forever" and "exclusively".  I would 
love for DFS to be more suitable for online, incremental stuff, but 
we're a ways from that right now.  As I said, we're pursuing 
reliability, scalability and performance before features like append. 
If you'd like to try to implement append w/o disrupting work on 
reliability scalability and performance, we'd welcome your 
contributions.  The project direction is determined by contributors.

Note that BigTable is a complex layer on top of GFS that caches and 
batches i/o.  So, while GFS does implement some features that DFS still 
does not (like appends), GFS is probably not used directly by, e.g., 
writely.  Finally, BigTable is not relational.

Doug

> Doug Cutting <cu...@apache.org> wrote: <chopped>
> 
> DFS is currently primarily used to support large, offline, batch 
> computations.  For example, a log of critical data with tight 
> transactional requirements is probably an inappropriate use of DFS at 
> this time.  Again, this may change, but that's where we are now.
> 
> Doug
> 
> 
> 
> 
> Thanks much.
> 
> -eric  
> 
>

Re: What about append in hadoop files ?

Posted by drwho <dr...@yahoo.com>.

If so, GFS, is also suitable only for large, offline, batch computations ?
I wonder how Google is going to use GFS for writely or their online
spreadsheet or their  BigTable (their gigantic relational DB).


Doug Cutting <cu...@apache.org> wrote: <chopped>

DFS is currently primarily used to support large, offline, batch 
computations.  For example, a log of critical data with tight 
transactional requirements is probably an inappropriate use of DFS at 
this time.  Again, this may change, but that's where we are now.

Doug




Thanks much.

-eric

Re: What about append in hadoop files ?

Posted by Doug Cutting <cu...@apache.org>.

Thomas FRIOL wrote:
> I would like to know today why it is not possible to append datas into 
> an existing file (Path) or why the FSDataOutputStream must be closed 
> before the file is written to the DFS.

Those are the current semantics of the filesytem: a file is not readable 
until it is closed, and files are write-once.  This considerably 
simplifies the implementation and supports the primary intended uses for 
DFS.  The simpler we keep DFS the easier it is to make it reliable and 
scalable.  At this point we are prioritizing reliability and scalability 
over new features.  Over time, when reliability and scalability are 
sufficiently demonstrated, these restrictions may be removed.

> In fact, my problem is that I have a servlet which is regularly writing 
> datas into a file in the DFS. Today, if my JVM crashes, I lose all my 
> datas because my output stream is closed only when the JVM stops itself.

You could periodically close the file and start writing a new file.

DFS is currently primarily used to support large, offline, batch 
computations.  For example, a log of critical data with tight 
transactional requirements is probably an inappropriate use of DFS at 
this time.  Again, this may change, but that's where we are now.

Doug