You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Thomas FRIOL <th...@anyware-tech.com> on 2006/07/13 17:17:19 UTC
What about append in hadoop files ?
Hi,
I would like to know today why it is not possible to append datas into
an existing file (Path) or why the FSDataOutputStream must be closed
before the file is written to the DFS.
In fact, my problem is that I have a servlet which is regularly writing
datas into a file in the DFS. Today, if my JVM crashes, I lose all my
datas because my output stream is closed only when the JVM stops itself.
So, has someone a solution to avoid my problem.
Thanks for any help.
Thomas.
--
Thomas FRIOL
Développeur Eclipse / Eclipse Developer
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tél : +33 (0)561 000 653
Portable : +33 (0)609 704 810
Fax : +33 (0)561 005 146
www.anyware-tech.com
Re: What about append in hadoop files ?
Posted by Thomas FRIOL <th...@anyware-tech.com>.
Doug Cutting a écrit :
> Thomas FRIOL wrote:
>> I would like to know today why it is not possible to append datas
>> into an existing file (Path) or why the FSDataOutputStream must be
>> closed before the file is written to the DFS.
>
> Those are the current semantics of the filesytem: a file is not
> readable until it is closed, and files are write-once. This
> considerably simplifies the implementation and supports the primary
> intended uses for DFS. The simpler we keep DFS the easier it is to
> make it reliable and scalable. At this point we are prioritizing
> reliability and scalability over new features. Over time, when
> reliability and scalability are sufficiently demonstrated, these
> restrictions may be removed.
>
>> In fact, my problem is that I have a servlet which is regularly
>> writing datas into a file in the DFS. Today, if my JVM crashes, I
>> lose all my datas because my output stream is closed only when the
>> JVM stops itself.
>
> You could periodically close the file and start writing a new file.
It's exactly what I am doing now, but there is still some datas that we
cannot ensure not to be lost due to a JVM crash.
>
> DFS is currently primarily used to support large, offline, batch
> computations. For example, a log of critical data with tight
> transactional requirements is probably an inappropriate use of DFS at
> this time. Again, this may change, but that's where we are now.
>
Ok thanks for your help.
> Doug
Re: What about append in hadoop files ?
Posted by Paul Sutter <su...@gmail.com>.
When I first started using Hadoop, I was shocked and disturbed that
the append functionality didnt exist.
But as it turns out, we've had no problem at all working around it. I
have grown to really like the simple atomicness of the current
featureset.
On 7/14/06, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
> Eric,
>
> I remember Doug advised somebody on a related issue to use a directory
> instead of a file for long lasting appends.
> You can logically divide your output into smaller files and close them
> whenever the logical boundary is reached.
> The directory can be treated as a collection of records. May be this
> will work for you.
> IMO the concurrent append feature is a high priority task.
>
> --Konstantin
>
> Doug Cutting wrote:
>
> > drwho wrote:
> >
> >> If so, GFS, is also suitable only for large, offline, batch
> >> computations ?
> >> I wonder how Google is going to use GFS for writely or their online
> >> spreadsheet or their BigTable (their gigantic relational DB).
> >
> >
> > Did I say anything about GFS? I don't think so. Also, I said,
> > "currently" and "primarily", not "forever" and "exclusively". I would
> > love for DFS to be more suitable for online, incremental stuff, but
> > we're a ways from that right now. As I said, we're pursuing
> > reliability, scalability and performance before features like append.
> > If you'd like to try to implement append w/o disrupting work on
> > reliability scalability and performance, we'd welcome your
> > contributions. The project direction is determined by contributors.
> >
> > Note that BigTable is a complex layer on top of GFS that caches and
> > batches i/o. So, while GFS does implement some features that DFS
> > still does not (like appends), GFS is probably not used directly by,
> > e.g., writely. Finally, BigTable is not relational.
> >
> > Doug
> >
> >> Doug Cutting <cu...@apache.org> wrote: <chopped>
> >>
> >> DFS is currently primarily used to support large, offline, batch
> >> computations. For example, a log of critical data with tight
> >> transactional requirements is probably an inappropriate use of DFS at
> >> this time. Again, this may change, but that's where we are now.
> >>
> >> Doug
> >>
> >>
> >>
> >>
> >> Thanks much.
> >>
> >> -eric
> >>
> >
> >
> >
>
>
Re: What about append in hadoop files ?
Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
Eric,
I remember Doug advised somebody on a related issue to use a directory
instead of a file for long lasting appends.
You can logically divide your output into smaller files and close them
whenever the logical boundary is reached.
The directory can be treated as a collection of records. May be this
will work for you.
IMO the concurrent append feature is a high priority task.
--Konstantin
Doug Cutting wrote:
> drwho wrote:
>
>> If so, GFS, is also suitable only for large, offline, batch
>> computations ?
>> I wonder how Google is going to use GFS for writely or their online
>> spreadsheet or their BigTable (their gigantic relational DB).
>
>
> Did I say anything about GFS? I don't think so. Also, I said,
> "currently" and "primarily", not "forever" and "exclusively". I would
> love for DFS to be more suitable for online, incremental stuff, but
> we're a ways from that right now. As I said, we're pursuing
> reliability, scalability and performance before features like append.
> If you'd like to try to implement append w/o disrupting work on
> reliability scalability and performance, we'd welcome your
> contributions. The project direction is determined by contributors.
>
> Note that BigTable is a complex layer on top of GFS that caches and
> batches i/o. So, while GFS does implement some features that DFS
> still does not (like appends), GFS is probably not used directly by,
> e.g., writely. Finally, BigTable is not relational.
>
> Doug
>
>> Doug Cutting <cu...@apache.org> wrote: <chopped>
>>
>> DFS is currently primarily used to support large, offline, batch
>> computations. For example, a log of critical data with tight
>> transactional requirements is probably an inappropriate use of DFS at
>> this time. Again, this may change, but that's where we are now.
>>
>> Doug
>>
>>
>>
>>
>> Thanks much.
>>
>> -eric
>>
>
>
>
Re: What about append in hadoop files ?
Posted by Doug Cutting <cu...@apache.org>.
drwho wrote:
> If so, GFS, is also suitable only for large, offline, batch computations ?
> I wonder how Google is going to use GFS for writely or their online
> spreadsheet or their BigTable (their gigantic relational DB).
Did I say anything about GFS? I don't think so. Also, I said,
"currently" and "primarily", not "forever" and "exclusively". I would
love for DFS to be more suitable for online, incremental stuff, but
we're a ways from that right now. As I said, we're pursuing
reliability, scalability and performance before features like append.
If you'd like to try to implement append w/o disrupting work on
reliability scalability and performance, we'd welcome your
contributions. The project direction is determined by contributors.
Note that BigTable is a complex layer on top of GFS that caches and
batches i/o. So, while GFS does implement some features that DFS still
does not (like appends), GFS is probably not used directly by, e.g.,
writely. Finally, BigTable is not relational.
Doug
> Doug Cutting <cu...@apache.org> wrote: <chopped>
>
> DFS is currently primarily used to support large, offline, batch
> computations. For example, a log of critical data with tight
> transactional requirements is probably an inappropriate use of DFS at
> this time. Again, this may change, but that's where we are now.
>
> Doug
>
>
>
>
> Thanks much.
>
> -eric
>
>
Re: What about append in hadoop files ?
Posted by drwho <dr...@yahoo.com>.
If so, GFS, is also suitable only for large, offline, batch computations ?
I wonder how Google is going to use GFS for writely or their online
spreadsheet or their BigTable (their gigantic relational DB).
Doug Cutting <cu...@apache.org> wrote: <chopped>
DFS is currently primarily used to support large, offline, batch
computations. For example, a log of critical data with tight
transactional requirements is probably an inappropriate use of DFS at
this time. Again, this may change, but that's where we are now.
Doug
Thanks much.
-eric
Re: What about append in hadoop files ?
Posted by Doug Cutting <cu...@apache.org>.
Thomas FRIOL wrote:
> I would like to know today why it is not possible to append datas into
> an existing file (Path) or why the FSDataOutputStream must be closed
> before the file is written to the DFS.
Those are the current semantics of the filesytem: a file is not readable
until it is closed, and files are write-once. This considerably
simplifies the implementation and supports the primary intended uses for
DFS. The simpler we keep DFS the easier it is to make it reliable and
scalable. At this point we are prioritizing reliability and scalability
over new features. Over time, when reliability and scalability are
sufficiently demonstrated, these restrictions may be removed.
> In fact, my problem is that I have a servlet which is regularly writing
> datas into a file in the DFS. Today, if my JVM crashes, I lose all my
> datas because my output stream is closed only when the JVM stops itself.
You could periodically close the file and start writing a new file.
DFS is currently primarily used to support large, offline, batch
computations. For example, a log of critical data with tight
transactional requirements is probably an inappropriate use of DFS at
this time. Again, this may change, but that's where we are now.
Doug