You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Koert Kuipers <ko...@tresata.com> on 2012/08/04 19:54:00 UTC

fs cache giving me headaches

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all
sorts of weird errors where FileSystems in unrelated code (sometimes not
even my code) started misbehaving and streams where unexpectedly shut. Then
i realized that FileSystem uses a cache and close() closes it for everyone!
Not pretty in my opinion, but i can live with it. So i checked other code
and found that basically nobody closes FileSystems. Apparently the expected
way of using FileSystems is to simple never close them. So i adopted this
approach (which i think is really contrary to java conventions for a
Closeable).

Lately i started working on some code for a daemon/server where many
FileSystems objects are created for different users (UGIs) that use the
service. As it turns out other projects have run into trouble with the
FileSystem cache in situations like this (for example, Scribe and Hoop). I
imagine the cache can get very large and cause problems (i have not tested
this myself).

Looking at the code for Hoop i noticed they simply turned off the
FileSystem cache and made sure to close every FileSystem. So here the
suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or
FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any
cache size limitations. However if i adopt this approach i basically can
not re-use any existing code or libraries that do not close FileSystems,
splitting the codebase into two which is pretty ugly. And this code is not
efficient in situations where there are very few used FileSystem objects
and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both
situations! Ideally i would have liked fs.close() to do the right thing
depending in the settings: if cache is turned off it closes the FileSystem,
and if it is turned on its a NOOP. That way i could always use
FileSystem.get(conf) and always close my filesystems, and the code would be
usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by "Aaron T. Myers" <at...@cloudera.com>.

On Tue, Aug 7, 2012 at 11:44 AM, Koert Kuipers <ko...@tresata.com> wrote:

> I do create new ugis, and i do not hand them off to threads. However i
> assumed that FileSystem.get(conf) would fetch from the filesystem cache
> based on the ugi (based on equality that is, not identity). So my
> assumption was that if different threads create ugis that are equal, they
> would fetch the same FileSystem from the cache. Is that wrong?


https://issues.apache.org/jira/browse/HADOOP-6670

Yes, UGIs are compared using identity, not value equality, for exactly this
purpose.

--
Aaron T. Myers
Software Engineer, Cloudera

Re: fs cache giving me headaches

Posted by "Aaron T. Myers" <at...@cloudera.com>.

On Tue, Aug 7, 2012 at 11:44 AM, Koert Kuipers <ko...@tresata.com> wrote:

> I do create new ugis, and i do not hand them off to threads. However i
> assumed that FileSystem.get(conf) would fetch from the filesystem cache
> based on the ugi (based on equality that is, not identity). So my
> assumption was that if different threads create ugis that are equal, they
> would fetch the same FileSystem from the cache. Is that wrong?


https://issues.apache.org/jira/browse/HADOOP-6670

Yes, UGIs are compared using identity, not value equality, for exactly this
purpose.

--
Aaron T. Myers
Software Engineer, Cloudera

Re: fs cache giving me headaches

Posted by "Aaron T. Myers" <at...@cloudera.com>.

On Tue, Aug 7, 2012 at 11:44 AM, Koert Kuipers <ko...@tresata.com> wrote:

> I do create new ugis, and i do not hand them off to threads. However i
> assumed that FileSystem.get(conf) would fetch from the filesystem cache
> based on the ugi (based on equality that is, not identity). So my
> assumption was that if different threads create ugis that are equal, they
> would fetch the same FileSystem from the cache. Is that wrong?


https://issues.apache.org/jira/browse/HADOOP-6670

Yes, UGIs are compared using identity, not value equality, for exactly this
purpose.

--
Aaron T. Myers
Software Engineer, Cloudera

Re: fs cache giving me headaches

Posted by "Aaron T. Myers" <at...@cloudera.com>.

On Tue, Aug 7, 2012 at 11:44 AM, Koert Kuipers <ko...@tresata.com> wrote:

> I do create new ugis, and i do not hand them off to threads. However i
> assumed that FileSystem.get(conf) would fetch from the filesystem cache
> based on the ugi (based on equality that is, not identity). So my
> assumption was that if different threads create ugis that are equal, they
> would fetch the same FileSystem from the cache. Is that wrong?


https://issues.apache.org/jira/browse/HADOOP-6670

Yes, UGIs are compared using identity, not value equality, for exactly this
purpose.

--
Aaron T. Myers
Software Engineer, Cloudera

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

I am a little confused....
I do create new ugis, and i do not hand them off to threads. However i
assumed that FileSystem.get(conf) would fetch from the filesystem cache
based on the ugi (based on equality that is, not identity). So my
assumption was that if different threads create ugis that are equal, they
would fetch the same FileSystem from the cache. Is that wrong?

On Tue, Aug 7, 2012 at 11:25 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> There is no UGI caching, so each request will receive a unique UGI even
> for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi)
> when the request is complete.  If however you spin off threads that
> continue to use the UGI even after the request is completed, then you'll
> have to determine for yourself when it's safe to close the filesystems.
>
> I've been kicking around a few ways to transparently close cached
> filesystems for a ugi when that ugi goes out of scope.  I should probably
> file a jira (if it stops going down) for discussion.
>
> Daryn
>
>
> On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:
>
> Daryn,
> The problem with FileSystem.closeAllForUGI(ugi) for me is that a server
> can be multi-threaded, and a user could be doing multiple request at the
> same time, so if i used closeAllForUGI isn't there a risk of shutting down
> the other requests for the same user?
>
> On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:
>
>> Yes, the implementation of fs.close() leaves something to be desired.
>>  There's actually been debate in the past about close being a no-op for a
>> cached fs, but the idea was rejected by the majority of people.
>>
>> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
>> of a request to flush all the fs cache entries for the ugi.  You'll get the
>> benefit of the cache during execution of the request, and be able to close
>> the cached fs instances to prevent memory leaks. I hope this helps!
>>
>> Daryn
>>
>>
>> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>>
>> ---------- Forwarded message ----------
>> From: "Koert Kuipers" <ko...@tresata.com>
>> Date: Aug 4, 2012 1:54 PM
>> Subject: fs cache giving me headaches
>> To: <co...@hadoop.apache.org>
>>
>> nothing has confused me as much in hadoop as FileSystem.close().
>> any decent java programmer that sees that an object implements Closable
>> writes code like this:
>> Final FileSystem fs = FileSystem.get(conf);
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> so i started out using hadoop FileSystem like this, and i ran into all
>> sorts of weird errors where FileSystems in unrelated code (sometimes not
>> even my code) started misbehaving and streams where unexpectedly shut. Then
>> i realized that FileSystem uses a cache and close() closes it for everyone!
>> Not pretty in my opinion, but i can live with it. So i checked other code
>> and found that basically nobody closes FileSystems. Apparently the expected
>> way of using FileSystems is to simple never close them. So i adopted this
>> approach (which i think is really contrary to java conventions for a
>> Closeable).
>>
>> Lately i started working on some code for a daemon/server where many
>> FileSystems objects are created for different users (UGIs) that use the
>> service. As it turns out other projects have run into trouble with the
>> FileSystem cache in situations like this (for example, Scribe and Hoop). I
>> imagine the cache can get very large and cause problems (i have not tested
>> this myself).
>>
>> Looking at the code for Hoop i noticed they simply turned off the
>> FileSystem cache and made sure to close every FileSystem. So here the
>> suggested approach to deal with FileSystems seems to be:
>> Final FileSystem fs = FileSystem.newInstance(conf); // or
>> FileSystem.get(conf) but with caching turned off in the conf
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> This code bypasses the cache if i understand it correctly, avoiding any
>> cache size limitations. However if i adopt this approach i basically can
>> not re-use any existing code or libraries that do not close FileSystems,
>> splitting the codebase into two which is pretty ugly. And this code is not
>> efficient in situations where there are very few used FileSystem objects
>> and a cache would improve performance, so the split works both ways.
>>
>> In short, there is so single way to code with FileSystem that works in
>> both situations! Ideally i would have liked fs.close() to do the right
>> thing depending in the settings: if cache is turned off it closes the
>> FileSystem, and if it is turned on its a NOOP. That way i could always use
>> FileSystem.get(conf) and always close my filesystems, and the code would be
>> usable irrespective of whether the cache is turned on or off.
>>
>> Any insights or suggestions? Thanks!
>>
>>
>>
>
>

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

I am a little confused....
I do create new ugis, and i do not hand them off to threads. However i
assumed that FileSystem.get(conf) would fetch from the filesystem cache
based on the ugi (based on equality that is, not identity). So my
assumption was that if different threads create ugis that are equal, they
would fetch the same FileSystem from the cache. Is that wrong?

On Tue, Aug 7, 2012 at 11:25 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> There is no UGI caching, so each request will receive a unique UGI even
> for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi)
> when the request is complete.  If however you spin off threads that
> continue to use the UGI even after the request is completed, then you'll
> have to determine for yourself when it's safe to close the filesystems.
>
> I've been kicking around a few ways to transparently close cached
> filesystems for a ugi when that ugi goes out of scope.  I should probably
> file a jira (if it stops going down) for discussion.
>
> Daryn
>
>
> On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:
>
> Daryn,
> The problem with FileSystem.closeAllForUGI(ugi) for me is that a server
> can be multi-threaded, and a user could be doing multiple request at the
> same time, so if i used closeAllForUGI isn't there a risk of shutting down
> the other requests for the same user?
>
> On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:
>
>> Yes, the implementation of fs.close() leaves something to be desired.
>>  There's actually been debate in the past about close being a no-op for a
>> cached fs, but the idea was rejected by the majority of people.
>>
>> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
>> of a request to flush all the fs cache entries for the ugi.  You'll get the
>> benefit of the cache during execution of the request, and be able to close
>> the cached fs instances to prevent memory leaks. I hope this helps!
>>
>> Daryn
>>
>>
>> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>>
>> ---------- Forwarded message ----------
>> From: "Koert Kuipers" <ko...@tresata.com>
>> Date: Aug 4, 2012 1:54 PM
>> Subject: fs cache giving me headaches
>> To: <co...@hadoop.apache.org>
>>
>> nothing has confused me as much in hadoop as FileSystem.close().
>> any decent java programmer that sees that an object implements Closable
>> writes code like this:
>> Final FileSystem fs = FileSystem.get(conf);
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> so i started out using hadoop FileSystem like this, and i ran into all
>> sorts of weird errors where FileSystems in unrelated code (sometimes not
>> even my code) started misbehaving and streams where unexpectedly shut. Then
>> i realized that FileSystem uses a cache and close() closes it for everyone!
>> Not pretty in my opinion, but i can live with it. So i checked other code
>> and found that basically nobody closes FileSystems. Apparently the expected
>> way of using FileSystems is to simple never close them. So i adopted this
>> approach (which i think is really contrary to java conventions for a
>> Closeable).
>>
>> Lately i started working on some code for a daemon/server where many
>> FileSystems objects are created for different users (UGIs) that use the
>> service. As it turns out other projects have run into trouble with the
>> FileSystem cache in situations like this (for example, Scribe and Hoop). I
>> imagine the cache can get very large and cause problems (i have not tested
>> this myself).
>>
>> Looking at the code for Hoop i noticed they simply turned off the
>> FileSystem cache and made sure to close every FileSystem. So here the
>> suggested approach to deal with FileSystems seems to be:
>> Final FileSystem fs = FileSystem.newInstance(conf); // or
>> FileSystem.get(conf) but with caching turned off in the conf
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> This code bypasses the cache if i understand it correctly, avoiding any
>> cache size limitations. However if i adopt this approach i basically can
>> not re-use any existing code or libraries that do not close FileSystems,
>> splitting the codebase into two which is pretty ugly. And this code is not
>> efficient in situations where there are very few used FileSystem objects
>> and a cache would improve performance, so the split works both ways.
>>
>> In short, there is so single way to code with FileSystem that works in
>> both situations! Ideally i would have liked fs.close() to do the right
>> thing depending in the settings: if cache is turned off it closes the
>> FileSystem, and if it is turned on its a NOOP. That way i could always use
>> FileSystem.get(conf) and always close my filesystems, and the code would be
>> usable irrespective of whether the cache is turned on or off.
>>
>> Any insights or suggestions? Thanks!
>>
>>
>>
>
>

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

I am a little confused....
I do create new ugis, and i do not hand them off to threads. However i
assumed that FileSystem.get(conf) would fetch from the filesystem cache
based on the ugi (based on equality that is, not identity). So my
assumption was that if different threads create ugis that are equal, they
would fetch the same FileSystem from the cache. Is that wrong?

On Tue, Aug 7, 2012 at 11:25 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> There is no UGI caching, so each request will receive a unique UGI even
> for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi)
> when the request is complete.  If however you spin off threads that
> continue to use the UGI even after the request is completed, then you'll
> have to determine for yourself when it's safe to close the filesystems.
>
> I've been kicking around a few ways to transparently close cached
> filesystems for a ugi when that ugi goes out of scope.  I should probably
> file a jira (if it stops going down) for discussion.
>
> Daryn
>
>
> On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:
>
> Daryn,
> The problem with FileSystem.closeAllForUGI(ugi) for me is that a server
> can be multi-threaded, and a user could be doing multiple request at the
> same time, so if i used closeAllForUGI isn't there a risk of shutting down
> the other requests for the same user?
>
> On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:
>
>> Yes, the implementation of fs.close() leaves something to be desired.
>>  There's actually been debate in the past about close being a no-op for a
>> cached fs, but the idea was rejected by the majority of people.
>>
>> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
>> of a request to flush all the fs cache entries for the ugi.  You'll get the
>> benefit of the cache during execution of the request, and be able to close
>> the cached fs instances to prevent memory leaks. I hope this helps!
>>
>> Daryn
>>
>>
>> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>>
>> ---------- Forwarded message ----------
>> From: "Koert Kuipers" <ko...@tresata.com>
>> Date: Aug 4, 2012 1:54 PM
>> Subject: fs cache giving me headaches
>> To: <co...@hadoop.apache.org>
>>
>> nothing has confused me as much in hadoop as FileSystem.close().
>> any decent java programmer that sees that an object implements Closable
>> writes code like this:
>> Final FileSystem fs = FileSystem.get(conf);
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> so i started out using hadoop FileSystem like this, and i ran into all
>> sorts of weird errors where FileSystems in unrelated code (sometimes not
>> even my code) started misbehaving and streams where unexpectedly shut. Then
>> i realized that FileSystem uses a cache and close() closes it for everyone!
>> Not pretty in my opinion, but i can live with it. So i checked other code
>> and found that basically nobody closes FileSystems. Apparently the expected
>> way of using FileSystems is to simple never close them. So i adopted this
>> approach (which i think is really contrary to java conventions for a
>> Closeable).
>>
>> Lately i started working on some code for a daemon/server where many
>> FileSystems objects are created for different users (UGIs) that use the
>> service. As it turns out other projects have run into trouble with the
>> FileSystem cache in situations like this (for example, Scribe and Hoop). I
>> imagine the cache can get very large and cause problems (i have not tested
>> this myself).
>>
>> Looking at the code for Hoop i noticed they simply turned off the
>> FileSystem cache and made sure to close every FileSystem. So here the
>> suggested approach to deal with FileSystems seems to be:
>> Final FileSystem fs = FileSystem.newInstance(conf); // or
>> FileSystem.get(conf) but with caching turned off in the conf
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> This code bypasses the cache if i understand it correctly, avoiding any
>> cache size limitations. However if i adopt this approach i basically can
>> not re-use any existing code or libraries that do not close FileSystems,
>> splitting the codebase into two which is pretty ugly. And this code is not
>> efficient in situations where there are very few used FileSystem objects
>> and a cache would improve performance, so the split works both ways.
>>
>> In short, there is so single way to code with FileSystem that works in
>> both situations! Ideally i would have liked fs.close() to do the right
>> thing depending in the settings: if cache is turned off it closes the
>> FileSystem, and if it is turned on its a NOOP. That way i could always use
>> FileSystem.get(conf) and always close my filesystems, and the code would be
>> usable irrespective of whether the cache is turned on or off.
>>
>> Any insights or suggestions? Thanks!
>>
>>
>>
>
>

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

I am a little confused....
I do create new ugis, and i do not hand them off to threads. However i
assumed that FileSystem.get(conf) would fetch from the filesystem cache
based on the ugi (based on equality that is, not identity). So my
assumption was that if different threads create ugis that are equal, they
would fetch the same FileSystem from the cache. Is that wrong?

On Tue, Aug 7, 2012 at 11:25 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> There is no UGI caching, so each request will receive a unique UGI even
> for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi)
> when the request is complete.  If however you spin off threads that
> continue to use the UGI even after the request is completed, then you'll
> have to determine for yourself when it's safe to close the filesystems.
>
> I've been kicking around a few ways to transparently close cached
> filesystems for a ugi when that ugi goes out of scope.  I should probably
> file a jira (if it stops going down) for discussion.
>
> Daryn
>
>
> On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:
>
> Daryn,
> The problem with FileSystem.closeAllForUGI(ugi) for me is that a server
> can be multi-threaded, and a user could be doing multiple request at the
> same time, so if i used closeAllForUGI isn't there a risk of shutting down
> the other requests for the same user?
>
> On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:
>
>> Yes, the implementation of fs.close() leaves something to be desired.
>>  There's actually been debate in the past about close being a no-op for a
>> cached fs, but the idea was rejected by the majority of people.
>>
>> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
>> of a request to flush all the fs cache entries for the ugi.  You'll get the
>> benefit of the cache during execution of the request, and be able to close
>> the cached fs instances to prevent memory leaks. I hope this helps!
>>
>> Daryn
>>
>>
>> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>>
>> ---------- Forwarded message ----------
>> From: "Koert Kuipers" <ko...@tresata.com>
>> Date: Aug 4, 2012 1:54 PM
>> Subject: fs cache giving me headaches
>> To: <co...@hadoop.apache.org>
>>
>> nothing has confused me as much in hadoop as FileSystem.close().
>> any decent java programmer that sees that an object implements Closable
>> writes code like this:
>> Final FileSystem fs = FileSystem.get(conf);
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> so i started out using hadoop FileSystem like this, and i ran into all
>> sorts of weird errors where FileSystems in unrelated code (sometimes not
>> even my code) started misbehaving and streams where unexpectedly shut. Then
>> i realized that FileSystem uses a cache and close() closes it for everyone!
>> Not pretty in my opinion, but i can live with it. So i checked other code
>> and found that basically nobody closes FileSystems. Apparently the expected
>> way of using FileSystems is to simple never close them. So i adopted this
>> approach (which i think is really contrary to java conventions for a
>> Closeable).
>>
>> Lately i started working on some code for a daemon/server where many
>> FileSystems objects are created for different users (UGIs) that use the
>> service. As it turns out other projects have run into trouble with the
>> FileSystem cache in situations like this (for example, Scribe and Hoop). I
>> imagine the cache can get very large and cause problems (i have not tested
>> this myself).
>>
>> Looking at the code for Hoop i noticed they simply turned off the
>> FileSystem cache and made sure to close every FileSystem. So here the
>> suggested approach to deal with FileSystems seems to be:
>> Final FileSystem fs = FileSystem.newInstance(conf); // or
>> FileSystem.get(conf) but with caching turned off in the conf
>> try {
>>     // do something with fs
>> } finally {
>>     fs.close();
>> }
>>
>> This code bypasses the cache if i understand it correctly, avoiding any
>> cache size limitations. However if i adopt this approach i basically can
>> not re-use any existing code or libraries that do not close FileSystems,
>> splitting the codebase into two which is pretty ugly. And this code is not
>> efficient in situations where there are very few used FileSystem objects
>> and a cache would improve performance, so the split works both ways.
>>
>> In short, there is so single way to code with FileSystem that works in
>> both situations! Ideally i would have liked fs.close() to do the right
>> thing depending in the settings: if cache is turned off it closes the
>> FileSystem, and if it is turned on its a NOOP. That way i could always use
>> FileSystem.get(conf) and always close my filesystems, and the code would be
>> usable irrespective of whether the cache is turned on or off.
>>
>> Any insights or suggestions? Thanks!
>>
>>
>>
>
>

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

There is no UGI caching, so each request will receive a unique UGI even for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi) when the request is complete.  If however you spin off threads that continue to use the UGI even after the request is completed, then you'll have to determine for yourself when it's safe to close the filesystems.

I've been kicking around a few ways to transparently close cached filesystems for a ugi when that ugi goes out of scope.  I should probably file a jira (if it stops going down) for discussion.

Daryn


On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can be multi-threaded, and a user could be doing multiple request at the same time, so if i used closeAllForUGI isn't there a risk of shutting down the other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com>> wrote:
Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

There is no UGI caching, so each request will receive a unique UGI even for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi) when the request is complete.  If however you spin off threads that continue to use the UGI even after the request is completed, then you'll have to determine for yourself when it's safe to close the filesystems.

I've been kicking around a few ways to transparently close cached filesystems for a ugi when that ugi goes out of scope.  I should probably file a jira (if it stops going down) for discussion.

Daryn


On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can be multi-threaded, and a user could be doing multiple request at the same time, so if i used closeAllForUGI isn't there a risk of shutting down the other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com>> wrote:
Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

There is no UGI caching, so each request will receive a unique UGI even for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi) when the request is complete.  If however you spin off threads that continue to use the UGI even after the request is completed, then you'll have to determine for yourself when it's safe to close the filesystems.

I've been kicking around a few ways to transparently close cached filesystems for a ugi when that ugi goes out of scope.  I should probably file a jira (if it stops going down) for discussion.

Daryn


On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can be multi-threaded, and a user could be doing multiple request at the same time, so if i used closeAllForUGI isn't there a risk of shutting down the other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com>> wrote:
Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

There is no UGI caching, so each request will receive a unique UGI even for the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi) when the request is complete.  If however you spin off threads that continue to use the UGI even after the request is completed, then you'll have to determine for yourself when it's safe to close the filesystems.

I've been kicking around a few ways to transparently close cached filesystems for a ugi when that ugi goes out of scope.  I should probably file a jira (if it stops going down) for discussion.

Daryn


On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can be multi-threaded, and a user could be doing multiple request at the same time, so if i used closeAllForUGI isn't there a risk of shutting down the other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com>> wrote:
Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can
be multi-threaded, and a user could be doing multiple request at the same
time, so if i used closeAllForUGI isn't there a risk of shutting down the
other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> Yes, the implementation of fs.close() leaves something to be desired.
>  There's actually been debate in the past about close being a no-op for a
> cached fs, but the idea was rejected by the majority of people.
>
> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
> of a request to flush all the fs cache entries for the ugi.  You'll get the
> benefit of the cache during execution of the request, and be able to close
> the cached fs instances to prevent memory leaks. I hope this helps!
>
> Daryn
>
>
> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>
> ---------- Forwarded message ----------
> From: "Koert Kuipers" <ko...@tresata.com>
> Date: Aug 4, 2012 1:54 PM
> Subject: fs cache giving me headaches
> To: <co...@hadoop.apache.org>
>
> nothing has confused me as much in hadoop as FileSystem.close().
> any decent java programmer that sees that an object implements Closable
> writes code like this:
> Final FileSystem fs = FileSystem.get(conf);
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> so i started out using hadoop FileSystem like this, and i ran into all
> sorts of weird errors where FileSystems in unrelated code (sometimes not
> even my code) started misbehaving and streams where unexpectedly shut. Then
> i realized that FileSystem uses a cache and close() closes it for everyone!
> Not pretty in my opinion, but i can live with it. So i checked other code
> and found that basically nobody closes FileSystems. Apparently the expected
> way of using FileSystems is to simple never close them. So i adopted this
> approach (which i think is really contrary to java conventions for a
> Closeable).
>
> Lately i started working on some code for a daemon/server where many
> FileSystems objects are created for different users (UGIs) that use the
> service. As it turns out other projects have run into trouble with the
> FileSystem cache in situations like this (for example, Scribe and Hoop). I
> imagine the cache can get very large and cause problems (i have not tested
> this myself).
>
> Looking at the code for Hoop i noticed they simply turned off the
> FileSystem cache and made sure to close every FileSystem. So here the
> suggested approach to deal with FileSystems seems to be:
> Final FileSystem fs = FileSystem.newInstance(conf); // or
> FileSystem.get(conf) but with caching turned off in the conf
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> This code bypasses the cache if i understand it correctly, avoiding any
> cache size limitations. However if i adopt this approach i basically can
> not re-use any existing code or libraries that do not close FileSystems,
> splitting the codebase into two which is pretty ugly. And this code is not
> efficient in situations where there are very few used FileSystem objects
> and a cache would improve performance, so the split works both ways.
>
> In short, there is so single way to code with FileSystem that works in
> both situations! Ideally i would have liked fs.close() to do the right
> thing depending in the settings: if cache is turned off it closes the
> FileSystem, and if it is turned on its a NOOP. That way i could always use
> FileSystem.get(conf) and always close my filesystems, and the code would be
> usable irrespective of whether the cache is turned on or off.
>
> Any insights or suggestions? Thanks!
>
>
>

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can
be multi-threaded, and a user could be doing multiple request at the same
time, so if i used closeAllForUGI isn't there a risk of shutting down the
other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> Yes, the implementation of fs.close() leaves something to be desired.
>  There's actually been debate in the past about close being a no-op for a
> cached fs, but the idea was rejected by the majority of people.
>
> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
> of a request to flush all the fs cache entries for the ugi.  You'll get the
> benefit of the cache during execution of the request, and be able to close
> the cached fs instances to prevent memory leaks. I hope this helps!
>
> Daryn
>
>
> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>
> ---------- Forwarded message ----------
> From: "Koert Kuipers" <ko...@tresata.com>
> Date: Aug 4, 2012 1:54 PM
> Subject: fs cache giving me headaches
> To: <co...@hadoop.apache.org>
>
> nothing has confused me as much in hadoop as FileSystem.close().
> any decent java programmer that sees that an object implements Closable
> writes code like this:
> Final FileSystem fs = FileSystem.get(conf);
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> so i started out using hadoop FileSystem like this, and i ran into all
> sorts of weird errors where FileSystems in unrelated code (sometimes not
> even my code) started misbehaving and streams where unexpectedly shut. Then
> i realized that FileSystem uses a cache and close() closes it for everyone!
> Not pretty in my opinion, but i can live with it. So i checked other code
> and found that basically nobody closes FileSystems. Apparently the expected
> way of using FileSystems is to simple never close them. So i adopted this
> approach (which i think is really contrary to java conventions for a
> Closeable).
>
> Lately i started working on some code for a daemon/server where many
> FileSystems objects are created for different users (UGIs) that use the
> service. As it turns out other projects have run into trouble with the
> FileSystem cache in situations like this (for example, Scribe and Hoop). I
> imagine the cache can get very large and cause problems (i have not tested
> this myself).
>
> Looking at the code for Hoop i noticed they simply turned off the
> FileSystem cache and made sure to close every FileSystem. So here the
> suggested approach to deal with FileSystems seems to be:
> Final FileSystem fs = FileSystem.newInstance(conf); // or
> FileSystem.get(conf) but with caching turned off in the conf
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> This code bypasses the cache if i understand it correctly, avoiding any
> cache size limitations. However if i adopt this approach i basically can
> not re-use any existing code or libraries that do not close FileSystems,
> splitting the codebase into two which is pretty ugly. And this code is not
> efficient in situations where there are very few used FileSystem objects
> and a cache would improve performance, so the split works both ways.
>
> In short, there is so single way to code with FileSystem that works in
> both situations! Ideally i would have liked fs.close() to do the right
> thing depending in the settings: if cache is turned off it closes the
> FileSystem, and if it is turned on its a NOOP. That way i could always use
> FileSystem.get(conf) and always close my filesystems, and the code would be
> usable irrespective of whether the cache is turned on or off.
>
> Any insights or suggestions? Thanks!
>
>
>

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can
be multi-threaded, and a user could be doing multiple request at the same
time, so if i used closeAllForUGI isn't there a risk of shutting down the
other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> Yes, the implementation of fs.close() leaves something to be desired.
>  There's actually been debate in the past about close being a no-op for a
> cached fs, but the idea was rejected by the majority of people.
>
> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
> of a request to flush all the fs cache entries for the ugi.  You'll get the
> benefit of the cache during execution of the request, and be able to close
> the cached fs instances to prevent memory leaks. I hope this helps!
>
> Daryn
>
>
> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>
> ---------- Forwarded message ----------
> From: "Koert Kuipers" <ko...@tresata.com>
> Date: Aug 4, 2012 1:54 PM
> Subject: fs cache giving me headaches
> To: <co...@hadoop.apache.org>
>
> nothing has confused me as much in hadoop as FileSystem.close().
> any decent java programmer that sees that an object implements Closable
> writes code like this:
> Final FileSystem fs = FileSystem.get(conf);
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> so i started out using hadoop FileSystem like this, and i ran into all
> sorts of weird errors where FileSystems in unrelated code (sometimes not
> even my code) started misbehaving and streams where unexpectedly shut. Then
> i realized that FileSystem uses a cache and close() closes it for everyone!
> Not pretty in my opinion, but i can live with it. So i checked other code
> and found that basically nobody closes FileSystems. Apparently the expected
> way of using FileSystems is to simple never close them. So i adopted this
> approach (which i think is really contrary to java conventions for a
> Closeable).
>
> Lately i started working on some code for a daemon/server where many
> FileSystems objects are created for different users (UGIs) that use the
> service. As it turns out other projects have run into trouble with the
> FileSystem cache in situations like this (for example, Scribe and Hoop). I
> imagine the cache can get very large and cause problems (i have not tested
> this myself).
>
> Looking at the code for Hoop i noticed they simply turned off the
> FileSystem cache and made sure to close every FileSystem. So here the
> suggested approach to deal with FileSystems seems to be:
> Final FileSystem fs = FileSystem.newInstance(conf); // or
> FileSystem.get(conf) but with caching turned off in the conf
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> This code bypasses the cache if i understand it correctly, avoiding any
> cache size limitations. However if i adopt this approach i basically can
> not re-use any existing code or libraries that do not close FileSystems,
> splitting the codebase into two which is pretty ugly. And this code is not
> efficient in situations where there are very few used FileSystem objects
> and a cache would improve performance, so the split works both ways.
>
> In short, there is so single way to code with FileSystem that works in
> both situations! Ideally i would have liked fs.close() to do the right
> thing depending in the settings: if cache is turned off it closes the
> FileSystem, and if it is turned on its a NOOP. That way i could always use
> FileSystem.get(conf) and always close my filesystems, and the code would be
> usable irrespective of whether the cache is turned on or off.
>
> Any insights or suggestions? Thanks!
>
>
>

Re: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can
be multi-threaded, and a user could be doing multiple request at the same
time, so if i used closeAllForUGI isn't there a risk of shutting down the
other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <da...@yahoo-inc.com> wrote:

> Yes, the implementation of fs.close() leaves something to be desired.
>  There's actually been debate in the past about close being a no-op for a
> cached fs, but the idea was rejected by the majority of people.
>
> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
> of a request to flush all the fs cache entries for the ugi.  You'll get the
> benefit of the cache during execution of the request, and be able to close
> the cached fs instances to prevent memory leaks. I hope this helps!
>
> Daryn
>
>
> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
>
> ---------- Forwarded message ----------
> From: "Koert Kuipers" <ko...@tresata.com>
> Date: Aug 4, 2012 1:54 PM
> Subject: fs cache giving me headaches
> To: <co...@hadoop.apache.org>
>
> nothing has confused me as much in hadoop as FileSystem.close().
> any decent java programmer that sees that an object implements Closable
> writes code like this:
> Final FileSystem fs = FileSystem.get(conf);
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> so i started out using hadoop FileSystem like this, and i ran into all
> sorts of weird errors where FileSystems in unrelated code (sometimes not
> even my code) started misbehaving and streams where unexpectedly shut. Then
> i realized that FileSystem uses a cache and close() closes it for everyone!
> Not pretty in my opinion, but i can live with it. So i checked other code
> and found that basically nobody closes FileSystems. Apparently the expected
> way of using FileSystems is to simple never close them. So i adopted this
> approach (which i think is really contrary to java conventions for a
> Closeable).
>
> Lately i started working on some code for a daemon/server where many
> FileSystems objects are created for different users (UGIs) that use the
> service. As it turns out other projects have run into trouble with the
> FileSystem cache in situations like this (for example, Scribe and Hoop). I
> imagine the cache can get very large and cause problems (i have not tested
> this myself).
>
> Looking at the code for Hoop i noticed they simply turned off the
> FileSystem cache and made sure to close every FileSystem. So here the
> suggested approach to deal with FileSystems seems to be:
> Final FileSystem fs = FileSystem.newInstance(conf); // or
> FileSystem.get(conf) but with caching turned off in the conf
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
>
> This code bypasses the cache if i understand it correctly, avoiding any
> cache size limitations. However if i adopt this approach i basically can
> not re-use any existing code or libraries that do not close FileSystems,
> splitting the codebase into two which is pretty ugly. And this code is not
> efficient in situations where there are very few used FileSystem objects
> and a cache would improve performance, so the split works both ways.
>
> In short, there is so single way to code with FileSystem that works in
> both situations! Ideally i would have liked fs.close() to do the right
> thing depending in the settings: if cache is turned off it closes the
> FileSystem, and if it is turned on its a NOOP. That way i could always use
> FileSystem.get(conf) and always close my filesystems, and the code would be
> usable irrespective of whether the cache is turned on or off.
>
> Any insights or suggestions? Thanks!
>
>
>

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Re: fs cache giving me headaches

Posted by Daryn Sharp <da...@yahoo-inc.com>.

Yes, the implementation of fs.close() leaves something to be desired.  There's actually been debate in the past about close being a no-op for a cached fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end of a request to flush all the fs cache entries for the ugi.  You'll get the benefit of the cache during execution of the request, and be able to close the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems is to simple never close them. So i adopted this approach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are created for different users (UGIs) that use the service. As it turns out other projects have run into trouble with the FileSystem cache in situations like this (for example, Scribe and Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cache size limitations. However if i adopt this approach i basically can not re-use any existing code or libraries that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this code is not efficient in situations where there are very few used FileSystem objects and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally i would have liked fs.close() to do the right thing depending in the settings: if cache is turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Fwd: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all
sorts of weird errors where FileSystems in unrelated code (sometimes not
even my code) started misbehaving and streams where unexpectedly shut. Then
i realized that FileSystem uses a cache and close() closes it for everyone!
Not pretty in my opinion, but i can live with it. So i checked other code
and found that basically nobody closes FileSystems. Apparently the expected
way of using FileSystems is to simple never close them. So i adopted this
approach (which i think is really contrary to java conventions for a
Closeable).

Lately i started working on some code for a daemon/server where many
FileSystems objects are created for different users (UGIs) that use the
service. As it turns out other projects have run into trouble with the
FileSystem cache in situations like this (for example, Scribe and Hoop). I
imagine the cache can get very large and cause problems (i have not tested
this myself).

Looking at the code for Hoop i noticed they simply turned off the
FileSystem cache and made sure to close every FileSystem. So here the
suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or
FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any
cache size limitations. However if i adopt this approach i basically can
not re-use any existing code or libraries that do not close FileSystems,
splitting the codebase into two which is pretty ugly. And this code is not
efficient in situations where there are very few used FileSystem objects
and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both
situations! Ideally i would have liked fs.close() to do the right thing
depending in the settings: if cache is turned off it closes the FileSystem,
and if it is turned on its a NOOP. That way i could always use
FileSystem.get(conf) and always close my filesystems, and the code would be
usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Fwd: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all
sorts of weird errors where FileSystems in unrelated code (sometimes not
even my code) started misbehaving and streams where unexpectedly shut. Then
i realized that FileSystem uses a cache and close() closes it for everyone!
Not pretty in my opinion, but i can live with it. So i checked other code
and found that basically nobody closes FileSystems. Apparently the expected
way of using FileSystems is to simple never close them. So i adopted this
approach (which i think is really contrary to java conventions for a
Closeable).

Lately i started working on some code for a daemon/server where many
FileSystems objects are created for different users (UGIs) that use the
service. As it turns out other projects have run into trouble with the
FileSystem cache in situations like this (for example, Scribe and Hoop). I
imagine the cache can get very large and cause problems (i have not tested
this myself).

Looking at the code for Hoop i noticed they simply turned off the
FileSystem cache and made sure to close every FileSystem. So here the
suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or
FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any
cache size limitations. However if i adopt this approach i basically can
not re-use any existing code or libraries that do not close FileSystems,
splitting the codebase into two which is pretty ugly. And this code is not
efficient in situations where there are very few used FileSystem objects
and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both
situations! Ideally i would have liked fs.close() to do the right thing
depending in the settings: if cache is turned off it closes the FileSystem,
and if it is turned on its a NOOP. That way i could always use
FileSystem.get(conf) and always close my filesystems, and the code would be
usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Fwd: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all
sorts of weird errors where FileSystems in unrelated code (sometimes not
even my code) started misbehaving and streams where unexpectedly shut. Then
i realized that FileSystem uses a cache and close() closes it for everyone!
Not pretty in my opinion, but i can live with it. So i checked other code
and found that basically nobody closes FileSystems. Apparently the expected
way of using FileSystems is to simple never close them. So i adopted this
approach (which i think is really contrary to java conventions for a
Closeable).

Lately i started working on some code for a daemon/server where many
FileSystems objects are created for different users (UGIs) that use the
service. As it turns out other projects have run into trouble with the
FileSystem cache in situations like this (for example, Scribe and Hoop). I
imagine the cache can get very large and cause problems (i have not tested
this myself).

Looking at the code for Hoop i noticed they simply turned off the
FileSystem cache and made sure to close every FileSystem. So here the
suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or
FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any
cache size limitations. However if i adopt this approach i basically can
not re-use any existing code or libraries that do not close FileSystems,
splitting the codebase into two which is pretty ugly. And this code is not
efficient in situations where there are very few used FileSystem objects
and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both
situations! Ideally i would have liked fs.close() to do the right thing
depending in the settings: if cache is turned off it closes the FileSystem,
and if it is turned on its a NOOP. That way i could always use
FileSystem.get(conf) and always close my filesystems, and the code would be
usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!

Fwd: fs cache giving me headaches

Posted by Koert Kuipers <ko...@tresata.com>.

---------- Forwarded message ----------
From: "Koert Kuipers" <ko...@tresata.com>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <co...@hadoop.apache.org>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all
sorts of weird errors where FileSystems in unrelated code (sometimes not
even my code) started misbehaving and streams where unexpectedly shut. Then
i realized that FileSystem uses a cache and close() closes it for everyone!
Not pretty in my opinion, but i can live with it. So i checked other code
and found that basically nobody closes FileSystems. Apparently the expected
way of using FileSystems is to simple never close them. So i adopted this
approach (which i think is really contrary to java conventions for a
Closeable).

Lately i started working on some code for a daemon/server where many
FileSystems objects are created for different users (UGIs) that use the
service. As it turns out other projects have run into trouble with the
FileSystem cache in situations like this (for example, Scribe and Hoop). I
imagine the cache can get very large and cause problems (i have not tested
this myself).

Looking at the code for Hoop i noticed they simply turned off the
FileSystem cache and made sure to close every FileSystem. So here the
suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or
FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any
cache size limitations. However if i adopt this approach i basically can
not re-use any existing code or libraries that do not close FileSystems,
splitting the codebase into two which is pretty ugly. And this code is not
efficient in situations where there are very few used FileSystem objects
and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both
situations! Ideally i would have liked fs.close() to do the right thing
depending in the settings: if cache is turned off it closes the FileSystem,
and if it is turned on its a NOOP. That way i could always use
FileSystem.get(conf) and always close my filesystems, and the code would be
usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!