You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Taeho Kang <tk...@gmail.com> on 2008/11/03 07:57:10 UTC

Question on opening file info from namenode in DFSClient

Dear Hadoop Users and Developers,

I was wondering if there's a plan to add "file info cache" in DFSClient?

It could eliminate network travelling cost for contacting Namenode and I
think it would greatly improve the DFSClient's performance.
The code I was looking at was this

-----------------------
DFSClient.java

    /**
     * Grab the open-file info from namenode
     */
    synchronized void openInfo() throws IOException {
      /* Maybe, we could add a file info cache here! */
      LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
      if (newInfo == null) {
        throw new IOException("Cannot open filename " + src);
      }
      if (locatedBlocks != null) {
        Iterator<LocatedBlock> oldIter =
locatedBlocks.getLocatedBlocks().iterator();
        Iterator<LocatedBlock> newIter =
newInfo.getLocatedBlocks().iterator();
        while (oldIter.hasNext() && newIter.hasNext()) {
          if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
{
            throw new IOException("Blocklist for " + src + " has changed!");
          }
        }
      }
      this.locatedBlocks = newInfo;
      this.currentNode = null;
    }
-----------------------

Does anybody have an opinion on this matter?

Thank you in advance,

Taeho

Re: Question on opening file info from namenode in DFSClient

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

Consider case of file getting removed & recreated with same name .. 
while there is cached info about the file (and job is running) in the 
DFSClient's (mapper/reducer).

- Mridul


Taeho Kang wrote:
> Dear Hadoop Users and Developers,
> 
> I was wondering if there's a plan to add "file info cache" in DFSClient?
> 
> It could eliminate network travelling cost for contacting Namenode and I
> think it would greatly improve the DFSClient's performance.
> The code I was looking at was this
> 
> -----------------------
> DFSClient.java
> 
>     /**
>      * Grab the open-file info from namenode
>      */
>     synchronized void openInfo() throws IOException {
>       /* Maybe, we could add a file info cache here! */
>       LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
>       if (newInfo == null) {
>         throw new IOException("Cannot open filename " + src);
>       }
>       if (locatedBlocks != null) {
>         Iterator<LocatedBlock> oldIter =
> locatedBlocks.getLocatedBlocks().iterator();
>         Iterator<LocatedBlock> newIter =
> newInfo.getLocatedBlocks().iterator();
>         while (oldIter.hasNext() && newIter.hasNext()) {
>           if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
> {
>             throw new IOException("Blocklist for " + src + " has changed!");
>           }
>         }
>       }
>       this.locatedBlocks = newInfo;
>       this.currentNode = null;
>     }
> -----------------------
> 
> Does anybody have an opinion on this matter?
> 
> Thank you in advance,
> 
> Taeho
>

Re: Question on opening file info from namenode in DFSClient

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi Taeho,

Thanks for ur explanation. If your application opens a dfs file and
does not close it, then the dfsclient will automatcally keep block
locations cached. So, you could achieve your desired goal by
developing a cache layer (above HDFS) that does not close the hdfs
file even if the user has closed it. This cache-layer needs to manage
this cache-pol of HDFS fle handles.

does this help?
thanks,
dhruba




On Fri, Nov 7, 2008 at 12:53 AM, Taeho Kang <tk...@gmail.com> wrote:
> Hi, thanks for your reply Dhruba,
>
> One of my co-workers is writing a BigTable-like application that could be
> used for online, near-real-time, services. So since the application could be
> hooked into online services, there would times when a large number of users
> (e.g. 1000 users) request to access few files in a very short time.
>
> Of course, in a batch process job, this is a rare case, but for online
> services, it's quite a common case.
> I think HBase developers would have run into similar issues as well.
>
> Is this enough explanation?
>
> Thanks in advance,
>
> Taeho
>
>
>
> On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
>
>> In the current code, details about block locations of a file are
>> cached on the client when the file is opened. This cache remains with
>> the client until the file is closed. If the same file is re-opened by
>> the same DFSClient, it re-contacts the namenode and refetches the
>> block locations. This works ok for most map-reduce apps because it is
>> rare that the same DSClient re-opens the same file again.
>>
>> Can you pl explain your use-case?
>>
>> thanks,
>> dhruba
>>
>>
>> On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang <tk...@gmail.com> wrote:
>> > Dear Hadoop Users and Developers,
>> >
>> > I was wondering if there's a plan to add "file info cache" in DFSClient?
>> >
>> > It could eliminate network travelling cost for contacting Namenode and I
>> > think it would greatly improve the DFSClient's performance.
>> > The code I was looking at was this
>> >
>> > -----------------------
>> > DFSClient.java
>> >
>> >    /**
>> >     * Grab the open-file info from namenode
>> >     */
>> >    synchronized void openInfo() throws IOException {
>> >      /* Maybe, we could add a file info cache here! */
>> >      LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
>> >      if (newInfo == null) {
>> >        throw new IOException("Cannot open filename " + src);
>> >      }
>> >      if (locatedBlocks != null) {
>> >        Iterator<LocatedBlock> oldIter =
>> > locatedBlocks.getLocatedBlocks().iterator();
>> >        Iterator<LocatedBlock> newIter =
>> > newInfo.getLocatedBlocks().iterator();
>> >        while (oldIter.hasNext() && newIter.hasNext()) {
>> >          if (!
>> oldIter.next().getBlock().equals(newIter.next().getBlock()))
>> > {
>> >            throw new IOException("Blocklist for " + src + " has
>> changed!");
>> >          }
>> >        }
>> >      }
>> >      this.locatedBlocks = newInfo;
>> >      this.currentNode = null;
>> >    }
>> > -----------------------
>> >
>> > Does anybody have an opinion on this matter?
>> >
>> > Thank you in advance,
>> >
>> > Taeho
>> >
>>
>

Re: Question on opening file info from namenode in DFSClient

Posted by Owen O'Malley <om...@apache.org>.

On Nov 7, 2008, at 12:53 AM, Taeho Kang wrote:

> One of my co-workers is writing a BigTable-like application that  
> could be
> used for online, near-real-time, services.

How is the new BigTable-like application different from HBase and  
HyperTable?

-- Owen

Re: Question on opening file info from namenode in DFSClient

Posted by stack <st...@duboce.net>.

Taeho Kang wrote:
> Hi, thanks for your reply Dhruba,
>
> One of my co-workers is writing a BigTable-like application that could be
> used for online, near-real-time, services. 
Can your co-worker be convinced to instead spend his time helping-along 
the ongoing bigtable-like efforts?
> I think HBase developers would have run into similar issues as well.
>   
In hbase, we open the file once and keep it open.  File is shared 
amongst all clients.

St.Ack

Re: Question on opening file info from namenode in DFSClient

Posted by Taeho Kang <tk...@gmail.com>.

Hi, thanks for your reply Dhruba,

One of my co-workers is writing a BigTable-like application that could be
used for online, near-real-time, services. So since the application could be
hooked into online services, there would times when a large number of users
(e.g. 1000 users) request to access few files in a very short time.

Of course, in a batch process job, this is a rare case, but for online
services, it's quite a common case.
I think HBase developers would have run into similar issues as well.

Is this enough explanation?

Thanks in advance,

Taeho



On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur <dh...@gmail.com> wrote:

> In the current code, details about block locations of a file are
> cached on the client when the file is opened. This cache remains with
> the client until the file is closed. If the same file is re-opened by
> the same DFSClient, it re-contacts the namenode and refetches the
> block locations. This works ok for most map-reduce apps because it is
> rare that the same DSClient re-opens the same file again.
>
> Can you pl explain your use-case?
>
> thanks,
> dhruba
>
>
> On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang <tk...@gmail.com> wrote:
> > Dear Hadoop Users and Developers,
> >
> > I was wondering if there's a plan to add "file info cache" in DFSClient?
> >
> > It could eliminate network travelling cost for contacting Namenode and I
> > think it would greatly improve the DFSClient's performance.
> > The code I was looking at was this
> >
> > -----------------------
> > DFSClient.java
> >
> >    /**
> >     * Grab the open-file info from namenode
> >     */
> >    synchronized void openInfo() throws IOException {
> >      /* Maybe, we could add a file info cache here! */
> >      LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
> >      if (newInfo == null) {
> >        throw new IOException("Cannot open filename " + src);
> >      }
> >      if (locatedBlocks != null) {
> >        Iterator<LocatedBlock> oldIter =
> > locatedBlocks.getLocatedBlocks().iterator();
> >        Iterator<LocatedBlock> newIter =
> > newInfo.getLocatedBlocks().iterator();
> >        while (oldIter.hasNext() && newIter.hasNext()) {
> >          if (!
> oldIter.next().getBlock().equals(newIter.next().getBlock()))
> > {
> >            throw new IOException("Blocklist for " + src + " has
> changed!");
> >          }
> >        }
> >      }
> >      this.locatedBlocks = newInfo;
> >      this.currentNode = null;
> >    }
> > -----------------------
> >
> > Does anybody have an opinion on this matter?
> >
> > Thank you in advance,
> >
> > Taeho
> >
>

Re: Question on opening file info from namenode in DFSClient

Posted by Dhruba Borthakur <dh...@gmail.com>.

In the current code, details about block locations of a file are
cached on the client when the file is opened. This cache remains with
the client until the file is closed. If the same file is re-opened by
the same DFSClient, it re-contacts the namenode and refetches the
block locations. This works ok for most map-reduce apps because it is
rare that the same DSClient re-opens the same file again.

Can you pl explain your use-case?

thanks,
dhruba


On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang <tk...@gmail.com> wrote:
> Dear Hadoop Users and Developers,
>
> I was wondering if there's a plan to add "file info cache" in DFSClient?
>
> It could eliminate network travelling cost for contacting Namenode and I
> think it would greatly improve the DFSClient's performance.
> The code I was looking at was this
>
> -----------------------
> DFSClient.java
>
>    /**
>     * Grab the open-file info from namenode
>     */
>    synchronized void openInfo() throws IOException {
>      /* Maybe, we could add a file info cache here! */
>      LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
>      if (newInfo == null) {
>        throw new IOException("Cannot open filename " + src);
>      }
>      if (locatedBlocks != null) {
>        Iterator<LocatedBlock> oldIter =
> locatedBlocks.getLocatedBlocks().iterator();
>        Iterator<LocatedBlock> newIter =
> newInfo.getLocatedBlocks().iterator();
>        while (oldIter.hasNext() && newIter.hasNext()) {
>          if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
> {
>            throw new IOException("Blocklist for " + src + " has changed!");
>          }
>        }
>      }
>      this.locatedBlocks = newInfo;
>      this.currentNode = null;
>    }
> -----------------------
>
> Does anybody have an opinion on this matter?
>
> Thank you in advance,
>
> Taeho
>