You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Taeho Kang <tk...@gmail.com> on 2008/11/03 07:57:10 UTC
Question on opening file info from namenode in DFSClient
Dear Hadoop Users and Developers,
I was wondering if there's a plan to add "file info cache" in DFSClient?
It could eliminate network travelling cost for contacting Namenode and I
think it would greatly improve the DFSClient's performance.
The code I was looking at was this
-----------------------
DFSClient.java
/**
* Grab the open-file info from namenode
*/
synchronized void openInfo() throws IOException {
/* Maybe, we could add a file info cache here! */
LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
if (newInfo == null) {
throw new IOException("Cannot open filename " + src);
}
if (locatedBlocks != null) {
Iterator<LocatedBlock> oldIter =
locatedBlocks.getLocatedBlocks().iterator();
Iterator<LocatedBlock> newIter =
newInfo.getLocatedBlocks().iterator();
while (oldIter.hasNext() && newIter.hasNext()) {
if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
{
throw new IOException("Blocklist for " + src + " has changed!");
}
}
}
this.locatedBlocks = newInfo;
this.currentNode = null;
}
-----------------------
Does anybody have an opinion on this matter?
Thank you in advance,
Taeho
Re: Question on opening file info from namenode in DFSClient
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Consider case of file getting removed & recreated with same name ..
while there is cached info about the file (and job is running) in the
DFSClient's (mapper/reducer).
- Mridul
Taeho Kang wrote:
> Dear Hadoop Users and Developers,
>
> I was wondering if there's a plan to add "file info cache" in DFSClient?
>
> It could eliminate network travelling cost for contacting Namenode and I
> think it would greatly improve the DFSClient's performance.
> The code I was looking at was this
>
> -----------------------
> DFSClient.java
>
> /**
> * Grab the open-file info from namenode
> */
> synchronized void openInfo() throws IOException {
> /* Maybe, we could add a file info cache here! */
> LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
> if (newInfo == null) {
> throw new IOException("Cannot open filename " + src);
> }
> if (locatedBlocks != null) {
> Iterator<LocatedBlock> oldIter =
> locatedBlocks.getLocatedBlocks().iterator();
> Iterator<LocatedBlock> newIter =
> newInfo.getLocatedBlocks().iterator();
> while (oldIter.hasNext() && newIter.hasNext()) {
> if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
> {
> throw new IOException("Blocklist for " + src + " has changed!");
> }
> }
> }
> this.locatedBlocks = newInfo;
> this.currentNode = null;
> }
> -----------------------
>
> Does anybody have an opinion on this matter?
>
> Thank you in advance,
>
> Taeho
>
Re: Question on opening file info from namenode in DFSClient
Posted by Dhruba Borthakur <dh...@gmail.com>.
Hi Taeho,
Thanks for ur explanation. If your application opens a dfs file and
does not close it, then the dfsclient will automatcally keep block
locations cached. So, you could achieve your desired goal by
developing a cache layer (above HDFS) that does not close the hdfs
file even if the user has closed it. This cache-layer needs to manage
this cache-pol of HDFS fle handles.
does this help?
thanks,
dhruba
On Fri, Nov 7, 2008 at 12:53 AM, Taeho Kang <tk...@gmail.com> wrote:
> Hi, thanks for your reply Dhruba,
>
> One of my co-workers is writing a BigTable-like application that could be
> used for online, near-real-time, services. So since the application could be
> hooked into online services, there would times when a large number of users
> (e.g. 1000 users) request to access few files in a very short time.
>
> Of course, in a batch process job, this is a rare case, but for online
> services, it's quite a common case.
> I think HBase developers would have run into similar issues as well.
>
> Is this enough explanation?
>
> Thanks in advance,
>
> Taeho
>
>
>
> On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
>
>> In the current code, details about block locations of a file are
>> cached on the client when the file is opened. This cache remains with
>> the client until the file is closed. If the same file is re-opened by
>> the same DFSClient, it re-contacts the namenode and refetches the
>> block locations. This works ok for most map-reduce apps because it is
>> rare that the same DSClient re-opens the same file again.
>>
>> Can you pl explain your use-case?
>>
>> thanks,
>> dhruba
>>
>>
>> On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang <tk...@gmail.com> wrote:
>> > Dear Hadoop Users and Developers,
>> >
>> > I was wondering if there's a plan to add "file info cache" in DFSClient?
>> >
>> > It could eliminate network travelling cost for contacting Namenode and I
>> > think it would greatly improve the DFSClient's performance.
>> > The code I was looking at was this
>> >
>> > -----------------------
>> > DFSClient.java
>> >
>> > /**
>> > * Grab the open-file info from namenode
>> > */
>> > synchronized void openInfo() throws IOException {
>> > /* Maybe, we could add a file info cache here! */
>> > LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
>> > if (newInfo == null) {
>> > throw new IOException("Cannot open filename " + src);
>> > }
>> > if (locatedBlocks != null) {
>> > Iterator<LocatedBlock> oldIter =
>> > locatedBlocks.getLocatedBlocks().iterator();
>> > Iterator<LocatedBlock> newIter =
>> > newInfo.getLocatedBlocks().iterator();
>> > while (oldIter.hasNext() && newIter.hasNext()) {
>> > if (!
>> oldIter.next().getBlock().equals(newIter.next().getBlock()))
>> > {
>> > throw new IOException("Blocklist for " + src + " has
>> changed!");
>> > }
>> > }
>> > }
>> > this.locatedBlocks = newInfo;
>> > this.currentNode = null;
>> > }
>> > -----------------------
>> >
>> > Does anybody have an opinion on this matter?
>> >
>> > Thank you in advance,
>> >
>> > Taeho
>> >
>>
>
Re: Question on opening file info from namenode in DFSClient
Posted by Owen O'Malley <om...@apache.org>.
On Nov 7, 2008, at 12:53 AM, Taeho Kang wrote:
> One of my co-workers is writing a BigTable-like application that
> could be
> used for online, near-real-time, services.
How is the new BigTable-like application different from HBase and
HyperTable?
-- Owen
Re: Question on opening file info from namenode in DFSClient
Posted by stack <st...@duboce.net>.
Taeho Kang wrote:
> Hi, thanks for your reply Dhruba,
>
> One of my co-workers is writing a BigTable-like application that could be
> used for online, near-real-time, services.
Can your co-worker be convinced to instead spend his time helping-along
the ongoing bigtable-like efforts?
> I think HBase developers would have run into similar issues as well.
>
In hbase, we open the file once and keep it open. File is shared
amongst all clients.
St.Ack
Re: Question on opening file info from namenode in DFSClient
Posted by Taeho Kang <tk...@gmail.com>.
Hi, thanks for your reply Dhruba,
One of my co-workers is writing a BigTable-like application that could be
used for online, near-real-time, services. So since the application could be
hooked into online services, there would times when a large number of users
(e.g. 1000 users) request to access few files in a very short time.
Of course, in a batch process job, this is a rare case, but for online
services, it's quite a common case.
I think HBase developers would have run into similar issues as well.
Is this enough explanation?
Thanks in advance,
Taeho
On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> In the current code, details about block locations of a file are
> cached on the client when the file is opened. This cache remains with
> the client until the file is closed. If the same file is re-opened by
> the same DFSClient, it re-contacts the namenode and refetches the
> block locations. This works ok for most map-reduce apps because it is
> rare that the same DSClient re-opens the same file again.
>
> Can you pl explain your use-case?
>
> thanks,
> dhruba
>
>
> On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang <tk...@gmail.com> wrote:
> > Dear Hadoop Users and Developers,
> >
> > I was wondering if there's a plan to add "file info cache" in DFSClient?
> >
> > It could eliminate network travelling cost for contacting Namenode and I
> > think it would greatly improve the DFSClient's performance.
> > The code I was looking at was this
> >
> > -----------------------
> > DFSClient.java
> >
> > /**
> > * Grab the open-file info from namenode
> > */
> > synchronized void openInfo() throws IOException {
> > /* Maybe, we could add a file info cache here! */
> > LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
> > if (newInfo == null) {
> > throw new IOException("Cannot open filename " + src);
> > }
> > if (locatedBlocks != null) {
> > Iterator<LocatedBlock> oldIter =
> > locatedBlocks.getLocatedBlocks().iterator();
> > Iterator<LocatedBlock> newIter =
> > newInfo.getLocatedBlocks().iterator();
> > while (oldIter.hasNext() && newIter.hasNext()) {
> > if (!
> oldIter.next().getBlock().equals(newIter.next().getBlock()))
> > {
> > throw new IOException("Blocklist for " + src + " has
> changed!");
> > }
> > }
> > }
> > this.locatedBlocks = newInfo;
> > this.currentNode = null;
> > }
> > -----------------------
> >
> > Does anybody have an opinion on this matter?
> >
> > Thank you in advance,
> >
> > Taeho
> >
>
Re: Question on opening file info from namenode in DFSClient
Posted by Dhruba Borthakur <dh...@gmail.com>.
In the current code, details about block locations of a file are
cached on the client when the file is opened. This cache remains with
the client until the file is closed. If the same file is re-opened by
the same DFSClient, it re-contacts the namenode and refetches the
block locations. This works ok for most map-reduce apps because it is
rare that the same DSClient re-opens the same file again.
Can you pl explain your use-case?
thanks,
dhruba
On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang <tk...@gmail.com> wrote:
> Dear Hadoop Users and Developers,
>
> I was wondering if there's a plan to add "file info cache" in DFSClient?
>
> It could eliminate network travelling cost for contacting Namenode and I
> think it would greatly improve the DFSClient's performance.
> The code I was looking at was this
>
> -----------------------
> DFSClient.java
>
> /**
> * Grab the open-file info from namenode
> */
> synchronized void openInfo() throws IOException {
> /* Maybe, we could add a file info cache here! */
> LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
> if (newInfo == null) {
> throw new IOException("Cannot open filename " + src);
> }
> if (locatedBlocks != null) {
> Iterator<LocatedBlock> oldIter =
> locatedBlocks.getLocatedBlocks().iterator();
> Iterator<LocatedBlock> newIter =
> newInfo.getLocatedBlocks().iterator();
> while (oldIter.hasNext() && newIter.hasNext()) {
> if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
> {
> throw new IOException("Blocklist for " + src + " has changed!");
> }
> }
> }
> this.locatedBlocks = newInfo;
> this.currentNode = null;
> }
> -----------------------
>
> Does anybody have an opinion on this matter?
>
> Thank you in advance,
>
> Taeho
>