You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Mike Smith <mi...@gmail.com> on 2007/01/16 21:30:03 UTC

Amazon S3/Ec2 problem [injection and fs.rename() problem]

I've testing Tom's nice S3/EC2 patch on couple of EC2/S3 machine. Injector
fails to inject urls, because fs.rename() in line 145 of
CrawlDb.javadeletes the whole content and only renames the parent
folder from xxxxx to
current. Basiclly,. crawl_dir/crawldb/current will an empty folder after
renaming.

I have not gone through the hadoop fs.rename code, I thought maybe somebody
have solved this problem before.

Thanks, Mike

Re: Amazon S3/Ec2 problem [injection and fs.rename() problem]

Posted by Mike Smith <mi...@gmail.com>.

I went through the S3FileSystem.java codes and fixed the renameRaw() method.
Now, it iterates through the folders recursively and rename those. Also, in
the case of existing destination folder, it moves the src folder under the
dst folder.

Here is the piece code that should be replaced in S3FileSystem.java.
renameRaw() method should be replaced by the following methods:


@Override
  public boolean renameRaw(Path src, Path dst) throws IOException {

   Path absoluteDst = makeAbsolute(dst);
   Path absoluteSrc = makeAbsolute(src);

   INode inode = store.getINode(absoluteDst);
   // checking to see of dst folder exist. In this case moves the
   // src folder under the existing path.
   if (inode != null && inode.isDirectory()) {
    Path newDst = new Path(absoluteDst.toString
()+"/"+absoluteSrc.getName());
    return renameRaw(src,newDst,src);
   } else {
   // if the dst folder does not exist, then the dst folder will be created.

   return renameRaw(src,dst,src);
   }
  }

  // recursivley goes through all the subfolders and rename those.
  public boolean renameRaw(Path src, Path dst,Path orgSrc) throws
IOException {
     Path absoluteSrc = makeAbsolute(src);
     Path newDst = new Path(src.toString().replaceFirst(orgSrc.toString(),
dst.toString()));
     Path absoluteDst = makeAbsolute(newDst);
     LOG.info(absoluteSrc.toString());
     INode inode = store.getINode (absoluteSrc);
     if (inode == null) {
       return false;
     }
     if (inode.isFile()) {
      store.storeINode(makeAbsolute(absoluteDst), inode);
     } else {
       store.storeINode (makeAbsolute(absoluteDst), inode);
       Path[] contents = listPathsRaw(absoluteSrc);
       if (contents == null) {
         return false;
       }
       for (Path p : contents) {
         if (! renameRaw(p,dst,orgSrc)) {
           return false;
         }

       }
     }
     store.deleteINode(absoluteSrc);
     return true;
}


On 1/16/07, Mike Smith <mi...@gmail.com> wrote:
>
> I've testing Tom's nice S3/EC2 patch on couple of EC2/S3 machine. Injector
> fails to inject urls, because fs.rename() in line 145 of CrawlDb.javadeletes the whole content and only renames the parent folder from xxxxx to
> current. Basiclly,. crawl_dir/crawldb/current will an empty folder after
> renaming.
>
> I have not gone through the hadoop fs.rename code, I thought maybe
> somebody have solved this problem before.
>
> Thanks, Mike
>

Re: Amazon S3/Ec2 problem [injection and fs.rename() problem]

Posted by Tom White <to...@gmail.com>.

I've created a Jira issue for this:
https://issues.apache.org/jira/browse/HADOOP-901.

Thanks,

Tom

On 17/01/07, Mike Smith <mi...@gmail.com> wrote:
> I went through the S3FileSystem.java codes and fixed the renameRaw() method.
> Now, it iterates through the folders recursively and rename those. Also, in
> the case of existing destination folder, it moves the src folder under the
> dst folder.
>
> Here is the piece code that should be replaced in S3FileSystem.java.
> renameRaw() method should be replaced by the following methods:
>
>
> @Override
>   public boolean renameRaw(Path src, Path dst) throws IOException {
>
>    Path absoluteDst = makeAbsolute(dst);
>    Path absoluteSrc = makeAbsolute(src);
>
>    INode inode = store.getINode(absoluteDst);
>    // checking to see of dst folder exist. In this case moves the
>    // src folder under the existing path.
>    if (inode != null && inode.isDirectory()) {
>     Path newDst = new Path(absoluteDst.toString
> ()+"/"+absoluteSrc.getName());
>     return renameRaw(src,newDst,src);
>    } else {
>    // if the dst folder does not exist, then the dst folder will be created.
>
>    return renameRaw(src,dst,src);
>    }
>   }
>
>   // recursively goes through all the subfolders and rename those.
>   public boolean renameRaw(Path src, Path dst,Path orgSrc) throws
> IOException {
>      Path absoluteSrc = makeAbsolute(src);
>      Path newDst = new Path(src.toString().replaceFirst(orgSrc.toString(),
> dst.toString()));
>      Path absoluteDst = makeAbsolute(newDst);
>      LOG.info(absoluteSrc.toString());
>      INode inode = store.getINode (absoluteSrc);
>      if (inode == null) {
>        return false;
>      }
>      if (inode.isFile()) {
>       store.storeINode(makeAbsolute(absoluteDst), inode);
>      } else {
>        store.storeINode (makeAbsolute(absoluteDst), inode);
>        Path[] contents = listPathsRaw(absoluteSrc);
>        if (contents == null) {
>          return false;
>        }
>        for (Path p : contents) {
>          if (! renameRaw(p,dst,orgSrc)) {
>            return false;
>          }
>
>        }
>      }
>      store.deleteINode(absoluteSrc);
>      return true;
> }
>
>
> On 1/16/07, Mike Smith <mi...@gmail.com> wrote:
> >
> > I've testing Tom's nice S3/EC2 patch on couple of EC2/S3 machine. Injector
> > fails to inject urls, because fs.rename() in line 145 of CrawlDb.javadeletes the whole content and only renames the parent folder from xxxxx to
> > current. Basiclly,. crawl_dir/crawldb/current will an empty folder after
> > renaming.
> >
> > I have not gone through the hadoop fs.rename code, I thought maybe
> > somebody have solved this problem before.
> >
> > Thanks, Mike
> >
>
>

Re: Amazon S3/Ec2 problem [injection and fs.rename() problem]

Posted by Mike Smith <mi...@gmail.com>.

I went through the S3FileSystem.java codes and fixed the renameRaw() method.
Now, it iterates through the folders recursively and rename those. Also, in
the case of existing destination folder, it moves the src folder under the
dst folder.

Here is the piece code that should be replaced in S3FileSystem.java.
renameRaw() method should be replaced by the following methods:


@Override
  public boolean renameRaw(Path src, Path dst) throws IOException {

   Path absoluteDst = makeAbsolute(dst);
   Path absoluteSrc = makeAbsolute(src);

   INode inode = store.getINode(absoluteDst);
   // checking to see of dst folder exist. In this case moves the
   // src folder under the existing path.
   if (inode != null && inode.isDirectory()) {
    Path newDst = new Path(absoluteDst.toString
()+"/"+absoluteSrc.getName());
    return renameRaw(src,newDst,src);
   } else {
   // if the dst folder does not exist, then the dst folder will be created.

   return renameRaw(src,dst,src);
   }
  }

  // recursively goes through all the subfolders and rename those.
  public boolean renameRaw(Path src, Path dst,Path orgSrc) throws
IOException {
     Path absoluteSrc = makeAbsolute(src);
     Path newDst = new Path(src.toString().replaceFirst(orgSrc.toString(),
dst.toString()));
     Path absoluteDst = makeAbsolute(newDst);
     LOG.info(absoluteSrc.toString());
     INode inode = store.getINode (absoluteSrc);
     if (inode == null) {
       return false;
     }
     if (inode.isFile()) {
      store.storeINode(makeAbsolute(absoluteDst), inode);
     } else {
       store.storeINode (makeAbsolute(absoluteDst), inode);
       Path[] contents = listPathsRaw(absoluteSrc);
       if (contents == null) {
         return false;
       }
       for (Path p : contents) {
         if (! renameRaw(p,dst,orgSrc)) {
           return false;
         }

       }
     }
     store.deleteINode(absoluteSrc);
     return true;
}


On 1/16/07, Mike Smith <mi...@gmail.com> wrote:
>
> I've testing Tom's nice S3/EC2 patch on couple of EC2/S3 machine. Injector
> fails to inject urls, because fs.rename() in line 145 of CrawlDb.javadeletes the whole content and only renames the parent folder from xxxxx to
> current. Basiclly,. crawl_dir/crawldb/current will an empty folder after
> renaming.
>
> I have not gone through the hadoop fs.rename code, I thought maybe
> somebody have solved this problem before.
>
> Thanks, Mike
>