You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doğacan Güney <do...@agmlab.com> on 2006/12/28 17:15:38 UTC
linkdb bug
Hi,
After today's big update, it seems invertlinks doesn't work if a linkdb
doesn't exist already, because fs.exists checks the wrong directory
(linkdb/ but not linkdb/current).
A simple patch is attached.
--
Doğacan Güney
Re: linkdb bug
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doğacan Güney wrote:
> Hi,
>
> After today's big update, it seems invertlinks doesn't work if a
> linkdb doesn't exist already, because fs.exists checks the wrong
> directory (linkdb/ but not linkdb/current).
>
> A simple patch is attached.
>
> --
> Doğacan Güney
> ------------------------------------------------------------------------
>
> Index: src/java/org/apache/nutch/crawl/LinkDb.java
> ===================================================================
> --- src/java/org/apache/nutch/crawl/LinkDb.java (revision 490745)
> +++ src/java/org/apache/nutch/crawl/LinkDb.java (working copy)
> @@ -212,6 +212,7 @@
> public void invert(Path linkDb, Path[] segments, boolean normalize, boolean filter, boolean force) throws IOException {
>
> Path lock = new Path(linkDb, LOCK_NAME);
> + Path currentLinkDb = new Path(linkDb, CURRENT_NAME);
> FileSystem fs = FileSystem.get(getConf());
> LockUtil.createLockFile(fs, lock, force);
> if (LOG.isInfoEnabled()) {
> @@ -233,14 +234,14 @@
> LockUtil.removeLockFile(fs, lock);
> throw e;
> }
> - if (fs.exists(linkDb)) {
> + if (fs.exists(currentLinkDb)) {
> if (LOG.isInfoEnabled()) {
> LOG.info("LinkDb: merging with existing linkdb: " + linkDb);
> }
> // try to merge
> Path newLinkDb = job.getOutputPath();
> job = LinkDb.createMergeJob(getConf(), linkDb, normalize, filter);
> - job.addInputPath(new Path(linkDb, CURRENT_NAME));
> + job.addInputPath(currentLinkDb);
> job.addInputPath(newLinkDb);
> try {
> JobClient.runJob(job);
>
Indeed, this may cause problems, especially if you already have a
directory called linkdb, but it's completely empty (i.e. doesn't contain
CURRENT_NAME subdir).
I'll fix it - thanks!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: linkdb bug
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doğacan Güney wrote:
> There is a problem with the indexer too. It doesn't check for the new
> CrawlDatum statuses. Patch attached.
I just fixed both bugs (rev 491291) - thanks!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: linkdb bug
Posted by Doğacan Güney <do...@agmlab.com>.
There is a problem with the indexer too. It doesn't check for the new
CrawlDatum statuses. Patch attached.