You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2014/09/21 18:53:33 UTC

[jira] [Comment Edited] (NUTCH-1844) testresources/testcrawl not referenced anywhere in code.

    [ https://issues.apache.org/jira/browse/NUTCH-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142510#comment-14142510 ] 

Chris A. Mattmann edited comment on NUTCH-1844 at 9/21/14 4:52 PM:
-------------------------------------------------------------------

After examining the Nutch 1.2 CrawlDbConverter:
http://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/tools/compat/CrawlDbConverter.html

And running it:

{noformat}
[chipotle:~/tmp/nutch1.2] mattmann% java -Djava.ext.dirs=build:lib org.apache.nutch.tools.compat.CrawlDbConverter ../nutch/src/testresources/testcrawl/crawldb foo -withMetadata
[chipotle:~/tmp/nutch1.2] mattmann% ls
CHANGES.txt         LICENSE.txt         README.txt          build/              conf/               default.properties  hadoop.log          lib/                src/
KEYS                NOTICE.txt          bin/                build.xml           contrib/            docs/               index.html          site/
[chipotle:~/tmp/nutch1.2] mattmann% ls foo
ls: foo: No such file or directory
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/testcrawl/
crawldb/  index/    indexes/  linkdb/   segments/
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/
fetch-test-site/ test-mime-util/  testcrawl/
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/
fetch-test-site/ test-mime-util/  testcrawl/
[chipotle:~/tmp/nutch1.2] mattmann% java -Djava.ext.dirs=build:lib org.apache.nutch.tools.compat.CrawlDbConverter ../nutch/src/testresources/testcrawl/crawldb ../nutch/src/testresources/testcrawl/crawldb2 -withMetadata
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/
fetch-test-site/ test-mime-util/  testcrawl/
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/test
ls: ../nutch/src/testresources/test: No such file or directory
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/testcrawl/
crawldb/  index/    indexes/  linkdb/   segments/
[chipotle:~/tmp/nutch1.2] mattmann% java -Djava.ext.dirs=build:lib org.apache.nutch.tools.compat.CrawlDbConverter ../nutch/src/testresources/testcrawl/crawldb ../nutch/src/testresources/testcrawl/crawldb2 
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/testcrawl/
crawldb/  index/    indexes/  linkdb/   segments/
{noformat}

Both against:
* crawldb
* whole crawl dir
* segments

etc., it produces no output and I can't seem to figure out how to use it. So, rather than invest more time here, I am going to suggest that if in 48 hours, I don't hear objections, I'm going to delete the testresources/testcrawl since it's not referenced anywhere in the code.


was (Author: chrismattmann):
After examining the Nutch 1.2 CrawlDbConverter:
http://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/tools/compat/CrawlDbConverter.html

And running it:

{noformat}
[chipotle:~/tmp/nutch1.2] mattmann% java -Djava.ext.dirs=build:lib org.apache.nutch.tools.compat.CrawlDbConverter ../nutch/src/testresources/testcrawl/crawldb foo -withMetadata
[chipotle:~/tmp/nutch1.2] mattmann% ls
CHANGES.txt         LICENSE.txt         README.txt          build/              conf/               default.properties  hadoop.log          lib/                src/
KEYS                NOTICE.txt          bin/                build.xml           contrib/            docs/               index.html          site/
[chipotle:~/tmp/nutch1.2] mattmann% ls foo
ls: foo: No such file or directory
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/testcrawl/
crawldb/  index/    indexes/  linkdb/   segments/
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/
fetch-test-site/ test-mime-util/  testcrawl/
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/
fetch-test-site/ test-mime-util/  testcrawl/
[chipotle:~/tmp/nutch1.2] mattmann% java -Djava.ext.dirs=build:lib org.apache.nutch.tools.compat.CrawlDbConverter ../nutch/src/testresources/testcrawl/crawldb ../nutch/src/testresources/testcrawl/crawldb2 -withMetadata
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/
fetch-test-site/ test-mime-util/  testcrawl/
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/test
ls: ../nutch/src/testresources/test: No such file or directory
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/testcrawl/
crawldb/  index/    indexes/  linkdb/   segments/
[chipotle:~/tmp/nutch1.2] mattmann% java -Djava.ext.dirs=build:lib org.apache.nutch.tools.compat.CrawlDbConverter ../nutch/src/testresources/testcrawl/crawldb ../nutch/src/testresources/testcrawl/crawldb2 
[chipotle:~/tmp/nutch1.2] mattmann% ls ../nutch/src/testresources/testcrawl/
crawldb/  index/    indexes/  linkdb/   segments/
{noformat}

Both against:
*crawldb
*whole crawl dir
* segments

etc., it produces no output and I can't seem to figure out how to use it. So, rather than invest more time here, I am going to suggest that if in 48 hours, I don't hear objections, I'm going to delete the testresources/testcrawl since it's not referenced anywhere in the code.

> testresources/testcrawl not referenced anywhere in code.
> --------------------------------------------------------
>
>                 Key: NUTCH-1844
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1844
>             Project: Nutch
>          Issue Type: Bug
>          Components: test
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.10
>
>
> While working on NUTCH-1526 in Review Board https://reviews.apache.org/r/9119/ [~lewismc] tried to test out the ./bin/nutch dump tool on src/testresources/testcrawl and found that it failed due to an old o.a.h.io.UTF8 key type (instead of the o.a.h.io.Text) type. 
> I looked into this - how were Nutch tests passing using this old code? I found that Andrzej a long time ago wrote a tool to update the index from the old UFT8 key format to Text - I also found that *no where in the Nutch code* is the testcrawl referenced.
> My suggestion: 
> * we remove the testcrawl (it's not used)
> * if we don't remove it, we at least run Andrzej's tool on it and then upgrade it to use o.a.h.io.Text keys. 
> I'll take care of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)