You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Vladimir Garvardt (JIRA)" <ji...@apache.org> on 2008/06/21 15:32:46 UTC
[jira] Commented: (NUTCH-442) Integrate Solr/Nutch
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607005#action_12607005 ]
Vladimir Garvardt commented on NUTCH-442:
-----------------------------------------
Hello.
I'm trying to apply this patch and faced a problem that I cannot solve by myself.
I checked out nutch trunk (rev 670194), downloaded attachments from this issue and started patching.
First I applied Crawl.patch, then Indexer.patch and then NUTCH-442_v5.patch. On applying last patch I got warning message. This happened because of conflict between Crawl.patch and NUTCH-442_v5.patch.
Crawl.patch performs the following action:
// index, dedup & merge
+ indexer.index(indexes, solrUrl, crawlDb, linkDb,
+ Arrays.asList(fs.listPaths(segments, HadoopFSUtil.getPassAllFilter())));
and NUTCH-442_v5.patch performs the following action
// index, dedup & merge
- indexer.index(indexes, crawlDb, linkDb, fs.listPaths(segments, HadoopFSUtil.getPassAllFilter()));
+ indexer.index(indexes, null, crawlDb, linkDb,
+ Arrays.asList(fs.listPaths(segments, HadoopFSUtil.getPassAllFilter())));
The main between this patches in second parameter.
First I tried to build nutch with second parameter set to null - crawling finished successfully, but no data was added to solr.
Then I changed second parameter to solrUrl and rebuilt nutch. On indexing following Exception was caught and indexing failed (no data in solr):
Indexer: starting
Indexer: crawldb: crawl/crawldb
Indexer: linkdb: crawl/linkdb
Indexer: solrUrl: http://localhost:8984/solr/
Indexer: adding segment: file:/home/vladimirga/Documents/dev/src/lucene-src/nutch-2008-06-21/wrk-01/crawl/segments/20080621200352
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:318)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:148)
What can cause that problem and how can I fix it to make nutch index into solr?
Thanks.
> Integrate Solr/Nutch
> --------------------
>
> Key: NUTCH-442
> URL: https://issues.apache.org/jira/browse/NUTCH-442
> Project: Nutch
> Issue Type: New Feature
> Environment: Ubuntu linux
> Reporter: rubdabadub
> Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch, NUTCH-442_v5.patch, NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, schema.xml
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-442) Integrate Solr/Nutch
Posted by Julien Nioche <li...@gmail.com>.
Vladimir,
There is a duplication of actions between the Crawl and Indexer patches on
one hand and the NUTCH-442_v5.patch on the other hand.
I simply replaced in 442_v5 the sections which are also modified by C and I
patches then applied this modified patch to the code. That worked fine.
J.
2008/6/21 Vladimir Garvardt (JIRA) <ji...@apache.org>:
>
> [
> https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607005#action_12607005]
>
> Vladimir Garvardt commented on NUTCH-442:
> -----------------------------------------
>
> Hello.
>
> I'm trying to apply this patch and faced a problem that I cannot solve by
> myself.
>
> I checked out nutch trunk (rev 670194), downloaded attachments from this
> issue and started patching.
> First I applied Crawl.patch, then Indexer.patch and then
> NUTCH-442_v5.patch. On applying last patch I got warning message. This
> happened because of conflict between Crawl.patch and NUTCH-442_v5.patch.
>
> Crawl.patch performs the following action:
> // index, dedup & merge
> + indexer.index(indexes, solrUrl, crawlDb, linkDb,
> + Arrays.asList(fs.listPaths(segments,
> HadoopFSUtil.getPassAllFilter())));
>
> and NUTCH-442_v5.patch performs the following action
> // index, dedup & merge
> - indexer.index(indexes, crawlDb, linkDb, fs.listPaths(segments,
> HadoopFSUtil.getPassAllFilter()));
> + indexer.index(indexes, null, crawlDb, linkDb,
> + Arrays.asList(fs.listPaths(segments,
> HadoopFSUtil.getPassAllFilter())));
>
>
> The main between this patches in second parameter.
> First I tried to build nutch with second parameter set to null - crawling
> finished successfully, but no data was added to solr.
> Then I changed second parameter to solrUrl and rebuilt nutch. On indexing
> following Exception was caught and indexing failed (no data in solr):
> Indexer: starting
> Indexer: crawldb: crawl/crawldb
> Indexer: linkdb: crawl/linkdb
> Indexer: solrUrl: http://localhost:8984/solr/
> Indexer: adding segment:
> file:/home/vladimirga/Documents/dev/src/lucene-src/nutch-2008-06-21/wrk-01/crawl/segments/20080621200352
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
> at org.apache.nutch.indexer.Indexer.index(Indexer.java:318)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:148)
>
> What can cause that problem and how can I fix it to make nutch index into
> solr?
>
> Thanks.
>
> > Integrate Solr/Nutch
> > --------------------
> >
> > Key: NUTCH-442
> > URL: https://issues.apache.org/jira/browse/NUTCH-442
> > Project: Nutch
> > Issue Type: New Feature
> > Environment: Ubuntu linux
> > Reporter: rubdabadub
> > Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch,
> NUTCH-442_v5.patch, NUTCH_442_v3.patch, RFC_multiple_search_backends.patch,
> schema.xml
> >
> >
> > Hi:
> > After trying out Sami's patch regarding Solr/Nutch. Can be found here (
> http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
> and I can confirm it worked :-) And that lead me to request the following :
> > I would be very very great full if this could be included in nutch 0.9 as
> I am trying to eliminate my python based crawler which post documents to
> solr. As I am in the corporate enviornment I can't install trunk version in
> the production enviornment thus I am asking this to be included in 0.9
> release. I hope my wish would be granted.
> > I look forward to get some feedback.
> > Thank you.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
--
DigitalPebble Ltd
http://www.digitalpebble.com