You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Vladimir Garvardt (JIRA)" <ji...@apache.org> on 2008/06/21 15:32:46 UTC

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

    [ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607005#action_12607005 ] 

Vladimir Garvardt commented on NUTCH-442:
-----------------------------------------

Hello.

I'm trying to apply this patch and faced a problem that I cannot solve by myself.

I checked out nutch trunk (rev 670194), downloaded attachments from this issue and started patching.
First I applied Crawl.patch, then Indexer.patch and then NUTCH-442_v5.patch. On applying last patch I got warning message. This happened because of conflict between Crawl.patch and NUTCH-442_v5.patch.

Crawl.patch performs the following action:
// index, dedup & merge
+      indexer.index(indexes, solrUrl, crawlDb, linkDb,
+          Arrays.asList(fs.listPaths(segments, HadoopFSUtil.getPassAllFilter())));

and NUTCH-442_v5.patch performs the following action
       // index, dedup & merge
-      indexer.index(indexes, crawlDb, linkDb, fs.listPaths(segments, HadoopFSUtil.getPassAllFilter()));
+      indexer.index(indexes, null, crawlDb, linkDb,
+          Arrays.asList(fs.listPaths(segments, HadoopFSUtil.getPassAllFilter())));


The main between this patches in second parameter.
First I tried to build nutch with second parameter set to null - crawling finished successfully, but no data was added to solr.
Then I changed second parameter to solrUrl and rebuilt nutch. On indexing following Exception was caught and indexing failed (no data in solr):
Indexer: starting
Indexer: crawldb: crawl/crawldb
Indexer: linkdb: crawl/linkdb
Indexer: solrUrl: http://localhost:8984/solr/
Indexer: adding segment: file:/home/vladimirga/Documents/dev/src/lucene-src/nutch-2008-06-21/wrk-01/crawl/segments/20080621200352
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
	at org.apache.nutch.indexer.Indexer.index(Indexer.java:318)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:148)

What can cause that problem and how can I fix it to make nutch index into solr?

Thanks.

> Integrate Solr/Nutch
> --------------------
>
>                 Key: NUTCH-442
>                 URL: https://issues.apache.org/jira/browse/NUTCH-442
>             Project: Nutch
>          Issue Type: New Feature
>         Environment: Ubuntu linux
>            Reporter: rubdabadub
>         Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch, NUTCH-442_v5.patch, NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, schema.xml
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (NUTCH-442) Integrate Solr/Nutch

Posted by Julien Nioche <li...@gmail.com>.

Vladimir,

There is a duplication of actions between the Crawl and Indexer patches on
one hand and the NUTCH-442_v5.patch on the other hand.
I simply replaced in 442_v5 the sections which are also modified by C and I
patches then applied this modified patch to the code. That worked fine.

J.

2008/6/21 Vladimir Garvardt (JIRA) <ji...@apache.org>:

>
>    [
> https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607005#action_12607005]
>
> Vladimir Garvardt commented on NUTCH-442:
> -----------------------------------------
>
> Hello.
>
> I'm trying to apply this patch and faced a problem that I cannot solve by
> myself.
>
> I checked out nutch trunk (rev 670194), downloaded attachments from this
> issue and started patching.
> First I applied Crawl.patch, then Indexer.patch and then
> NUTCH-442_v5.patch. On applying last patch I got warning message. This
> happened because of conflict between Crawl.patch and NUTCH-442_v5.patch.
>
> Crawl.patch performs the following action:
> // index, dedup & merge
> +      indexer.index(indexes, solrUrl, crawlDb, linkDb,
> +          Arrays.asList(fs.listPaths(segments,
> HadoopFSUtil.getPassAllFilter())));
>
> and NUTCH-442_v5.patch performs the following action
>       // index, dedup & merge
> -      indexer.index(indexes, crawlDb, linkDb, fs.listPaths(segments,
> HadoopFSUtil.getPassAllFilter()));
> +      indexer.index(indexes, null, crawlDb, linkDb,
> +          Arrays.asList(fs.listPaths(segments,
> HadoopFSUtil.getPassAllFilter())));
>
>
> The main between this patches in second parameter.
> First I tried to build nutch with second parameter set to null - crawling
> finished successfully, but no data was added to solr.
> Then I changed second parameter to solrUrl and rebuilt nutch. On indexing
> following Exception was caught and indexing failed (no data in solr):
> Indexer: starting
> Indexer: crawldb: crawl/crawldb
> Indexer: linkdb: crawl/linkdb
> Indexer: solrUrl: http://localhost:8984/solr/
> Indexer: adding segment:
> file:/home/vladimirga/Documents/dev/src/lucene-src/nutch-2008-06-21/wrk-01/crawl/segments/20080621200352
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
>        at org.apache.nutch.indexer.Indexer.index(Indexer.java:318)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:148)
>
> What can cause that problem and how can I fix it to make nutch index into
> solr?
>
> Thanks.
>
> > Integrate Solr/Nutch
> > --------------------
> >
> >                 Key: NUTCH-442
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-442
> >             Project: Nutch
> >          Issue Type: New Feature
> >         Environment: Ubuntu linux
> >            Reporter: rubdabadub
> >         Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch,
> NUTCH-442_v5.patch, NUTCH_442_v3.patch, RFC_multiple_search_backends.patch,
> schema.xml
> >
> >
> > Hi:
> > After trying out Sami's patch regarding Solr/Nutch. Can be found here (
> http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
> and I can confirm it worked :-) And that lead me to request the following :
> > I would be very very great full if this could be included in nutch 0.9 as
> I am trying to eliminate my python based crawler which post documents to
> solr. As I am in the corporate enviornment I can't install trunk version in
> the production enviornment thus I am asking this to be included in 0.9
> release. I hope my wish would be granted.
> > I look forward to get some feedback.
> > Thank you.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com