You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Adam Estrada <es...@gmail.com> on 2011/01/03 17:45:30 UTC

Re: [Nutch] and Solr integration

All,

I realize that the documentation says that you crawl first then add to Solr
but I spent several hours running the same command through Cygwin with
-solrindex http://localhost:8983/solr on the command line (eg. bin/nutch
crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
http://localhost:8983/solr) and it worked. Does anyone know why it's not
working for me anymore? I am using the Lucid build of Solr which was what i
was using before. I neglected to write down the command line syntax which is
biting me in the arse. Any tips on this one would be great!

Thanks,
Adam

On Mon, Dec 20, 2010 at 4:21 PM, Anurag <an...@gmail.com> wrote:

>
> why are using solrindex in the argument.? It is used when we need to index
> the crawled data in Solr
> For more read http://wiki.apache.org/nutch/NutchTutorial .
>
> Also for nutch-solr integration this is very useful blog
> http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> I integrated nutch and solr and it works well.
>
> Thanks
>
> On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] <
> ml-node+2122347-622655030-146354@n3.nabble.com<ml...@n3.nabble.com>
> <ml...@n3.nabble.com>
> >
> > wrote:
>
> > All,
> >
> > I have a couple websites that I need to crawl and the following command
> > line
> > used to work I think. Solr is up and running and everything is fine there
> > and I can go through and index the site but I really need the results
> added
> >
> > to Solr after the crawl. Does anyone have any idea on how to make that
> > happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
> > which I am not using at all.
> >
> > $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
> > -solrindex
> > ht
> > tp://localhost:8983/solr
> > crawl started in: crawl
> > rootUrlDir = http://localhost:8983/solr
> > threads = 10
> > depth = 100
> > indexer=lucene
> > topN = 50
> > Injector: starting at 2010-12-20 15:23:25
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: http://localhost:8983/solr
> > Injector: Converting injected urls to crawl db entries.
> > Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> > http
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
> > )
> >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> >         at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
> >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> >         at
> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
> > ava:169)
> >         at
> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
> > va:201)
> >         at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >
> >         at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
> > 81)
> >         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >         at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
> >
> >
> > ------------------------------
> >  View message @
> >
> http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
> > To start a new topic under Solr - User, email
> > ml-node+472068-1941297125-146354@n3.nabble.com<ml...@n3.nabble.com>
> <ml...@n3.nabble.com>
> >
> > To unsubscribe from Solr - User, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
> >.
> >
> >
>
>
>
> --
> Kumar Anurag
>
>
> -----
> Kumar Anurag
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: [Nutch] and Solr integration

Posted by Adam Estrada <es...@gmail.com>.

BLEH! <facepalm> This is entirely possible to do in a single step AS LONG AS
YOU GET THE SYNTAX CORRECT ;-)

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

<http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/>bin/nutch
crawl urls -dir crawl -threads 10 -depth 100 -topN 50* -solr*
http://localhost:8983/solr

<http://localhost:8983/solr>The correct param is -solr NOT -solrindex.

Cheers,
Adam

On Mon, Jan 3, 2011 at 11:45 AM, Adam Estrada <es...@gmail.com>wrote:

> All,
>
> I realize that the documentation says that you crawl first then add to Solr
> but I spent several hours running the same command through Cygwin with
> -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch
> crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
> http://localhost:8983/solr) and it worked. Does anyone know why it's not
> working for me anymore? I am using the Lucid build of Solr which was what i
> was using before. I neglected to write down the command line syntax which is
> biting me in the arse. Any tips on this one would be great!
>
> Thanks,
> Adam
>
> On Mon, Dec 20, 2010 at 4:21 PM, Anurag <an...@gmail.com> wrote:
>
>>
>> why are using solrindex in the argument.? It is used when we need to index
>> the crawled data in Solr
>> For more read http://wiki.apache.org/nutch/NutchTutorial .
>>
>> Also for nutch-solr integration this is very useful blog
>> http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
>> I integrated nutch and solr and it works well.
>>
>> Thanks
>>
>> On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] <
>> ml-node+2122347-622655030-146354@n3.nabble.com<ml...@n3.nabble.com>
>> <ml...@n3.nabble.com>
>> >
>> > wrote:
>>
>> > All,
>> >
>> > I have a couple websites that I need to crawl and the following command
>> > line
>> > used to work I think. Solr is up and running and everything is fine
>> there
>> > and I can go through and index the site but I really need the results
>> added
>> >
>> > to Solr after the crawl. Does anyone have any idea on how to make that
>> > happen or what I'm doing wrong?  These errors are being thrown fro
>> Hadoop
>> > which I am not using at all.
>> >
>> > $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
>> > -solrindex
>> > ht
>> > tp://localhost:8983/solr
>> > crawl started in: crawl
>> > rootUrlDir = http://localhost:8983/solr
>> > threads = 10
>> > depth = 100
>> > indexer=lucene
>> > topN = 50
>> > Injector: starting at 2010-12-20 15:23:25
>> > Injector: crawlDb: crawl/crawldb
>> > Injector: urlDir: http://localhost:8983/solr
>> > Injector: Converting injected urls to crawl db entries.
>> > Exception in thread "main" java.io.IOException: No FileSystem for
>> scheme:
>> > http
>> >         at
>> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
>> > )
>> >         at
>> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>> >         at
>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
>> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
>> >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>> >         at
>> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
>> > ava:169)
>> >         at
>> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
>> > va:201)
>> >         at
>> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> >
>> >         at
>> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
>> > 81)
>> >         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>> >
>> >         at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> >         at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
>> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
>> >
>> >
>> > ------------------------------
>> >  View message @
>> >
>> http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
>> > To start a new topic under Solr - User, email
>> > ml-node+472068-1941297125-146354@n3.nabble.com<ml...@n3.nabble.com>
>> <ml...@n3.nabble.com>
>> >
>> > To unsubscribe from Solr - User, click here<
>> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
>> >.
>> >
>> >
>>
>>
>>
>> --
>> Kumar Anurag
>>
>>
>> -----
>> Kumar Anurag
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>