You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ahmad Ajiloo <ah...@gmail.com> on 2011/11/20 14:38:51 UTC

Intranet Document Search with Nutch

Hello
I tried to have a Intranet Search on my documents with this guidance:
http://wiki.apache.org/nutch/IntranetDocumentSearch

In configuration part explains: "When configured correctly, there should be
a core located at *http://localhost:8983/solr/nutch*. You can test this by
accessing the administration page at *http://localhost:8983/solr/nutch/admin
* where you can also verify that the schema is being correctly loaded. "

I copied schema.xml from Nutch to Solr but there is no page in this link: (
*http://localhost:8983/solr/nutch*) and I got this error:
HTTP ERROR 404

Problem accessing /solr/nutch. Reason:

    NOT_FOUND

*Powered by Jetty://*
----------------------------------
so when I get this command (> ./bin/nutch crawl urls -dir crawl -depth 2
-solr http://localhost:8983/solr/nutch) No data will send to Solr !

Re: Intranet Document Search with Nutch

Posted by Ahmad Ajiloo <ah...@gmail.com>.
Yes, I'm using "crawl' command. so I parsed them and updated crawlDb.

On Tue, Nov 22, 2011 at 4:25 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Did you parse the segments? did you update the crawldb with the new
> segments?
>
> Also, copying the schema won't magically create a new Sorl core, so unless
> you
> did that manually localhost:8983/solr/nutch won't exist.
>
> On Tuesday 22 November 2011 13:44:33 Ahmad Ajiloo wrote:
> > I checked my logs, but there is no reason for incomplete indexing.
> > After dumping segment it is clear that all the contents of seed files are
> > fetched. but in indexing stage there is a problem that none of the data
> > didn't send to Solr. So there is no indxed data in
> > Solr_Home/example/solr/data/index directory!
> >
> > On Sun, Nov 20, 2011 at 7:23 PM, Lewis John Mcgibbney <
> >
> > lewis.mcgibbney@gmail.com> wrote:
> > > Are you familiar with crawling the web as oppose to your intranet? If
> so
> > > did the Nutch Solr configuration work correctly? As you are aware the
> > > difference in configuration between the web and your intranet should be
> > > trivial. Please have a look at your logs and see if ALL stages of the
> > > crawl are working as expected, especially your indexing stages.
> > >
> > > On Sun, Nov 20, 2011 at 1:38 PM, Ahmad Ajiloo <ahmad.ajiloo@gmail.com
> > >
> > > >wrote:
> > > > Hello
> > > > I tried to have a Intranet Search on my documents with this guidance:
> > > > http://wiki.apache.org/nutch/IntranetDocumentSearch
> > > >
> > > > In configuration part explains: "When configured correctly, there
> > > > should
> > >
> > > be
> > >
> > > > a core located at *http://localhost:8983/solr/nutch*. You can test
> this
> > >
> > > by
> > >
> > > > accessing the administration page at *
> > > > http://localhost:8983/solr/nutch/admin
> > > > * where you can also verify that the schema is being correctly
> loaded.
> > > > "
> > > >
> > > > I copied schema.xml from Nutch to Solr but there is no page in this
> > >
> > > link: (
> > >
> > > > *http://localhost:8983/solr/nutch*) and I got this error:
> > > > HTTP ERROR 404
> > > >
> > > > Problem accessing /solr/nutch. Reason:
> > > >    NOT_FOUND
> > > >
> > > > *Powered by Jetty://*
> > > > ----------------------------------
> > > > so when I get this command (> ./bin/nutch crawl urls -dir crawl
> -depth
> > > > 2 -solr http://localhost:8983/solr/nutch) No data will send to Solr
> !
> > >
> > > --
> > > *Lewis*
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Intranet Document Search with Nutch

Posted by Markus Jelsma <ma...@openindex.io>.
Did you parse the segments? did you update the crawldb with the new segments?

Also, copying the schema won't magically create a new Sorl core, so unless you 
did that manually localhost:8983/solr/nutch won't exist.

On Tuesday 22 November 2011 13:44:33 Ahmad Ajiloo wrote:
> I checked my logs, but there is no reason for incomplete indexing.
> After dumping segment it is clear that all the contents of seed files are
> fetched. but in indexing stage there is a problem that none of the data
> didn't send to Solr. So there is no indxed data in
> Solr_Home/example/solr/data/index directory!
> 
> On Sun, Nov 20, 2011 at 7:23 PM, Lewis John Mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > Are you familiar with crawling the web as oppose to your intranet? If so
> > did the Nutch Solr configuration work correctly? As you are aware the
> > difference in configuration between the web and your intranet should be
> > trivial. Please have a look at your logs and see if ALL stages of the
> > crawl are working as expected, especially your indexing stages.
> > 
> > On Sun, Nov 20, 2011 at 1:38 PM, Ahmad Ajiloo <ahmad.ajiloo@gmail.com
> > 
> > >wrote:
> > > Hello
> > > I tried to have a Intranet Search on my documents with this guidance:
> > > http://wiki.apache.org/nutch/IntranetDocumentSearch
> > > 
> > > In configuration part explains: "When configured correctly, there
> > > should
> > 
> > be
> > 
> > > a core located at *http://localhost:8983/solr/nutch*. You can test this
> > 
> > by
> > 
> > > accessing the administration page at *
> > > http://localhost:8983/solr/nutch/admin
> > > * where you can also verify that the schema is being correctly loaded.
> > > "
> > > 
> > > I copied schema.xml from Nutch to Solr but there is no page in this
> > 
> > link: (
> > 
> > > *http://localhost:8983/solr/nutch*) and I got this error:
> > > HTTP ERROR 404
> > > 
> > > Problem accessing /solr/nutch. Reason:
> > >    NOT_FOUND
> > > 
> > > *Powered by Jetty://*
> > > ----------------------------------
> > > so when I get this command (> ./bin/nutch crawl urls -dir crawl -depth
> > > 2 -solr http://localhost:8983/solr/nutch) No data will send to Solr !
> > 
> > --
> > *Lewis*

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Intranet Document Search with Nutch

Posted by Ahmad Ajiloo <ah...@gmail.com>.
I understood --with debugging the code-- that Indexing Filters don't run,
because in IndexerMapReduce.reduce() function, none of the instances of
"values" variable are not parseText or parseData. I don't khnow why?! so
program doesn't run indexing filters.
Can anyone help me?

On Tue, Nov 22, 2011 at 4:14 PM, Ahmad Ajiloo <ah...@gmail.com>wrote:

> I checked my logs, but there is no reason for incomplete indexing.
> After dumping segment it is clear that all the contents of seed files are
> fetched. but in indexing stage there is a problem that none of the data
> didn't send to Solr. So there is no indxed data in
> Solr_Home/example/solr/data/index directory!
>
>
> On Sun, Nov 20, 2011 at 7:23 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Are you familiar with crawling the web as oppose to your intranet? If so
>> did the Nutch Solr configuration work correctly? As you are aware the
>> difference in configuration between the web and your intranet should be
>> trivial. Please have a look at your logs and see if ALL stages of the
>> crawl
>> are working as expected, especially your indexing stages.
>>
>> On Sun, Nov 20, 2011 at 1:38 PM, Ahmad Ajiloo <ahmad.ajiloo@gmail.com
>> >wrote:
>>
>> > Hello
>> > I tried to have a Intranet Search on my documents with this guidance:
>> > http://wiki.apache.org/nutch/IntranetDocumentSearch
>> >
>> > In configuration part explains: "When configured correctly, there
>> should be
>> > a core located at *http://localhost:8983/solr/nutch*. You can test
>> this by
>> > accessing the administration page at *
>> > http://localhost:8983/solr/nutch/admin
>> > * where you can also verify that the schema is being correctly loaded. "
>> >
>> > I copied schema.xml from Nutch to Solr but there is no page in this
>> link: (
>> > *http://localhost:8983/solr/nutch*) and I got this error:
>> > HTTP ERROR 404
>> >
>> > Problem accessing /solr/nutch. Reason:
>> >
>> >    NOT_FOUND
>> >
>> > *Powered by Jetty://*
>> > ----------------------------------
>> > so when I get this command (> ./bin/nutch crawl urls -dir crawl -depth 2
>> > -solr http://localhost:8983/solr/nutch) No data will send to Solr !
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Re: Intranet Document Search with Nutch

Posted by Ahmad Ajiloo <ah...@gmail.com>.
I checked my logs, but there is no reason for incomplete indexing.
After dumping segment it is clear that all the contents of seed files are
fetched. but in indexing stage there is a problem that none of the data
didn't send to Solr. So there is no indxed data in
Solr_Home/example/solr/data/index directory!

On Sun, Nov 20, 2011 at 7:23 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Are you familiar with crawling the web as oppose to your intranet? If so
> did the Nutch Solr configuration work correctly? As you are aware the
> difference in configuration between the web and your intranet should be
> trivial. Please have a look at your logs and see if ALL stages of the crawl
> are working as expected, especially your indexing stages.
>
> On Sun, Nov 20, 2011 at 1:38 PM, Ahmad Ajiloo <ahmad.ajiloo@gmail.com
> >wrote:
>
> > Hello
> > I tried to have a Intranet Search on my documents with this guidance:
> > http://wiki.apache.org/nutch/IntranetDocumentSearch
> >
> > In configuration part explains: "When configured correctly, there should
> be
> > a core located at *http://localhost:8983/solr/nutch*. You can test this
> by
> > accessing the administration page at *
> > http://localhost:8983/solr/nutch/admin
> > * where you can also verify that the schema is being correctly loaded. "
> >
> > I copied schema.xml from Nutch to Solr but there is no page in this
> link: (
> > *http://localhost:8983/solr/nutch*) and I got this error:
> > HTTP ERROR 404
> >
> > Problem accessing /solr/nutch. Reason:
> >
> >    NOT_FOUND
> >
> > *Powered by Jetty://*
> > ----------------------------------
> > so when I get this command (> ./bin/nutch crawl urls -dir crawl -depth 2
> > -solr http://localhost:8983/solr/nutch) No data will send to Solr !
> >
>
>
>
> --
> *Lewis*
>

Re: Intranet Document Search with Nutch

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Are you familiar with crawling the web as oppose to your intranet? If so
did the Nutch Solr configuration work correctly? As you are aware the
difference in configuration between the web and your intranet should be
trivial. Please have a look at your logs and see if ALL stages of the crawl
are working as expected, especially your indexing stages.

On Sun, Nov 20, 2011 at 1:38 PM, Ahmad Ajiloo <ah...@gmail.com>wrote:

> Hello
> I tried to have a Intranet Search on my documents with this guidance:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
>
> In configuration part explains: "When configured correctly, there should be
> a core located at *http://localhost:8983/solr/nutch*. You can test this by
> accessing the administration page at *
> http://localhost:8983/solr/nutch/admin
> * where you can also verify that the schema is being correctly loaded. "
>
> I copied schema.xml from Nutch to Solr but there is no page in this link: (
> *http://localhost:8983/solr/nutch*) and I got this error:
> HTTP ERROR 404
>
> Problem accessing /solr/nutch. Reason:
>
>    NOT_FOUND
>
> *Powered by Jetty://*
> ----------------------------------
> so when I get this command (> ./bin/nutch crawl urls -dir crawl -depth 2
> -solr http://localhost:8983/solr/nutch) No data will send to Solr !
>



-- 
*Lewis*