You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/01/31 11:17:24 UTC

Index while crawling

Hi folks,

 I should thank you all for the great help you have been offering so far. I
am learning about Nutch quite well.

 One more beginners question here - Can I search for something while nutch
is still crawling an site? I believe this is not possible. However, why I am
asking this is - I am crawling a big site and  also the site is updated
frequently with a lot of new pages, I just wanted to get some quick results
while its on the go.

Thanks,
Abhi

Re: Index while crawling

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hello Arjun,

 I would really appreciate if you could start up a separate thread for your
questions.

 This is the second time, I am seeing your deviating from what I am asking
other nutch users by replying to my questions/emails or threads. The thread
takes different context and I am unable to get back on the right context
again with my questions. I am sure starting a new thread does not take much
time.

 Please understand my situation and co-operate. Hope you understand.

Thanks,
Abhishek


On Mon, Jan 31, 2011 at 7:49 PM, Arjun Kumar Reddy <
charjunkumar.reddy@iiitb.net> wrote:

> Hi List,
>
> Is it possible to disable indexing in nutch?
>
> Actually in my application, I am working on twitter feeds where I
> am filtering the tweets present with links and I am crawling these links. I
> am just bothered about the contents of these links. I am able to get the
> contents by reading the segments.
>
> I dont require the search feature provided by Nutch for which it does
> indexing. So, is it possible to remove indexing in nutch? Doing this will
> improve the performance of my crawler.
>
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy
> *
> *
>
>
>
> On Mon, Jan 31, 2011 at 5:00 PM, Alexander Aristov <
> alexander.aristov@gmail.com> wrote:
>
> > yes, you can but only if you use nutch + solr.
> >
> > If you use old nutchfrontend then you might brake index and searching
> after
> > merging content or indexes.
> >
> > If you don't merge then search should work during crawling.
> >
> > but remember that results don't come available for searching immediately
> > after fetching. all pages must be fetched andf then indexed first to be
> > searchable.
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
> >
> > > Hi folks,
> > >
> > >  I should thank you all for the great help you have been offering so
> far.
> > I
> > > am learning about Nutch quite well.
> > >
> > >  One more beginners question here - Can I search for something while
> > nutch
> > > is still crawling an site? I believe this is not possible. However, why
> I
> > > am
> > > asking this is - I am crawling a big site and  also the site is updated
> > > frequently with a lot of new pages, I just wanted to get some quick
> > results
> > > while its on the go.
> > >
> > > Thanks,
> > > Abhi
> > >
> >
>

Re: Index while crawling

Posted by Julien Nioche <li...@gmail.com>.
The Crawl command will automatically do the indexing but if you use the
separate commands (generate - fetch - parse - update) you are entirely free
NOT to do the indexing. Quite a few applications use Nutch for its crawling
capabilities but not for indexing / searching.

On 31 January 2011 11:49, Arjun Kumar Reddy <ch...@iiitb.net>wrote:

> Hi List,
>
> Is it possible to disable indexing in nutch?
>
> Actually in my application, I am working on twitter feeds where I
> am filtering the tweets present with links and I am crawling these links. I
> am just bothered about the contents of these links. I am able to get the
> contents by reading the segments.
>
> I dont require the search feature provided by Nutch for which it does
> indexing. So, is it possible to remove indexing in nutch? Doing this will
> improve the performance of my crawler.
>
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy
> *
> *
>
>
>
> On Mon, Jan 31, 2011 at 5:00 PM, Alexander Aristov <
> alexander.aristov@gmail.com> wrote:
>
> > yes, you can but only if you use nutch + solr.
> >
> > If you use old nutchfrontend then you might brake index and searching
> after
> > merging content or indexes.
> >
> > If you don't merge then search should work during crawling.
> >
> > but remember that results don't come available for searching immediately
> > after fetching. all pages must be fetched andf then indexed first to be
> > searchable.
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
> >
> > > Hi folks,
> > >
> > >  I should thank you all for the great help you have been offering so
> far.
> > I
> > > am learning about Nutch quite well.
> > >
> > >  One more beginners question here - Can I search for something while
> > nutch
> > > is still crawling an site? I believe this is not possible. However, why
> I
> > > am
> > > asking this is - I am crawling a big site and  also the site is updated
> > > frequently with a lot of new pages, I just wanted to get some quick
> > results
> > > while its on the go.
> > >
> > > Thanks,
> > > Abhi
> > >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Index while crawling

Posted by Arjun Kumar Reddy <ch...@iiitb.net>.
Hi List,

Is it possible to disable indexing in nutch?

Actually in my application, I am working on twitter feeds where I
am filtering the tweets present with links and I am crawling these links. I
am just bothered about the contents of these links. I am able to get the
contents by reading the segments.

I dont require the search feature provided by Nutch for which it does
indexing. So, is it possible to remove indexing in nutch? Doing this will
improve the performance of my crawler.


Thanks and regards,*
*Ch. Arjun Kumar Reddy
*
*



On Mon, Jan 31, 2011 at 5:00 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> yes, you can but only if you use nutch + solr.
>
> If you use old nutchfrontend then you might brake index and searching after
> merging content or indexes.
>
> If you don't merge then search should work during crawling.
>
> but remember that results don't come available for searching immediately
> after fetching. all pages must be fetched andf then indexed first to be
> searchable.
>
> Best Regards
> Alexander Aristov
>
>
> On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
>
> > Hi folks,
> >
> >  I should thank you all for the great help you have been offering so far.
> I
> > am learning about Nutch quite well.
> >
> >  One more beginners question here - Can I search for something while
> nutch
> > is still crawling an site? I believe this is not possible. However, why I
> > am
> > asking this is - I am crawling a big site and  also the site is updated
> > frequently with a lot of new pages, I just wanted to get some quick
> results
> > while its on the go.
> >
> > Thanks,
> > Abhi
> >
>

Re: Index while crawling

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi all,

 I am kind of still having problems in figuring this out. I used the
instructions in the following URL,

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

 At the end what I see is only the search results from the seed urls that
are passed in. I think I am missing out something here, as per the tutorial
there is no where the depth or threads is specified. I feel that is why only
the seeds are showing up and no other pages are shown while searching in
admin screen of solr.

 Could you please let me know some pointers or advice on whats that I am
missing?

Thanks,
Abi

On Tue, Feb 1, 2011 at 6:25 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Get your own fresh copy of Solr 1.4.1 (if you get one of the development
> versions you'll need to upgrade the Solr jar's in Nutch' lib). Unpack and
> find
> the example directory. In there you'll overwrite solr/conf/schema.xml with
> the
> one shipped with Nutch and you're good to go. Java -jar start.jar and it's
> running. I'd might also be a good idea to follow the tutorial first.
>
> > Hi,
> >
> >  I am unable to start Solr for the currently running crawl and when I try
> > to the below, I get messages saying the linkdb and segments do not exist
> > in the file system which is the true case.
> >
> >  So how do I run solr in this case? or Do I have to run Solr seperately
> > instead of starting it from the nutch itself.
> >
> > Thanks,
> > Abhi
> >
> > On Mon, Jan 31, 2011 at 11:51 PM, .: Abhishek :. <ab...@gmail.com>
> wrote:
> > > Hi Alexander,
> > >
> > >  Thanks for the response. So I should be starting solr as follows,
> > >
> > > bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > >
> > >  But while fetching we won't have segments right? So in this case how
> do
> > >  I
> > >
> > > start Solr?
> > >
> > > Thanks,
> > > Abhi
> > >
> > >
> > > On Mon, Jan 31, 2011 at 7:30 PM, Alexander Aristov <
> > >
> > > alexander.aristov@gmail.com> wrote:
> > >> yes, you can but only if you use nutch + solr.
> > >>
> > >> If you use old nutchfrontend then you might brake index and searching
> > >> after
> > >> merging content or indexes.
> > >>
> > >> If you don't merge then search should work during crawling.
> > >>
> > >> but remember that results don't come available for searching
> immediately
> > >> after fetching. all pages must be fetched andf then indexed first to
> be
> > >> searchable.
> > >>
> > >> Best Regards
> > >> Alexander Aristov
> > >>
> > >> On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
> > >> > Hi folks,
> > >> >
> > >> >  I should thank you all for the great help you have been offering so
> > >>
> > >> far. I
> > >>
> > >> > am learning about Nutch quite well.
> > >> >
> > >> >  One more beginners question here - Can I search for something while
> > >>
> > >> nutch
> > >>
> > >> > is still crawling an site? I believe this is not possible. However,
> > >> > why
> > >>
> > >> I
> > >>
> > >> > am
> > >> > asking this is - I am crawling a big site and  also the site is
> > >> > updated frequently with a lot of new pages, I just wanted to get
> some
> > >> > quick
> > >>
> > >> results
> > >>
> > >> > while its on the go.
> > >> >
> > >> > Thanks,
> > >> > Abhi
>

Re: Index while crawling

Posted by Markus Jelsma <ma...@openindex.io>.
Get your own fresh copy of Solr 1.4.1 (if you get one of the development 
versions you'll need to upgrade the Solr jar's in Nutch' lib). Unpack and find 
the example directory. In there you'll overwrite solr/conf/schema.xml with the 
one shipped with Nutch and you're good to go. Java -jar start.jar and it's 
running. I'd might also be a good idea to follow the tutorial first.

> Hi,
> 
>  I am unable to start Solr for the currently running crawl and when I try
> to the below, I get messages saying the linkdb and segments do not exist
> in the file system which is the true case.
> 
>  So how do I run solr in this case? or Do I have to run Solr seperately
> instead of starting it from the nutch itself.
> 
> Thanks,
> Abhi
> 
> On Mon, Jan 31, 2011 at 11:51 PM, .: Abhishek :. <ab...@gmail.com> wrote:
> > Hi Alexander,
> > 
> >  Thanks for the response. So I should be starting solr as follows,
> > 
> > bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*
> > 
> >  But while fetching we won't have segments right? So in this case how do
> >  I
> > 
> > start Solr?
> > 
> > Thanks,
> > Abhi
> > 
> > 
> > On Mon, Jan 31, 2011 at 7:30 PM, Alexander Aristov <
> > 
> > alexander.aristov@gmail.com> wrote:
> >> yes, you can but only if you use nutch + solr.
> >> 
> >> If you use old nutchfrontend then you might brake index and searching
> >> after
> >> merging content or indexes.
> >> 
> >> If you don't merge then search should work during crawling.
> >> 
> >> but remember that results don't come available for searching immediately
> >> after fetching. all pages must be fetched andf then indexed first to be
> >> searchable.
> >> 
> >> Best Regards
> >> Alexander Aristov
> >> 
> >> On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
> >> > Hi folks,
> >> > 
> >> >  I should thank you all for the great help you have been offering so
> >> 
> >> far. I
> >> 
> >> > am learning about Nutch quite well.
> >> > 
> >> >  One more beginners question here - Can I search for something while
> >> 
> >> nutch
> >> 
> >> > is still crawling an site? I believe this is not possible. However,
> >> > why
> >> 
> >> I
> >> 
> >> > am
> >> > asking this is - I am crawling a big site and  also the site is
> >> > updated frequently with a lot of new pages, I just wanted to get some
> >> > quick
> >> 
> >> results
> >> 
> >> > while its on the go.
> >> > 
> >> > Thanks,
> >> > Abhi

Re: Index while crawling

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi,

 I am unable to start Solr for the currently running crawl and when I try to
the below, I get messages saying the linkdb and segments do not exist in the
file system which is the true case.

 So how do I run solr in this case? or Do I have to run Solr seperately
instead of starting it from the nutch itself.

Thanks,
Abhi


On Mon, Jan 31, 2011 at 11:51 PM, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi Alexander,
>
>  Thanks for the response. So I should be starting solr as follows,
>
> bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
>
>  But while fetching we won't have segments right? So in this case how do I
> start Solr?
>
> Thanks,
> Abhi
>
>
> On Mon, Jan 31, 2011 at 7:30 PM, Alexander Aristov <
> alexander.aristov@gmail.com> wrote:
>
>> yes, you can but only if you use nutch + solr.
>>
>> If you use old nutchfrontend then you might brake index and searching
>> after
>> merging content or indexes.
>>
>> If you don't merge then search should work during crawling.
>>
>> but remember that results don't come available for searching immediately
>> after fetching. all pages must be fetched andf then indexed first to be
>> searchable.
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
>>
>> > Hi folks,
>> >
>> >  I should thank you all for the great help you have been offering so
>> far. I
>> > am learning about Nutch quite well.
>> >
>> >  One more beginners question here - Can I search for something while
>> nutch
>> > is still crawling an site? I believe this is not possible. However, why
>> I
>> > am
>> > asking this is - I am crawling a big site and  also the site is updated
>> > frequently with a lot of new pages, I just wanted to get some quick
>> results
>> > while its on the go.
>> >
>> > Thanks,
>> > Abhi
>> >
>>
>
>

Re: Index while crawling

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi Alexander,

 Thanks for the response. So I should be starting solr as follows,

bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*

 But while fetching we won't have segments right? So in this case how do I
start Solr?

Thanks,
Abhi

On Mon, Jan 31, 2011 at 7:30 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> yes, you can but only if you use nutch + solr.
>
> If you use old nutchfrontend then you might brake index and searching after
> merging content or indexes.
>
> If you don't merge then search should work during crawling.
>
> but remember that results don't come available for searching immediately
> after fetching. all pages must be fetched andf then indexed first to be
> searchable.
>
> Best Regards
> Alexander Aristov
>
>
> On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:
>
> > Hi folks,
> >
> >  I should thank you all for the great help you have been offering so far.
> I
> > am learning about Nutch quite well.
> >
> >  One more beginners question here - Can I search for something while
> nutch
> > is still crawling an site? I believe this is not possible. However, why I
> > am
> > asking this is - I am crawling a big site and  also the site is updated
> > frequently with a lot of new pages, I just wanted to get some quick
> results
> > while its on the go.
> >
> > Thanks,
> > Abhi
> >
>

Re: Index while crawling

Posted by Alexander Aristov <al...@gmail.com>.
yes, you can but only if you use nutch + solr.

If you use old nutchfrontend then you might brake index and searching after
merging content or indexes.

If you don't merge then search should work during crawling.

but remember that results don't come available for searching immediately
after fetching. all pages must be fetched andf then indexed first to be
searchable.

Best Regards
Alexander Aristov


On 31 January 2011 13:17, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi folks,
>
>  I should thank you all for the great help you have been offering so far. I
> am learning about Nutch quite well.
>
>  One more beginners question here - Can I search for something while nutch
> is still crawling an site? I believe this is not possible. However, why I
> am
> asking this is - I am crawling a big site and  also the site is updated
> frequently with a lot of new pages, I just wanted to get some quick results
> while its on the go.
>
> Thanks,
> Abhi
>