You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Beats <ta...@yahoo.com> on 2009/07/11 09:20:43 UTC

how to crawl a page but not index it

hi all

i want to crawl a page and then crawl all its outlinks and index the content
of those crawled outlinks..

the problem is i dont want to index the page from where i get these
outlinks..


thanx in advance


-- 
View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24437901.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to crawl a page but not index it

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-11-16 12:13, ytthet wrote:
> 
> Hi All,
> 
> I have similar requirements as Beats.
> 
> I need to crawl certain page to extract URLs, but not to index the page. 
> 
> For example, blog home page contains snap shot of last page and links to
> them. In that case, I need to extract only links and not to index the page.
> 
> I cannot do as Jake suggested, <meta name="robots"
> content="noindex,follow">, for I do not own the page. Rather, I am indexing
> few collections of web sites.
> 
> Have we found any solutions or suggestions on the matter?

This and similar use case scenarios all boil down to your ability to
specify what is so special about this page, and then just skipping it in
your custom IndexingFilter (returning null from a filter will discard
the page from index).

One simple solution, if you know in advance the urls of pages that you
want to discard, would be to inject these urls with an additional
metadata "homepage=true" and then check this in your IndexingFilter.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: how to crawl a page but not index it

Posted by ytthet <ye...@gmail.com>.

Hi All,

I have similar requirements as Beats.

I need to crawl certain page to extract URLs, but not to index the page. 

For example, blog home page contains snap shot of last page and links to
them. In that case, I need to extract only links and not to index the page.

I cannot do as Jake suggested, <meta name="robots"
content="noindex,follow">, for I do not own the page. Rather, I am indexing
few collections of web sites.

Have we found any solutions or suggestions on the matter?

Thanks in advance.

Y.T Thet

jakecjacobson wrote:
> 
> Hi,
> 
> Nutch should follow the meta robots directives so in page A add this
> meta directive.
> 
> <meta name="robots" content="noindex,follow">
> 
> http://www.seoresource.net/robots-metatags.htm
> 
> Jake Jacobson
> 
> http://www.linkedin.com/in/jakejacobson
> http://www.facebook.com/jakecjacobson
> http://twitter.com/jakejacobson
> 
> Our greatest fear should not be of failure,
> but of succeeding at something that doesn't really matter.
>    -- ANONYMOUS
> 
> 
> 
> On Tue, Jul 14, 2009 at 8:32 AM, Beats<ta...@yahoo.com> wrote:
>>
>> hi,
>>
>> actually what i want is to crawl a web page say 'page A' and all its
>> outlinks.
>> i want to index all the content gathered by crawling the outlinks. But
>> not
>> the 'page A'.
>> is there any way to do it in single run.
>>
>> with Regards
>>
>> Beats
>> beats@yahoo.com
>>
> 
> 

-- 
View this message in context: http://lucene.472066.n3.nabble.com/how-to-crawl-a-page-but-not-index-it-tp618712p1910348.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to crawl a page but not index it

Posted by Jake Jacobson <ja...@gmail.com>.

Hi,

Nutch should follow the meta robots directives so in page A add this
meta directive.

<meta name="robots" content="noindex,follow">

http://www.seoresource.net/robots-metatags.htm

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Tue, Jul 14, 2009 at 8:32 AM, Beats<ta...@yahoo.com> wrote:
>
> hi,
>
> actually what i want is to crawl a web page say 'page A' and all its
> outlinks.
> i want to index all the content gathered by crawling the outlinks. But not
> the 'page A'.
> is there any way to do it in single run.
>
> with Regards
>
> Beats
> beats@yahoo.com
>
>
>
> SunGod wrote:
>>
>> 1.create work dir test first
>>
>>
>> 2.insert url
>> ../bin/nutch inject test -urlfile urls
>>
>> 3.create fetchlist
>> ../bin/nutch generate test test/segments
>>
>> 4.fetch url
>> s1=`ls -d crawl/segments/2* | tail -1`
>> echo $s1
>> ../bin/nutch fetch test/segments/20090628160619
>>
>> 5.update crawldb
>> ../bin/nutch updatedb test test/segments/20090628160619
>>
>> loop step 3 - 5, write a bash script running is best!
>>
>> next time please use google search first
>>
>> 2009/7/13 Beats <ta...@yahoo.com>
>>
>>>
>>> can anyone help me on this..
>>>
>>> i m using solr to index the nutch doc.
>>> So i think prune tool will not work.
>>>
>>> i do not want to index the document taken from a particular set of sites
>>>
>>> with regards Beats
>>> --
>>> View this message in context:
>>> http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
>>>  Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24478530.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: how to crawl a page but not index it

Posted by Beats <ta...@yahoo.com>.

hi,

actually what i want is to crawl a web page say 'page A' and all its
outlinks.
i want to index all the content gathered by crawling the outlinks. But not
the 'page A'.
is there any way to do it in single run.

with Regards

Beats
beats@yahoo.com



SunGod wrote:
> 
> 1.create work dir test first
> 
> 
> 2.insert url
> ../bin/nutch inject test -urlfile urls
> 
> 3.create fetchlist
> ../bin/nutch generate test test/segments
> 
> 4.fetch url
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> ../bin/nutch fetch test/segments/20090628160619
> 
> 5.update crawldb
> ../bin/nutch updatedb test test/segments/20090628160619
> 
> loop step 3 - 5, write a bash script running is best!
> 
> next time please use google search first
> 
> 2009/7/13 Beats <ta...@yahoo.com>
> 
>>
>> can anyone help me on this..
>>
>> i m using solr to index the nutch doc.
>> So i think prune tool will not work.
>>
>> i do not want to index the document taken from a particular set of sites
>>
>> with regards Beats
>> --
>> View this message in context:
>> http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
>>  Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24478530.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to crawl a page but not index it

Posted by SunGod <su...@cheemer.org>.

PS :
these command line like nutch 0.8

nutch 1.0 changes, but similar

2009/7/13 SunGod <su...@cheemer.org>

> 1.create work dir test first
>
> 2.insert url
> ../bin/nutch inject test -urlfile urls
>
> 3.create fetchlist
> ../bin/nutch generate test test/segments
>
> 4.fetch url
> s1=`ls -d crawl/segments/2* | tail -1`
> echo $s1
> ../bin/nutch fetch test/segments/20090628160619
>
> 5.update crawldb
> ../bin/nutch updatedb test test/segments/20090628160619
>
> loop step 3 - 5, write a bash script running is best!
>
> next time please use google search first
>
> 2009/7/13 Beats <ta...@yahoo.com>
>
>
>> can anyone help me on this..
>>
>> i m using solr to index the nutch doc.
>> So i think prune tool will not work.
>>
>> i do not want to index the document taken from a particular set of sites
>>
>> with regards Beats
>> --
>> View this message in context:
>> http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
>>  Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>

Re: how to crawl a page but not index it

Posted by SunGod <su...@cheemer.org>.

1.create work dir test first

2.insert url
../bin/nutch inject test -urlfile urls

3.create fetchlist
../bin/nutch generate test test/segments

4.fetch url
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
../bin/nutch fetch test/segments/20090628160619

5.update crawldb
../bin/nutch updatedb test test/segments/20090628160619

loop step 3 - 5, write a bash script running is best!

next time please use google search first

2009/7/13 Beats <ta...@yahoo.com>

>
> can anyone help me on this..
>
> i m using solr to index the nutch doc.
> So i think prune tool will not work.
>
> i do not want to index the document taken from a particular set of sites
>
> with regards Beats
> --
> View this message in context:
> http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
>  Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: how to crawl a page but not index it

Posted by Beats <ta...@yahoo.com>.

can anyone help me on this..

i m using solr to index the nutch doc.
So i think prune tool will not work.

i do not want to index the document taken from a particular set of sites

with regards Beats 
-- 
View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
Sent from the Nutch - User mailing list archive at Nabble.com.