You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Kelvin Tan <ke...@relevanz.com> on 2005/08/29 00:56:00 UTC

Re: bot-traps and refetching

Michael,

On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> 1) bot-traps problem for OC
>
> If we have a crawling depth for each starting host, it seems that
> the crawling will be finalized in the end ( we can decrement depth
> value in each time the outlink falls in same host domain).
>
> Let me know if my thought is wrong.
>

Correct. Limiting crawls by depth is probably the simplest way of avoiding death by bot-traps. There are other methods though, like assigning credits to hosts and adapting fetchlist scheduling according to credit usage, or flagging recurring path elements as suspect.

> 2) refetching
>
> If OC's fetchlist is online (memory residence), the next time
> refetch we have to restart from seeds.txt once again. Is it right?
>

Maybe with the current implementation. But if you Implement a CrawlSeedSource that reads in the FetcherOutput directory in the Nutch segment, then you can seed a crawl using what's already been fetched. 

> 3) page content checking
>
> In OC API, I found an API WebDBContentSeenFilter, who uses Nutch
> webdb data structure to see if the fetched page content has been
> seen before. That means, we have to use Nutch to create a webdb
> (maybe nutch/updatedb) in order to support this function. Is it
> right?

Exactly right. 

k

a further concern of refetching scenario about depth controlled crawling

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

When we consider about refetching. Saying, we have a
list of fetched URL saved locally ( in fetchedURLs?);

then, when we turn to next day, we only need to take a
look at URLs in that list to do fetching. Don't
necessarily need to start from seeds.txt;

Then, we lost meaning of depth for individual URL, in
two aspects---1) we didn't keep depth of an individual
site, we only know the depth of a child site in the
fly. Offcause, we could save depth value with URLs in
local file, by adding a bit overhead, but then we
might meet second problem 2) if the hierarchy
structure of site is changed by webmaster, for
example, site maintenance, or in purpose, etc.

So, another idea raised in my mind: we only
distinguish site by "in-domain" or "out-links"; We
might a flag for each URLs saved locally; We can
assume that a normal site should have a limited
depth,saying 100; 

When we do refetching, for in-domain site, we don't
care about its' original depth in the previous
fetching. We just crawl it at most 100 times. Because
we treat all the content in an in-domain site are
valuable; For out-link site, we only fetch once and
get content of its' home page;

What you think?

thanks,

Michael Ji,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.

Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>

That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <ke...@relevanz.com> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.

Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>

That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <ke...@relevanz.com> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Re: controlled depth crawling

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

I see your idea and agree with you.

Then, I guess the filter will apply in 

FetcherThread.java
with lines of
"
if ( fetchListScope.isInScope(flScopeIn) &
depthFLFilter.filter(flScopeIn) )....
"

Am I right?

I am in the business trip this week. Hard to squeeze
time to do testing and developing. But I will keep you
updated.

thanks,

Micheal, 


--- Kelvin Tan <ke...@relevanz.com> wrote:

> Hey Michael, I don't think that would work, because
> every link on a single page would be decrementing
> its parent depth.
> 
> Instead, I would stick to the DepthFLFilter I
> provided, and changed ScheduledURL's ctor to
> 
> public ScheduledURL(ScheduledURL parent, URL url) {
>     this.id = assignId();
>     this.seedIndex = parent.seedIndex;
>     this.parentId = parent.id;
>     this.depth = parent.depth + 1;
>     this.url = url;
>   }
> 
> Then in beans.xml, declare DepthFLFilter as a bean,
> and set the "max" property to 5.
> 
> You can even have a more fine-grained control by
> making a FLFilter that allows you to specify a host
> and maxDepth, and if a host is not declared, then
> the default depth is used. Something like
> 
> <bean
>
class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>           <property
> name="defaultMax"><value>20</value></property>
> 		<property name="hosts">
>             <map>
>               <entry>
>                 <key>www.nutch.org</key>
>                 <value>7</value>
>               </entry>
>               <entry>
>                 <key>www.apache.org</key>
>                 <value>2</value>
>               </entry>
>             </map>
>           </property>
>         </bean>
> 
> (formatting is probably going to end up warped).
> 
> See what I mean?
> 
> k
> 
> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji
> wrote:
> >
> > Hi Kelvin:
> >
> > I tried to implement controlled depth crawling
> based on your Nutch-
> > 84 and the discussion we had before.
> >
> > 1. In DepthFLFilter Class,
> >
> > I did a bit modification
> > "
> > public synchronized int
> filter(FetchListScope.Input input) {
> > input.parent.decrementDepth();
> > return input.parent.depth >= 0 ? ALLOW : REJECT; }
> "
> >
> > 2 In ScheduledURL Class
> > add one member variable and one member function "
> public int depth;
> >
> > public void decrementDepth() {
> > depth --;
> > }
> > "
> >
> > 3 Then
> >
> > we need an initial depth for each domain; for the
> initial testing;
> > I can set a default value 5 for all the site in
> seeds.txt and for
> > each outlink, the value will be 1;
> >
> > In that way, a pretty vertical crawling is done
> for on-site domain
> > while outlink homepage is still visible;
> >
> > Further more, should we define a depth value for
> each url in
> > seeds.txt?
> >
> > Did I in the right track?
> >
> > Thanks,
> >
> > Michael Ji
> >
> >
> > __________________________________
> > Yahoo! Mail
> > Stay connected, organized, and protected. Take the
> tour:
> > http://tour.mail.yahoo.com/mailtour.html
> 
> 
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Michael, I don't think that would work, because every link on a single page would be decrementing its parent depth. 

Instead, I would stick to the DepthFLFilter I provided, and changed ScheduledURL's ctor to

public ScheduledURL(ScheduledURL parent, URL url) {
    this.id = assignId();
    this.seedIndex = parent.seedIndex;
    this.parentId = parent.id;
    this.depth = parent.depth + 1;
    this.url = url;
  }

Then in beans.xml, declare DepthFLFilter as a bean, and set the "max" property to 5. 

You can even have a more fine-grained control by making a FLFilter that allows you to specify a host and maxDepth, and if a host is not declared, then the default depth is used. Something like

<bean class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
          <property name="defaultMax"><value>20</value></property>
		<property name="hosts">
            <map>
              <entry>
                <key>www.nutch.org</key>
                <value>7</value>
              </entry>
              <entry>
                <key>www.apache.org</key>
                <value>2</value>
              </entry>
            </map>
          </property>
        </bean>

(formatting is probably going to end up warped).

See what I mean?

k

On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>
> Hi Kelvin:
>
> I tried to implement controlled depth crawling based on your Nutch-
> 84 and the discussion we had before.
>
> 1. In DepthFLFilter Class,
>
> I did a bit modification
> "
> public synchronized int filter(FetchListScope.Input input) {
> input.parent.decrementDepth();
> return input.parent.depth >= 0 ? ALLOW : REJECT; } "
>
> 2 In ScheduledURL Class
> add one member variable and one member function " public int depth;
>
> public void decrementDepth() {
> depth --;
> }
> "
>
> 3 Then
>
> we need an initial depth for each domain; for the initial testing;
> I can set a default value 5 for all the site in seeds.txt and for
> each outlink, the value will be 1;
>
> In that way, a pretty vertical crawling is done for on-site domain
> while outlink homepage is still visible;
>
> Further more, should we define a depth value for each url in
> seeds.txt?
>
> Did I in the right track?
>
> Thanks,
>
> Michael Ji
>
>
> __________________________________
> Yahoo! Mail
> Stay connected, organized, and protected. Take the tour:
> http://tour.mail.yahoo.com/mailtour.html

controlled depth crawling

Posted by Michael Ji <fj...@yahoo.com>.

Hi Kelvin:

I tried to implement controlled depth crawling based
on your Nutch-84 and the discussion we had before.

1. In DepthFLFilter Class, 

I did a bit modification
"
public synchronized int filter(FetchListScope.Input
input) {
    input.parent.decrementDepth();
    return input.parent.depth >= 0 ? ALLOW : REJECT;
  }
"

2 In ScheduledURL Class
add one member variable and one member function
"
public int depth;

public void decrementDepth() {
    depth --;
  }
"

3 Then

we need an initial depth for each domain; for the
initial testing; I can set a default value 5 for all
the site in seeds.txt and for each outlink, the value
will be 1;

In that way, a pretty vertical crawling is done for
on-site domain while outlink homepage is still
visible;

Further more, should we define a depth value for each
url in seeds.txt? 

Did I in the right track?

Thanks,

Michael Ji


		
__________________________________ 
Yahoo! Mail 
Stay connected, organized, and protected. Take the tour: 
http://tour.mail.yahoo.com/mailtour.html

Re: bot-traps and refetching

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

I believe my previous email about "further concerning
of controlled crawling" confused you a bit via my
unmatured thought. But I believe that controlled
crawling is very important for an efficient vertical
crawling application generally.

After reviewed our previous discussion, I think the
solution for bot-traps and refetching in OC might be
able to be combined as one.

1) Refetching will look at the FetcherOutput of last
run, and queue the URLs according to their domain name
(for http 1.1 protocol) as your FetcherThread does.

2) We might just count the number of URLs within the
same domain (in fly of queue?). If that number is over
centain threshold, we might think stop adding new URLs
for that domain---it is equvalent in sense of
controlled crawling, but in a way of "width".

Will it work as proposed?

thanks, 

Michael Ji,

--- Kelvin Tan <ke...@relevanz.com> wrote:

> Michael,
> 
> On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji
> wrote:
> > Hi Kelvin:
> >
> > 2) refetching
> >
> > If OC's fetchlist is online (memory residence),
> the next time
> > refetch we have to restart from seeds.txt once
> again. Is it right?
> >
> 
> Maybe with the current implementation. But if you
> Implement a CrawlSeedSource that reads in the
> FetcherOutput directory in the Nutch segment, then
> you can seed a crawl using what's already been
> fetched.
> 
> 

____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs