You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/08/29 04:37:16 UTC

controlled depth crawling

Hi Kelvin:

I tried to implement controlled depth crawling based
on your Nutch-84 and the discussion we had before.

1. In DepthFLFilter Class, 

I did a bit modification
"
public synchronized int filter(FetchListScope.Input
input) {
    input.parent.decrementDepth();
    return input.parent.depth >= 0 ? ALLOW : REJECT;
  }
"

2 In ScheduledURL Class
add one member variable and one member function
"
public int depth;

public void decrementDepth() {
    depth --;
  }
"

3 Then

we need an initial depth for each domain; for the
initial testing; I can set a default value 5 for all
the site in seeds.txt and for each outlink, the value
will be 1;

In that way, a pretty vertical crawling is done for
on-site domain while outlink homepage is still
visible;

Further more, should we define a depth value for each
url in seeds.txt? 

Did I in the right track?

Thanks,

Michael Ji


		
__________________________________ 
Yahoo! Mail 
Stay connected, organized, and protected. Take the tour: 
http://tour.mail.yahoo.com/mailtour.html 


a further concern of refetching scenario about depth controlled crawling

Posted by Michael Ji <fj...@yahoo.com>.
hi Kelvin:

When we consider about refetching. Saying, we have a
list of fetched URL saved locally ( in fetchedURLs?);

then, when we turn to next day, we only need to take a
look at URLs in that list to do fetching. Don't
necessarily need to start from seeds.txt;

Then, we lost meaning of depth for individual URL, in
two aspects---1) we didn't keep depth of an individual
site, we only know the depth of a child site in the
fly. Offcause, we could save depth value with URLs in
local file, by adding a bit overhead, but then we
might meet second problem 2) if the hierarchy
structure of site is changed by webmaster, for
example, site maintenance, or in purpose, etc.

So, another idea raised in my mind: we only
distinguish site by "in-domain" or "out-links"; We
might a flag for each URLs saved locally; We can
assume that a normal site should have a limited
depth,saying 100; 

When we do refetching, for in-domain site, we don't
care about its' original depth in the previous
fetching. We just crawl it at most 100 times. Because
we treat all the content in an in-domain site are
valuable; For out-link site, we only fetch once and
get content of its' home page;

What you think?

thanks,

Michael Ji,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.
Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>

That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <ke...@relevanz.com> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs



Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.
Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>

That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <ke...@relevanz.com> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs



Re: controlled depth crawling

Posted by Michael Ji <fj...@yahoo.com>.
hi Kelvin:

I see your idea and agree with you.

Then, I guess the filter will apply in 

FetcherThread.java
with lines of
"
if ( fetchListScope.isInScope(flScopeIn) &
depthFLFilter.filter(flScopeIn) )....
"

Am I right?

I am in the business trip this week. Hard to squeeze
time to do testing and developing. But I will keep you
updated.

thanks,

Micheal, 


--- Kelvin Tan <ke...@relevanz.com> wrote:

> Hey Michael, I don't think that would work, because
> every link on a single page would be decrementing
> its parent depth.
> 
> Instead, I would stick to the DepthFLFilter I
> provided, and changed ScheduledURL's ctor to
> 
> public ScheduledURL(ScheduledURL parent, URL url) {
>     this.id = assignId();
>     this.seedIndex = parent.seedIndex;
>     this.parentId = parent.id;
>     this.depth = parent.depth + 1;
>     this.url = url;
>   }
> 
> Then in beans.xml, declare DepthFLFilter as a bean,
> and set the "max" property to 5.
> 
> You can even have a more fine-grained control by
> making a FLFilter that allows you to specify a host
> and maxDepth, and if a host is not declared, then
> the default depth is used. Something like
> 
> <bean
>
class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>           <property
> name="defaultMax"><value>20</value></property>
> 		<property name="hosts">
>             <map>
>               <entry>
>                 <key>www.nutch.org</key>
>                 <value>7</value>
>               </entry>
>               <entry>
>                 <key>www.apache.org</key>
>                 <value>2</value>
>               </entry>
>             </map>
>           </property>
>         </bean>
> 
> (formatting is probably going to end up warped).
> 
> See what I mean?
> 
> k
> 
> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji
> wrote:
> >
> > Hi Kelvin:
> >
> > I tried to implement controlled depth crawling
> based on your Nutch-
> > 84 and the discussion we had before.
> >
> > 1. In DepthFLFilter Class,
> >
> > I did a bit modification
> > "
> > public synchronized int
> filter(FetchListScope.Input input) {
> > input.parent.decrementDepth();
> > return input.parent.depth >= 0 ? ALLOW : REJECT; }
> "
> >
> > 2 In ScheduledURL Class
> > add one member variable and one member function "
> public int depth;
> >
> > public void decrementDepth() {
> > depth --;
> > }
> > "
> >
> > 3 Then
> >
> > we need an initial depth for each domain; for the
> initial testing;
> > I can set a default value 5 for all the site in
> seeds.txt and for
> > each outlink, the value will be 1;
> >
> > In that way, a pretty vertical crawling is done
> for on-site domain
> > while outlink homepage is still visible;
> >
> > Further more, should we define a depth value for
> each url in
> > seeds.txt?
> >
> > Did I in the right track?
> >
> > Thanks,
> >
> > Michael Ji
> >
> >
> > __________________________________
> > Yahoo! Mail
> > Stay connected, organized, and protected. Take the
> tour:
> > http://tour.mail.yahoo.com/mailtour.html
> 
> 
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.
Hey Michael, I don't think that would work, because every link on a single page would be decrementing its parent depth. 

Instead, I would stick to the DepthFLFilter I provided, and changed ScheduledURL's ctor to

public ScheduledURL(ScheduledURL parent, URL url) {
    this.id = assignId();
    this.seedIndex = parent.seedIndex;
    this.parentId = parent.id;
    this.depth = parent.depth + 1;
    this.url = url;
  }

Then in beans.xml, declare DepthFLFilter as a bean, and set the "max" property to 5. 

You can even have a more fine-grained control by making a FLFilter that allows you to specify a host and maxDepth, and if a host is not declared, then the default depth is used. Something like

<bean class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
          <property name="defaultMax"><value>20</value></property>
		<property name="hosts">
            <map>
              <entry>
                <key>www.nutch.org</key>
                <value>7</value>
              </entry>
              <entry>
                <key>www.apache.org</key>
                <value>2</value>
              </entry>
            </map>
          </property>
        </bean>

(formatting is probably going to end up warped).

See what I mean?

k

On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>
> Hi Kelvin:
>
> I tried to implement controlled depth crawling based on your Nutch-
> 84 and the discussion we had before.
>
> 1. In DepthFLFilter Class,
>
> I did a bit modification
> "
> public synchronized int filter(FetchListScope.Input input) {
> input.parent.decrementDepth();
> return input.parent.depth >= 0 ? ALLOW : REJECT; } "
>
> 2 In ScheduledURL Class
> add one member variable and one member function " public int depth;
>
> public void decrementDepth() {
> depth --;
> }
> "
>
> 3 Then
>
> we need an initial depth for each domain; for the initial testing;
> I can set a default value 5 for all the site in seeds.txt and for
> each outlink, the value will be 1;
>
> In that way, a pretty vertical crawling is done for on-site domain
> while outlink homepage is still visible;
>
> Further more, should we define a depth value for each url in
> seeds.txt?
>
> Did I in the right track?
>
> Thanks,
>
> Michael Ji
>
>
> __________________________________
> Yahoo! Mail
> Stay connected, organized, and protected. Take the tour:
> http://tour.mail.yahoo.com/mailtour.html