You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/08/27 18:01:28 UTC

indexing and refetching by using NUTCH-84) Fetcher for constrained crawls

Hi there,

I installed Nutch-84 patch in Nutch 07 and run patch
test script successfully with my seeds.txt.

It created /segment/ with sub-directories of
"content", "fetcher", "parse_data" and "parse_text". 

Followings are the issues I met and concerning:

1) Indexing

Then, I run nutch/index for this segment successfully.
But there is no result (hits) returned in searching
after I launch tomcat box.

2) Domain control

As I understood, this patch is for control domain
crawling. Seems we can define the fetching depth for
both domain site and outlinking site by ourself. If
so, where these parameters I can input?

3) Refetching

Based on the fetched data, I tried several things,
such as, running nutch/updatedb, nutch/gengerate,
nutch/fetcher. Seems not working. 

Is there a scenario that I can adopt this patch for
refetching purpose? 

thanks,

Michael Ji,



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Michael, please see inline.. 

On Sat, 27 Aug 2005 14:54:21 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Thanks your hint for depth control. I will try it tonight and will
> let you know the result.
>
> I guess the design of patch-84 is to become an independent crawler
> by itself. Is it true?
>

nutch-84 was designed to be a standalone focused/constrained crawler.

> So, it will replace the commands of "nutch/admintool create..,
> nutch/generate, nutch/updateda", etc, by only using OC APIs.
>

No. Nutch is more than just the crawler (webdb, analyzer, etc). nutch-84 was created for people who want a crawler (to use with lucene) but not necessarily the whole nutch infrastructure. Take for instance the fact that the nutch query language is slighty different from Lucene's. By extending this simple PostFetchProcessor below, you can easily just crawl and add documents directly to a lucene index without needing to use WebDB (or the bin/nutch/index command).

public abstract class LucenePostFetchProcessor implements PostFetchProcessor{
  private String index;
  private IndexWriter writer;
  private boolean overwrite;
  private Analyzer analyzer;

  public void process(FetcherOutput fo, Content content, Parse parse)
      throws IOException {
    if(writer == null) initWriter();
    writeDocument(fo, content, parse);
  }

  protected abstract void writeDocument(
      FetcherOutput fo, Content content, Parse parse);

  private void initWriter() throws IOException {
    writer = new IndexWriter(index, analyzer, overwrite);
  }

  public void close() throws IOException {
    if(writer != null) writer.close();
  }

  // setters
}

> I mean, OC can form its own fetch list for the fetching next round,
> for example. Only the fetched result needs to be indexed and merged.
>

If what you mean is that OC will run continuously until there are no more URLs to fetch, you are correct. Unfortunately, until we deal with the problem of bot traps, I don't think this is a good idea for a production environment.

HTH,
k

Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

Thanks your hint for depth control. I will try it
tonight and will let you know the result.

I guess the design of patch-84 is to become an
independent crawler by itself. Is it true? 

So, it will replace the commands of "nutch/admintool
create.., nutch/generate, nutch/updateda", etc, by
only using OC APIs.

I mean, OC can form its own fetch list for the
fetching next round, for example. Only the fetched
result needs to be indexed and merged.

thanks,

Michael Ji

--- Kelvin Tan <ke...@relevanz.com> wrote:

> If we add a depth field to ScheduledURL, then
> controlling the depth of a crawl is simple:
> 
> /**
>  * Limits a crawl to a fixed depth. Seeds are depth
> 0.
>  */
> public class DepthFLFilter implements
> ScopeFilter<FetchListScope.Input> {
>   private int max;
> 
>   public synchronized int
> filter(FetchListScope.Input input) {
>     return input.parent.depth < max ? ALLOW :
> REJECT;
>   }
> 
>   public void setMax(int max) {
>     this.max = max;
>   }
> }
> 
> On Sat, 27 Aug 2005 13:19:18 -0400, Kelvin Tan
> wrote:
> > Hey Michael, did you use the nutch-84 segment
> location as the
> > argument for the respective nutch commands, e.g..
> >
> > bin/nutch updatedb db <path_to_segment>
> >
> > If intending to integrate with webdb, you'll need
> to ensure the
> > directory structure of the segment output is what
> Nutch expects,
> > which means
> > db/segments/<segment_name>
> >
> > I haven't tried running Nutch with the index
> created, but when I
> > open the index in Luke, everything looks correct.
> Let me know if
> > you still have problems.
> >
> > To customize how domains are crawled, you'll want
> to write a
> > ScopeFilter. Take a look at SameParentHostFLFilter
> for an example.
> > When I have some time later today, I'll see if I
> can hack something
> > quick to limit crawling by depth..
> >
> > k
> >
> > On Sat, 27 Aug 2005 09:01:28 -0700 (PDT), Michael
> Ji wrote:
> >
> >> Hi there,
> >>
> >> I installed Nutch-84 patch in Nutch 07 and run
> patch test script  
> >> successfully with my seeds.txt.
> >>
> >> It created /segment/ with sub-directories of
> "content",
> >> "fetcher",  "parse_data" and "parse_text".
> >>
> >> Followings are the issues I met and concerning:
> >>
> >> 1) Indexing
> >>
> >> Then, I run nutch/index for this segment
> successfully. But there
> >> is  no result (hits) returned in searching after
> I launch tomcat
> >> box.
> >>
> >> 2) Domain control
> >>
> >> As I understood, this patch is for control domain
> crawling. Seems
> >>  we can define the fetching depth for both domain
> site and  
> >> outlinking site by ourself. If so, where these
> parameters I can  
> >> input?
> >>
> >> 3) Refetching
> >>
> >> Based on the fetched data, I tried several
> things, such as,
> >> running  nutch/updatedb, nutch/gengerate,
> nutch/fetcher. Seems
> >> not working.
> >>
> >> Is there a scenario that I can adopt this patch
> for refetching  
> >> purpose?
> >>
> >> thanks,
> >>
> >> Michael Ji,
> >>
> >>
>
>> ____________________________________________________
> Start your
> >> day  with Yahoo! - make it your home page
> >> http://www.yahoo.com/r/hs
> 
> 
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: crawling ability of NUTCH-84

Posted by Kelvin Tan <ke...@relevanz.com>.

Michael, 

On Sun, 28 Aug 2005 08:31:29 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Just a curious question.
>
> As I know, the goal of nutch global crawling ability will reach 10
> billions page based on implementation of map reduced.
>
> OC, seeming to fall in the middle, is for control industry domain
> crawling. How many sites is its' goal?dealing with couple of
> thousand sites?
>

The goal of OC is to facilitate focused crawling. I see at least 2 kinds of focused crawling:

1. Whole-web focused crawling, like spidering all pages/sites on WWW related to research publications on leukemia
2. Crawling a given list of URLs/sites compreh, like Teleport Pro.

Although OC was designed with scenario #2 in mind, I think it would also be suitable for scenario #1.

If size of crawl is a concern, I don't think it'd be difficult to build in a throttling mechanism to ensure that the in-memory data structures don't get too large.
I've been travelling around alot lately, so I haven't had a chance to test OC on crawls > 200k pages. 

> I believe the importance for industry domain crawling is in-time
> updating. So identifying content of fetched page and saving post-
> parsing time is critical.
>

I agree. High on my todo list are:

1. Refetch using if-modified-since
2. Using an alternate link extractor if nekohtml ends up to be a bottleneck
3. Parsing downloaded pages to extract data into databases to facilitate aggregation, like defining a site template to map HTML pages to database columns (think job sites for example). 
4. Move post-fetch processing into a separate thread if it turns out to be a bottleneck

k

crawling ability of NUTCH-84

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

Just a curious question.

As I know, the goal of nutch global crawling ability
will reach 10 billions page based on implementation of
map reduced.

OC, seeming to fall in the middle, is for control
industry domain crawling. How many sites is its'
goal?dealing with couple of thousand sites?

I believe the importance for industry domain crawling
is in-time updating. So identifying content of fetched
page and saving post-parsing time is critical.

thanks,

Michael Ji,


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

a further concern of refetching scenario about depth controlled crawling

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

When we consider about refetching. Saying, we have a
list of fetched URL saved locally ( in fetchedURLs?);

then, when we turn to next day, we only need to take a
look at URLs in that list to do fetching. Don't
necessarily need to start from seeds.txt;

Then, we lost meaning of depth for individual URL, in
two aspects---1) we didn't keep depth of an individual
site, we only know the depth of a child site in the
fly. Offcause, we could save depth value with URLs in
local file, by adding a bit overhead, but then we
might meet second problem 2) if the hierarchy
structure of site is changed by webmaster, for
example, site maintenance, or in purpose, etc.

So, another idea raised in my mind: we only
distinguish site by "in-domain" or "out-links"; We
might a flag for each URLs saved locally; We can
assume that a normal site should have a limited
depth,saying 100; 

When we do refetching, for in-domain site, we don't
care about its' original depth in the previous
fetching. We just crawl it at most 100 times. Because
we treat all the content in an in-domain site are
valuable; For out-link site, we only fetch once and
get content of its' home page;

What you think?

thanks,

Michael Ji,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.

Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>

That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <ke...@relevanz.com> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.

Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>

That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <ke...@relevanz.com> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Re: controlled depth crawling

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

I see your idea and agree with you.

Then, I guess the filter will apply in 

FetcherThread.java
with lines of
"
if ( fetchListScope.isInScope(flScopeIn) &
depthFLFilter.filter(flScopeIn) )....
"

Am I right?

I am in the business trip this week. Hard to squeeze
time to do testing and developing. But I will keep you
updated.

thanks,

Micheal, 


--- Kelvin Tan <ke...@relevanz.com> wrote:

> Hey Michael, I don't think that would work, because
> every link on a single page would be decrementing
> its parent depth.
> 
> Instead, I would stick to the DepthFLFilter I
> provided, and changed ScheduledURL's ctor to
> 
> public ScheduledURL(ScheduledURL parent, URL url) {
>     this.id = assignId();
>     this.seedIndex = parent.seedIndex;
>     this.parentId = parent.id;
>     this.depth = parent.depth + 1;
>     this.url = url;
>   }
> 
> Then in beans.xml, declare DepthFLFilter as a bean,
> and set the "max" property to 5.
> 
> You can even have a more fine-grained control by
> making a FLFilter that allows you to specify a host
> and maxDepth, and if a host is not declared, then
> the default depth is used. Something like
> 
> <bean
>
class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>           <property
> name="defaultMax"><value>20</value></property>
> 		<property name="hosts">
>             <map>
>               <entry>
>                 <key>www.nutch.org</key>
>                 <value>7</value>
>               </entry>
>               <entry>
>                 <key>www.apache.org</key>
>                 <value>2</value>
>               </entry>
>             </map>
>           </property>
>         </bean>
> 
> (formatting is probably going to end up warped).
> 
> See what I mean?
> 
> k
> 
> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji
> wrote:
> >
> > Hi Kelvin:
> >
> > I tried to implement controlled depth crawling
> based on your Nutch-
> > 84 and the discussion we had before.
> >
> > 1. In DepthFLFilter Class,
> >
> > I did a bit modification
> > "
> > public synchronized int
> filter(FetchListScope.Input input) {
> > input.parent.decrementDepth();
> > return input.parent.depth >= 0 ? ALLOW : REJECT; }
> "
> >
> > 2 In ScheduledURL Class
> > add one member variable and one member function "
> public int depth;
> >
> > public void decrementDepth() {
> > depth --;
> > }
> > "
> >
> > 3 Then
> >
> > we need an initial depth for each domain; for the
> initial testing;
> > I can set a default value 5 for all the site in
> seeds.txt and for
> > each outlink, the value will be 1;
> >
> > In that way, a pretty vertical crawling is done
> for on-site domain
> > while outlink homepage is still visible;
> >
> > Further more, should we define a depth value for
> each url in
> > seeds.txt?
> >
> > Did I in the right track?
> >
> > Thanks,
> >
> > Michael Ji
> >
> >
> > __________________________________
> > Yahoo! Mail
> > Stay connected, organized, and protected. Take the
> tour:
> > http://tour.mail.yahoo.com/mailtour.html
> 
> 
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: controlled depth crawling

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Michael, I don't think that would work, because every link on a single page would be decrementing its parent depth. 

Instead, I would stick to the DepthFLFilter I provided, and changed ScheduledURL's ctor to

public ScheduledURL(ScheduledURL parent, URL url) {
    this.id = assignId();
    this.seedIndex = parent.seedIndex;
    this.parentId = parent.id;
    this.depth = parent.depth + 1;
    this.url = url;
  }

Then in beans.xml, declare DepthFLFilter as a bean, and set the "max" property to 5. 

You can even have a more fine-grained control by making a FLFilter that allows you to specify a host and maxDepth, and if a host is not declared, then the default depth is used. Something like

<bean class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
          <property name="defaultMax"><value>20</value></property>
		<property name="hosts">
            <map>
              <entry>
                <key>www.nutch.org</key>
                <value>7</value>
              </entry>
              <entry>
                <key>www.apache.org</key>
                <value>2</value>
              </entry>
            </map>
          </property>
        </bean>

(formatting is probably going to end up warped).

See what I mean?

k

On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>
> Hi Kelvin:
>
> I tried to implement controlled depth crawling based on your Nutch-
> 84 and the discussion we had before.
>
> 1. In DepthFLFilter Class,
>
> I did a bit modification
> "
> public synchronized int filter(FetchListScope.Input input) {
> input.parent.decrementDepth();
> return input.parent.depth >= 0 ? ALLOW : REJECT; } "
>
> 2 In ScheduledURL Class
> add one member variable and one member function " public int depth;
>
> public void decrementDepth() {
> depth --;
> }
> "
>
> 3 Then
>
> we need an initial depth for each domain; for the initial testing;
> I can set a default value 5 for all the site in seeds.txt and for
> each outlink, the value will be 1;
>
> In that way, a pretty vertical crawling is done for on-site domain
> while outlink homepage is still visible;
>
> Further more, should we define a depth value for each url in
> seeds.txt?
>
> Did I in the right track?
>
> Thanks,
>
> Michael Ji
>
>
> __________________________________
> Yahoo! Mail
> Stay connected, organized, and protected. Take the tour:
> http://tour.mail.yahoo.com/mailtour.html

controlled depth crawling

Posted by Michael Ji <fj...@yahoo.com>.

Hi Kelvin:

I tried to implement controlled depth crawling based
on your Nutch-84 and the discussion we had before.

1. In DepthFLFilter Class, 

I did a bit modification
"
public synchronized int filter(FetchListScope.Input
input) {
    input.parent.decrementDepth();
    return input.parent.depth >= 0 ? ALLOW : REJECT;
  }
"

2 In ScheduledURL Class
add one member variable and one member function
"
public int depth;

public void decrementDepth() {
    depth --;
  }
"

3 Then

we need an initial depth for each domain; for the
initial testing; I can set a default value 5 for all
the site in seeds.txt and for each outlink, the value
will be 1;

In that way, a pretty vertical crawling is done for
on-site domain while outlink homepage is still
visible;

Further more, should we define a depth value for each
url in seeds.txt? 

Did I in the right track?

Thanks,

Michael Ji


		
__________________________________ 
Yahoo! Mail 
Stay connected, organized, and protected. Take the tour: 
http://tour.mail.yahoo.com/mailtour.html

Re: bot-traps and refetching

Posted by Michael Ji <fj...@yahoo.com>.

hi Kelvin:

I believe my previous email about "further concerning
of controlled crawling" confused you a bit via my
unmatured thought. But I believe that controlled
crawling is very important for an efficient vertical
crawling application generally.

After reviewed our previous discussion, I think the
solution for bot-traps and refetching in OC might be
able to be combined as one.

1) Refetching will look at the FetcherOutput of last
run, and queue the URLs according to their domain name
(for http 1.1 protocol) as your FetcherThread does.

2) We might just count the number of URLs within the
same domain (in fly of queue?). If that number is over
centain threshold, we might think stop adding new URLs
for that domain---it is equvalent in sense of
controlled crawling, but in a way of "width".

Will it work as proposed?

thanks, 

Michael Ji,

--- Kelvin Tan <ke...@relevanz.com> wrote:

> Michael,
> 
> On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji
> wrote:
> > Hi Kelvin:
> >
> > 2) refetching
> >
> > If OC's fetchlist is online (memory residence),
> the next time
> > refetch we have to restart from seeds.txt once
> again. Is it right?
> >
> 
> Maybe with the current implementation. But if you
> Implement a CrawlSeedSource that reads in the
> FetcherOutput directory in the Nutch segment, then
> you can seed a crawl using what's already been
> fetched.
> 
> 

____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: bot-traps and refetching

Posted by Kelvin Tan <ke...@relevanz.com>.

Michael,

On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> 1) bot-traps problem for OC
>
> If we have a crawling depth for each starting host, it seems that
> the crawling will be finalized in the end ( we can decrement depth
> value in each time the outlink falls in same host domain).
>
> Let me know if my thought is wrong.
>

Correct. Limiting crawls by depth is probably the simplest way of avoiding death by bot-traps. There are other methods though, like assigning credits to hosts and adapting fetchlist scheduling according to credit usage, or flagging recurring path elements as suspect.

> 2) refetching
>
> If OC's fetchlist is online (memory residence), the next time
> refetch we have to restart from seeds.txt once again. Is it right?
>

Maybe with the current implementation. But if you Implement a CrawlSeedSource that reads in the FetcherOutput directory in the Nutch segment, then you can seed a crawl using what's already been fetched. 

> 3) page content checking
>
> In OC API, I found an API WebDBContentSeenFilter, who uses Nutch
> webdb data structure to see if the fetched page content has been
> seen before. That means, we have to use Nutch to create a webdb
> (maybe nutch/updatedb) in order to support this function. Is it
> right?

Exactly right. 

k

bot-traps and refetching

Posted by Michael Ji <fj...@yahoo.com>.

Hi Kelvin:

1) bot-traps problem for OC

If we have a crawling depth for each starting host, it
seems that the crawling will be finalized in the end (
we can decrement depth value in each time the outlink
falls in same host domain). 

Let me know if my thought is wrong.

2) refetching

If OC's fetchlist is online (memory residence), the
next time refetch we have to restart from seeds.txt
once again. Is it right?

3) page content checking

In OC API, I found an API WebDBContentSeenFilter, who
uses Nutch webdb data structure to see if the fetched
page content has been seen before. That means, we have
to use Nutch to create a webdb (maybe nutch/updatedb)
in order to support this function. Is it right?

thanks,

Michael,




		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Launch Nutch Search Engine successfully based on Nutch-84 data

Posted by Michael Ji <fj...@yahoo.com>.

After running Nutch-84, a segment/ containing fetched
data is saved in local drive;

1)
then, run, nutch/index for that segment;
2)
then, a bit tricky, need to create a dummy directory
inside that segment/ and move all the previous
contents within segment/ to that dummy dirctory;
3)
then launch tomcat box in the dir/ parallel to
segment/

My guess is that the Nutch format is has
segments/200888***/...

Just want to share my experience with you,

Michael Ji,


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Michael,

On Sat, 27 Aug 2005 21:13:27 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> I started to dig into the code and data structure of OC;
>
> Just curious questions:
>
> 1) Where OC forms a fetchlist? I didn't see it in segment/ of OC
> created.

Crawls are seeded using a CrawlSeedSource. These urls are injected into the respective FetcherThread's fetchlists. 

After the initial seed, URLs are added to fetchlists from the parsed pages outlinks. OC builds the fetchlist online vs nutch's offline fetchlist building.

>
> 2) In OC, FetchList is organized in such a way of URLs sequence per
> host. Then, what if there are too many hosts, saying ten thousand.
> How about I/O performance concern? Will it exceed the system open-
> file limitation?
>

The concern, if any, would be memory, not i/o, because DefaultFetchList currently stores everything in memory. Still, its an interface, and simple for someone to implement a fetchlist that has bounds on memory, persisting to disk where appropriate.

k

Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Posted by Michael Ji <fj...@yahoo.com>.

Hi Kelvin:

I started to dig into the code and data structure of
OC; 

Just curious questions:

1) Where OC forms a fetchlist? I didn't see it in
segment/ of OC created. 

2) In OC, FetchList is organized in such a way of URLs
sequence per host. Then, what if there are too many
hosts, saying ten thousand. How about I/O performance
concern? Will it exceed the system open-file
limitation?

thanks,

Michael Ji

--- Kelvin Tan <ke...@relevanz.com> wrote:

> If we add a depth field to ScheduledURL, then
> controlling the depth of a crawl is simple:
> 
> /**
>  * Limits a crawl to a fixed depth. Seeds are depth
> 0.
>  */
> public class DepthFLFilter implements
> ScopeFilter<FetchListScope.Input> {
>   private int max;
> 
>   public synchronized int
> filter(FetchListScope.Input input) {
>     return input.parent.depth < max ? ALLOW :
> REJECT;
>   }
> 
>   public void setMax(int max) {
>     this.max = max;
>   }
> }
> 
> On Sat, 27 Aug 2005 13:19:18 -0400, Kelvin Tan
> wrote:
> > Hey Michael, did you use the nutch-84 segment
> location as the
> > argument for the respective nutch commands, e.g..
> >
> > bin/nutch updatedb db <path_to_segment>
> >
> > If intending to integrate with webdb, you'll need
> to ensure the
> > directory structure of the segment output is what
> Nutch expects,
> > which means
> > db/segments/<segment_name>
> >
> > I haven't tried running Nutch with the index
> created, but when I
> > open the index in Luke, everything looks correct.
> Let me know if
> > you still have problems.
> >
> > To customize how domains are crawled, you'll want
> to write a
> > ScopeFilter. Take a look at SameParentHostFLFilter
> for an example.
> > When I have some time later today, I'll see if I
> can hack something
> > quick to limit crawling by depth..
> >
> > k
> >
> > On Sat, 27 Aug 2005 09:01:28 -0700 (PDT), Michael
> Ji wrote:
> >
> >> Hi there,
> >>
> >> I installed Nutch-84 patch in Nutch 07 and run
> patch test script  
> >> successfully with my seeds.txt.
> >>
> >> It created /segment/ with sub-directories of
> "content",
> >> "fetcher",  "parse_data" and "parse_text".
> >>
> >> Followings are the issues I met and concerning:
> >>
> >> 1) Indexing
> >>
> >> Then, I run nutch/index for this segment
> successfully. But there
> >> is  no result (hits) returned in searching after
> I launch tomcat
> >> box.
> >>
> >> 2) Domain control
> >>
> >> As I understood, this patch is for control domain
> crawling. Seems
> >>  we can define the fetching depth for both domain
> site and  
> >> outlinking site by ourself. If so, where these
> parameters I can  
> >> input?
> >>
> >> 3) Refetching
> >>
> >> Based on the fetched data, I tried several
> things, such as,
> >> running  nutch/updatedb, nutch/gengerate,
> nutch/fetcher. Seems
> >> not working.
> >>
> >> Is there a scenario that I can adopt this patch
> for refetching  
> >> purpose?
> >>
> >> thanks,
> >>
> >> Michael Ji,
> >>
> >>
>
>> ____________________________________________________
> Start your
> >> day  with Yahoo! - make it your home page
> >> http://www.yahoo.com/r/hs
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Posted by Kelvin Tan <ke...@relevanz.com>.

If we add a depth field to ScheduledURL, then controlling the depth of a crawl is simple:

/**
 * Limits a crawl to a fixed depth. Seeds are depth 0.
 */
public class DepthFLFilter implements ScopeFilter<FetchListScope.Input> {
  private int max;

  public synchronized int filter(FetchListScope.Input input) {
    return input.parent.depth < max ? ALLOW : REJECT;
  }

  public void setMax(int max) {
    this.max = max;
  }
}

On Sat, 27 Aug 2005 13:19:18 -0400, Kelvin Tan wrote:
> Hey Michael, did you use the nutch-84 segment location as the
> argument for the respective nutch commands, e.g..
>
> bin/nutch updatedb db <path_to_segment>
>
> If intending to integrate with webdb, you'll need to ensure the
> directory structure of the segment output is what Nutch expects,
> which means
> db/segments/<segment_name>
>
> I haven't tried running Nutch with the index created, but when I
> open the index in Luke, everything looks correct. Let me know if
> you still have problems.
>
> To customize how domains are crawled, you'll want to write a
> ScopeFilter. Take a look at SameParentHostFLFilter for an example.
> When I have some time later today, I'll see if I can hack something
> quick to limit crawling by depth..
>
> k
>
> On Sat, 27 Aug 2005 09:01:28 -0700 (PDT), Michael Ji wrote:
>
>> Hi there,
>>
>> I installed Nutch-84 patch in Nutch 07 and run patch test script  
>> successfully with my seeds.txt.
>>
>> It created /segment/ with sub-directories of "content",
>> "fetcher",  "parse_data" and "parse_text".
>>
>> Followings are the issues I met and concerning:
>>
>> 1) Indexing
>>
>> Then, I run nutch/index for this segment successfully. But there
>> is  no result (hits) returned in searching after I launch tomcat
>> box.
>>
>> 2) Domain control
>>
>> As I understood, this patch is for control domain crawling. Seems
>>  we can define the fetching depth for both domain site and  
>> outlinking site by ourself. If so, where these parameters I can  
>> input?
>>
>> 3) Refetching
>>
>> Based on the fetched data, I tried several things, such as,
>> running  nutch/updatedb, nutch/gengerate, nutch/fetcher. Seems
>> not working.
>>
>> Is there a scenario that I can adopt this patch for refetching  
>> purpose?
>>
>> thanks,
>>
>> Michael Ji,
>>
>>
>> ____________________________________________________ Start your
>> day  with Yahoo! - make it your home page
>> http://www.yahoo.com/r/hs

Re: indexing and refetching by using NUTCH-84) Fetcher for constrained crawls

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Michael, did you use the nutch-84 segment location as the argument for the respective nutch commands, e.g..

bin/nutch updatedb db <path_to_segment>

If intending to integrate with webdb, you'll need to ensure the directory structure of the segment output is what Nutch expects, which means
db/segments/<segment_name>

I haven't tried running Nutch with the index created, but when I open the index in Luke, everything looks correct. Let me know if you still have problems.

To customize how domains are crawled, you'll want to write a ScopeFilter. Take a look at SameParentHostFLFilter for an example. When I have some time later today, I'll see if I can hack something quick to limit crawling by depth..

k

On Sat, 27 Aug 2005 09:01:28 -0700 (PDT), Michael Ji wrote:
>
> Hi there,
>
> I installed Nutch-84 patch in Nutch 07 and run patch test script
> successfully with my seeds.txt.
>
> It created /segment/ with sub-directories of "content", "fetcher",
> "parse_data" and "parse_text".
>
> Followings are the issues I met and concerning:
>
> 1) Indexing
>
> Then, I run nutch/index for this segment successfully. But there is
> no result (hits) returned in searching after I launch tomcat box.
>
> 2) Domain control
>
> As I understood, this patch is for control domain crawling. Seems
> we can define the fetching depth for both domain site and
> outlinking site by ourself. If so, where these parameters I can
> input?
>
> 3) Refetching
>
> Based on the fetched data, I tried several things, such as, running
> nutch/updatedb, nutch/gengerate, nutch/fetcher. Seems not working.
>
> Is there a scenario that I can adopt this patch for refetching
> purpose?
>
> thanks,
>
> Michael Ji,
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs