You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2003/06/09 21:44:47 UTC

Re: High Capacity (Distributed) Crawler

Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.

Thanks,
Otis

--- Leo Galambos <Le...@seznam.cz> wrote:
> Hi.
> 
> I would like to write $SUBJ (HCDC), because LARM does not offer many 
> options which are required by web/http crawling IMHO. Here is my
> list:
> 
> 1. I would like to manage the decision what will be gathered first - 
> this would be based on pageRank, number of errors, connection speed
> etc. 
> etc.
> 2. pure JAVA solution without any DBMS/JDBC
> 3. better configuration in case of an error
> 4. NIO style as it is suggested by LARM specification
> 5. egothor's filters for automatic processing of various data formats
> 6. management of "Expires" HTTP-meta headers, heuristic rules which
> will 
> describe how fast a page can expire (.php often expires faster than
> .html)
> 7. reindexing without any data exports from a full-text index
> 8. open protocol between the crawler and a full-text engine
> 
> If anyone wants to join (or just extend the wish list), let me know,
> please.
> 
> -g-
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: High Capacity (Distributed) Crawler

Posted by Leo Galambos <Le...@seznam.cz>.
Otis Gospodnetic wrote:

>>What interface do you need for Lucene? Will you use PUSH (=the robot 
>>will modify Lucene's index) or PULL (=the engine will get deltas from
>>
>>the robot) mode? Tell me what you need and I will try to do all my
>>best.
>>    
>>
>
>I'd imagine one would want to use it in the PUSH mode (e.g. the crawler
>fetches a web page and adds it to the searchable index).
>How does PULL mode work?  I've never heard of web crawlers being used
>in the PULL mode.  What exactly does that mean, could you please
>describe it?
>  
>
It is a long story, so I will assume, that everything runs on a single 
box - it is the most simple case.
"[x]" will denote points, where Lucene may have problems with a fast 
implementation, I guess.

Crawler: The crawler stores meta and body of all documents. If you want 
to retrieve the document meta or body (knowing its URI), it costs O(1) 
(2 seeks and 2 read requests in auxiliary data structures). After this 
retrieval you also get a direct handle to meta and body - then the price 
of retrieval becomes O(1), but no extra seeks in any structures. The 
handle is persistent and is related to URI. The meta and body is updated 
as soon as the crawler fetches a new fresh copy.

Engine: engine stores the handle for each document. Moreover it knows 
the last (highest) handle, which is stored in the main index. So the 
trick is this:
1) build up an auxiliary index from new documents. The new documents are 
documents which have their handle greater than the last handle which is 
known to the engine, thus you can iterate them easily - this process can 
run in a separate thread
2) consult the changes. You read meta, which are stored in index, and 
test if they are obsolete (note: you have already got the handle, so it 
smokes). If so, you denote the respective document as "deleted" and its 
new version (if any) is appended to another index - the index of 
changes. The insertion to the index runs in a separate thread, so the 
main thread is not blocked. BTW: [x] The documents, which are not 
modified, may modify their ranks (depthrank, pagerank, frequencyrank 
etc) in this round.

[x] The two auxiliary indices are then merged with the main index.

Obviously, the weak point is the test if anything is changed. This can 
be easily solved with the index dynamization I use. Despite Lucene, I 
order barrels (segments in your terminology) by their size. I do not 
want to describe all the details - I hate long e-mails ;-), but the 
dynamization guarantees that:
a) the query time is never worse than 8x, comparing with 
fully-optimalized index (if you buy 8x faster HW, you overcome this easily)
b) the documents, which are often modified, are stored in small barrels 
of the main index. It means, that their actualization is fast.

So, I process only the small barrels once a day, and the larger ones 
less often. If we say, that 5M of docs are updated daily, PULL mode can 
handle this load in few minutes. Unfortunately, the slowest point is the 
HTML parser which may run few hours :-(.

If you want to actualize other 10^10 crap pages once a month, it can be 
done too, but it is out of my first assumption above ;-).

-g-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: High Capacity (Distributed) Crawler

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Leo,

> The first beta is done (without NIO). It needs, however, further 
> testing. Unfortunatelly, I could not find enough servers which I may
> hit.

Nice.  Pretty much any site is a candidate, as long as you are nice to
it.
You could, for instance, hit all dmoz URLs.  Or you could extract a set
of links from Yahoo.  Or you could try finding that small and large set
of URLs that Google provided a while ago for their Google Challenge.

> I wanted to commit the robot as a part of egothor (it will use it in 
> PULL mode), but we have a nice weather here, so I lost any motivation
> to play with PC ;-).

Yes, I hear some places in central Europe having temperatures of 36-38
C.  Hot!
We are not that lucky in NYC this year :(  Lots of rain and cloudy
weather, which is atypical.

> What interface do you need for Lucene? Will you use PUSH (=the robot 
> will modify Lucene's index) or PULL (=the engine will get deltas from
> 
> the robot) mode? Tell me what you need and I will try to do all my
> best.

I'd imagine one would want to use it in the PUSH mode (e.g. the crawler
fetches a web page and adds it to the searchable index).
How does PULL mode work?  I've never heard of web crawlers being used
in the PULL mode.  What exactly does that mean, could you please
describe it?

Thanks,
Otis


> Otis Gospodnetic wrote:
> 
> >Leo,
> >
> >Have you started this project?  Where is it hosted?
> >It would be nice to see a few alternative implementations of a
> robust
> >and scalable java web crawler with the ability to index whatever it
> >fetches.
> >
> >Thanks,
> >Otis
> >
> >--- Leo Galambos <Le...@seznam.cz> wrote:
> >  
> >
> >>Hi.
> >>
> >>I would like to write $SUBJ (HCDC), because LARM does not offer
> many 
> >>options which are required by web/http crawling IMHO. Here is my
> >>list:
> >>
> >>1. I would like to manage the decision what will be gathered first
> - 
> >>this would be based on pageRank, number of errors, connection speed
> >>etc. 
> >>etc.
> >>2. pure JAVA solution without any DBMS/JDBC
> >>3. better configuration in case of an error
> >>4. NIO style as it is suggested by LARM specification
> >>5. egothor's filters for automatic processing of various data
> formats
> >>6. management of "Expires" HTTP-meta headers, heuristic rules which
> >>will 
> >>describe how fast a page can expire (.php often expires faster than
> >>.html)
> >>7. reindexing without any data exports from a full-text index
> >>8. open protocol between the crawler and a full-text engine
> >>
> >>If anyone wants to join (or just extend the wish list), let me
> know,
> >>please.
> >>
> >>-g-
> >>
> >>
>
>>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >
> >__________________________________
> >Do you Yahoo!?
> >Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
> >http://calendar.yahoo.com
> >
>
>---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >  
> >
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: High Capacity (Distributed) Crawler

Posted by Leo Galambos <Le...@seznam.cz>.
Hi Otis.

The first beta is done (without NIO). It needs, however, further 
testing. Unfortunatelly, I could not find enough servers which I may hit.

I wanted to commit the robot as a part of egothor (it will use it in 
PULL mode), but we have a nice weather here, so I lost any motivation to 
play with PC ;-).

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from 
the robot) mode? Tell me what you need and I will try to do all my best.

-g-


Otis Gospodnetic wrote:

>Leo,
>
>Have you started this project?  Where is it hosted?
>It would be nice to see a few alternative implementations of a robust
>and scalable java web crawler with the ability to index whatever it
>fetches.
>
>Thanks,
>Otis
>
>--- Leo Galambos <Le...@seznam.cz> wrote:
>  
>
>>Hi.
>>
>>I would like to write $SUBJ (HCDC), because LARM does not offer many 
>>options which are required by web/http crawling IMHO. Here is my
>>list:
>>
>>1. I would like to manage the decision what will be gathered first - 
>>this would be based on pageRank, number of errors, connection speed
>>etc. 
>>etc.
>>2. pure JAVA solution without any DBMS/JDBC
>>3. better configuration in case of an error
>>4. NIO style as it is suggested by LARM specification
>>5. egothor's filters for automatic processing of various data formats
>>6. management of "Expires" HTTP-meta headers, heuristic rules which
>>will 
>>describe how fast a page can expire (.php often expires faster than
>>.html)
>>7. reindexing without any data exports from a full-text index
>>8. open protocol between the crawler and a full-text engine
>>
>>If anyone wants to join (or just extend the wish list), let me know,
>>please.
>>
>>-g-
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>__________________________________
>Do you Yahoo!?
>Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
>http://calendar.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org