You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jay Lorenzo <ja...@gmail.com> on 2005/09/02 07:22:37 UTC

Re: Automating workflow using ndfs

Thanks, that's good information - it sounds like I need to take a closer 
look at index deployment to see what the best solution is for automating 
index management.

The initial email was more about understanding what the envisioned workflow 
would for automating the creation of those indexes in a NDFS system, meaning 
what 
choices are available for automating the workflow of 
fetchlist->crawl->updateDb->index
part of the equation when you have a node hosting a webdb, and a number of 
nodes 
crawling and indexing. 

If I use a message based system, I assume I would create new fetchlists at a 
given 
locations of the NDFS, and message the fetchers where to find the 
fetchlists. Once crawled, 
I need to then update the webdb with the links discovered during the crawl.

Maybe this is too complex of a solution, but my sense is that map-reduce 
systems still need a way 
to manage the workflow/control that needs to occur if you want to create 
pipelines that 
generate indexes.

Thanks,

Jay Lorenzo

On 8/31/05, Doug Cutting <cu...@nutch.org> wrote:
> 
> I assume that in most NDFS-based configurations the production search
> system will not run out of NDFS. Rather, indexes will be created
> offline for a deployment (i.e., merging things to create an index 
> peractually
> search node), then copied out of NDFS to the local filesystem on a
> production search node and placed in production. This can be done
> incrementally, where new indexes are deployed without re-deploying old
> indexes. In this scenario, new indexes are rotated in replacing old
> indexes, and the .del file for every index is updated, to reflect
> deduping. There is no code yet which implements this.
> 
> Is this what you were asking?
> 
> Doug
> 
> 
> Jay Lorenzo wrote:
> > I'm pretty new to nutch, but in reading through the mail lists and other
> > papers, I don't think I've really seen any discussion on using ndfs with
> > respect to automating end to end workflow for data that is going to be
> > searched (fetch->index->merge->search).
> >
> > The few crawler designs I'm familiar with typically have spiders
> > (fetchers) and
> > indexers on the same box. Once pages are crawled and indexed the indexes
> > are pipelined to merge/query boxes to complete the workflow.
> >
> > When I look at the nutch design and ndfs, I'm assuming the design intent
> > for 'pure ndfs' workflow is for the webdb to generate segments on a ndfs
> > partition, and once the updating of the webdb is completed, the segments
> > are processed 'on-disk' by the subsequent
> > fetcher/index/merge/query mechanisms. Is this a correct assumption?
> >
> > Automating this kind of continuous workflow usually is dependent on the
> > implementation of some kind of control mechanism to assure that the
> > correct sequence of operations is performed.
> >
> > Are there any recommendations on the best way to automate this
> > workflow when using ndfs? I've prototyped a continuous workflow system
> > using a traditional pipeline model with per stage work queues, and I see
> > how that could be applied to a clustered filesystem like ndfs, but I'm
> > curious to hear what the design intent or best practice is envisioned
> > for automating ndfs based implementations.
> >
> >
> > Thanks,
> >
> > Jay
> >
>

Re: page rank in Nutch

Posted by Ken Krugler <kk...@transpac.com>.

>Lucene has basic scoring algorithm based on tf, tdf
>and field boost value.
>
>And Nutch adopts page rank concept by using its'
>unique link analysis via DistributedAnalysisTool
>class.

Actually I don't think most people run this. I believe it starts to 
have performance issues when your page counts get large, which is one 
of the reasons for the mapred work being done by Doug/Mike in a 
branch.

Typically the extent of "link analysis" is the number of inbound 
links to a page, which is always being calculated whenever the WebDB 
is updated following a crawl.

>But when I take a look at "score in detail" of Nutch's
>search result, I didn't see a factor called "link
>analysis" or something else like that.
>
>Where can I see this factor or it is already combined
>into the score value we saw in the score detail page.

See my previous post on how inbound link counts are used to boost a 
Lucene document (web page).

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

page rank in Nutch

Posted by Michael Ji <fj...@yahoo.com>.

hi,

Lucene has basic scoring algorithm based on tf, tdf
and field boost value.

And Nutch adopts page rank concept by using its'
unique link analysis via DistributedAnalysisTool
class.

But when I take a look at "score in detail" of Nutch's
search result, I didn't see a factor called "link
analysis" or something else like that.

Where can I see this factor or it is already combined
into the score value we saw in the score detail page.

thanks,

Michael Ji,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Automating workflow using ndfs

Posted by Anjun Chen <an...@sbcglobal.net>.

I'm going to make a request in Jira now. -AJ

--- Matt Kangas <ka...@gmail.com> wrote:

> Great! Is there a ticket in JIRA requesting this
> feature? If not, we  
> should file one and get a few votes in favor of it.
> AFAIK, that's the  
> process for getting new features into Nutch.
> 
> On Sep 2, 2005, at 1:30 PM, AJ Chen wrote:
> 
> > Matt,
> > This is great! It would be very useful to Nutch
> developers if your  
> > code can be shared.  I'm sure quite a few
> applications will benefit  
> > from it because it fills a gap between whole-web
> crawling and  
> > single site (or a handful of sites) crawling. 
> I'll be interested  
> > in adapting your plugin to Nutch convention.
> > Thanks,
> > -AJ
> >
> > Matt Kangas wrote:
> >
> >
> >> AJ and Earl,
> >>
> >> I've implemented URLFilters before. In fact, I
> have a   
> >> WhitelistURLFilter that implements just what you
> describe: a   
> >> hashtable of regex-lists. We implemented it
> specifically because  
> >> we  want to be able to crawl a large number of
> known-good paths  
> >> through  sites, including paths through CGIs. The
> hash is a Nutch  
> >> ArrayFile,  which provides low runtime overhead.
> We've tested it  
> >> on 200+ sites  thus far, and haven't seen any
> indication that it  
> >> will have problems  scaling further.
> >>
> >> The filter and its supporting WhitelistWriter
> currently rely on a  
> >> few  custom classes, but it should be
> straightforward to adapt to  
> >> Nutch  naming conventions, etc. If you're
> interested in doing this  
> >> work, I  can see if it's ok to publish our code.
> >>
> >> BTW, we're currently alpha-testing the site that
> uses this  
> >> plugin,  and preparing for a public beta. I'll be
> sure to post  
> >> here when we're  finally open for business. :)
> >>
> >> --Matt
> >>
> >>
> >> On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
> >>
> >>
> >>> From reading http://wiki.apache.org/nutch/  
> >>> DissectingTheNutchCrawler, it seems that a new
> urlfilter is a  
> >>> good  place to extend the inclusion regex
> capability.  The new  
> >>> urlfilter  will be defined by urlfilter.class
> property, which  
> >>> gets loaded by  the URLFilterFactory.
> >>> Regex is necessary because you want to include
> urls matching   
> >>> certain patterns.
> >>>
> >>> Can anybody who implemented URLFilter plugin
> before share some   
> >>> thoughts about this approach? I expect the new
> filter must have  
> >>> all  capabilities that the current
> RegexURLFilter.java has so  
> >>> that it  won't require change in any other
> classes. The  
> >>> difference is that  the new filter uses a hash
> table for  
> >>> efficiently looking up regex  for included
> domains (a large  
> >>> number!).
> >>>
> >>> BTW, I can't find urlfilter.class property in
> any of the   
> >>> configuration files in Nutch-0.7. Does 0.7
> version still support   
> >>> urlfilter extension? Any difference relative to
> what's described  
> >>> in  the doc DissectingTheNutchCrawler cited
> above?
> >>>
> >>> Thanks,
> >>> AJ
> >>>
> >>> Earl Cahill wrote:
> >>>
> >>>
> >>>
> >>>>> The goal is to avoid entering 100,000 regex in
> the
> >>>>> craw-urlfilter.xml and checking ALL these
> regex for each URL.  
> >>>>> Any  comment?
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> Sure seems like just some hash look up table
> could
> >>>> handle it.  I am having a hard time seeing when
> you
> >>>> really need a regex and a fixed list wouldn't
> do. Especially if   
> >>>> you have forward and maybe a backwards
> >>>> lookup as well in a multi-level hash, to
> perhaps
> >>>> include/exclude at a certain sudomain level,
> like
> >>>>
> >>>> include: com->site->good (for good.site.com
> stuff)
> >>>> exclude: com->site->bad (for bad.site.com)
> >>>>
> >>>> and kind of walk backwards, kind of like dns. 
> Then
> >>>> you could just do a few hash lookups instead of
> >>>> 100,000 regexes.
> >>>>
> >>>> I realize I am talking about host and not page
> level
> >>>> filtering, but if you want to include
> everything from
> >>>> your 100,000 sites, I think such a strategy
> could
> >>>> work.
> >>>>
> >>>> Hope this makes sense.  Maybe I could write
> some code
> >>>> to and see if it works in practice.  If nothing
> else,
> >>>> maybe the hash stuff could just be another
> filter
> >>>> option in conf/crawl-urlfilter.txt.
> >>>>
> >>>> Earl
> >>>>
> >>>>
> >>>>
> >>>
> >>> -- 
> >>> AJ (Anjun) Chen, Ph.D.
> >>> Canova Bioconsulting Marketing * BD * Software
> Development
> >>> 748 Matadero Ave., Palo Alto, CA 94306, USA
> >>> Cell 650-283-4091, anjun.chen@sbcglobal.net
> >>>
> ---------------------------------------------------
> >>>
> >>>
> >>
> >> -- 
> >> Matt Kangas / kangas@gmail.com
> >>
> >>
> >>
> >>
> >
> > -- 
> > AJ (Anjun) Chen, Ph.D.
> > Canova Bioconsulting Marketing * BD * Software
> Development
> > 748 Matadero Ave., Palo Alto, CA 94306, USA
> > Cell 650-283-4091, anjun.chen@sbcglobal.net
> >
> ---------------------------------------------------
> >
> 
> --
> Matt Kangas / kangas@gmail.com
> 
> 
>

Re: Automating workflow using ndfs

Posted by Matt Kangas <ka...@gmail.com>.

Great! Is there a ticket in JIRA requesting this feature? If not, we  
should file one and get a few votes in favor of it. AFAIK, that's the  
process for getting new features into Nutch.

On Sep 2, 2005, at 1:30 PM, AJ Chen wrote:

> Matt,
> This is great! It would be very useful to Nutch developers if your  
> code can be shared.  I'm sure quite a few applications will benefit  
> from it because it fills a gap between whole-web crawling and  
> single site (or a handful of sites) crawling.  I'll be interested  
> in adapting your plugin to Nutch convention.
> Thanks,
> -AJ
>
> Matt Kangas wrote:
>
>
>> AJ and Earl,
>>
>> I've implemented URLFilters before. In fact, I have a   
>> WhitelistURLFilter that implements just what you describe: a   
>> hashtable of regex-lists. We implemented it specifically because  
>> we  want to be able to crawl a large number of known-good paths  
>> through  sites, including paths through CGIs. The hash is a Nutch  
>> ArrayFile,  which provides low runtime overhead. We've tested it  
>> on 200+ sites  thus far, and haven't seen any indication that it  
>> will have problems  scaling further.
>>
>> The filter and its supporting WhitelistWriter currently rely on a  
>> few  custom classes, but it should be straightforward to adapt to  
>> Nutch  naming conventions, etc. If you're interested in doing this  
>> work, I  can see if it's ok to publish our code.
>>
>> BTW, we're currently alpha-testing the site that uses this  
>> plugin,  and preparing for a public beta. I'll be sure to post  
>> here when we're  finally open for business. :)
>>
>> --Matt
>>
>>
>> On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
>>
>>
>>> From reading http://wiki.apache.org/nutch/  
>>> DissectingTheNutchCrawler, it seems that a new urlfilter is a  
>>> good  place to extend the inclusion regex capability.  The new  
>>> urlfilter  will be defined by urlfilter.class property, which  
>>> gets loaded by  the URLFilterFactory.
>>> Regex is necessary because you want to include urls matching   
>>> certain patterns.
>>>
>>> Can anybody who implemented URLFilter plugin before share some   
>>> thoughts about this approach? I expect the new filter must have  
>>> all  capabilities that the current RegexURLFilter.java has so  
>>> that it  won't require change in any other classes. The  
>>> difference is that  the new filter uses a hash table for  
>>> efficiently looking up regex  for included domains (a large  
>>> number!).
>>>
>>> BTW, I can't find urlfilter.class property in any of the   
>>> configuration files in Nutch-0.7. Does 0.7 version still support   
>>> urlfilter extension? Any difference relative to what's described  
>>> in  the doc DissectingTheNutchCrawler cited above?
>>>
>>> Thanks,
>>> AJ
>>>
>>> Earl Cahill wrote:
>>>
>>>
>>>
>>>>> The goal is to avoid entering 100,000 regex in the
>>>>> craw-urlfilter.xml and checking ALL these regex for each URL.  
>>>>> Any  comment?
>>>>>
>>>>>
>>>>>
>>>>
>>>> Sure seems like just some hash look up table could
>>>> handle it.  I am having a hard time seeing when you
>>>> really need a regex and a fixed list wouldn't do. Especially if   
>>>> you have forward and maybe a backwards
>>>> lookup as well in a multi-level hash, to perhaps
>>>> include/exclude at a certain sudomain level, like
>>>>
>>>> include: com->site->good (for good.site.com stuff)
>>>> exclude: com->site->bad (for bad.site.com)
>>>>
>>>> and kind of walk backwards, kind of like dns.  Then
>>>> you could just do a few hash lookups instead of
>>>> 100,000 regexes.
>>>>
>>>> I realize I am talking about host and not page level
>>>> filtering, but if you want to include everything from
>>>> your 100,000 sites, I think such a strategy could
>>>> work.
>>>>
>>>> Hope this makes sense.  Maybe I could write some code
>>>> to and see if it works in practice.  If nothing else,
>>>> maybe the hash stuff could just be another filter
>>>> option in conf/crawl-urlfilter.txt.
>>>>
>>>> Earl
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> AJ (Anjun) Chen, Ph.D.
>>> Canova Bioconsulting Marketing * BD * Software Development
>>> 748 Matadero Ave., Palo Alto, CA 94306, USA
>>> Cell 650-283-4091, anjun.chen@sbcglobal.net
>>> ---------------------------------------------------
>>>
>>>
>>
>> -- 
>> Matt Kangas / kangas@gmail.com
>>
>>
>>
>>
>
> -- 
> AJ (Anjun) Chen, Ph.D.
> Canova Bioconsulting Marketing * BD * Software Development
> 748 Matadero Ave., Palo Alto, CA 94306, USA
> Cell 650-283-4091, anjun.chen@sbcglobal.net
> ---------------------------------------------------
>

--
Matt Kangas / kangas@gmail.com

Re: Automating workflow using ndfs

Posted by AJ Chen <an...@sbcglobal.net>.

Matt,
This is great! It would be very useful to Nutch developers if your code 
can be shared.  I'm sure quite a few applications will benefit from it 
because it fills a gap between whole-web crawling and single site (or a 
handful of sites) crawling.  I'll be interested in adapting your plugin 
to Nutch convention.
Thanks,
-AJ

Matt Kangas wrote:

> AJ and Earl,
>
> I've implemented URLFilters before. In fact, I have a  
> WhitelistURLFilter that implements just what you describe: a  
> hashtable of regex-lists. We implemented it specifically because we  
> want to be able to crawl a large number of known-good paths through  
> sites, including paths through CGIs. The hash is a Nutch ArrayFile,  
> which provides low runtime overhead. We've tested it on 200+ sites  
> thus far, and haven't seen any indication that it will have problems  
> scaling further.
>
> The filter and its supporting WhitelistWriter currently rely on a few  
> custom classes, but it should be straightforward to adapt to Nutch  
> naming conventions, etc. If you're interested in doing this work, I  
> can see if it's ok to publish our code.
>
> BTW, we're currently alpha-testing the site that uses this plugin,  
> and preparing for a public beta. I'll be sure to post here when we're  
> finally open for business. :)
>
> --Matt
>
>
> On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
>
>> From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, 
>> it seems that a new urlfilter is a good  place to extend the 
>> inclusion regex capability.  The new urlfilter  will be defined by 
>> urlfilter.class property, which gets loaded by  the URLFilterFactory.
>> Regex is necessary because you want to include urls matching  certain 
>> patterns.
>>
>> Can anybody who implemented URLFilter plugin before share some  
>> thoughts about this approach? I expect the new filter must have all  
>> capabilities that the current RegexURLFilter.java has so that it  
>> won't require change in any other classes. The difference is that  
>> the new filter uses a hash table for efficiently looking up regex  
>> for included domains (a large number!).
>>
>> BTW, I can't find urlfilter.class property in any of the  
>> configuration files in Nutch-0.7. Does 0.7 version still support  
>> urlfilter extension? Any difference relative to what's described in  
>> the doc DissectingTheNutchCrawler cited above?
>>
>> Thanks,
>> AJ
>>
>> Earl Cahill wrote:
>>
>>
>>>> The goal is to avoid entering 100,000 regex in the
>>>> craw-urlfilter.xml and checking ALL these regex for each URL. Any  
>>>> comment?
>>>>
>>>>
>>>
>>> Sure seems like just some hash look up table could
>>> handle it.  I am having a hard time seeing when you
>>> really need a regex and a fixed list wouldn't do. Especially if  you 
>>> have forward and maybe a backwards
>>> lookup as well in a multi-level hash, to perhaps
>>> include/exclude at a certain sudomain level, like
>>>
>>> include: com->site->good (for good.site.com stuff)
>>> exclude: com->site->bad (for bad.site.com)
>>>
>>> and kind of walk backwards, kind of like dns.  Then
>>> you could just do a few hash lookups instead of
>>> 100,000 regexes.
>>>
>>> I realize I am talking about host and not page level
>>> filtering, but if you want to include everything from
>>> your 100,000 sites, I think such a strategy could
>>> work.
>>>
>>> Hope this makes sense.  Maybe I could write some code
>>> to and see if it works in practice.  If nothing else,
>>> maybe the hash stuff could just be another filter
>>> option in conf/crawl-urlfilter.txt.
>>>
>>> Earl
>>>
>>>
>>
>> -- 
>> AJ (Anjun) Chen, Ph.D.
>> Canova Bioconsulting Marketing * BD * Software Development
>> 748 Matadero Ave., Palo Alto, CA 94306, USA
>> Cell 650-283-4091, anjun.chen@sbcglobal.net
>> ---------------------------------------------------
>>
>
> -- 
> Matt Kangas / kangas@gmail.com
>
>
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Re: Automating workflow using ndfs

Posted by Matt Kangas <ka...@gmail.com>.

AJ and Earl,

I've implemented URLFilters before. In fact, I have a  
WhitelistURLFilter that implements just what you describe: a  
hashtable of regex-lists. We implemented it specifically because we  
want to be able to crawl a large number of known-good paths through  
sites, including paths through CGIs. The hash is a Nutch ArrayFile,  
which provides low runtime overhead. We've tested it on 200+ sites  
thus far, and haven't seen any indication that it will have problems  
scaling further.

The filter and its supporting WhitelistWriter currently rely on a few  
custom classes, but it should be straightforward to adapt to Nutch  
naming conventions, etc. If you're interested in doing this work, I  
can see if it's ok to publish our code.

BTW, we're currently alpha-testing the site that uses this plugin,  
and preparing for a public beta. I'll be sure to post here when we're  
finally open for business. :)

--Matt


On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:

> From reading http://wiki.apache.org/nutch/ 
> DissectingTheNutchCrawler, it seems that a new urlfilter is a good  
> place to extend the inclusion regex capability.  The new urlfilter  
> will be defined by urlfilter.class property, which gets loaded by  
> the URLFilterFactory.
> Regex is necessary because you want to include urls matching  
> certain patterns.
>
> Can anybody who implemented URLFilter plugin before share some  
> thoughts about this approach? I expect the new filter must have all  
> capabilities that the current RegexURLFilter.java has so that it  
> won't require change in any other classes. The difference is that  
> the new filter uses a hash table for efficiently looking up regex  
> for included domains (a large number!).
>
> BTW, I can't find urlfilter.class property in any of the  
> configuration files in Nutch-0.7. Does 0.7 version still support  
> urlfilter extension? Any difference relative to what's described in  
> the doc DissectingTheNutchCrawler cited above?
>
> Thanks,
> AJ
>
> Earl Cahill wrote:
>
>
>>> The goal is to avoid entering 100,000 regex in the
>>> craw-urlfilter.xml and checking ALL these regex for each URL. Any  
>>> comment?
>>>
>>>
>>
>> Sure seems like just some hash look up table could
>> handle it.  I am having a hard time seeing when you
>> really need a regex and a fixed list wouldn't do. Especially if  
>> you have forward and maybe a backwards
>> lookup as well in a multi-level hash, to perhaps
>> include/exclude at a certain sudomain level, like
>>
>> include: com->site->good (for good.site.com stuff)
>> exclude: com->site->bad (for bad.site.com)
>>
>> and kind of walk backwards, kind of like dns.  Then
>> you could just do a few hash lookups instead of
>> 100,000 regexes.
>>
>> I realize I am talking about host and not page level
>> filtering, but if you want to include everything from
>> your 100,000 sites, I think such a strategy could
>> work.
>>
>> Hope this makes sense.  Maybe I could write some code
>> to and see if it works in practice.  If nothing else,
>> maybe the hash stuff could just be another filter
>> option in conf/crawl-urlfilter.txt.
>>
>> Earl
>>
>>
>
> -- 
> AJ (Anjun) Chen, Ph.D.
> Canova Bioconsulting Marketing * BD * Software Development
> 748 Matadero Ave., Palo Alto, CA 94306, USA
> Cell 650-283-4091, anjun.chen@sbcglobal.net
> ---------------------------------------------------
>

--
Matt Kangas / kangas@gmail.com

Re: Automating workflow using ndfs

Posted by AJ Chen <an...@sbcglobal.net>.

 From reading http://wiki.apache.org/nutch/DissectingTheNutchCrawler, it 
seems that a new urlfilter is a good place to extend the inclusion regex 
capability.  The new urlfilter will be defined by urlfilter.class 
property, which gets loaded by the URLFilterFactory.
Regex is necessary because you want to include urls matching certain 
patterns.

Can anybody who implemented URLFilter plugin before share some thoughts 
about this approach? I expect the new filter must have all capabilities 
that the current RegexURLFilter.java has so that it won't require change 
in any other classes. The difference is that the new filter uses a hash 
table for efficiently looking up regex for included domains (a large 
number!).

BTW, I can't find urlfilter.class property in any of the configuration 
files in Nutch-0.7. Does 0.7 version still support urlfilter extension? 
Any difference relative to what's described in the doc 
DissectingTheNutchCrawler cited above?

Thanks,
AJ

Earl Cahill wrote:

>> The goal is to 
>>avoid entering 100,000 regex in the
>>craw-urlfilter.xml and checking ALL 
>>these regex for each URL. Any comment?
>>    
>>
>
>Sure seems like just some hash look up table could
>handle it.  I am having a hard time seeing when you
>really need a regex and a fixed list wouldn't do. 
>Especially if you have forward and maybe a backwards
>lookup as well in a multi-level hash, to perhaps
>include/exclude at a certain sudomain level, like
>
>include: com->site->good (for good.site.com stuff)
>exclude: com->site->bad (for bad.site.com)
>
>and kind of walk backwards, kind of like dns.  Then
>you could just do a few hash lookups instead of
>100,000 regexes.
>
>I realize I am talking about host and not page level
>filtering, but if you want to include everything from
>your 100,000 sites, I think such a strategy could
>work.
>
>Hope this makes sense.  Maybe I could write some code
>to and see if it works in practice.  If nothing else,
>maybe the hash stuff could just be another filter
>option in conf/crawl-urlfilter.txt.
>
>Earl
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around 
>http://mail.yahoo.com 
>
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Re: Automating workflow using ndfs

Posted by Earl Cahill <ca...@yahoo.com>.

>  The goal is to 
> avoid entering 100,000 regex in the
> craw-urlfilter.xml and checking ALL 
> these regex for each URL. Any comment?

Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. 
Especially if you have forward and maybe a backwards
lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Automating workflow using ndfs

Posted by AJ Chen <an...@sbcglobal.net>.

I'm also thinking about implementing an automated workflow of 
fetchlist->crawl->updateDb->index. Although my project may not require 
NDSF because it only concerns about deep crawling of 100,000 sites, an 
appropriate workflow is still needed to automatically take care of 
failed urls, newly-added urls, daily update, etc.  Appreciate it if 
somebody can share experience on design of the workflow.

The nutch intranet crawler (or site-specific crawler, which I prefer to 
call) is an automated process, but it's designed to conveniently deal 
with just a handful of sites.  With a larger number of selected sites, I 
expect a modified version is needed.  One modification I can think of is 
to create a lookup table in the urlfilter object for domains to be 
crawled and their corresponding regular expressions.  The goal is to 
avoid entering 100,000 regex in the craw-urlfilter.xml and checking ALL 
these regex for each URL. Any comment?

thanks,
-AJ


Jay Lorenzo wrote:

>Thanks, that's good information - it sounds like I need to take a closer 
>look at index deployment to see what the best solution is for automating 
>index management.
>
>The initial email was more about understanding what the envisioned workflow 
>would for automating the creation of those indexes in a NDFS system, meaning 
>what 
>choices are available for automating the workflow of 
>fetchlist->crawl->updateDb->index
>part of the equation when you have a node hosting a webdb, and a number of 
>nodes 
>crawling and indexing. 
>
>If I use a message based system, I assume I would create new fetchlists at a 
>given 
>locations of the NDFS, and message the fetchers where to find the 
>fetchlists. Once crawled, 
>I need to then update the webdb with the links discovered during the crawl.
>
>Maybe this is too complex of a solution, but my sense is that map-reduce 
>systems still need a way 
>to manage the workflow/control that needs to occur if you want to create 
>pipelines that 
>generate indexes.
>
>Thanks,
>
>Jay Lorenzo
>
>On 8/31/05, Doug Cutting <cu...@nutch.org> wrote:
>  
>
>>I assume that in most NDFS-based configurations the production search
>>system will not run out of NDFS. Rather, indexes will be created
>>offline for a deployment (i.e., merging things to create an index 
>>peractually
>>search node), then copied out of NDFS to the local filesystem on a
>>production search node and placed in production. This can be done
>>incrementally, where new indexes are deployed without re-deploying old
>>indexes. In this scenario, new indexes are rotated in replacing old
>>indexes, and the .del file for every index is updated, to reflect
>>deduping. There is no code yet which implements this.
>>
>>Is this what you were asking?
>>
>>Doug
>>
>>
>>Jay Lorenzo wrote:
>>    
>>
>>>I'm pretty new to nutch, but in reading through the mail lists and other
>>>papers, I don't think I've really seen any discussion on using ndfs with
>>>respect to automating end to end workflow for data that is going to be
>>>searched (fetch->index->merge->search).
>>>
>>>The few crawler designs I'm familiar with typically have spiders
>>>(fetchers) and
>>>indexers on the same box. Once pages are crawled and indexed the indexes
>>>are pipelined to merge/query boxes to complete the workflow.
>>>
>>>When I look at the nutch design and ndfs, I'm assuming the design intent
>>>for 'pure ndfs' workflow is for the webdb to generate segments on a ndfs
>>>partition, and once the updating of the webdb is completed, the segments
>>>are processed 'on-disk' by the subsequent
>>>fetcher/index/merge/query mechanisms. Is this a correct assumption?
>>>
>>>Automating this kind of continuous workflow usually is dependent on the
>>>implementation of some kind of control mechanism to assure that the
>>>correct sequence of operations is performed.
>>>
>>>Are there any recommendations on the best way to automate this
>>>workflow when using ndfs? I've prototyped a continuous workflow system
>>>using a traditional pipeline model with per stage work queues, and I see
>>>how that could be applied to a clustered filesystem like ndfs, but I'm
>>>curious to hear what the design intent or best practice is envisioned
>>>for automating ndfs based implementations.
>>>
>>>
>>>Thanks,
>>>
>>>Jay
>>>
>>>      
>>>
>
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------