You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2005/12/06 12:32:10 UTC

Crawling TLD's + injected sites.

We're trying to index based on a country.  What I'm trying to accomplish is:
- auto crawl sites with the correct TLD
- auto crawl manually injected sites. 

from this, I then only want to further follow sites that match the TLD.

This means that sites with the correct TLD extension, if found anywhere 
in the crawl, then get indexed and crawled.  However other TLD's - say 
.com's, have to be manually injected.  However if a .com is crawled (one 
that has been injected), we only follow links from that .com to sites 
that have the country TLD - it wouldn't follow other .com links that 
haven't been manually injected.

That allows me to automatically approve a country specific TLD no matter 
how they're found.  However only manually approved .com's/.net/.org etc. 
would be crawled.

Hope that makes sense.  We've got the regex for the TLD set up already.  
However it seems that we're going to need to set up a filter every 
single .com we manually inject, and from my reading this seems like it's 
going to eventually cause speed problems.

Any suggestions?

Thanks.

Re: ad feed for nutch

Posted by Byron Miller <by...@yahoo.com>.
phpadsnew is ok.. not easy to integrate with a keyword
based system such as search.

I've used Inclick before with moderate success.. was
under heavy development at the time however the
developers seem to have a strong base to work from.

With my experience it's not affordable to really do
your own PPC and try and compete..  Backfill with
Google specific sites or establish a mutual/beneficial
relationship with a 2nd/3rd tier PPC engine that will
co-market with you.

-byron

--- Thomas Delnoij <di...@gmail.com> wrote:

> It should be fairly easy to integrate PhpAdsNew with
> Nutch:
> http://phpadsnew.com/.
> 
> Rgrds, Thomas
> 
> On 12/7/05, Greg Cohen <gr...@gcohen.com> wrote:
> >
> > Glenn,
> >
> > I'm trying to put together a project that will
> also require ad serving,
> > but
> > want it to be open source and give greater
> transparency to the advertisers
> > than they get today with google and overture.  If
> you start developing
> > one,
> > were you thinking of making this open source
> project?
> >
> > Thanks.
> >
> > -greg
> >
> > -----Original Message-----
> > From: Insurance Squared Inc.
> [mailto:gcooke@insurancesquared.com]
> > Sent: Tuesday, December 06, 2005 3:37 AM
> > To: nutch-user@lucene.apache.org
> > Subject: ad feed for nutch
> >
> > Has anyone had any luck with advertising/ad
> management systems being
> > integrated into nutch? Not just something for the
> owner to admin ads,
> > but to allow external advertisers to manage their
> accounts/bids, that
> > kind of thing.
> >
> > I'm drawing up plans for one if none are
> available, but clearly
> > something that's already running would be nicer.
> >
> > Thanks,
> > -glenn
> >
> >
> > >
> >
> >
> >
> 


Re: ad feed for nutch

Posted by Thomas Delnoij <di...@gmail.com>.
It should be fairly easy to integrate PhpAdsNew with Nutch:
http://phpadsnew.com/.

Rgrds, Thomas

On 12/7/05, Greg Cohen <gr...@gcohen.com> wrote:
>
> Glenn,
>
> I'm trying to put together a project that will also require ad serving,
> but
> want it to be open source and give greater transparency to the advertisers
> than they get today with google and overture.  If you start developing
> one,
> were you thinking of making this open source project?
>
> Thanks.
>
> -greg
>
> -----Original Message-----
> From: Insurance Squared Inc. [mailto:gcooke@insurancesquared.com]
> Sent: Tuesday, December 06, 2005 3:37 AM
> To: nutch-user@lucene.apache.org
> Subject: ad feed for nutch
>
> Has anyone had any luck with advertising/ad management systems being
> integrated into nutch? Not just something for the owner to admin ads,
> but to allow external advertisers to manage their accounts/bids, that
> kind of thing.
>
> I'm drawing up plans for one if none are available, but clearly
> something that's already running would be nicer.
>
> Thanks,
> -glenn
>
>
> >
>
>
>

Re: ad feed for nutch

Posted by Stefan Groschupf <sg...@media-style.com>.
Since some time ad-sense works fine for my nutch generated pages.
There was already a posting regarding this some time ago.

Am 06.12.2005 um 18:15 schrieb Paul Harrison:

> I don't know that Google is really reading the pages in real-time.  I
> believe they are looking at the URL itself and adapting the results
> accordingly.  By looking at the URL they can pick up on the posted  
> keywords
> and then match to the appropriate ad.  You can see the deficiencies  
> in this
> by searching on words with double meanings and see what the ads are  
> that
> come back from Google vs. what the results are on the page.
>
> -----Original Message-----
> From: Stefan Groschupf [mailto:sg@media-style.com]
> Sent: Tuesday, December 06, 2005 11:04 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: ad feed for nutch
>
> adsense works with nutch since they use ajax style to read content
> from pages in 'realtime'.
>
> Am 06.12.2005 um 12:37 schrieb Insurance Squared Inc.:
>
>> Has anyone had any luck with advertising/ad management systems
>> being integrated into nutch? Not just something for the owner to
>> admin ads, but to allow external advertisers to manage their
>> accounts/bids, that kind of thing.
>>
>> I'm drawing up plans for one if none are available, but clearly
>> something that's already running would be nicer.
>>
>> Thanks,
>> -glenn
>>
>>
>>>
>>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



RE: ad feed for nutch

Posted by Paul Harrison <pa...@personifi.com>.
I don't know that Google is really reading the pages in real-time.  I
believe they are looking at the URL itself and adapting the results
accordingly.  By looking at the URL they can pick up on the posted keywords
and then match to the appropriate ad.  You can see the deficiencies in this
by searching on words with double meanings and see what the ads are that
come back from Google vs. what the results are on the page.

-----Original Message-----
From: Stefan Groschupf [mailto:sg@media-style.com] 
Sent: Tuesday, December 06, 2005 11:04 AM
To: nutch-user@lucene.apache.org
Subject: Re: ad feed for nutch

adsense works with nutch since they use ajax style to read content  
from pages in 'realtime'.

Am 06.12.2005 um 12:37 schrieb Insurance Squared Inc.:

> Has anyone had any luck with advertising/ad management systems  
> being integrated into nutch? Not just something for the owner to  
> admin ads, but to allow external advertisers to manage their  
> accounts/bids, that kind of thing.
>
> I'm drawing up plans for one if none are available, but clearly  
> something that's already running would be nicer.
>
> Thanks,
> -glenn
>
>
>>
>


Re: ad feed for nutch

Posted by Stefan Groschupf <sg...@media-style.com>.
adsense works with nutch since they use ajax style to read content  
from pages in 'realtime'.

Am 06.12.2005 um 12:37 schrieb Insurance Squared Inc.:

> Has anyone had any luck with advertising/ad management systems  
> being integrated into nutch? Not just something for the owner to  
> admin ads, but to allow external advertisers to manage their  
> accounts/bids, that kind of thing.
>
> I'm drawing up plans for one if none are available, but clearly  
> something that's already running would be nicer.
>
> Thanks,
> -glenn
>
>
>>
>


RE: ad feed for nutch

Posted by Greg Cohen <gr...@gcohen.com>.
Glenn,

I'm trying to put together a project that will also require ad serving, but
want it to be open source and give greater transparency to the advertisers
than they get today with google and overture.  If you start developing one,
were you thinking of making this open source project?  

Thanks.

-greg

-----Original Message-----
From: Insurance Squared Inc. [mailto:gcooke@insurancesquared.com] 
Sent: Tuesday, December 06, 2005 3:37 AM
To: nutch-user@lucene.apache.org
Subject: ad feed for nutch

Has anyone had any luck with advertising/ad management systems being 
integrated into nutch? Not just something for the owner to admin ads, 
but to allow external advertisers to manage their accounts/bids, that 
kind of thing.

I'm drawing up plans for one if none are available, but clearly 
something that's already running would be nicer.

Thanks,
-glenn


>



ad feed for nutch

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
Has anyone had any luck with advertising/ad management systems being 
integrated into nutch? Not just something for the owner to admin ads, 
but to allow external advertisers to manage their accounts/bids, that 
kind of thing.

I'm drawing up plans for one if none are available, but clearly 
something that's already running would be nicer.

Thanks,
-glenn


>

Re: Crawling TLD's + injected sites.

Posted by Thomas Delnoij <di...@gmail.com>.
> from my reading this seems like it's going to eventually cause speed
problems.

Yes, because you would have to add one regex expression for every .com
domain to your regex-urlfilter.

I think the urlfilter-db plugin was specifically designed to solve this
problem:
http://issues.apache.org/jira/browse/NUTCH-100

Rgds, Thomas