You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "lucene@libero.it" <lu...@libero.it> on 2002/04/24 11:02:32 UTC

Italian web sites

Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The 
problem is that I'd like spidering only Italian web sites. 
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

Thanks


Laura


Re: Italian web sites

Posted by Ype Kingma <yk...@xs4all.nl>.
Laura

>Hi all,
>
>I'm using Jobo for spidering web sites and lucene for indexing. The
>problem is that I'd like spidering only Italian web sites.
>How can I see discover the country of a web site?
>
>Dou you know some method that tou can suggest me?

The best method I know is using n-grams of characters and
use the frequencies of the n-grams that occur most:
http://citeseer.nj.nec.com/context/698873/68861

Regards,
Ype

-- 

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Italian web sites

Posted by Karl Øie <ka...@gan.no>.
combined with that you could use an italian stop-word list to run statistics 
on a page :-) ?!?

On Wednesday 24 April 2002 11:02, lucene@libero.it wrote:
> Hi all,
> 
> I'm using Jobo for spidering web sites and lucene for indexing. The 
> problem is that I'd like spidering only Italian web sites. 
> How can I see discover the country of a web site?
> 
> Dou you know some method that tou can suggest me?
> 
> Thanks
> 
> 
> Laura
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Italian web sites

Posted by Marco Ferrante <fe...@unige.it>.
What does it mean? "Italian website" can be:
  - site that use italian language
  - site owned by an italian organization
  - site hosted in a italian geographical site
Every definition has a different solution.

Date sent:      	Wed, 24 Apr 2002 11:02:32 +0200
From:           	"lucene@libero.it" <lu...@libero.it>
Subject:        	Italian web sites
To:             	lucene-user@jakarta.apache.org
Send reply to:  	Lucene Users List <lu...@jakarta.apache.org>

> Hi all,
>
> I'm using Jobo for spidering web sites and lucene for indexing. The
> problem is that I'd like spidering only Italian web sites.
> How can I see discover the country of a web site?
>
> Dou you know some method that tou can suggest me?
>
> Thanks
>
>
> Laura
>


--------------------------------------------------
Marco Ferrante (ferrante@unige.it)
CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
Università degli Studi di Genova - Italy
Via Brigata Salerno, ponte - 16147 Genova
tel (+39) 0103532621 (interno tel. 2621)
--------------------------------------------------


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Italian web sites

Posted by "Nader S. Henein" <ns...@bayt.net>.
sniff the IP and then using the database at the
internet topology website http://netgeo.caida.org/perl/netgeo.cgi
you can find the country of origin, (use that to populate your
own DB) so retrieval decreases as you accumulate IPs), but that will
give you the website in Italy (not Italian websites). Unfortunately unless
Italian
uses a different encoding for the page, picking it up from the page
(JavaScript)
won't help much.




-----Original Message-----
From: lucene@libero.it [mailto:lucene@libero.it]
Sent: Wednesday, April 24, 2002 1:03 PM
To: lucene-user@jakarta.apache.org
Subject: Italian web sites


Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The
problem is that I'd like spidering only Italian web sites.
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

Thanks


Laura



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>