You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jayadeep Reddy <ja...@ehealthaccess.com> on 2013/10/15 12:23:23 UTC

How to Crawl Specific sites

How can I index data of only Indian websites

-- 
Jayadeep Reddy.S,
M.D & C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk

RE: How to Crawl Specific sites

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.

If you rely on a filter based on TLD's (.in, .com, etc...) you won't get a
good result, since the TLD is no guarantee for language, ie. A .com TLD may
host websites not only in English but any other conceivable language, a host
in France (.fr) may host websites in greek, for example.

In conf/nutch-site.xml:

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

I believe (I'm not sure) this relies on the language code in the HTML header
returned by the hosting webserver, so it relies on the author of the website
to specify the language, so its not 100% either. I start with a seed file
with URL's which I know are in the language I want, but as the crawls grow I
start to see docs in other languages (maybe I have not configured this
correctly)

Personally I would like to reject any document that is not in the language
intended, but I haven't gotten to that point. My next step will be to look
into the Tika parser supplied with Nutch.

My 2 cents, hope it helps!



-----Original Message-----
From: Talat UYARER [mailto:talat.uyarer@agmlab.com] 
Sent: Tuesday, October 15, 2013 5:15 PM
To: user@nutch.apache.org
Subject: Re: How to Crawl Specific sites

Hi,
In addition to Markus answer If you dont want to fetch again non Indıan 
website, You can do it by writing some custom code. Actually We wrote 
code because of same needs. Normally if your websites mixed, like .com 
or .in, you dont understand website language from the url. We solve this 
by writing custom FetchSchedular code. We check their languages in its 
shouldfetch method. If website language is not allowed. We dont generate 
again.  If you want to wait I will share our code.

Talat

15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
> Hi - either by using a language detector that only allows some or all
common languages spoken in India or by using a domain URL filter to restrict
to the .in domain.
>   
>   
> -----Original message-----
>> From:Jayadeep Reddy <ja...@ehealthaccess.com>
>> Sent: Tuesday 15th October 2013 12:10
>> To: user@nutch.apache.org
>> Subject: How to Crawl Specific sites
>>
>> How can I index data of only Indian websites
>>
>> -- 
>> Jayadeep Reddy.S,
>> M.D & C.E.O
>> e Health Access Pvt.Ltd
>> www.ehealthaccess.com
>> Hyderabad-Chennai-Banglore
>> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>>

Re: How to Crawl Specific sites

Posted by Jayadeep Reddy <ja...@ehealthaccess.com>.

Thank You


On Tue, Oct 15, 2013 at 8:44 PM, Talat UYARER <ta...@agmlab.com>wrote:

> Hi,
> In addition to Markus answer If you dont want to fetch again non Indıan
> website, You can do it by writing some custom code. Actually We wrote code
> because of same needs. Normally if your websites mixed, like .com or .in,
> you dont understand website language from the url. We solve this by writing
> custom FetchSchedular code. We check their languages in its shouldfetch
> method. If website language is not allowed. We dont generate again.  If you
> want to wait I will share our code.
>
> Talat
>
> 15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
>
>  Hi - either by using a language detector that only allows some or all
>> common languages spoken in India or by using a domain URL filter to
>> restrict to the .in domain.
>>     -----Original message-----
>>
>>> From:Jayadeep Reddy <ja...@ehealthaccess.com>
>>> Sent: Tuesday 15th October 2013 12:10
>>> To: user@nutch.apache.org
>>> Subject: How to Crawl Specific sites
>>>
>>> How can I index data of only Indian websites
>>>
>>> --
>>> Jayadeep Reddy.S,
>>> M.D & C.E.O
>>> e Health Access Pvt.Ltd
>>> www.ehealthaccess.com
>>> Hyderabad-Chennai-Banglore
>>> http://www.youtube.com/watch?**v=0k5LX8mw6Sk<http://www.youtube.com/watch?v=0k5LX8mw6Sk>
>>>
>>>
>


-- 
Jayadeep Reddy.S,
M.D & C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk

RE: How to Crawl Specific sites

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.

Answer to my own question here:

http://lucene.472066.n3.nabble.com/How-to-install-or-use-nutch-patch-td20211
41.html

-----Original Message-----
From: Ralf R. Kotowski [mailto:rrk@enlle.com] 
Sent: Saturday, November 02, 2013 7:00 PM
To: user@nutch.apache.org
Subject: RE: How to Crawl Specific sites

Thank you very much,

Excuse my ignorante, i'm not familiar on how to use Jira nor how to apply
patches... if someone could enlighten me, that would be great..

thnx

-----Original Message-----
From: Talat UYARER [mailto:talat.uyarer@agmlab.com] 
Sent: Saturday, November 02, 2013 6:47 PM
To: user@nutch.apache.org; Ralf R. Kotowski
Subject: RE: How to Crawl Specific sites

Hi Raph,
You can find NUTCH-1661 in jira. i uploaded today :)

Talat

Sent with AquaMail for Android
http://www.aqua-mail.com


On 2 Kasım 2013 19:10:04 "Ralf R. Kotowski" <rr...@enlle.com> wrote:
> Would you be willing to share this code?
>
> Thnx
>
> -----Original Message-----
> From: Talat UYARER [mailto:talat.uyarer@agmlab.com] Sent: Tuesday, October

> 15, 2013 5:15 PM
> To: user@nutch.apache.org
> Subject: Re: How to Crawl Specific sites
>
> Hi,
> In addition to Markus answer If you dont want to fetch again non Indıan 
> website, You can do it by writing some custom code. Actually We wrote code

> because of same needs. Normally if your websites mixed, like .com or .in, 
> you dont understand website language from the url. We solve this by
writing 
> custom FetchSchedular code. We check their languages in its shouldfetch 
> method. If website language is not allowed. We dont generate again.  If
you 
> want to wait I will share our code.
>
> Talat
>
> 15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
> > Hi - either by using a language detector that only allows some or all
> common languages spoken in India or by using a domain URL filter to
restrict
> to the .in domain.
> >  -----Original message-----
> >> From:Jayadeep Reddy <ja...@ehealthaccess.com>
> >> Sent: Tuesday 15th October 2013 12:10
> >> To: user@nutch.apache.org
> >> Subject: How to Crawl Specific sites
> >>
> >> How can I index data of only Indian websites
> >>
> >> -- Jayadeep Reddy.S,
> >> M.D & C.E.O
> >> e Health Access Pvt.Ltd
> >> www.ehealthaccess.com
> >> Hyderabad-Chennai-Banglore
> >> http://www.youtube.com/watch?v=0k5LX8mw6Sk
> >>
>
>

RE: How to Crawl Specific sites

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.

Thank you very much,

Excuse my ignorante, i'm not familiar on how to use Jira nor how to apply
patches... if someone could enlighten me, that would be great..

thnx

-----Original Message-----
From: Talat UYARER [mailto:talat.uyarer@agmlab.com] 
Sent: Saturday, November 02, 2013 6:47 PM
To: user@nutch.apache.org; Ralf R. Kotowski
Subject: RE: How to Crawl Specific sites

Hi Raph,
You can find NUTCH-1661 in jira. i uploaded today :)

Talat

Sent with AquaMail for Android
http://www.aqua-mail.com


On 2 Kasım 2013 19:10:04 "Ralf R. Kotowski" <rr...@enlle.com> wrote:
> Would you be willing to share this code?
>
> Thnx
>
> -----Original Message-----
> From: Talat UYARER [mailto:talat.uyarer@agmlab.com] Sent: Tuesday, October

> 15, 2013 5:15 PM
> To: user@nutch.apache.org
> Subject: Re: How to Crawl Specific sites
>
> Hi,
> In addition to Markus answer If you dont want to fetch again non Indıan 
> website, You can do it by writing some custom code. Actually We wrote code

> because of same needs. Normally if your websites mixed, like .com or .in, 
> you dont understand website language from the url. We solve this by
writing 
> custom FetchSchedular code. We check their languages in its shouldfetch 
> method. If website language is not allowed. We dont generate again.  If
you 
> want to wait I will share our code.
>
> Talat
>
> 15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
> > Hi - either by using a language detector that only allows some or all
> common languages spoken in India or by using a domain URL filter to
restrict
> to the .in domain.
> >  -----Original message-----
> >> From:Jayadeep Reddy <ja...@ehealthaccess.com>
> >> Sent: Tuesday 15th October 2013 12:10
> >> To: user@nutch.apache.org
> >> Subject: How to Crawl Specific sites
> >>
> >> How can I index data of only Indian websites
> >>
> >> -- Jayadeep Reddy.S,
> >> M.D & C.E.O
> >> e Health Access Pvt.Ltd
> >> www.ehealthaccess.com
> >> Hyderabad-Chennai-Banglore
> >> http://www.youtube.com/watch?v=0k5LX8mw6Sk
> >>
>
>

RE: How to Crawl Specific sites

Posted by Talat UYARER <ta...@agmlab.com>.

Hi Raph,
You can find NUTCH-1661 in jira. i uploaded today :)

Talat

Sent with AquaMail for Android
http://www.aqua-mail.com


On 2 Kasım 2013 19:10:04 "Ralf R. Kotowski" <rr...@enlle.com> wrote:
> Would you be willing to share this code?
>
> Thnx
>
> -----Original Message-----
> From: Talat UYARER [mailto:talat.uyarer@agmlab.com] Sent: Tuesday, October 
> 15, 2013 5:15 PM
> To: user@nutch.apache.org
> Subject: Re: How to Crawl Specific sites
>
> Hi,
> In addition to Markus answer If you dont want to fetch again non Indıan 
> website, You can do it by writing some custom code. Actually We wrote code 
> because of same needs. Normally if your websites mixed, like .com or .in, 
> you dont understand website language from the url. We solve this by writing 
> custom FetchSchedular code. We check their languages in its shouldfetch 
> method. If website language is not allowed. We dont generate again.  If you 
> want to wait I will share our code.
>
> Talat
>
> 15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
> > Hi - either by using a language detector that only allows some or all
> common languages spoken in India or by using a domain URL filter to restrict
> to the .in domain.
> >  -----Original message-----
> >> From:Jayadeep Reddy <ja...@ehealthaccess.com>
> >> Sent: Tuesday 15th October 2013 12:10
> >> To: user@nutch.apache.org
> >> Subject: How to Crawl Specific sites
> >>
> >> How can I index data of only Indian websites
> >>
> >> -- Jayadeep Reddy.S,
> >> M.D & C.E.O
> >> e Health Access Pvt.Ltd
> >> www.ehealthaccess.com
> >> Hyderabad-Chennai-Banglore
> >> http://www.youtube.com/watch?v=0k5LX8mw6Sk
> >>
>
>

RE: How to Crawl Specific sites

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.

Would you be willing to share this code?

Thnx

-----Original Message-----
From: Talat UYARER [mailto:talat.uyarer@agmlab.com] 
Sent: Tuesday, October 15, 2013 5:15 PM
To: user@nutch.apache.org
Subject: Re: How to Crawl Specific sites

Hi,
In addition to Markus answer If you dont want to fetch again non Indıan 
website, You can do it by writing some custom code. Actually We wrote 
code because of same needs. Normally if your websites mixed, like .com 
or .in, you dont understand website language from the url. We solve this 
by writing custom FetchSchedular code. We check their languages in its 
shouldfetch method. If website language is not allowed. We dont generate 
again.  If you want to wait I will share our code.

Talat

15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
> Hi - either by using a language detector that only allows some or all
common languages spoken in India or by using a domain URL filter to restrict
to the .in domain.
>   
>   
> -----Original message-----
>> From:Jayadeep Reddy <ja...@ehealthaccess.com>
>> Sent: Tuesday 15th October 2013 12:10
>> To: user@nutch.apache.org
>> Subject: How to Crawl Specific sites
>>
>> How can I index data of only Indian websites
>>
>> -- 
>> Jayadeep Reddy.S,
>> M.D & C.E.O
>> e Health Access Pvt.Ltd
>> www.ehealthaccess.com
>> Hyderabad-Chennai-Banglore
>> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>>

Re: How to Crawl Specific sites

Posted by Talat UYARER <ta...@agmlab.com>.

Hi,
In addition to Markus answer If you dont want to fetch again non Indıan 
website, You can do it by writing some custom code. Actually We wrote 
code because of same needs. Normally if your websites mixed, like .com 
or .in, you dont understand website language from the url. We solve this 
by writing custom FetchSchedular code. We check their languages in its 
shouldfetch method. If website language is not allowed. We dont generate 
again.  If you want to wait I will share our code.

Talat

15-10-2013 13:36 tarihinde, Markus Jelsma yazdı:
> Hi - either by using a language detector that only allows some or all common languages spoken in India or by using a domain URL filter to restrict to the .in domain.
>   
>   
> -----Original message-----
>> From:Jayadeep Reddy <ja...@ehealthaccess.com>
>> Sent: Tuesday 15th October 2013 12:10
>> To: user@nutch.apache.org
>> Subject: How to Crawl Specific sites
>>
>> How can I index data of only Indian websites
>>
>> -- 
>> Jayadeep Reddy.S,
>> M.D & C.E.O
>> e Health Access Pvt.Ltd
>> www.ehealthaccess.com
>> Hyderabad-Chennai-Banglore
>> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>>

RE: How to Crawl Specific sites

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - either by using a language detector that only allows some or all common languages spoken in India or by using a domain URL filter to restrict to the .in domain.
 
 
-----Original message-----
> From:Jayadeep Reddy <ja...@ehealthaccess.com>
> Sent: Tuesday 15th October 2013 12:10
> To: user@nutch.apache.org
> Subject: How to Crawl Specific sites
> 
> How can I index data of only Indian websites
> 
> -- 
> Jayadeep Reddy.S,
> M.D & C.E.O
> e Health Access Pvt.Ltd
> www.ehealthaccess.com
> Hyderabad-Chennai-Banglore
> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>