You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Talat Uyarer <ta...@uyarer.com> on 2015/01/28 10:55:36 UTC

Nutch IRI URIs

Hi all,

Do you have any idea How can Nutch handle IRI URIs ?

Thanks

-- 
Talat

Re: Nutch IRI URIs

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Yep meant normalized thanks seb

Sent from my iPhone

> On Jan 30, 2015, at 9:16 AM, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> Hi,
> 
>> that can be done via a URL filter in Nutch,
> 
> Should be "URL normalizer", right?
> 
> I did this once by adding rules to regex-normalize.xml.
> If the URLs are in a certain language with a limited set
> on non-ASCII letters (that's the case for Turkish),
> this will result in a dozen of extra rules.
> 
> But, in general, normalization of IRIs to URIs should be done
> per default. Could you open a Jira for this?
> 
> Internally (as keys in CrawlDb, segments, web table)
> there should be only pure ASCII URIs.
> Cf. NUTCH-1321 [1] and NUTCH-1708 [2].
> 
> Sebastian
> 
> [1] https://issues.apache.org/jira/browse/NUTCH-1321
> [2]
> https://issues.apache.org/jira/browse/NUTCH-1708?focusedCommentId=13968762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13968762
> 
> 
> 
> 
> 2015-01-29 21:59 GMT+01:00 Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov>:
> 
>> Thanks Talat, good question. So what you want are the URLs to
>> actually come through with encoding and stuff like the 2nd example?
>> 
>> I think that can be done via a URL filter in Nutch, or also via the
>> parser (which by default is parse-tika so you are subject to the
>> outlinks it extracts). But even still you could always change the outlinnks
>> I believe with a filter as well.
>> 
>> Does that help?
>> 
>> Cheers,
>> Chris
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Talat Uyarer <ta...@uyarer.com>
>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Date: Wednesday, January 28, 2015 at 10:24 PM
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Re: Nutch IRI URIs
>> 
>>> Hi Chris,
>>> 
>>> IRI extend upon URIs by using the Universal Character Set whereas URIs
>>> were limited to the ASCII with far fewer characters. with HTML5 Some
>>> pages has IRI outlinks like as
>>> http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca  I
>>> realized when Nutch extract outlinks, it can not normalize IRI to URI
>>> form. If the outlink is normalized, it should look like
>>> http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
>>> 
>>> Do you have any idea Nutch can handle IRI url ? I did some test but I
>>> could not find any solution. If there is not any support. IMHO We
>>> should add IRI support in urlnormalizer-basic. Wdyt ?
>>> 
>>> Talat
>>> 
>>> 2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
>>> <ch...@jpl.nasa.gov>:
>>>> Hi Talat,
>>>> 
>>>> What are these? I’m sorry but do you have a pointer (sorry if it’s
>>>> obvious).
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Talat Uyarer <ta...@uyarer.com>
>>>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>> Date: Wednesday, January 28, 2015 at 1:55 AM
>>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>> Subject: Nutch IRI URIs
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Do you have any idea How can Nutch handle IRI URIs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> --
>>>>> Talat
>>> 
>>> 
>>> 
>>> --
>>> Talat UYARER
>>> Websitesi: http://talat.uyarer.com
>>> Twitter: http://twitter.com/talatuyarer
>>> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>> 
>> 

Re: Nutch IRI URIs

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> that can be done via a URL filter in Nutch,

Should be "URL normalizer", right?

I did this once by adding rules to regex-normalize.xml.
If the URLs are in a certain language with a limited set
on non-ASCII letters (that's the case for Turkish),
this will result in a dozen of extra rules.

But, in general, normalization of IRIs to URIs should be done
per default. Could you open a Jira for this?

Internally (as keys in CrawlDb, segments, web table)
there should be only pure ASCII URIs.
Cf. NUTCH-1321 [1] and NUTCH-1708 [2].

Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-1321
[2]
https://issues.apache.org/jira/browse/NUTCH-1708?focusedCommentId=13968762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13968762




2015-01-29 21:59 GMT+01:00 Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov>:

> Thanks Talat, good question. So what you want are the URLs to
> actually come through with encoding and stuff like the 2nd example?
>
> I think that can be done via a URL filter in Nutch, or also via the
> parser (which by default is parse-tika so you are subject to the
> outlinks it extracts). But even still you could always change the outlinnks
> I believe with a filter as well.
>
> Does that help?
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Talat Uyarer <ta...@uyarer.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Wednesday, January 28, 2015 at 10:24 PM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Re: Nutch IRI URIs
>
> >Hi Chris,
> >
> >IRI extend upon URIs by using the Universal Character Set whereas URIs
> >were limited to the ASCII with far fewer characters. with HTML5 Some
> >pages has IRI outlinks like as
> >http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca  I
> >realized when Nutch extract outlinks, it can not normalize IRI to URI
> >form. If the outlink is normalized, it should look like
> >http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
> >
> >Do you have any idea Nutch can handle IRI url ? I did some test but I
> >could not find any solution. If there is not any support. IMHO We
> >should add IRI support in urlnormalizer-basic. Wdyt ?
> >
> >Talat
> >
> >2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
> ><ch...@jpl.nasa.gov>:
> >> Hi Talat,
> >>
> >> What are these? I’m sorry but do you have a pointer (sorry if it’s
> >> obvious).
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Talat Uyarer <ta...@uyarer.com>
> >> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Date: Wednesday, January 28, 2015 at 1:55 AM
> >> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Subject: Nutch IRI URIs
> >>
> >>>Hi all,
> >>>
> >>>Do you have any idea How can Nutch handle IRI URIs ?
> >>>
> >>>Thanks
> >>>
> >>>--
> >>>Talat
> >>
> >
> >
> >
> >--
> >Talat UYARER
> >Websitesi: http://talat.uyarer.com
> >Twitter: http://twitter.com/talatuyarer
> >Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>
>

Re: Nutch IRI URIs

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Talat, good question. So what you want are the URLs to
actually come through with encoding and stuff like the 2nd example?

I think that can be done via a URL filter in Nutch, or also via the
parser (which by default is parse-tika so you are subject to the
outlinks it extracts). But even still you could always change the outlinnks
I believe with a filter as well.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Talat Uyarer <ta...@uyarer.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Wednesday, January 28, 2015 at 10:24 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Re: Nutch IRI URIs

>Hi Chris,
>
>IRI extend upon URIs by using the Universal Character Set whereas URIs
>were limited to the ASCII with far fewer characters. with HTML5 Some
>pages has IRI outlinks like as
>http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca  I
>realized when Nutch extract outlinks, it can not normalize IRI to URI
>form. If the outlink is normalized, it should look like
>http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
>
>Do you have any idea Nutch can handle IRI url ? I did some test but I
>could not find any solution. If there is not any support. IMHO We
>should add IRI support in urlnormalizer-basic. Wdyt ?
>
>Talat
>
>2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov>:
>> Hi Talat,
>>
>> What are these? I’m sorry but do you have a pointer (sorry if it’s
>> obvious).
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Talat Uyarer <ta...@uyarer.com>
>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Date: Wednesday, January 28, 2015 at 1:55 AM
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Nutch IRI URIs
>>
>>>Hi all,
>>>
>>>Do you have any idea How can Nutch handle IRI URIs ?
>>>
>>>Thanks
>>>
>>>--
>>>Talat
>>
>
>
>
>-- 
>Talat UYARER
>Websitesi: http://talat.uyarer.com
>Twitter: http://twitter.com/talatuyarer
>Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Re: Nutch IRI URIs

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Chris,

IRI extend upon URIs by using the Universal Character Set whereas URIs
were limited to the ASCII with far fewer characters. with HTML5 Some
pages has IRI outlinks like as
http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca  I
realized when Nutch extract outlinks, it can not normalize IRI to URI
form. If the outlink is normalized, it should look like
http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca

Do you have any idea Nutch can handle IRI url ? I did some test but I
could not find any solution. If there is not any support. IMHO We
should add IRI support in urlnormalizer-basic. Wdyt ?

Talat

2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
<ch...@jpl.nasa.gov>:
> Hi Talat,
>
> What are these? I’m sorry but do you have a pointer (sorry if it’s
> obvious).
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Talat Uyarer <ta...@uyarer.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Wednesday, January 28, 2015 at 1:55 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Nutch IRI URIs
>
>>Hi all,
>>
>>Do you have any idea How can Nutch handle IRI URIs ?
>>
>>Thanks
>>
>>--
>>Talat
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: Nutch IRI URIs

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Talat,

What are these? I’m sorry but do you have a pointer (sorry if it’s
obvious).

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Talat Uyarer <ta...@uyarer.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Wednesday, January 28, 2015 at 1:55 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Nutch IRI URIs

>Hi all,
>
>Do you have any idea How can Nutch handle IRI URIs ?
>
>Thanks
>
>-- 
>Talat