You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Talat Uyarer <ta...@uyarer.com> on 2015/01/28 10:55:36 UTC
Nutch IRI URIs
Hi all,
Do you have any idea How can Nutch handle IRI URIs ?
Thanks
--
Talat
Re: Nutch IRI URIs
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Yep meant normalized thanks seb
Sent from my iPhone
> On Jan 30, 2015, at 9:16 AM, Sebastian Nagel <wa...@googlemail.com> wrote:
>
> Hi,
>
>> that can be done via a URL filter in Nutch,
>
> Should be "URL normalizer", right?
>
> I did this once by adding rules to regex-normalize.xml.
> If the URLs are in a certain language with a limited set
> on non-ASCII letters (that's the case for Turkish),
> this will result in a dozen of extra rules.
>
> But, in general, normalization of IRIs to URIs should be done
> per default. Could you open a Jira for this?
>
> Internally (as keys in CrawlDb, segments, web table)
> there should be only pure ASCII URIs.
> Cf. NUTCH-1321 [1] and NUTCH-1708 [2].
>
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1321
> [2]
> https://issues.apache.org/jira/browse/NUTCH-1708?focusedCommentId=13968762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13968762
>
>
>
>
> 2015-01-29 21:59 GMT+01:00 Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov>:
>
>> Thanks Talat, good question. So what you want are the URLs to
>> actually come through with encoding and stuff like the 2nd example?
>>
>> I think that can be done via a URL filter in Nutch, or also via the
>> parser (which by default is parse-tika so you are subject to the
>> outlinks it extracts). But even still you could always change the outlinnks
>> I believe with a filter as well.
>>
>> Does that help?
>>
>> Cheers,
>> Chris
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Talat Uyarer <ta...@uyarer.com>
>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Date: Wednesday, January 28, 2015 at 10:24 PM
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Re: Nutch IRI URIs
>>
>>> Hi Chris,
>>>
>>> IRI extend upon URIs by using the Universal Character Set whereas URIs
>>> were limited to the ASCII with far fewer characters. with HTML5 Some
>>> pages has IRI outlinks like as
>>> http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca I
>>> realized when Nutch extract outlinks, it can not normalize IRI to URI
>>> form. If the outlink is normalized, it should look like
>>> http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
>>>
>>> Do you have any idea Nutch can handle IRI url ? I did some test but I
>>> could not find any solution. If there is not any support. IMHO We
>>> should add IRI support in urlnormalizer-basic. Wdyt ?
>>>
>>> Talat
>>>
>>> 2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
>>> <ch...@jpl.nasa.gov>:
>>>> Hi Talat,
>>>>
>>>> What are these? I’m sorry but do you have a pointer (sorry if it’s
>>>> obvious).
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW: http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Talat Uyarer <ta...@uyarer.com>
>>>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>> Date: Wednesday, January 28, 2015 at 1:55 AM
>>>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>>>> Subject: Nutch IRI URIs
>>>>
>>>>> Hi all,
>>>>>
>>>>> Do you have any idea How can Nutch handle IRI URIs ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> --
>>>>> Talat
>>>
>>>
>>>
>>> --
>>> Talat UYARER
>>> Websitesi: http://talat.uyarer.com
>>> Twitter: http://twitter.com/talatuyarer
>>> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>>
>>
Re: Nutch IRI URIs
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
> that can be done via a URL filter in Nutch,
Should be "URL normalizer", right?
I did this once by adding rules to regex-normalize.xml.
If the URLs are in a certain language with a limited set
on non-ASCII letters (that's the case for Turkish),
this will result in a dozen of extra rules.
But, in general, normalization of IRIs to URIs should be done
per default. Could you open a Jira for this?
Internally (as keys in CrawlDb, segments, web table)
there should be only pure ASCII URIs.
Cf. NUTCH-1321 [1] and NUTCH-1708 [2].
Sebastian
[1] https://issues.apache.org/jira/browse/NUTCH-1321
[2]
https://issues.apache.org/jira/browse/NUTCH-1708?focusedCommentId=13968762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13968762
2015-01-29 21:59 GMT+01:00 Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov>:
> Thanks Talat, good question. So what you want are the URLs to
> actually come through with encoding and stuff like the 2nd example?
>
> I think that can be done via a URL filter in Nutch, or also via the
> parser (which by default is parse-tika so you are subject to the
> outlinks it extracts). But even still you could always change the outlinnks
> I believe with a filter as well.
>
> Does that help?
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Talat Uyarer <ta...@uyarer.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Wednesday, January 28, 2015 at 10:24 PM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Re: Nutch IRI URIs
>
> >Hi Chris,
> >
> >IRI extend upon URIs by using the Universal Character Set whereas URIs
> >were limited to the ASCII with far fewer characters. with HTML5 Some
> >pages has IRI outlinks like as
> >http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca I
> >realized when Nutch extract outlinks, it can not normalize IRI to URI
> >form. If the outlink is normalized, it should look like
> >http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
> >
> >Do you have any idea Nutch can handle IRI url ? I did some test but I
> >could not find any solution. If there is not any support. IMHO We
> >should add IRI support in urlnormalizer-basic. Wdyt ?
> >
> >Talat
> >
> >2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
> ><ch...@jpl.nasa.gov>:
> >> Hi Talat,
> >>
> >> What are these? I’m sorry but do you have a pointer (sorry if it’s
> >> obvious).
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW: http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Talat Uyarer <ta...@uyarer.com>
> >> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Date: Wednesday, January 28, 2015 at 1:55 AM
> >> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> >> Subject: Nutch IRI URIs
> >>
> >>>Hi all,
> >>>
> >>>Do you have any idea How can Nutch handle IRI URIs ?
> >>>
> >>>Thanks
> >>>
> >>>--
> >>>Talat
> >>
> >
> >
> >
> >--
> >Talat UYARER
> >Websitesi: http://talat.uyarer.com
> >Twitter: http://twitter.com/talatuyarer
> >Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>
>
Re: Nutch IRI URIs
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Talat, good question. So what you want are the URLs to
actually come through with encoding and stuff like the 2nd example?
I think that can be done via a URL filter in Nutch, or also via the
parser (which by default is parse-tika so you are subject to the
outlinks it extracts). But even still you could always change the outlinnks
I believe with a filter as well.
Does that help?
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Talat Uyarer <ta...@uyarer.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Wednesday, January 28, 2015 at 10:24 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Re: Nutch IRI URIs
>Hi Chris,
>
>IRI extend upon URIs by using the Universal Character Set whereas URIs
>were limited to the ASCII with far fewer characters. with HTML5 Some
>pages has IRI outlinks like as
>http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca I
>realized when Nutch extract outlinks, it can not normalize IRI to URI
>form. If the outlink is normalized, it should look like
>http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
>
>Do you have any idea Nutch can handle IRI url ? I did some test but I
>could not find any solution. If there is not any support. IMHO We
>should add IRI support in urlnormalizer-basic. Wdyt ?
>
>Talat
>
>2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov>:
>> Hi Talat,
>>
>> What are these? I’m sorry but do you have a pointer (sorry if it’s
>> obvious).
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Talat Uyarer <ta...@uyarer.com>
>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Date: Wednesday, January 28, 2015 at 1:55 AM
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Nutch IRI URIs
>>
>>>Hi all,
>>>
>>>Do you have any idea How can Nutch handle IRI URIs ?
>>>
>>>Thanks
>>>
>>>--
>>>Talat
>>
>
>
>
>--
>Talat UYARER
>Websitesi: http://talat.uyarer.com
>Twitter: http://twitter.com/talatuyarer
>Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: Nutch IRI URIs
Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Chris,
IRI extend upon URIs by using the Universal Character Set whereas URIs
were limited to the ASCII with far fewer characters. with HTML5 Some
pages has IRI outlinks like as
http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=Çatalca I
realized when Nutch extract outlinks, it can not normalize IRI to URI
form. If the outlink is normalized, it should look like
http://www.avrupaparkbahceler.com/parklarimiz.php?ilce=%C3%87atalca
Do you have any idea Nutch can handle IRI url ? I did some test but I
could not find any solution. If there is not any support. IMHO We
should add IRI support in urlnormalizer-basic. Wdyt ?
Talat
2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980)
<ch...@jpl.nasa.gov>:
> Hi Talat,
>
> What are these? I’m sorry but do you have a pointer (sorry if it’s
> obvious).
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Talat Uyarer <ta...@uyarer.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Wednesday, January 28, 2015 at 1:55 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Nutch IRI URIs
>
>>Hi all,
>>
>>Do you have any idea How can Nutch handle IRI URIs ?
>>
>>Thanks
>>
>>--
>>Talat
>
--
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: Nutch IRI URIs
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Talat,
What are these? I’m sorry but do you have a pointer (sorry if it’s
obvious).
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Talat Uyarer <ta...@uyarer.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Wednesday, January 28, 2015 at 1:55 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Nutch IRI URIs
>Hi all,
>
>Do you have any idea How can Nutch handle IRI URIs ?
>
>Thanks
>
>--
>Talat