You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/09/12 16:58:57 UTC

Resolving of relative URL's

Hi,

Since TIKA-287 all relative URL's are resolved to absolutes regardless of the 
presence of the base element. This is not always desired behaviour.

Would it be possible to use some setting to instruct the parser not to resolve 
URL's if the base element doesn't exist or does not have an href attribute 
with a valid absolute URL?


Thanks,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Resolving of relative URL's

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Sep 19, 2011 at 10:56 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> There are now several cases known to us where we would like to control URL
> resolving. All cases share one similarity, URL's being relative in the
> original source. How could we instruct the parser or modify the code to do so?

I guess we could make the URL resolution mechanism pluggable.

But I still don't see how else you'd resolve relative URLs than what's
now being done in Tika's HtmlHandler.resolve() method.

Generally speaking avoiding problems like the recursive URL you
mentioned should be done above the level of URL resolution. For
example, your crawler would face the exact same problem when
encountering say a dynamic calendar web site with links to the next or
previous day. Such an infinite URL space is perfectly valid, so no
resolution mechanism could prevent the crawler from entering such an
trap. Instead the crawler should employ heuristics like maximum
recursion depth, etc. to avoid such problems.

BR,

Jukka Zitting

Re: Resolving of relative URL's

Posted by Markus Jelsma <ma...@openindex.io>.

Jukka and others,

There are now several cases known to us where we would like to control URL 
resolving. All cases share one similarity, URL's being relative in the 
original source. How could we instruct the parser or modify the code to do so?

Right now we need to come up with regular expressions to detect commonalities 
in URI segments and throw them away.

Thanks

> Hi,
> 
> On Mon, Sep 12, 2011 at 6:00 PM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> > Yes! Nutch extracts all outlinks but there is a tedious crawler trap
> > regarding to self-referring relative URL's. Consider
> > http://example.org/content/ with a list of relative links (menu on each
> > page) of which one or more is actually incorrect:
> > 
> > ../more-content/
> > ../other-content/
> > wrong-link/
> > ../even-more/content/
> > 
> > For pages without base href the wrong-link/ is resolved to
> > http://example.org/content/wrong-link/. The new page also contains the
> > same url list as above so the next wrong link is resolved as
> > http://example.org/content/wrong-link/wrong-link/......
> > 
> > An endless nightmare for a crawler :)
> 
> How would not resolving the links in Tika help in this case? To crawl
> the site, the crawler would in any case have to resolve the links, and
> come up with the exact same resolved URLs.
> 
> BR,
> 
> Jukka Zitting

Re: Resolving of relative URL's

Posted by Markus Jelsma <ma...@openindex.io>.


On Monday 12 September 2011 18:08:50 Jukka Zitting wrote:
> > For pages without base href the wrong-link/ is resolved to
> > http://example.org/content/wrong-link/. The new page also contains the
> > same url list as above so the next wrong link is resolved as
> > http://example.org/content/wrong-link/wrong-link/......
> > 
> > An endless nightmare for a crawler :)
> 
> How would not resolving the links in Tika help in this case? To crawl
> the site, the crawler would in any case have to resolve the links, and
> come up with the exact same resolved URLs.
> 

I could choose not to collect those relative URL's as outlink. Right now i 
cannot determine whether a URL was originally a relative URL.

> BR,
> 
> Jukka Zitting

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Resolving of relative URL's

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Sep 12, 2011 at 6:00 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding
> to self-referring relative URL's. Consider http://example.org/content/ with a
> list of relative links (menu on each page) of which one or more is actually
> incorrect:
>
> ../more-content/
> ../other-content/
> wrong-link/
> ../even-more/content/
>
> For pages without base href the wrong-link/ is resolved to
> http://example.org/content/wrong-link/. The new page also contains the same
> url list as above so the next wrong link is resolved as
> http://example.org/content/wrong-link/wrong-link/......
>
> An endless nightmare for a crawler :)

How would not resolving the links in Tika help in this case? To crawl
the site, the crawler would in any case have to resolve the links, and
come up with the exact same resolved URLs.

BR,

Jukka Zitting

Re: Resolving of relative URL's

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

On Monday 12 September 2011 17:35:49 Jukka Zitting wrote:
> Hi,
> 
> On Mon, Sep 12, 2011 at 4:58 PM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> > Since TIKA-287 all relative URL's are resolved to absolutes regardless of
> > the presence of the base element. This is not always desired behaviour.
> 
> Can you describe a use case where that's not the desired behaviour? I
> would assume that a resolved URL is always preferred to an unresolved
> one.

Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding 
to self-referring relative URL's. Consider http://example.org/content/ with a 
list of relative links (menu on each page) of which one or more is actually 
incorrect:

../more-content/
../other-content/
wrong-link/
../even-more/content/

For pages without base href the wrong-link/ is resolved to 
http://example.org/content/wrong-link/. The new page also contains the same 
url list as above so the next wrong link is resolved as 
http://example.org/content/wrong-link/wrong-link/......

An endless nightmare for a crawler :)

> 
> > Would it be possible to use some setting to instruct the parser not to
> > resolve URL's if the base element doesn't exist or does not have an href
> > attribute with a valid absolute URL?
> 
> Currently Tika looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY
> metadata keys for the default base URL. If neither is present and
> there is no <base href=".."> element, then URLs in the document will
> not be resolved.

Hm, testing with Nutch i see that URL's are always extracted. Seems at least 
one meta data key is present although i'm not too sure. In the Nutch code an 
empty org.apache.tika.metadata.Metadata object is passed to the parse() 
method.

> 
> BR,
> 
> Jukka Zitting

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Resolving of relative URL's

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Sep 12, 2011 at 4:58 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Since TIKA-287 all relative URL's are resolved to absolutes regardless of the
> presence of the base element. This is not always desired behaviour.

Can you describe a use case where that's not the desired behaviour? I
would assume that a resolved URL is always preferred to an unresolved
one.

> Would it be possible to use some setting to instruct the parser not to resolve
> URL's if the base element doesn't exist or does not have an href attribute
> with a valid absolute URL?

Currently Tika looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY
metadata keys for the default base URL. If neither is present and
there is no <base href=".."> element, then URLs in the document will
not be resolved.

BR,

Jukka Zitting