You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Wechner <mi...@wyona.com> on 2006/08/18 11:42:26 UTC

Nutch doesn't dive deeper

Hi

I am trying to index http://ulysses.wyona.org/ but somehow it just 
indexes the homepage but doesn't seem to follow
any links. I have set "depth 3" and other sites are being crawled deeper 
without a problem but not the Ulysses page.

Has anyone made similar experiences?

Is it possible that Nutch has problem with well-formed XHTML 
(application/xhtml+xml)?

Thanks

Michi

-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61

Re: Nutch doesn't dive deeper

Posted by sami siren <ss...@gmail.com>.

2006/8/27, Chris Mattmann <ch...@jpl.nasa.gov>:
>
> Hi Sami,
>
>   I'm not sure that I agree that the entire set of mime types that you
> list
> below should be removed from the parse-plugins.xml default mapping. For
> instance, if you look at the current mapping file, many of the types below
> would have no other option for parsing them besides the TextParser. I
> think
> it makes a lot of sense to parse some of the below documents with the
> TextParser because, in fact, they are text documents.



A LaTeX document is a
> plan text document.


Yes it can contain textual content among other things. However without
proper parsing the outcome is (at least pars of it) not something I would
like to see in search results.

Text/css is essentially a plain text document.
>

yes, contents are most often ASCII but is it really something one wants to
index by default?


An rfc822
> message is indeed (stripped of headers), a plain text document.


yes, contents are most often ascii, but  I quess as often encoded (for
example mime) to be more or less useless in unparsed form.

   There's a careful tradeoff that must be made in terms of having a default
> config file that allows the greatest coverage of mime tyeps that are
> available, and the handling of them with at least * one * parser, in
> contrast to not including any parser at all for a particular mime type. I
> struggled with this very issue when I initially created that file and what
> you see in there now represents a "best guess" of mime types mapped to the
> available parsers that exist in Nutch. The other option on that file is
> that
> people can modify it on their own. For instance, in a domain-specific
> deployment, a user can add and remove whatever mime type to plugin
> mappings
> she wants from the parse-plugins.xml file: it was never meant to be
> something that was "set in stone" per se. It would be good to see some
> experiments to see what the best config set for parse-plugins.xml is.



My opinion is that we should not try to pretend to be able to parse
something when we really are not. We should give a default config that
allows the greatest set of mime types Nutch really can handle. Then again
those two text type of documents you picked up are quite rare and not
mainstream and probably enabling/disabling them doesn't really make any
difference in search results.

--
 Sami Siren

Re: Nutch doesn't dive deeper

Posted by Michael Wechner <mi...@wyona.com>.

sami siren wrote:

> This is yet another side effect of applying TextParser to non plain text
> documents and in this particular case it comes short with namespace
> declarations. I propose that we remove the PlainText parser from at least
> the following mime types:
>
> * (default)
> application/rss+xml
> application/vnd.wap.wbxml
> application/vnd.wap.wmlc
> application/vnd.wap.wmlscriptc
> application/xhtml+xml
> application/x-latex
> application/x-netcdf
> application/x-tex
> application/x-texinfo
> application/x-troff
> application/x-troff-man
> application/x-troff-me
> application/x-troff-ms
> message/news
> message/rfc822
> text/css
> text/sgml
> text/vnd.wap.wml
> text/xml
> text/x-setext
>
> I would guess that handling of text/xhtml+xml 


I guess you mean application/xhtml+xml (as you actually note above)

> mimetpe should  be done with
> html parser anyway.


yes, I would say so

Thanks

Michi

>
> -- 
> Sami Siren
>
> 2006/8/25, Michael Wechner <mi...@wyona.com>:
>
>>
>> I think the problem is as follows with XHTML files:
>>
>> 2006-08-25 16:06:11,925 WARN  parse.ParserFactory -
>> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
>> contentType application/xhtml+xml via parse-plugins.xml, but its
>> plugin.xml file does not claim to support contentType:
>> application/xhtml+xml
>> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
>> java.net.MalformedURLException: unknown protocol: xmlns
>>         at java.net.URL.<init>(URL.java:544)
>>         at java.net.URL.<init>(URL.java:434)
>>         at java.net.URL.<init>(URL.java:383)
>>
>>
>> whereas maybe this could be resolved with
>>
>> http://issues.apache.org/jira/browse/NUTCH-359
>>
>> I am kind of suprised that nobody else is having this problem with
>> proper XHTML ;-)
>>
>> Thanks
>>
>> Michi
>>
>> Ken Gregoire wrote:
>>
>> > look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
>> >
>> > User-agent: *
>> > Disallow: /foo/bar.html
>> >
>> > User-agent: lenya
>> > Disallow: /foo/bar.html
>> >
>> >
>> >
>> >
>> >
>> > Michael Wechner wrote:
>> >
>> >> Hi
>> >>
>> >> I am trying to index http://ulysses.wyona.org/ but somehow it just
>> >> indexes the homepage but doesn't seem to follow
>> >> any links. I have set "depth 3" and other sites are being crawled
>> >> deeper without a problem but not the Ulysses page.
>> >>
>> >> Has anyone made similar experiences?
>> >>
>> >> Is it possible that Nutch has problem with well-formed XHTML
>> >> (application/xhtml+xml)?
>> >>
>> >> Thanks
>> >>
>> >> Michi
>> >>
>> >
>>
>>
>> -- 
>> Michael Wechner
>> Wyona      -   Open Source Content Management   -    Apache Lenya
>> http://www.wyona.com                      http://lenya.apache.org
>> michael.wechner@wyona.com                        michi@apache.org
>> +41 44 272 91 61
>>
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61

Re: Nutch doesn't dive deeper

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Sami,

  I'm not sure that I agree that the entire set of mime types that you list
below should be removed from the parse-plugins.xml default mapping. For
instance, if you look at the current mapping file, many of the types below
would have no other option for parsing them besides the TextParser. I think
it makes a lot of sense to parse some of the below documents with the
TextParser because, in fact, they are text documents. A LaTeX document is a
plan text document. Text/css is essentially a plain text document. An rfc822
message is indeed (stripped of headers), a plain text document.

   There's a careful tradeoff that must be made in terms of having a default
config file that allows the greatest coverage of mime tyeps that are
available, and the handling of them with at least * one * parser, in
contrast to not including any parser at all for a particular mime type. I
struggled with this very issue when I initially created that file and what
you see in there now represents a "best guess" of mime types mapped to the
available parsers that exist in Nutch. The other option on that file is that
people can modify it on their own. For instance, in a domain-specific
deployment, a user can add and remove whatever mime type to plugin mappings
she wants from the parse-plugins.xml file: it was never meant to be
something that was "set in stone" per se. It would be good to see some
experiments to see what the best config set for parse-plugins.xml is.

Thanks!

Cheers,
  Chris

On 8/27/06 12:30 AM, "sami siren" <ss...@gmail.com> wrote:

> This is yet another side effect of applying TextParser to non plain text
> documents and in this particular case it comes short with namespace
> declarations. I propose that we remove the PlainText parser from at least
> the following mime types:
> 
> * (default)
> application/rss+xml
> application/vnd.wap.wbxml
> application/vnd.wap.wmlc
> application/vnd.wap.wmlscriptc
> application/xhtml+xml
> application/x-latex
> application/x-netcdf
> application/x-tex
> application/x-texinfo
> application/x-troff
> application/x-troff-man
> application/x-troff-me
> application/x-troff-ms
> message/news
> message/rfc822
> text/css
> text/sgml
> text/vnd.wap.wml
> text/xml
> text/x-setext
> 
> I would guess that handling of text/xhtml+xml mimetpe should  be done with
> html parser anyway.
> 
> --
>  Sami Siren
> 
> 2006/8/25, Michael Wechner <mi...@wyona.com>:
>> 
>> I think the problem is as follows with XHTML files:
>> 
>> 2006-08-25 16:06:11,925 WARN  parse.ParserFactory -
>> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
>> contentType application/xhtml+xml via parse-plugins.xml, but its
>> plugin.xml file does not claim to support contentType:
>> application/xhtml+xml
>> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
>> java.net.MalformedURLException: unknown protocol: xmlns
>>         at java.net.URL.<init>(URL.java:544)
>>         at java.net.URL.<init>(URL.java:434)
>>         at java.net.URL.<init>(URL.java:383)
>> 
>> 
>> whereas maybe this could be resolved with
>> 
>> http://issues.apache.org/jira/browse/NUTCH-359
>> 
>> I am kind of suprised that nobody else is having this problem with
>> proper XHTML ;-)
>> 
>> Thanks
>> 
>> Michi
>> 
>> Ken Gregoire wrote:
>> 
>>> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
>>> 
>>> User-agent: *
>>> Disallow: /foo/bar.html
>>> 
>>> User-agent: lenya
>>> Disallow: /foo/bar.html
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Michael Wechner wrote:
>>> 
>>>> Hi
>>>> 
>>>> I am trying to index http://ulysses.wyona.org/ but somehow it just
>>>> indexes the homepage but doesn't seem to follow
>>>> any links. I have set "depth 3" and other sites are being crawled
>>>> deeper without a problem but not the Ulysses page.
>>>> 
>>>> Has anyone made similar experiences?
>>>> 
>>>> Is it possible that Nutch has problem with well-formed XHTML
>>>> (application/xhtml+xml)?
>>>> 
>>>> Thanks
>>>> 
>>>> Michi
>>>> 
>>> 
>> 
>> 
>> --
>> Michael Wechner
>> Wyona      -   Open Source Content Management   -    Apache Lenya
>> http://www.wyona.com                      http://lenya.apache.org
>> michael.wechner@wyona.com                        michi@apache.org
>> +41 44 272 91 61
>> 
>>

Re: Nutch doesn't dive deeper

Posted by sami siren <ss...@gmail.com>.

This is yet another side effect of applying TextParser to non plain text
documents and in this particular case it comes short with namespace
declarations. I propose that we remove the PlainText parser from at least
the following mime types:

* (default)
application/rss+xml
application/vnd.wap.wbxml
application/vnd.wap.wmlc
application/vnd.wap.wmlscriptc
application/xhtml+xml
application/x-latex
application/x-netcdf
application/x-tex
application/x-texinfo
application/x-troff
application/x-troff-man
application/x-troff-me
application/x-troff-ms
message/news
message/rfc822
text/css
text/sgml
text/vnd.wap.wml
text/xml
text/x-setext

I would guess that handling of text/xhtml+xml mimetpe should  be done with
html parser anyway.

--
 Sami Siren

2006/8/25, Michael Wechner <mi...@wyona.com>:
>
> I think the problem is as follows with XHTML files:
>
> 2006-08-25 16:06:11,925 WARN  parse.ParserFactory -
> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
> contentType application/xhtml+xml via parse-plugins.xml, but its
> plugin.xml file does not claim to support contentType:
> application/xhtml+xml
> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: xmlns
>         at java.net.URL.<init>(URL.java:544)
>         at java.net.URL.<init>(URL.java:434)
>         at java.net.URL.<init>(URL.java:383)
>
>
> whereas maybe this could be resolved with
>
> http://issues.apache.org/jira/browse/NUTCH-359
>
> I am kind of suprised that nobody else is having this problem with
> proper XHTML ;-)
>
> Thanks
>
> Michi
>
> Ken Gregoire wrote:
>
> > look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
> >
> > User-agent: *
> > Disallow: /foo/bar.html
> >
> > User-agent: lenya
> > Disallow: /foo/bar.html
> >
> >
> >
> >
> >
> > Michael Wechner wrote:
> >
> >> Hi
> >>
> >> I am trying to index http://ulysses.wyona.org/ but somehow it just
> >> indexes the homepage but doesn't seem to follow
> >> any links. I have set "depth 3" and other sites are being crawled
> >> deeper without a problem but not the Ulysses page.
> >>
> >> Has anyone made similar experiences?
> >>
> >> Is it possible that Nutch has problem with well-formed XHTML
> >> (application/xhtml+xml)?
> >>
> >> Thanks
> >>
> >> Michi
> >>
> >
>
>
> --
> Michael Wechner
> Wyona      -   Open Source Content Management   -    Apache Lenya
> http://www.wyona.com                      http://lenya.apache.org
> michael.wechner@wyona.com                        michi@apache.org
> +41 44 272 91 61
>
>

Re: Nutch doesn't dive deeper

Posted by Michael Wechner <mi...@wyona.com>.

I think the problem is as follows with XHTML files:

2006-08-25 16:06:11,925 WARN  parse.ParserFactory - 
ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to 
contentType application/xhtml+xml via parse-plugins.xml, but its 
plugin.xml file does not claim to support contentType: application/xhtml+xml
2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: xmlns
        at java.net.URL.<init>(URL.java:544)
        at java.net.URL.<init>(URL.java:434)
        at java.net.URL.<init>(URL.java:383)


whereas maybe this could be resolved with

http://issues.apache.org/jira/browse/NUTCH-359

I am kind of suprised that nobody else is having this problem with 
proper XHTML ;-)

Thanks

Michi

Ken Gregoire wrote:

> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
>
> User-agent: *
> Disallow: /foo/bar.html
>
> User-agent: lenya
> Disallow: /foo/bar.html
>
>
>
>
>
> Michael Wechner wrote:
>
>> Hi
>>
>> I am trying to index http://ulysses.wyona.org/ but somehow it just 
>> indexes the homepage but doesn't seem to follow
>> any links. I have set "depth 3" and other sites are being crawled 
>> deeper without a problem but not the Ulysses page.
>>
>> Has anyone made similar experiences?
>>
>> Is it possible that Nutch has problem with well-formed XHTML 
>> (application/xhtml+xml)?
>>
>> Thanks
>>
>> Michi
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61

Re: Nutch doesn't dive deeper

Posted by Michael Wechner <mi...@wyona.com>.

Ken Gregoire wrote:

> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt


right, but shouldn't it just block the URL /foo/bar.html?

Maybe I completely misunderstand how a robots.txt should be written or 
is it possible
that Nutch doesn't really parse "Disallow"?

Also I have commented the Disallow http://ulysses.wyona.org/robots.txt
but get the same result resp. just one page crawled.

So I am not sure if has anything to do with the robots.txt

Thanks

Michi

>
> User-agent: *
> Disallow: /foo/bar.html
>
> User-agent: lenya
> Disallow: /foo/bar.html
>
>
>
>
>
> Michael Wechner wrote:
>
>> Hi
>>
>> I am trying to index http://ulysses.wyona.org/ but somehow it just 
>> indexes the homepage but doesn't seem to follow
>> any links. I have set "depth 3" and other sites are being crawled 
>> deeper without a problem but not the Ulysses page.
>>
>> Has anyone made similar experiences?
>>
>> Is it possible that Nutch has problem with well-formed XHTML 
>> (application/xhtml+xml)?
>>
>> Thanks
>>
>> Michi
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61

Re: Nutch doesn't dive deeper

Posted by Ken Gregoire <ke...@gordiandata.net>.

look here, it is blocking robots: http://ulysses.wyona.org/robots.txt

User-agent: *
Disallow: /foo/bar.html

User-agent: lenya
Disallow: /foo/bar.html





Michael Wechner wrote:

> Hi
>
> I am trying to index http://ulysses.wyona.org/ but somehow it just 
> indexes the homepage but doesn't seem to follow
> any links. I have set "depth 3" and other sites are being crawled 
> deeper without a problem but not the Ulysses page.
>
> Has anyone made similar experiences?
>
> Is it possible that Nutch has problem with well-formed XHTML 
> (application/xhtml+xml)?
>
> Thanks
>
> Michi
>