You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jérôme Charron <je...@gmail.com> on 2006/05/02 16:13:52 UTC

Re: Content-Type inconsistency?

> I'm not so sure.  When crawling Apache we had trouble with this feature.
>   Some HTML files that had an XML header and the server identified as
> "text/html" Nutch decided to treat as XML, not HTML.

Yes, the current version of the mime-type resolver is a crude one.
XML, HTML, RSS and all XML based files are not always correctly identified.
(this problem is well known, and cause troubles for instance with RSS feeds
that
return text/xml content-type).

  We had to turn off
> the guessing of content types to index Apache correctly.

Instead of turning off the guessing of content types you should only to
remove
the magic for xml in mime-types.xml
In the new version (based on freedesktop) that is sleeping for a while on my
disk, I think
such problems are solved since it introduce many informations not included
in the current version:
hierarchy between content-types (text/html is a subclass of text/xml), some
way to express some complex magic clause, and so on.
For instance, it  can now correctly identify RSS documents : generally RSS
feeds are associated with a generic text/xml content-type, and
we cannot identify them => they fall back to the generic parse-text parser.


>   I think we
> shouldn't aim guess things any more than a browser does.  If browsers
> require standards compliance, then our lives will be simpler.

Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because there
is a <link rel="alternate" type="..."/>
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).
Since all servers simply return the generic text/xml content-type, the only
way to know it is a rss related document is to use magic content-type
guessing (you can notice that many browsers doesn"t identify it as a rss
document, but simply as a generic xml file).
One more thing is that actually, there is no officialy registered
content-type for rss. So, we can only use guessing from the document content
to know it is a rss document.


Jérôme

Re: Content-Type inconsistency?

Posted by Jérôme Charron <je...@gmail.com>.
> Shouldn't RSS feeds declare the correct content-type?

Yes, they should, but generally, they don't (a lot of rss feeds return a
text/xml content-type).
I don't know why. Perhaps because application/rss+xml is not registered to
IANA (http://www.iana.org/assignments/media-types/application/)
In practice, many webmasters are don't aware of this, since the main entry
point for their feeds are some HTML pages
that reference them (with the good content-type in HTML tag link) or some
feeds aggregators that simply try to parse the feed content (without any
care of the protocol mime-type) => Their feeds are viewable and usable by
end users.

Further more, I see this "feature" as an extension of the cache mechanism.
The cache provides an access for a document that no longer exists or is
simply temporally unavailable. So why not giving access via the cache to a
document with a wrong protocol content-type but that was correctly
identified /parsed / indexed by Nutch?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Content-Type inconsistency?

Posted by Doug Cutting <cu...@apache.org>.
Jérôme Charron wrote:
>> We had to turn off
>> the guessing of content types to index Apache correctly.
> 
> Instead of turning off the guessing of content types you should only to
> remove the magic for xml in mime-types.xml

Perhaps that would have worked also, but, with Apache, simply trusting 
the declared Content-Type seems to work quite well.

>> I think we
>> shouldn't aim guess things any more than a browser does.  If browsers
>> require standards compliance, then our lives will be simpler.
> 
> Yes, but actually Nutch cannot acts as a browser.
> For instance with RSS: A browser know that a URL is a RSS feed because 
> there
> is a <link rel="alternate" type="..."/>
> with the correct content-type (application/rss+xml) in the refering HTML
> page.
> Nutch doesn't keep such informations for guessing a content-type (it could
> be a good think to add), so it must find the content-type from the URL
> (without any context).

Shouldn't RSS feeds declare the correct content-type?

http://feedvalidator.org/docs/warning/NonSpecificMediaType.html

I don't see that context should be required for feeds.

Doug