You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/02/29 14:57:46 UTC
NUTCH-1273
Hi,
In the process of addressing NUTCH-1273 [0] I ran into a small problem with
some Tika classes.
The patch I attached to the issue currently upgrades usage of external
dependencies bar Tika. You will therefore still see javac flagging up
problems within o.a.n.util.MimeUtil#autoResolveContentType [1]
// if returned null, or if it's the default type then try url resolution
168 if (type == null 169 || (type != null &&
type.getName().equals(MimeTypes.OCTET_STREAM))) { 170 // If no mime-type
header, or cannot find a corresponding registered 171 // mime-type, then
guess a mime-type from the url pattern 172 type =
this.mimeTypes.getMimeType(url) != null ? this.mimeTypes 173 .getMimeType(url)
: type; 174 }
Initially I tried changing the above to
// if returned null, or if it's the default type then try url resolution
if (type == null
|| (type != null && type.getName().equals(
MimeTypes.OCTET_STREAM))) {
// If no mime-type header, or cannot find a corresponding registered
// mime-type, then guess a mime-type from the url pattern
String mt = tika.detect(url);
type = mt != null ? mt : type;
}
However after compiling I get
[javac] MimeUtil.java:165: incompatible types
[javac] found :
java.lang.Object&java.io.Serializable&java.lang.Comparable<? extends
java.lang.Object&java.io.Serializable&java.lang.Comparable<?>>
[javac] required: org.apache.tika.mime.MimeType
[javac] type = mt != null ? mt : type;
[javac] ^
This is because Tika.detect(URL) returns the mimetype as a String and the
detectors themselves return a MediaType.
I went to user@tika and the feedback I got was
* Switch your code to use a mimetype String
* Switch your code to use MediaType rather than MimeType, and call
DefaultDetector directly (rather than using the Tika facade class)
* If you get back a String (not null) for the mimetype, create a MimeType
object for it.
So I suppose my question is what do we want too do?
Thanks
[0] https://issues.apache.org/jira/browse/NUTCH-1273<https://issues.apache.org/jira/browse/NUTCH-1273>
[1]
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/MimeUtil.java?view=markup
--
*Lewis*
Re: NUTCH-1273
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Markus,
Well it would appear that the method I mention is the only one which still
uses instances of the deprecated API. I notice that we support Tika-core &
parsers 0.10 in Nutchgora and 1.0 core in trunk. I'll probably just re-open
the relevant issues again assign Nutchgora to them and upgrade to 1.0
removing dependency on tika-parsers in the process (if possible). I notice
that Tika 1.1 is_not_ available on maven central yet, so this is really in
incremental move towards supporting 1.1 as well.
Regarding the code example and options I initially proposed, do you have
any comments on the best route to go down?
Thanks
Lewis
On Wed, Feb 29, 2012 at 2:32 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> Hmm, i modified the Content and MIMEUtil classes to use the new .detect
> API in
> NUTCH-1230. I was under the impression all deprecated calls were replaced.
>
> On Wednesday 29 February 2012 14:57:46 Lewis John Mcgibbney wrote:
> > Hi,
> >
> > In the process of addressing NUTCH-1273 [0] I ran into a small problem
> with
> > some Tika classes.
> > The patch I attached to the issue currently upgrades usage of external
> > dependencies bar Tika. You will therefore still see javac flagging up
> > problems within o.a.n.util.MimeUtil#autoResolveContentType [1]
> >
> > // if returned null, or if it's the default type then try url resolution
> > 168 if (type == null 169 || (type != null &&
> > type.getName().equals(MimeTypes.OCTET_STREAM))) { 170 // If no mime-type
> > header, or cannot find a corresponding registered 171 // mime-type, then
> > guess a mime-type from the url pattern 172 type =
> > this.mimeTypes.getMimeType(url) != null ? this.mimeTypes 173
> > .getMimeType(url)
> >
> > : type; 174 }
> >
> > Initially I tried changing the above to
> >
> > // if returned null, or if it's the default type then try url
> > resolution if (type == null
> >
> > || (type != null && type.getName().equals(
> >
> > MimeTypes.OCTET_STREAM))) {
> >
> > // If no mime-type header, or cannot find a corresponding
> registered
> > // mime-type, then guess a mime-type from the url pattern
> > String mt = tika.detect(url);
> >
> > type = mt != null ? mt : type;
> > }
> >
> > However after compiling I get
> >
> > [javac] MimeUtil.java:165: incompatible types
> > [javac] found :
> > java.lang.Object&java.io.Serializable&java.lang.Comparable<? extends
> > java.lang.Object&java.io.Serializable&java.lang.Comparable<?>>
> > [javac] required: org.apache.tika.mime.MimeType
> > [javac] type = mt != null ? mt : type;
> > [javac] ^
> >
> > This is because Tika.detect(URL) returns the mimetype as a String and the
> > detectors themselves return a MediaType.
> >
> > I went to user@tika and the feedback I got was
> >
> > * Switch your code to use a mimetype String
> > * Switch your code to use MediaType rather than MimeType, and call
> > DefaultDetector directly (rather than using the Tika facade class)
> > * If you get back a String (not null) for the mimetype, create a MimeType
> > object for it.
> >
> > So I suppose my question is what do we want too do?
> >
> > Thanks
> >
> > [0]
> > https://issues.apache.org/jira/browse/NUTCH-1273<
> https://issues.apache.org
> > /jira/browse/NUTCH-1273> [1]
> >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/Mim
> > eUtil.java?view=markup
>
> --
> Markus Jelsma - CTO - Openindex
>
--
*Lewis*
Re: NUTCH-1273
Posted by Markus Jelsma <ma...@openindex.io>.
Hmm, i modified the Content and MIMEUtil classes to use the new .detect API in
NUTCH-1230. I was under the impression all deprecated calls were replaced.
On Wednesday 29 February 2012 14:57:46 Lewis John Mcgibbney wrote:
> Hi,
>
> In the process of addressing NUTCH-1273 [0] I ran into a small problem with
> some Tika classes.
> The patch I attached to the issue currently upgrades usage of external
> dependencies bar Tika. You will therefore still see javac flagging up
> problems within o.a.n.util.MimeUtil#autoResolveContentType [1]
>
> // if returned null, or if it's the default type then try url resolution
> 168 if (type == null 169 || (type != null &&
> type.getName().equals(MimeTypes.OCTET_STREAM))) { 170 // If no mime-type
> header, or cannot find a corresponding registered 171 // mime-type, then
> guess a mime-type from the url pattern 172 type =
> this.mimeTypes.getMimeType(url) != null ? this.mimeTypes 173
> .getMimeType(url)
>
> : type; 174 }
>
> Initially I tried changing the above to
>
> // if returned null, or if it's the default type then try url
> resolution if (type == null
>
> || (type != null && type.getName().equals(
>
> MimeTypes.OCTET_STREAM))) {
>
> // If no mime-type header, or cannot find a corresponding registered
> // mime-type, then guess a mime-type from the url pattern
> String mt = tika.detect(url);
>
> type = mt != null ? mt : type;
> }
>
> However after compiling I get
>
> [javac] MimeUtil.java:165: incompatible types
> [javac] found :
> java.lang.Object&java.io.Serializable&java.lang.Comparable<? extends
> java.lang.Object&java.io.Serializable&java.lang.Comparable<?>>
> [javac] required: org.apache.tika.mime.MimeType
> [javac] type = mt != null ? mt : type;
> [javac] ^
>
> This is because Tika.detect(URL) returns the mimetype as a String and the
> detectors themselves return a MediaType.
>
> I went to user@tika and the feedback I got was
>
> * Switch your code to use a mimetype String
> * Switch your code to use MediaType rather than MimeType, and call
> DefaultDetector directly (rather than using the Tika facade class)
> * If you get back a String (not null) for the mimetype, create a MimeType
> object for it.
>
> So I suppose my question is what do we want too do?
>
> Thanks
>
> [0]
> https://issues.apache.org/jira/browse/NUTCH-1273<https://issues.apache.org
> /jira/browse/NUTCH-1273> [1]
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/Mim
> eUtil.java?view=markup
--
Markus Jelsma - CTO - Openindex