You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/02/29 14:57:46 UTC

NUTCH-1273

Hi,

In the process of addressing NUTCH-1273 [0] I ran into a small problem with
some Tika classes.
The patch I attached to the issue currently upgrades usage of external
dependencies bar Tika. You will therefore still see javac flagging up
problems within o.a.n.util.MimeUtil#autoResolveContentType [1]

// if returned null, or if it's the default type then try url resolution
168 if (type == null  169 || (type != null &&
type.getName().equals(MimeTypes.OCTET_STREAM))) {  170 // If no mime-type
header, or cannot find a corresponding registered  171 // mime-type, then
guess a mime-type from the url pattern  172 type =
this.mimeTypes.getMimeType(url) != null ? this.mimeTypes  173 .getMimeType(url)
: type;  174 }
Initially I tried changing the above to

    // if returned null, or if it's the default type then try url resolution
    if (type == null
        || (type != null && type.getName().equals(
MimeTypes.OCTET_STREAM))) {

      // If no mime-type header, or cannot find a corresponding registered
      // mime-type, then guess a mime-type from the url pattern
      String mt = tika.detect(url);

      type = mt != null ? mt : type;
    }

However after compiling I get

    [javac] MimeUtil.java:165: incompatible types
    [javac] found   :
java.lang.Object&java.io.Serializable&java.lang.Comparable<? extends
java.lang.Object&java.io.Serializable&java.lang.Comparable<?>>
    [javac] required: org.apache.tika.mime.MimeType
    [javac]       type = mt != null ? mt : type;
    [javac]                                ^

This is because Tika.detect(URL) returns the mimetype as a String and the
detectors themselves return a MediaType.

I went to user@tika and the feedback I got was

* Switch your code to use a mimetype String
* Switch your code to use MediaType rather than MimeType, and call
 DefaultDetector directly (rather than using the Tika facade class)
* If you get back a String (not null) for the mimetype, create a MimeType
 object for it.

So I suppose my question is what do we want too do?

Thanks

[0] https://issues.apache.org/jira/browse/NUTCH-1273<https://issues.apache.org/jira/browse/NUTCH-1273>
[1]
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/MimeUtil.java?view=markup

-- 
*Lewis*

Re: NUTCH-1273

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Markus,

Well it would appear that the method I mention is the only one which still
uses instances of the deprecated API. I notice that we support Tika-core &
parsers 0.10 in Nutchgora and 1.0 core in trunk. I'll probably just re-open
the relevant issues again assign Nutchgora to them and upgrade to 1.0
removing dependency on tika-parsers in the process (if possible). I notice
that Tika 1.1 is_not_ available on maven central yet, so this is really in
incremental move towards supporting 1.1 as well.

Regarding the code example and options I initially proposed, do you have
any comments on the best route to go down?

Thanks

Lewis

On Wed, Feb 29, 2012 at 2:32 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hmm, i modified the Content and MIMEUtil classes to use the new .detect
> API in
> NUTCH-1230. I was under the impression all deprecated calls were replaced.
>
> On Wednesday 29 February 2012 14:57:46 Lewis John Mcgibbney wrote:
> > Hi,
> >
> > In the process of addressing NUTCH-1273 [0] I ran into a small problem
> with
> > some Tika classes.
> > The patch I attached to the issue currently upgrades usage of external
> > dependencies bar Tika. You will therefore still see javac flagging up
> > problems within o.a.n.util.MimeUtil#autoResolveContentType [1]
> >
> > // if returned null, or if it's the default type then try url resolution
> > 168 if (type == null  169 || (type != null &&
> > type.getName().equals(MimeTypes.OCTET_STREAM))) {  170 // If no mime-type
> > header, or cannot find a corresponding registered  171 // mime-type, then
> > guess a mime-type from the url pattern  172 type =
> > this.mimeTypes.getMimeType(url) != null ? this.mimeTypes  173
> > .getMimeType(url)
> >
> > : type;  174 }
> >
> > Initially I tried changing the above to
> >
> >     // if returned null, or if it's the default type then try url
> > resolution if (type == null
> >
> >         || (type != null && type.getName().equals(
> >
> > MimeTypes.OCTET_STREAM))) {
> >
> >       // If no mime-type header, or cannot find a corresponding
> registered
> >       // mime-type, then guess a mime-type from the url pattern
> >       String mt = tika.detect(url);
> >
> >       type = mt != null ? mt : type;
> >     }
> >
> > However after compiling I get
> >
> >     [javac] MimeUtil.java:165: incompatible types
> >     [javac] found   :
> > java.lang.Object&java.io.Serializable&java.lang.Comparable<? extends
> > java.lang.Object&java.io.Serializable&java.lang.Comparable<?>>
> >     [javac] required: org.apache.tika.mime.MimeType
> >     [javac]       type = mt != null ? mt : type;
> >     [javac]                                ^
> >
> > This is because Tika.detect(URL) returns the mimetype as a String and the
> > detectors themselves return a MediaType.
> >
> > I went to user@tika and the feedback I got was
> >
> > * Switch your code to use a mimetype String
> > * Switch your code to use MediaType rather than MimeType, and call
> >  DefaultDetector directly (rather than using the Tika facade class)
> > * If you get back a String (not null) for the mimetype, create a MimeType
> >  object for it.
> >
> > So I suppose my question is what do we want too do?
> >
> > Thanks
> >
> > [0]
> > https://issues.apache.org/jira/browse/NUTCH-1273<
> https://issues.apache.org
> > /jira/browse/NUTCH-1273> [1]
> >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/Mim
> > eUtil.java?view=markup
>
> --
> Markus Jelsma - CTO - Openindex
>



-- 
*Lewis*

Re: NUTCH-1273

Posted by Markus Jelsma <ma...@openindex.io>.
Hmm, i modified the Content and MIMEUtil classes to use the new .detect API in 
NUTCH-1230. I was under the impression all deprecated calls were replaced. 

On Wednesday 29 February 2012 14:57:46 Lewis John Mcgibbney wrote:
> Hi,
> 
> In the process of addressing NUTCH-1273 [0] I ran into a small problem with
> some Tika classes.
> The patch I attached to the issue currently upgrades usage of external
> dependencies bar Tika. You will therefore still see javac flagging up
> problems within o.a.n.util.MimeUtil#autoResolveContentType [1]
> 
> // if returned null, or if it's the default type then try url resolution
> 168 if (type == null  169 || (type != null &&
> type.getName().equals(MimeTypes.OCTET_STREAM))) {  170 // If no mime-type
> header, or cannot find a corresponding registered  171 // mime-type, then
> guess a mime-type from the url pattern  172 type =
> this.mimeTypes.getMimeType(url) != null ? this.mimeTypes  173
> .getMimeType(url)
> 
> : type;  174 }
> 
> Initially I tried changing the above to
> 
>     // if returned null, or if it's the default type then try url
> resolution if (type == null
> 
>         || (type != null && type.getName().equals(
> 
> MimeTypes.OCTET_STREAM))) {
> 
>       // If no mime-type header, or cannot find a corresponding registered
>       // mime-type, then guess a mime-type from the url pattern
>       String mt = tika.detect(url);
> 
>       type = mt != null ? mt : type;
>     }
> 
> However after compiling I get
> 
>     [javac] MimeUtil.java:165: incompatible types
>     [javac] found   :
> java.lang.Object&java.io.Serializable&java.lang.Comparable<? extends
> java.lang.Object&java.io.Serializable&java.lang.Comparable<?>>
>     [javac] required: org.apache.tika.mime.MimeType
>     [javac]       type = mt != null ? mt : type;
>     [javac]                                ^
> 
> This is because Tika.detect(URL) returns the mimetype as a String and the
> detectors themselves return a MediaType.
> 
> I went to user@tika and the feedback I got was
> 
> * Switch your code to use a mimetype String
> * Switch your code to use MediaType rather than MimeType, and call
>  DefaultDetector directly (rather than using the Tika facade class)
> * If you get back a String (not null) for the mimetype, create a MimeType
>  object for it.
> 
> So I suppose my question is what do we want too do?
> 
> Thanks
> 
> [0]
> https://issues.apache.org/jira/browse/NUTCH-1273<https://issues.apache.org
> /jira/browse/NUTCH-1273> [1]
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/Mim
> eUtil.java?view=markup

-- 
Markus Jelsma - CTO - Openindex