You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ned Rockson <ne...@discoveryengine.com> on 2007/11/06 23:47:45 UTC

Tika API

I think there may be a bug in the Content.java when it tries to convert 
the textual representation of the type to a MimeType.  It always returns 
null.  I'm trying to fix it but I can't find an API for Tika (or even 
src).  Can someone point me in the right direction?

Thanks,
Ned

Re: Tika API

Posted by Dennis Kubes <ku...@apache.org>.

Chris Mattmann wrote:
> Hi Ned,
> 
>  Glad to see you're poking around with the Tika software and its use in
> Nutch. To start, you probably want to go to the website for Tika:
> 
>  http://incubator.apache.org/tika/
> 
>  On that website, you should see the links to the SVN repository. The
> version of Tika that was used was a version that I built the same day I
> committed the fix for NUTCH-562:
> 
>  http://issues.apache.org/jira/browse/NUTCH-562
> 
>  Which appears to be a version of Tika built on October 8th. The API for the
> mime framework has changed a bit since then (to its betterment), however, I
> neglected to upgrade the Nutch API because of the strong objection I
> received from Andrzej and input from Dennis Kubes regarding the use of the
> Tika API in Nutch. I stand by my email I sent in reply to the objections:
> 
>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
> 
>  However, out of respect for the other committers, neglected to make any
> updates to the Nutch use of the Tika API since I never heard back from
> anyone after my response.

I looked back at the messages. I tried to respond as best I could.  If I 
missed something I apologize.

My only concern was the documentation for Tika (of course Nutch has the 
same problem :)) as I figured we would have this situation here, where 
someone was asking quesitons about Tika and didn't know where to turn. 
But since you and Sami were both committers for Tika and Nutch and are 
active here I thought it would be fine.

> 
>  That said, could you be a bit more specific Ned as to the exact problem
> you're having, e.g., "I tried visiting this site (URL here), the content
> type was (content/type here), and then it got into Content.java, and on line
> XXX it seems that the MimeType is getting set to null when it tries to...".
> With that info, I could probably help you quite a bit more. Also, depending
> upon how the rest of the Nutch committers want to handle the use of Tika
> (revert and remain stagnant, or use Tika and leverage the updates we're
> making to the Mime framework there), then we could come up with a strategy
> to help you out with the issue you're having.

The previous patches seem to work good, we have fetched well over 100M 
page without any problems.  I would say lets try to move things forward 
if you feel the Tika code is ready.  Maybe I am missing something.  I 
have not delved into Tika deeply but if it is better we should use it. 
It is poised to break something?

Dennis

> 
> Thanks!
> 
> Cheers,
>   Chris
> 
> 
> 
> On 11/6/07 3:47 PM, "Ned Rockson" <ne...@discoveryengine.com> wrote:
> 
>> I think there may be a bug in the Content.java when it tries to convert
>> the textual representation of the type to a MimeType.  It always returns
>> null.  I'm trying to fix it but I can't find an API for Tika (or even
>> src).  Can someone point me in the right direction?
>>
>> Thanks,
>> Ned
> 
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> 

Re: Tika API

Posted by Ned Rockson <ne...@discoveryengine.com>.
That's strange - I did an svn up at the root of the nutch-trunk  
directory and merged all of the changes with my code base.  I must  
have missed the changes to the conf director when merging as I was  
only diffing the src directory.

--Ned
On Nov 6, 2007, at 7:05 PM, Chris Mattmann wrote:

> [..snip..]
>
>>     return type.getName();
>>   }
>>
>>
>> The NPE was being thrown on the last line, so I did some tracing and
>> found out that the call to MimeType.clean(typeName) [typeName <-
>> "text/html] worked fine, but the next line caused a problem.  The
>> this.mimeTypes.getRepository.forName(cleanedMimeType) was returning
>> null.  My problem was that I downloaded the trunk and it didn't  
>> have a
>> MimeUtils anymore so I had no way to trace this.
>
> Yes, this class was removed as part of NUTCH-562. Its usage was  
> replaced
> with the class of the same name within the Tika API, which is based  
> on the
> Nutch API for mime types.
>
>>
>> Anyway, after an hour or so of banging my head against the wail I
>> realized the update to Nutch didn't have the correct .xml file
>> describing mime types in the conf/ directory.  Thus, I unzipped  
>> the Tika
>> jar, grabbed the .xml file and changed nutch-default.xml to point to
>> that xml for mime types and it started working.
>
> This is strange: as part of the patch for NUTCH-562, there was a  
> file called
> tika-mimetypes.xml, that was committed to the conf/ folder within  
> the trunk.
> Do you not have this file? The nutch-default.xml file within the conf/
> folder in the nutch trunk points to the tika-mimetypes.xml, so that  
> should
> have worked. I'm wondering if you had an old version of the /conf  
> directory
> and neglected to svn up it?
>
>>
>> Sorry again for being so vague.  I'm not sure if I should submit a  
>> JIRA
>> issue for this, but I'm happy to do so if anyone else has seen  
>> this issue.
>
> No problem: let's discuss the JIRA issue once we get an answer to  
> the above
> questions.
>
> Thanks for being more descriptive and looking forward to your  
> response.
>
> Cheers,
>   Chris
>
>>
>> Thanks,
>> Ned
>>
>>
>> Chris Mattmann wrote:
>>> Hi Ned,
>>>
>>>  Glad to see you're poking around with the Tika software and its  
>>> use in
>>> Nutch. To start, you probably want to go to the website for Tika:
>>>
>>>  http://incubator.apache.org/tika/
>>>
>>>  On that website, you should see the links to the SVN repository.  
>>> The
>>> version of Tika that was used was a version that I built the same  
>>> day I
>>> committed the fix for NUTCH-562:
>>>
>>>  http://issues.apache.org/jira/browse/NUTCH-562
>>>
>>>  Which appears to be a version of Tika built on October 8th. The  
>>> API for the
>>> mime framework has changed a bit since then (to its betterment),  
>>> however, I
>>> neglected to upgrade the Nutch API because of the strong objection I
>>> received from Andrzej and input from Dennis Kubes regarding the  
>>> use of the
>>> Tika API in Nutch. I stand by my email I sent in reply to the  
>>> objections:
>>>
>>>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
>>>
>>>  However, out of respect for the other committers, neglected to  
>>> make any
>>> updates to the Nutch use of the Tika API since I never heard back  
>>> from
>>> anyone after my response.
>>>
>>>  That said, could you be a bit more specific Ned as to the exact  
>>> problem
>>> you're having, e.g., "I tried visiting this site (URL here), the  
>>> content
>>> type was (content/type here), and then it got into Content.java,  
>>> and on line
>>> XXX it seems that the MimeType is getting set to null when it  
>>> tries to...".
>>> With that info, I could probably help you quite a bit more. Also,  
>>> depending
>>> upon how the rest of the Nutch committers want to handle the use  
>>> of Tika
>>> (revert and remain stagnant, or use Tika and leverage the updates  
>>> we're
>>> making to the Mime framework there), then we could come up with a  
>>> strategy
>>> to help you out with the issue you're having.
>>>
>>> Thanks!
>>>
>>> Cheers,
>>>   Chris
>>>
>>>
>>>
>>> On 11/6/07 3:47 PM, "Ned Rockson" <ne...@discoveryengine.com> wrote:
>>>
>>>
>>>> I think there may be a bug in the Content.java when it tries to  
>>>> convert
>>>> the textual representation of the type to a MimeType.  It always  
>>>> returns
>>>> null.  I'm trying to fix it but I can't find an API for Tika (or  
>>>> even
>>>> src).  Can someone point me in the right direction?
>>>>
>>>> Thanks,
>>>> Ned
>>>>
>>>
>>> ______________________________________________
>>> Chris Mattmann, Ph.D.
>>> Chris.Mattmann@jpl.nasa.gov
>>> _________________________________________________
>>> Jet Propulsion Laboratory            Pasadena, CA
>>> Office: 171-266B                     Mailstop:  171-246
>>> _______________________________________________________
>>>
>>> Disclaimer:  The opinions presented within are my own and do not  
>>> reflect
>>> those of either NASA, JPL, or the California Institute of  
>>> Technology.
>>>
>>>
>>>
>>>
>>
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not  
> reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>


Re: Tika API

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
[..snip..]

>     return type.getName();
>   }
> 
> 
> The NPE was being thrown on the last line, so I did some tracing and
> found out that the call to MimeType.clean(typeName) [typeName <-
> "text/html] worked fine, but the next line caused a problem.  The
> this.mimeTypes.getRepository.forName(cleanedMimeType) was returning
> null.  My problem was that I downloaded the trunk and it didn't have a
> MimeUtils anymore so I had no way to trace this.

Yes, this class was removed as part of NUTCH-562. Its usage was replaced
with the class of the same name within the Tika API, which is based on the
Nutch API for mime types.

> 
> Anyway, after an hour or so of banging my head against the wail I
> realized the update to Nutch didn't have the correct .xml file
> describing mime types in the conf/ directory.  Thus, I unzipped the Tika
> jar, grabbed the .xml file and changed nutch-default.xml to point to
> that xml for mime types and it started working.

This is strange: as part of the patch for NUTCH-562, there was a file called
tika-mimetypes.xml, that was committed to the conf/ folder within the trunk.
Do you not have this file? The nutch-default.xml file within the conf/
folder in the nutch trunk points to the tika-mimetypes.xml, so that should
have worked. I'm wondering if you had an old version of the /conf directory
and neglected to svn up it?

> 
> Sorry again for being so vague.  I'm not sure if I should submit a JIRA
> issue for this, but I'm happy to do so if anyone else has seen this issue.

No problem: let's discuss the JIRA issue once we get an answer to the above
questions.

Thanks for being more descriptive and looking forward to your response.

Cheers,
  Chris

> 
> Thanks,
> Ned
> 
> 
> Chris Mattmann wrote:
>> Hi Ned,
>> 
>>  Glad to see you're poking around with the Tika software and its use in
>> Nutch. To start, you probably want to go to the website for Tika:
>> 
>>  http://incubator.apache.org/tika/
>> 
>>  On that website, you should see the links to the SVN repository. The
>> version of Tika that was used was a version that I built the same day I
>> committed the fix for NUTCH-562:
>> 
>>  http://issues.apache.org/jira/browse/NUTCH-562
>> 
>>  Which appears to be a version of Tika built on October 8th. The API for the
>> mime framework has changed a bit since then (to its betterment), however, I
>> neglected to upgrade the Nutch API because of the strong objection I
>> received from Andrzej and input from Dennis Kubes regarding the use of the
>> Tika API in Nutch. I stand by my email I sent in reply to the objections:
>> 
>>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
>> 
>>  However, out of respect for the other committers, neglected to make any
>> updates to the Nutch use of the Tika API since I never heard back from
>> anyone after my response.
>> 
>>  That said, could you be a bit more specific Ned as to the exact problem
>> you're having, e.g., "I tried visiting this site (URL here), the content
>> type was (content/type here), and then it got into Content.java, and on line
>> XXX it seems that the MimeType is getting set to null when it tries to...".
>> With that info, I could probably help you quite a bit more. Also, depending
>> upon how the rest of the Nutch committers want to handle the use of Tika
>> (revert and remain stagnant, or use Tika and leverage the updates we're
>> making to the Mime framework there), then we could come up with a strategy
>> to help you out with the issue you're having.
>> 
>> Thanks!
>> 
>> Cheers,
>>   Chris
>> 
>> 
>> 
>> On 11/6/07 3:47 PM, "Ned Rockson" <ne...@discoveryengine.com> wrote:
>> 
>>   
>>> I think there may be a bug in the Content.java when it tries to convert
>>> the textual representation of the type to a MimeType.  It always returns
>>> null.  I'm trying to fix it but I can't find an API for Tika (or even
>>> src).  Can someone point me in the right direction?
>>> 
>>> Thanks,
>>> Ned
>>>     
>> 
>> ______________________________________________
>> Chris Mattmann, Ph.D.
>> Chris.Mattmann@jpl.nasa.gov
>> _________________________________________________
>> Jet Propulsion Laboratory            Pasadena, CA
>> Office: 171-266B                     Mailstop:  171-246
>> _______________________________________________________
>> 
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>> 
>> 
>> 
>>   
> 

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Tika API

Posted by Ned Rockson <ne...@discoveryengine.com>.
Sorry about being so vague.  I was trying to run Fetcher (and Fetcher2 
as well) and noticed that every fetched page was throwing an NPE.  
Essentially, the NPE was coming from Content.java that referenced 
HttpResponse and the line that was causing the problem had a call to 
MimeUtils.getRepository().  The snippet of code looks like this:


    MimeType type = null;
    String cleanedMimeType = null;
    try {
      cleanedMimeType = MimeType.clean(typeName);
    } catch (MimeTypeException mte) {
      // Seems to be a malformed mime type name...
    }

    // first try to get the type from the cleaned type name
    type = cleanedMimeType != null ? this.mimeTypes.getRepository().forName(
        cleanedMimeType) : null;

    // if returned null, then try url resolution
    if (type == null) {
      // If no mime-type header, or cannot find a corresponding registered
      // mime-type, then guess a mime-type from the url pattern
      type = this.mimeTypes.getRepository().getMimeType(url) != null ? 
this.mimeTypes
          .getRepository().getMimeType(url)
          : type;
    }

    // if magic is enabled use mime magic to guess if the mime type returned
    // from the magic guess is different than the one that's already set 
so far
    // if it is, go with the mime type returned by the magic
    if (this.mimeTypeMagic) {
      MimeType magicType = this.mimeTypes.getRepository().getMimeType(data);
      if (magicType != null && 
!type.getName().equals(magicType.getName())) {
        // If magic enabled and the current mime type differs from that 
of the
        // one returned from the magic, take the magic mimeType

        type = magicType;
      }
    }
    return type.getName();
  }


The NPE was being thrown on the last line, so I did some tracing and 
found out that the call to MimeType.clean(typeName) [typeName <- 
"text/html] worked fine, but the next line caused a problem.  The 
this.mimeTypes.getRepository.forName(cleanedMimeType) was returning 
null.  My problem was that I downloaded the trunk and it didn't have a 
MimeUtils anymore so I had no way to trace this. 

Anyway, after an hour or so of banging my head against the wail I 
realized the update to Nutch didn't have the correct .xml file 
describing mime types in the conf/ directory.  Thus, I unzipped the Tika 
jar, grabbed the .xml file and changed nutch-default.xml to point to 
that xml for mime types and it started working.

Sorry again for being so vague.  I'm not sure if I should submit a JIRA 
issue for this, but I'm happy to do so if anyone else has seen this issue.

Thanks,
Ned


Chris Mattmann wrote:
> Hi Ned,
>
>  Glad to see you're poking around with the Tika software and its use in
> Nutch. To start, you probably want to go to the website for Tika:
>
>  http://incubator.apache.org/tika/
>
>  On that website, you should see the links to the SVN repository. The
> version of Tika that was used was a version that I built the same day I
> committed the fix for NUTCH-562:
>
>  http://issues.apache.org/jira/browse/NUTCH-562
>
>  Which appears to be a version of Tika built on October 8th. The API for the
> mime framework has changed a bit since then (to its betterment), however, I
> neglected to upgrade the Nutch API because of the strong objection I
> received from Andrzej and input from Dennis Kubes regarding the use of the
> Tika API in Nutch. I stand by my email I sent in reply to the objections:
>
>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
>
>  However, out of respect for the other committers, neglected to make any
> updates to the Nutch use of the Tika API since I never heard back from
> anyone after my response.
>
>  That said, could you be a bit more specific Ned as to the exact problem
> you're having, e.g., "I tried visiting this site (URL here), the content
> type was (content/type here), and then it got into Content.java, and on line
> XXX it seems that the MimeType is getting set to null when it tries to...".
> With that info, I could probably help you quite a bit more. Also, depending
> upon how the rest of the Nutch committers want to handle the use of Tika
> (revert and remain stagnant, or use Tika and leverage the updates we're
> making to the Mime framework there), then we could come up with a strategy
> to help you out with the issue you're having.
>
> Thanks!
>
> Cheers,
>   Chris
>
>
>
> On 11/6/07 3:47 PM, "Ned Rockson" <ne...@discoveryengine.com> wrote:
>
>   
>> I think there may be a bug in the Content.java when it tries to convert
>> the textual representation of the type to a MimeType.  It always returns
>> null.  I'm trying to fix it but I can't find an API for Tika (or even
>> src).  Can someone point me in the right direction?
>>
>> Thanks,
>> Ned
>>     
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>   


Re: Tika API

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Ned,

 Glad to see you're poking around with the Tika software and its use in
Nutch. To start, you probably want to go to the website for Tika:

 http://incubator.apache.org/tika/

 On that website, you should see the links to the SVN repository. The
version of Tika that was used was a version that I built the same day I
committed the fix for NUTCH-562:

 http://issues.apache.org/jira/browse/NUTCH-562

 Which appears to be a version of Tika built on October 8th. The API for the
mime framework has changed a bit since then (to its betterment), however, I
neglected to upgrade the Nutch API because of the strong objection I
received from Andrzej and input from Dennis Kubes regarding the use of the
Tika API in Nutch. I stand by my email I sent in reply to the objections:

 http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y

 However, out of respect for the other committers, neglected to make any
updates to the Nutch use of the Tika API since I never heard back from
anyone after my response.

 That said, could you be a bit more specific Ned as to the exact problem
you're having, e.g., "I tried visiting this site (URL here), the content
type was (content/type here), and then it got into Content.java, and on line
XXX it seems that the MimeType is getting set to null when it tries to...".
With that info, I could probably help you quite a bit more. Also, depending
upon how the rest of the Nutch committers want to handle the use of Tika
(revert and remain stagnant, or use Tika and leverage the updates we're
making to the Mime framework there), then we could come up with a strategy
to help you out with the issue you're having.

Thanks!

Cheers,
  Chris



On 11/6/07 3:47 PM, "Ned Rockson" <ne...@discoveryengine.com> wrote:

> I think there may be a bug in the Content.java when it tries to convert
> the textual representation of the type to a MimeType.  It always returns
> null.  I'm trying to fix it but I can't find an API for Tika (or even
> src).  Can someone point me in the right direction?
> 
> Thanks,
> Ned

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.