You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Iain Lopata <il...@hotmail.com> on 2015/04/13 16:26:06 UTC

Mimetype detection for JSON

I have a page that I am fetching that contains JSON and I have a plugin for
parsing JSON.

 

The server sets a mimetype of "text/html" and consequently my json parser
does not get invoked.

 

If I run parsechecker from the command line and specify -forceAs
"application/json" the json parser is invoked and works successfully.

 

So, I believe that if I can get tika to give me "application/json" as the
detected content type for this page, it should work during a crawl.

 

I have copied tika-mimetypes.xml from the tika jar file and installed a copy
in my configuration directory.  I have updated nutch-site.xml to point to
this file and the log entries indicate that this is being found.

 

In my copy of tika-mimetypes.xml I have added the match rule shown below

 

<mime-type type="application/json">

          <sub-class-of type="application/javascript"/>

          <magic priority="100">

                  <match value="{" type="string" offset="0"/>

          </magic>

          <glob pattern="*.json"/>

  </mime-type>

 

I know that my match is much too broad, but I am using this just while
trying to resolve this problem.

 

I have also set lang.extraction.policy to identify in nutch-site.xml (again
primarily for testing purposes).

 

I am still getting the content type detected as text/html and the json
parser is not being invoked.  Any suggestions as to what to look at next?

 

Thanks!

 

Iain

RE: Mimetype detection for JSON

Posted by Iain Lopata <il...@hotmail.com>.

Glad we are on the same page now.  So much harder to discuss code over e-mail than in person!!  I will open the report in JIRA shortly.  Thanks for sticking with me through this discussion!

> Date: Fri, 17 Apr 2015 00:07:08 +0200
> From: wastl.nagel@googlemail.com
> To: user@nutch.apache.org
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
> > It would also seems simple enough to pass these hints to Tika
> > with the following modification to my previously proposed code:
> > try {
> >   InputStream in = new ByteArrayInputStream(data);
> >   Metadata meta = new Metadata();
> >   meta.set(Metadata.CONTENT_TYPE,typeName);
> >   meta.set(Metadata.RESOURCE_NAME_KEY,url);
> >   magicType = this.mimeTypes.detect(in, meta).toString();
> 
> 
> Ok, that would be quite close to the current state in trunk:
> 
>     if (this.mimeMagic) {
>       String magicType = null;
>       // pass URL (file name) and (cleansed) content type from protocol to Tika
>       Metadata tikaMeta = new Metadata();
>       tikaMeta.add(Metadata.RESOURCE_NAME_KEY, url);
>       tikaMeta.add(Metadata.CONTENT_TYPE,
>           (cleanedMimeType != null ? cleanedMimeType : typeName));
>       try {
>         InputStream stream = TikaInputStream.get(data);
>         try {
>           magicType = tika.detect(stream, tikaMeta);
>         } finally {
> 
> 
> Sorry, looks like I've overseen this little difference:
> 
>  magicType = tika.detect(stream, tikaMeta);
> vs.
>  magicType = this.mimeTypes.detect(in, meta).toString();
> 
> 
> Is this correct?
> Seems plausible because only "mimeTypes" is explicitly configured to use the
> mime.types.file, while "tika" is not, cf. MimeUtil constructor.
> 
> > In fact, I am  left wondering why the entire autoResolveContentType in
> > MimeUtil.java can not be replace by this code
> 
> Yes, of course. Please, open a Jira. That's a bug, definitely.
> 
> Thanks,
> Sebastian
> 
> 
> On 04/16/2015 04:13 PM, Iain Lopata wrote:
> > Sebastian,
> > 
> > I am not sure I understand your response.
> > 
> > While you are correct that the call to the detect method in my revised code below only uses the content, in the broader context of MimeUtil.java both the mime type returned by the server and the filename are both considered before MimeUtil returns a final value.
> > 
> > It would also seems simple enough to pass these hints to Tika with the following modification to my previously proposed code:
> > 
> >              try {
> >   InputStream in = new ByteArrayInputStream(data);
> >   Metadata meta = new Metadata();
> >   meta.set(Metadata.CONTENT_TYPE,typeName);
> >   meta.set(Metadata.RESOURCE_NAME_KEY,url);  magicType = this.mimeTypes.detect(in, meta).toString();
> > 
> >   LOG.debug("Magic Type for" + url + " is " + magicType);
> > 
> > } catch (Exception e) {
> >    //Can't complete magic detection
> > }
> > 
> > In fact, I am  left wondering why the entire autoResolveContentType in MimeUtil.java can not be replace by this code, but for now I will be happy with a solution that allows me to add rules to tika-mimetypes.xml such that these rules get used by Nutch.
> > 
> > Iain
> > 
> > 
> >> Date: Thu, 16 Apr 2015 00:26:20 +0200
> >> From: wastl.nagel@googlemail.com
> >> To: user@nutch.apache.org
> >> Subject: Re: Mimetype detection for JSON
> >>
> >> Hi Iain,
> >>
> >> that means mime type detection is done exclusively on content
> >> without URL and server content type. There are examples where
> >> both will definitely add necessary support, cf. NUTCH-1605.
> >>
> >> Maybe it's best to let Tika improve the mime detectors, there
> >> is still some work ongoing, cf. TIKA-1517.
> >>
> >> It could be an option, instead of a binary mime.type.magic
> >> to set a (weighted) hierarchy of heuristics
> >>  magic > URL pattern > HTTP content type
> >> or just a list of hints to be used.
> >>
> >> But it's not as easy because often these are used in combination
> >> a zip file by signature with extension .xlsx is likely to be an Excel
> >> Office Open XML spreadsheet. JSON is similar or even worse:
> >> a '{' 0x7B in position 0 is only a little hint:
> >> - could be also '[' (but less likely)
> >> - also RTF has a '{' in position 0
> >>
> >> Sebastian
> >>
> >>
> >> On 04/15/2015 02:05 PM, Iain Lopata wrote:
> >>> The following change to MimeUtil.java seems to solve my problem:
> >>>
> >>> //      magicType = tika.detect(data);
> >>>             try {
> >>>                     InputStream in = new ByteArrayInputStream(data);
> >>>                     Metadata meta = new Metadata();
> >>>                     magicType = this.mimeTypes.detect(in, meta).toString();
> >>>                     LOG.debug("Magic Type for" + url + " is " + magicType);
> >>>             } catch (Exception e) {
> >>>                     //Can't complete magic detection
> >>>             }
> >>>
> >>> However, my confidence that I haven’t broken something else is modest at best.
> >>>
> >>> If this looks like a bug I am happy to create the JIRA entry and submit this as a patch, but before I do so can you tell me if this looks sensible?
> >>>
> >>> -----Original Message-----
> >>> From: Iain Lopata [mailto:ilopata1@hotmail.com] 
> >>> Sent: Tuesday, April 14, 2015 8:43 PM
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Mimetype detection for JSON
> >>>
> >>> It seems to me that setting tika-mimetypes.xml in the Nutch configuration causes MimeUtil.java to use the specified file for initial lookup and for URL resolution.  However, when it comes to magic detection, the tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> >>>
> >>> If I update the Tika jar with my match rule it works perfectly. If I only place the updated tika-mimetypes.xml file in my Nutch configuration directory, the magic detection does not use my match rule.
> >>>
> >>> Can anyone familiar with the Tika implementation tell me if there is a way to update Nutch's MimeUtil.java to instantiate Tika to use the configuration file from Nutch?  Or would it be better just to update the configuration file in the Tika jar?
> >>>
> >>> -----Original Message-----
> >>> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> >>> Sent: Tuesday, April 14, 2015 5:32 PM
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Mimetype detection for JSON
> >>>
> >>> Thanks Sebastian.
> >>>
> >>> mime.type.magic is true.
> >>>
> >>> I don’t have control over the web server, so cannot test with application/javascript
> >>>
> >>> Time for some deeper debugging it seems.  Will update the list with findings.
> >>>
> >>> -----Original Message-----
> >>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >>> Sent: Tuesday, April 14, 2015 4:09 PM
> >>> To: user@nutch.apache.org
> >>> Subject: Re: Mimetype detection for JSON
> >>>
> >>> Hi Iain,
> >>>
> >>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >>>> a copy in my configuration directory.  I have updated nutch-site.xml 
> >>>> to point to this file and the log entries indicate that this is being found.
> >>>
> >>> ... and the property mime.type.magic is true (default)?
> >>>
> >>>
> >>>> <mime-type type="application/json">
> >>>>           <sub-class-of type="application/javascript"/>
> >>>
> >>> Just as a trial: What happens if you make the web server return "application/javascript"
> >>> as content type?
> >>>
> >>>
> >>>> I am still getting the content type detected as text/html and the json 
> >>>> parser is not being invoked.  Any suggestions as to what to look at next?
> >>>
> >>> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
> >>> - byte stream for magic detection
> >>> - URL for additional file name patterns
> >>> - content type sent by server
> >>> URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.
> >>>
> >>> I fear that you have to run a debugger to find out what is going wrong.
> >>> I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.
> >>>
> >>> Cheers,
> >>> Sebastian
> >>>
> >>> On 04/13/2015 04:26 PM, Iain Lopata wrote:
> >>>> I have a page that I am fetching that contains JSON and I have a 
> >>>> plugin for parsing JSON.
> >>>>
> >>>>  
> >>>>
> >>>> The server sets a mimetype of "text/html" and consequently my json 
> >>>> parser does not get invoked.
> >>>>
> >>>>  
> >>>>
> >>>> If I run parsechecker from the command line and specify -forceAs 
> >>>> "application/json" the json parser is invoked and works successfully.
> >>>>
> >>>>  
> >>>>
> >>>> So, I believe that if I can get tika to give me "application/json" as 
> >>>> the detected content type for this page, it should work during a crawl.
> >>>>
> >>>>  
> >>>>
> >>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >>>> a copy in my configuration directory.  I have updated nutch-site.xml 
> >>>> to point to this file and the log entries indicate that this is being found.
> >>>>
> >>>>  
> >>>>
> >>>> In my copy of tika-mimetypes.xml I have added the match rule shown 
> >>>> below
> >>>>
> >>>>  
> >>>>
> >>>> <mime-type type="application/json">
> >>>>
> >>>>           <sub-class-of type="application/javascript"/>
> >>>>
> >>>>           <magic priority="100">
> >>>>
> >>>>                   <match value="{" type="string" offset="0"/>
> >>>>
> >>>>           </magic>
> >>>>
> >>>>           <glob pattern="*.json"/>
> >>>>
> >>>>   </mime-type>
> >>>>
> >>>>  
> >>>>
> >>>> I know that my match is much too broad, but I am using this just while 
> >>>> trying to resolve this problem.
> >>>>
> >>>>  
> >>>>
> >>>> I have also set lang.extraction.policy to identify in nutch-site.xml 
> >>>> (again primarily for testing purposes).
> >>>>
> >>>>  
> >>>>
> >>>> I am still getting the content type detected as text/html and the json 
> >>>> parser is not being invoked.  Any suggestions as to what to look at next?
> >>>>
> >>>>  
> >>>>
> >>>> Thanks!
> >>>>
> >>>>  
> >>>>
> >>>> Iain
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>
> >  		 	   		  
> > 
>

Re: Mimetype detection for JSON

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Iain,

> It would also seems simple enough to pass these hints to Tika
> with the following modification to my previously proposed code:
> try {
>   InputStream in = new ByteArrayInputStream(data);
>   Metadata meta = new Metadata();
>   meta.set(Metadata.CONTENT_TYPE,typeName);
>   meta.set(Metadata.RESOURCE_NAME_KEY,url);
>   magicType = this.mimeTypes.detect(in, meta).toString();


Ok, that would be quite close to the current state in trunk:

    if (this.mimeMagic) {
      String magicType = null;
      // pass URL (file name) and (cleansed) content type from protocol to Tika
      Metadata tikaMeta = new Metadata();
      tikaMeta.add(Metadata.RESOURCE_NAME_KEY, url);
      tikaMeta.add(Metadata.CONTENT_TYPE,
          (cleanedMimeType != null ? cleanedMimeType : typeName));
      try {
        InputStream stream = TikaInputStream.get(data);
        try {
          magicType = tika.detect(stream, tikaMeta);
        } finally {


Sorry, looks like I've overseen this little difference:

 magicType = tika.detect(stream, tikaMeta);
vs.
 magicType = this.mimeTypes.detect(in, meta).toString();


Is this correct?
Seems plausible because only "mimeTypes" is explicitly configured to use the
mime.types.file, while "tika" is not, cf. MimeUtil constructor.

> In fact, I am  left wondering why the entire autoResolveContentType in
> MimeUtil.java can not be replace by this code

Yes, of course. Please, open a Jira. That's a bug, definitely.

Thanks,
Sebastian


On 04/16/2015 04:13 PM, Iain Lopata wrote:
> Sebastian,
> 
> I am not sure I understand your response.
> 
> While you are correct that the call to the detect method in my revised code below only uses the content, in the broader context of MimeUtil.java both the mime type returned by the server and the filename are both considered before MimeUtil returns a final value.
> 
> It would also seems simple enough to pass these hints to Tika with the following modification to my previously proposed code:
> 
>              try {
>   InputStream in = new ByteArrayInputStream(data);
>   Metadata meta = new Metadata();
>   meta.set(Metadata.CONTENT_TYPE,typeName);
>   meta.set(Metadata.RESOURCE_NAME_KEY,url);  magicType = this.mimeTypes.detect(in, meta).toString();
> 
>   LOG.debug("Magic Type for" + url + " is " + magicType);
> 
> } catch (Exception e) {
>    //Can't complete magic detection
> }
> 
> In fact, I am  left wondering why the entire autoResolveContentType in MimeUtil.java can not be replace by this code, but for now I will be happy with a solution that allows me to add rules to tika-mimetypes.xml such that these rules get used by Nutch.
> 
> Iain
> 
> 
>> Date: Thu, 16 Apr 2015 00:26:20 +0200
>> From: wastl.nagel@googlemail.com
>> To: user@nutch.apache.org
>> Subject: Re: Mimetype detection for JSON
>>
>> Hi Iain,
>>
>> that means mime type detection is done exclusively on content
>> without URL and server content type. There are examples where
>> both will definitely add necessary support, cf. NUTCH-1605.
>>
>> Maybe it's best to let Tika improve the mime detectors, there
>> is still some work ongoing, cf. TIKA-1517.
>>
>> It could be an option, instead of a binary mime.type.magic
>> to set a (weighted) hierarchy of heuristics
>>  magic > URL pattern > HTTP content type
>> or just a list of hints to be used.
>>
>> But it's not as easy because often these are used in combination
>> a zip file by signature with extension .xlsx is likely to be an Excel
>> Office Open XML spreadsheet. JSON is similar or even worse:
>> a '{' 0x7B in position 0 is only a little hint:
>> - could be also '[' (but less likely)
>> - also RTF has a '{' in position 0
>>
>> Sebastian
>>
>>
>> On 04/15/2015 02:05 PM, Iain Lopata wrote:
>>> The following change to MimeUtil.java seems to solve my problem:
>>>
>>> //      magicType = tika.detect(data);
>>>             try {
>>>                     InputStream in = new ByteArrayInputStream(data);
>>>                     Metadata meta = new Metadata();
>>>                     magicType = this.mimeTypes.detect(in, meta).toString();
>>>                     LOG.debug("Magic Type for" + url + " is " + magicType);
>>>             } catch (Exception e) {
>>>                     //Can't complete magic detection
>>>             }
>>>
>>> However, my confidence that I haven’t broken something else is modest at best.
>>>
>>> If this looks like a bug I am happy to create the JIRA entry and submit this as a patch, but before I do so can you tell me if this looks sensible?
>>>
>>> -----Original Message-----
>>> From: Iain Lopata [mailto:ilopata1@hotmail.com] 
>>> Sent: Tuesday, April 14, 2015 8:43 PM
>>> To: user@nutch.apache.org
>>> Subject: RE: Mimetype detection for JSON
>>>
>>> It seems to me that setting tika-mimetypes.xml in the Nutch configuration causes MimeUtil.java to use the specified file for initial lookup and for URL resolution.  However, when it comes to magic detection, the tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
>>>
>>> If I update the Tika jar with my match rule it works perfectly. If I only place the updated tika-mimetypes.xml file in my Nutch configuration directory, the magic detection does not use my match rule.
>>>
>>> Can anyone familiar with the Tika implementation tell me if there is a way to update Nutch's MimeUtil.java to instantiate Tika to use the configuration file from Nutch?  Or would it be better just to update the configuration file in the Tika jar?
>>>
>>> -----Original Message-----
>>> From: Iain Lopata [mailto:ilopata1@hotmail.com]
>>> Sent: Tuesday, April 14, 2015 5:32 PM
>>> To: user@nutch.apache.org
>>> Subject: RE: Mimetype detection for JSON
>>>
>>> Thanks Sebastian.
>>>
>>> mime.type.magic is true.
>>>
>>> I don’t have control over the web server, so cannot test with application/javascript
>>>
>>> Time for some deeper debugging it seems.  Will update the list with findings.
>>>
>>> -----Original Message-----
>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>> Sent: Tuesday, April 14, 2015 4:09 PM
>>> To: user@nutch.apache.org
>>> Subject: Re: Mimetype detection for JSON
>>>
>>> Hi Iain,
>>>
>>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>>>> a copy in my configuration directory.  I have updated nutch-site.xml 
>>>> to point to this file and the log entries indicate that this is being found.
>>>
>>> ... and the property mime.type.magic is true (default)?
>>>
>>>
>>>> <mime-type type="application/json">
>>>>           <sub-class-of type="application/javascript"/>
>>>
>>> Just as a trial: What happens if you make the web server return "application/javascript"
>>> as content type?
>>>
>>>
>>>> I am still getting the content type detected as text/html and the json 
>>>> parser is not being invoked.  Any suggestions as to what to look at next?
>>>
>>> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
>>> - byte stream for magic detection
>>> - URL for additional file name patterns
>>> - content type sent by server
>>> URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.
>>>
>>> I fear that you have to run a debugger to find out what is going wrong.
>>> I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.
>>>
>>> Cheers,
>>> Sebastian
>>>
>>> On 04/13/2015 04:26 PM, Iain Lopata wrote:
>>>> I have a page that I am fetching that contains JSON and I have a 
>>>> plugin for parsing JSON.
>>>>
>>>>  
>>>>
>>>> The server sets a mimetype of "text/html" and consequently my json 
>>>> parser does not get invoked.
>>>>
>>>>  
>>>>
>>>> If I run parsechecker from the command line and specify -forceAs 
>>>> "application/json" the json parser is invoked and works successfully.
>>>>
>>>>  
>>>>
>>>> So, I believe that if I can get tika to give me "application/json" as 
>>>> the detected content type for this page, it should work during a crawl.
>>>>
>>>>  
>>>>
>>>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>>>> a copy in my configuration directory.  I have updated nutch-site.xml 
>>>> to point to this file and the log entries indicate that this is being found.
>>>>
>>>>  
>>>>
>>>> In my copy of tika-mimetypes.xml I have added the match rule shown 
>>>> below
>>>>
>>>>  
>>>>
>>>> <mime-type type="application/json">
>>>>
>>>>           <sub-class-of type="application/javascript"/>
>>>>
>>>>           <magic priority="100">
>>>>
>>>>                   <match value="{" type="string" offset="0"/>
>>>>
>>>>           </magic>
>>>>
>>>>           <glob pattern="*.json"/>
>>>>
>>>>   </mime-type>
>>>>
>>>>  
>>>>
>>>> I know that my match is much too broad, but I am using this just while 
>>>> trying to resolve this problem.
>>>>
>>>>  
>>>>
>>>> I have also set lang.extraction.policy to identify in nutch-site.xml 
>>>> (again primarily for testing purposes).
>>>>
>>>>  
>>>>
>>>> I am still getting the content type detected as text/html and the json 
>>>> parser is not being invoked.  Any suggestions as to what to look at next?
>>>>
>>>>  
>>>>
>>>> Thanks!
>>>>
>>>>  
>>>>
>>>> Iain
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>  		 	   		  
>

RE: Mimetype detection for JSON

Posted by Iain Lopata <il...@hotmail.com>.

Sebastian,

I am not sure I understand your response.

While you are correct that the call to the detect method in my revised code below only uses the content, in the broader context of MimeUtil.java both the mime type returned by the server and the filename are both considered before MimeUtil returns a final value.

It would also seems simple enough to pass these hints to Tika with the following modification to my previously proposed code:

             try {
  InputStream in = new ByteArrayInputStream(data);
  Metadata meta = new Metadata();
  meta.set(Metadata.CONTENT_TYPE,typeName);
  meta.set(Metadata.RESOURCE_NAME_KEY,url);  magicType = this.mimeTypes.detect(in, meta).toString();

  LOG.debug("Magic Type for" + url + " is " + magicType);

} catch (Exception e) {
   //Can't complete magic detection
}

In fact, I am  left wondering why the entire autoResolveContentType in MimeUtil.java can not be replace by this code, but for now I will be happy with a solution that allows me to add rules to tika-mimetypes.xml such that these rules get used by Nutch.

Iain


> Date: Thu, 16 Apr 2015 00:26:20 +0200
> From: wastl.nagel@googlemail.com
> To: user@nutch.apache.org
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
> that means mime type detection is done exclusively on content
> without URL and server content type. There are examples where
> both will definitely add necessary support, cf. NUTCH-1605.
> 
> Maybe it's best to let Tika improve the mime detectors, there
> is still some work ongoing, cf. TIKA-1517.
> 
> It could be an option, instead of a binary mime.type.magic
> to set a (weighted) hierarchy of heuristics
>  magic > URL pattern > HTTP content type
> or just a list of hints to be used.
> 
> But it's not as easy because often these are used in combination
> a zip file by signature with extension .xlsx is likely to be an Excel
> Office Open XML spreadsheet. JSON is similar or even worse:
> a '{' 0x7B in position 0 is only a little hint:
> - could be also '[' (but less likely)
> - also RTF has a '{' in position 0
> 
> Sebastian
> 
> 
> On 04/15/2015 02:05 PM, Iain Lopata wrote:
> > The following change to MimeUtil.java seems to solve my problem:
> > 
> > //      magicType = tika.detect(data);
> >             try {
> >                     InputStream in = new ByteArrayInputStream(data);
> >                     Metadata meta = new Metadata();
> >                     magicType = this.mimeTypes.detect(in, meta).toString();
> >                     LOG.debug("Magic Type for" + url + " is " + magicType);
> >             } catch (Exception e) {
> >                     //Can't complete magic detection
> >             }
> > 
> > However, my confidence that I haven’t broken something else is modest at best.
> > 
> > If this looks like a bug I am happy to create the JIRA entry and submit this as a patch, but before I do so can you tell me if this looks sensible?
> > 
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopata1@hotmail.com] 
> > Sent: Tuesday, April 14, 2015 8:43 PM
> > To: user@nutch.apache.org
> > Subject: RE: Mimetype detection for JSON
> > 
> > It seems to me that setting tika-mimetypes.xml in the Nutch configuration causes MimeUtil.java to use the specified file for initial lookup and for URL resolution.  However, when it comes to magic detection, the tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> > 
> > If I update the Tika jar with my match rule it works perfectly. If I only place the updated tika-mimetypes.xml file in my Nutch configuration directory, the magic detection does not use my match rule.
> > 
> > Can anyone familiar with the Tika implementation tell me if there is a way to update Nutch's MimeUtil.java to instantiate Tika to use the configuration file from Nutch?  Or would it be better just to update the configuration file in the Tika jar?
> > 
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > Sent: Tuesday, April 14, 2015 5:32 PM
> > To: user@nutch.apache.org
> > Subject: RE: Mimetype detection for JSON
> > 
> > Thanks Sebastian.
> > 
> > mime.type.magic is true.
> > 
> > I don’t have control over the web server, so cannot test with application/javascript
> > 
> > Time for some deeper debugging it seems.  Will update the list with findings.
> > 
> > -----Original Message-----
> > From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> > Sent: Tuesday, April 14, 2015 4:09 PM
> > To: user@nutch.apache.org
> > Subject: Re: Mimetype detection for JSON
> > 
> > Hi Iain,
> > 
> >> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >> a copy in my configuration directory.  I have updated nutch-site.xml 
> >> to point to this file and the log entries indicate that this is being found.
> > 
> > ... and the property mime.type.magic is true (default)?
> > 
> > 
> >> <mime-type type="application/json">
> >>           <sub-class-of type="application/javascript"/>
> > 
> > Just as a trial: What happens if you make the web server return "application/javascript"
> > as content type?
> > 
> > 
> >> I am still getting the content type detected as text/html and the json 
> >> parser is not being invoked.  Any suggestions as to what to look at next?
> > 
> > The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
> > - byte stream for magic detection
> > - URL for additional file name patterns
> > - content type sent by server
> > URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.
> > 
> > I fear that you have to run a debugger to find out what is going wrong.
> > I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.
> > 
> > Cheers,
> > Sebastian
> > 
> > On 04/13/2015 04:26 PM, Iain Lopata wrote:
> >> I have a page that I am fetching that contains JSON and I have a 
> >> plugin for parsing JSON.
> >>
> >>  
> >>
> >> The server sets a mimetype of "text/html" and consequently my json 
> >> parser does not get invoked.
> >>
> >>  
> >>
> >> If I run parsechecker from the command line and specify -forceAs 
> >> "application/json" the json parser is invoked and works successfully.
> >>
> >>  
> >>
> >> So, I believe that if I can get tika to give me "application/json" as 
> >> the detected content type for this page, it should work during a crawl.
> >>
> >>  
> >>
> >> I have copied tika-mimetypes.xml from the tika jar file and installed 
> >> a copy in my configuration directory.  I have updated nutch-site.xml 
> >> to point to this file and the log entries indicate that this is being found.
> >>
> >>  
> >>
> >> In my copy of tika-mimetypes.xml I have added the match rule shown 
> >> below
> >>
> >>  
> >>
> >> <mime-type type="application/json">
> >>
> >>           <sub-class-of type="application/javascript"/>
> >>
> >>           <magic priority="100">
> >>
> >>                   <match value="{" type="string" offset="0"/>
> >>
> >>           </magic>
> >>
> >>           <glob pattern="*.json"/>
> >>
> >>   </mime-type>
> >>
> >>  
> >>
> >> I know that my match is much too broad, but I am using this just while 
> >> trying to resolve this problem.
> >>
> >>  
> >>
> >> I have also set lang.extraction.policy to identify in nutch-site.xml 
> >> (again primarily for testing purposes).
> >>
> >>  
> >>
> >> I am still getting the content type detected as text/html and the json 
> >> parser is not being invoked.  Any suggestions as to what to look at next?
> >>
> >>  
> >>
> >> Thanks!
> >>
> >>  
> >>
> >> Iain
> >>
> >>
> > 
> > 
> > 
> > 
>

Re: Mimetype detection for JSON

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Iain,

that means mime type detection is done exclusively on content
without URL and server content type. There are examples where
both will definitely add necessary support, cf. NUTCH-1605.

Maybe it's best to let Tika improve the mime detectors, there
is still some work ongoing, cf. TIKA-1517.

It could be an option, instead of a binary mime.type.magic
to set a (weighted) hierarchy of heuristics
 magic > URL pattern > HTTP content type
or just a list of hints to be used.

But it's not as easy because often these are used in combination
a zip file by signature with extension .xlsx is likely to be an Excel
Office Open XML spreadsheet. JSON is similar or even worse:
a '{' 0x7B in position 0 is only a little hint:
- could be also '[' (but less likely)
- also RTF has a '{' in position 0

Sebastian


On 04/15/2015 02:05 PM, Iain Lopata wrote:
> The following change to MimeUtil.java seems to solve my problem:
> 
> //      magicType = tika.detect(data);
>             try {
>                     InputStream in = new ByteArrayInputStream(data);
>                     Metadata meta = new Metadata();
>                     magicType = this.mimeTypes.detect(in, meta).toString();
>                     LOG.debug("Magic Type for" + url + " is " + magicType);
>             } catch (Exception e) {
>                     //Can't complete magic detection
>             }
> 
> However, my confidence that I haven’t broken something else is modest at best.
> 
> If this looks like a bug I am happy to create the JIRA entry and submit this as a patch, but before I do so can you tell me if this looks sensible?
> 
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com] 
> Sent: Tuesday, April 14, 2015 8:43 PM
> To: user@nutch.apache.org
> Subject: RE: Mimetype detection for JSON
> 
> It seems to me that setting tika-mimetypes.xml in the Nutch configuration causes MimeUtil.java to use the specified file for initial lookup and for URL resolution.  However, when it comes to magic detection, the tika-mimetypes.xml file in the Tika jar file seems to be used instead.  
> 
> If I update the Tika jar with my match rule it works perfectly. If I only place the updated tika-mimetypes.xml file in my Nutch configuration directory, the magic detection does not use my match rule.
> 
> Can anyone familiar with the Tika implementation tell me if there is a way to update Nutch's MimeUtil.java to instantiate Tika to use the configuration file from Nutch?  Or would it be better just to update the configuration file in the Tika jar?
> 
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Tuesday, April 14, 2015 5:32 PM
> To: user@nutch.apache.org
> Subject: RE: Mimetype detection for JSON
> 
> Thanks Sebastian.
> 
> mime.type.magic is true.
> 
> I don’t have control over the web server, so cannot test with application/javascript
> 
> Time for some deeper debugging it seems.  Will update the list with findings.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Tuesday, April 14, 2015 4:09 PM
> To: user@nutch.apache.org
> Subject: Re: Mimetype detection for JSON
> 
> Hi Iain,
> 
>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>> a copy in my configuration directory.  I have updated nutch-site.xml 
>> to point to this file and the log entries indicate that this is being found.
> 
> ... and the property mime.type.magic is true (default)?
> 
> 
>> <mime-type type="application/json">
>>           <sub-class-of type="application/javascript"/>
> 
> Just as a trial: What happens if you make the web server return "application/javascript"
> as content type?
> 
> 
>> I am still getting the content type detected as text/html and the json 
>> parser is not being invoked.  Any suggestions as to what to look at next?
> 
> The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
> - byte stream for magic detection
> - URL for additional file name patterns
> - content type sent by server
> URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.
> 
> I fear that you have to run a debugger to find out what is going wrong.
> I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.
> 
> Cheers,
> Sebastian
> 
> On 04/13/2015 04:26 PM, Iain Lopata wrote:
>> I have a page that I am fetching that contains JSON and I have a 
>> plugin for parsing JSON.
>>
>>  
>>
>> The server sets a mimetype of "text/html" and consequently my json 
>> parser does not get invoked.
>>
>>  
>>
>> If I run parsechecker from the command line and specify -forceAs 
>> "application/json" the json parser is invoked and works successfully.
>>
>>  
>>
>> So, I believe that if I can get tika to give me "application/json" as 
>> the detected content type for this page, it should work during a crawl.
>>
>>  
>>
>> I have copied tika-mimetypes.xml from the tika jar file and installed 
>> a copy in my configuration directory.  I have updated nutch-site.xml 
>> to point to this file and the log entries indicate that this is being found.
>>
>>  
>>
>> In my copy of tika-mimetypes.xml I have added the match rule shown 
>> below
>>
>>  
>>
>> <mime-type type="application/json">
>>
>>           <sub-class-of type="application/javascript"/>
>>
>>           <magic priority="100">
>>
>>                   <match value="{" type="string" offset="0"/>
>>
>>           </magic>
>>
>>           <glob pattern="*.json"/>
>>
>>   </mime-type>
>>
>>  
>>
>> I know that my match is much too broad, but I am using this just while 
>> trying to resolve this problem.
>>
>>  
>>
>> I have also set lang.extraction.policy to identify in nutch-site.xml 
>> (again primarily for testing purposes).
>>
>>  
>>
>> I am still getting the content type detected as text/html and the json 
>> parser is not being invoked.  Any suggestions as to what to look at next?
>>
>>  
>>
>> Thanks!
>>
>>  
>>
>> Iain
>>
>>
> 
> 
> 
>

RE: Mimetype detection for JSON

Posted by Iain Lopata <il...@hotmail.com>.

The following change to MimeUtil.java seems to solve my problem:

//      magicType = tika.detect(data);
            try {
                    InputStream in = new ByteArrayInputStream(data);
                    Metadata meta = new Metadata();
                    magicType = this.mimeTypes.detect(in, meta).toString();
                    LOG.debug("Magic Type for" + url + " is " + magicType);
            } catch (Exception e) {
                    //Can't complete magic detection
            }

However, my confidence that I haven’t broken something else is modest at best.

If this looks like a bug I am happy to create the JIRA entry and submit this as a patch, but before I do so can you tell me if this looks sensible?

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Tuesday, April 14, 2015 8:43 PM
To: user@nutch.apache.org
Subject: RE: Mimetype detection for JSON

It seems to me that setting tika-mimetypes.xml in the Nutch configuration causes MimeUtil.java to use the specified file for initial lookup and for URL resolution.  However, when it comes to magic detection, the tika-mimetypes.xml file in the Tika jar file seems to be used instead.  

If I update the Tika jar with my match rule it works perfectly. If I only place the updated tika-mimetypes.xml file in my Nutch configuration directory, the magic detection does not use my match rule.

Can anyone familiar with the Tika implementation tell me if there is a way to update Nutch's MimeUtil.java to instantiate Tika to use the configuration file from Nutch?  Or would it be better just to update the configuration file in the Tika jar?

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com]
Sent: Tuesday, April 14, 2015 5:32 PM
To: user@nutch.apache.org
Subject: RE: Mimetype detection for JSON

Thanks Sebastian.

mime.type.magic is true.

I don’t have control over the web server, so cannot test with application/javascript

Time for some deeper debugging it seems.  Will update the list with findings.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, April 14, 2015 4:09 PM
To: user@nutch.apache.org
Subject: Re: Mimetype detection for JSON

Hi Iain,

> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.

... and the property mime.type.magic is true (default)?


> <mime-type type="application/json">
>           <sub-class-of type="application/javascript"/>

Just as a trial: What happens if you make the web server return "application/javascript"
as content type?


> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?

The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
- byte stream for magic detection
- URL for additional file name patterns
- content type sent by server
URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.

I fear that you have to run a debugger to find out what is going wrong.
I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.

Cheers,
Sebastian

On 04/13/2015 04:26 PM, Iain Lopata wrote:
> I have a page that I am fetching that contains JSON and I have a 
> plugin for parsing JSON.
> 
>  
> 
> The server sets a mimetype of "text/html" and consequently my json 
> parser does not get invoked.
> 
>  
> 
> If I run parsechecker from the command line and specify -forceAs 
> "application/json" the json parser is invoked and works successfully.
> 
>  
> 
> So, I believe that if I can get tika to give me "application/json" as 
> the detected content type for this page, it should work during a crawl.
> 
>  
> 
> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.
> 
>  
> 
> In my copy of tika-mimetypes.xml I have added the match rule shown 
> below
> 
>  
> 
> <mime-type type="application/json">
> 
>           <sub-class-of type="application/javascript"/>
> 
>           <magic priority="100">
> 
>                   <match value="{" type="string" offset="0"/>
> 
>           </magic>
> 
>           <glob pattern="*.json"/>
> 
>   </mime-type>
> 
>  
> 
> I know that my match is much too broad, but I am using this just while 
> trying to resolve this problem.
> 
>  
> 
> I have also set lang.extraction.policy to identify in nutch-site.xml 
> (again primarily for testing purposes).
> 
>  
> 
> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?
> 
>  
> 
> Thanks!
> 
>  
> 
> Iain
> 
>

RE: Mimetype detection for JSON

Posted by Iain Lopata <il...@hotmail.com>.

It seems to me that setting tika-mimetypes.xml in the Nutch configuration causes MimeUtil.java to use the specified file for initial lookup and for URL resolution.  However, when it comes to magic detection, the tika-mimetypes.xml file in the Tika jar file seems to be used instead.  

If I update the Tika jar with my match rule it works perfectly. If I only place the updated tika-mimetypes.xml file in my Nutch configuration directory, the magic detection does not use my match rule.

Can anyone familiar with the Tika implementation tell me if there is a way to update Nutch's MimeUtil.java to instantiate Tika to use the configuration file from Nutch?  Or would it be better just to update the configuration file in the Tika jar?

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Tuesday, April 14, 2015 5:32 PM
To: user@nutch.apache.org
Subject: RE: Mimetype detection for JSON

Thanks Sebastian.

mime.type.magic is true.

I don’t have control over the web server, so cannot test with application/javascript

Time for some deeper debugging it seems.  Will update the list with findings.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
Sent: Tuesday, April 14, 2015 4:09 PM
To: user@nutch.apache.org
Subject: Re: Mimetype detection for JSON

Hi Iain,

> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.

... and the property mime.type.magic is true (default)?


> <mime-type type="application/json">
>           <sub-class-of type="application/javascript"/>

Just as a trial: What happens if you make the web server return "application/javascript"
as content type?


> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?

The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
- byte stream for magic detection
- URL for additional file name patterns
- content type sent by server
URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.

I fear that you have to run a debugger to find out what is going wrong.
I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.

Cheers,
Sebastian

On 04/13/2015 04:26 PM, Iain Lopata wrote:
> I have a page that I am fetching that contains JSON and I have a 
> plugin for parsing JSON.
> 
>  
> 
> The server sets a mimetype of "text/html" and consequently my json 
> parser does not get invoked.
> 
>  
> 
> If I run parsechecker from the command line and specify -forceAs 
> "application/json" the json parser is invoked and works successfully.
> 
>  
> 
> So, I believe that if I can get tika to give me "application/json" as 
> the detected content type for this page, it should work during a crawl.
> 
>  
> 
> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.
> 
>  
> 
> In my copy of tika-mimetypes.xml I have added the match rule shown 
> below
> 
>  
> 
> <mime-type type="application/json">
> 
>           <sub-class-of type="application/javascript"/>
> 
>           <magic priority="100">
> 
>                   <match value="{" type="string" offset="0"/>
> 
>           </magic>
> 
>           <glob pattern="*.json"/>
> 
>   </mime-type>
> 
>  
> 
> I know that my match is much too broad, but I am using this just while 
> trying to resolve this problem.
> 
>  
> 
> I have also set lang.extraction.policy to identify in nutch-site.xml 
> (again primarily for testing purposes).
> 
>  
> 
> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?
> 
>  
> 
> Thanks!
> 
>  
> 
> Iain
> 
>

RE: Mimetype detection for JSON

Posted by Iain Lopata <il...@hotmail.com>.

Thanks Sebastian.

mime.type.magic is true.

I don’t have control over the web server, so cannot test with application/javascript

Time for some deeper debugging it seems.  Will update the list with findings.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Tuesday, April 14, 2015 4:09 PM
To: user@nutch.apache.org
Subject: Re: Mimetype detection for JSON

Hi Iain,

> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.

... and the property mime.type.magic is true (default)?


> <mime-type type="application/json">
>           <sub-class-of type="application/javascript"/>

Just as a trial: What happens if you make the web server return "application/javascript"
as content type?


> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?

The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika:
- byte stream for magic detection
- URL for additional file name patterns
- content type sent by server
URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc.

I fear that you have to run a debugger to find out what is going wrong.
I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected.

Cheers,
Sebastian

On 04/13/2015 04:26 PM, Iain Lopata wrote:
> I have a page that I am fetching that contains JSON and I have a 
> plugin for parsing JSON.
> 
>  
> 
> The server sets a mimetype of "text/html" and consequently my json 
> parser does not get invoked.
> 
>  
> 
> If I run parsechecker from the command line and specify -forceAs 
> "application/json" the json parser is invoked and works successfully.
> 
>  
> 
> So, I believe that if I can get tika to give me "application/json" as 
> the detected content type for this page, it should work during a crawl.
> 
>  
> 
> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.
> 
>  
> 
> In my copy of tika-mimetypes.xml I have added the match rule shown 
> below
> 
>  
> 
> <mime-type type="application/json">
> 
>           <sub-class-of type="application/javascript"/>
> 
>           <magic priority="100">
> 
>                   <match value="{" type="string" offset="0"/>
> 
>           </magic>
> 
>           <glob pattern="*.json"/>
> 
>   </mime-type>
> 
>  
> 
> I know that my match is much too broad, but I am using this just while 
> trying to resolve this problem.
> 
>  
> 
> I have also set lang.extraction.policy to identify in nutch-site.xml 
> (again primarily for testing purposes).
> 
>  
> 
> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?
> 
>  
> 
> Thanks!
> 
>  
> 
> Iain
> 
>

Re: Mimetype detection for JSON

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Iain,

> I have copied tika-mimetypes.xml from the tika jar file and installed a copy
> in my configuration directory.  I have updated nutch-site.xml to point to
> this file and the log entries indicate that this is being found.

... and the property mime.type.magic is true (default)?


> <mime-type type="application/json">
>           <sub-class-of type="application/javascript"/>

Just as a trial: What happens if you make the web server return "application/javascript"
as content type?


> I am still getting the content type detected as text/html and the json
> parser is not being invoked.  Any suggestions as to what to look at next?

The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following
resources to Tika:
- byte stream for magic detection
- URL for additional file name patterns
- content type sent by server
URL and server content type are required as additional hints, e.g.,
for zip containers such as .xlsx, etc.

I fear that you have to run a debugger to find out what is going wrong.
I would also run first Tika alone with the modified tika-mimetypes.xml,
just to make sure that the mime magic works as expected.

Cheers,
Sebastian

On 04/13/2015 04:26 PM, Iain Lopata wrote:
> I have a page that I am fetching that contains JSON and I have a plugin for
> parsing JSON.
> 
>  
> 
> The server sets a mimetype of "text/html" and consequently my json parser
> does not get invoked.
> 
>  
> 
> If I run parsechecker from the command line and specify -forceAs
> "application/json" the json parser is invoked and works successfully.
> 
>  
> 
> So, I believe that if I can get tika to give me "application/json" as the
> detected content type for this page, it should work during a crawl.
> 
>  
> 
> I have copied tika-mimetypes.xml from the tika jar file and installed a copy
> in my configuration directory.  I have updated nutch-site.xml to point to
> this file and the log entries indicate that this is being found.
> 
>  
> 
> In my copy of tika-mimetypes.xml I have added the match rule shown below
> 
>  
> 
> <mime-type type="application/json">
> 
>           <sub-class-of type="application/javascript"/>
> 
>           <magic priority="100">
> 
>                   <match value="{" type="string" offset="0"/>
> 
>           </magic>
> 
>           <glob pattern="*.json"/>
> 
>   </mime-type>
> 
>  
> 
> I know that my match is much too broad, but I am using this just while
> trying to resolve this problem.
> 
>  
> 
> I have also set lang.extraction.policy to identify in nutch-site.xml (again
> primarily for testing purposes).
> 
>  
> 
> I am still getting the content type detected as text/html and the json
> parser is not being invoked.  Any suggestions as to what to look at next?
> 
>  
> 
> Thanks!
> 
>  
> 
> Iain
> 
>