You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by Thomas Bennett <lm...@gmail.com> on 2012/02/21 13:20:48 UTC

Mime type detection

Hi,

I see that the file manager extracts the mime type from the product
references that are passed to it via the xml-rcp ingestProduct call.

I'm ingesting hdf5 files (ext .h5) into my archive.

I've captured the methodCall and here is the actual parameter that is
passed to the File Manager on a successful.

<member>
    <name>references</name>
       ...
                        <member>
                            <name>mimeType</name>
                            <value>application/octet-stream</value>
                        </member>
                        <member>
                            <name>origReference</name>
                            <value>file:/var/kat/data/1329472755.h5</value>
                        </member>
       ...
</member>

As you can see the mimeType is detected as application/octet-stream.

This mimeType is auto-detected by the CAS-Crawler (I'm using the
AutoDetectProductCrawler
crawlerId).

However. I configure the Crawler policy/mimetypes.xml:

<mime-info>
<mime-type type="product/hdf5">
 <glob pattern="\d{10}\.h5$" isregex="true"/>
</mime-type>
</mime-info>

and policy/mime-extractor-map.xml:

<cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true
or false"
mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
 <mime type="product/hdf5">
<extractor
class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
 <config
file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
<preCondComparators>
 <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
</preCondComparators>
 </extractor>
</mime>
</cas:mimetypemap>

The AutoDetectProductCrawler now uses this to detect the file and extract
the metadata. However, when it comes to MimeType detection, this is done in
the following line of code in
org.apache.oodt.cas.filemgr.structs.Reference.java:


        try {

            this.mimeType = mimeTypeRepository

                    .getMimeType(new URL(origRef));

        } catch (MalformedURLException e) {

            e.printStackTrace();

        }
So the mime-type is actually detected by the Tika library. Woot! So Tika
does not seem to know about .h5 files and that they are hdf5 files.

Forcing a MimeType to be "application/x-hdf" in the MetaData results in the
mimetype being appended.

MimeTypeapplication/x-hdfapplication/octet-streamapplicationoctet-stream

So my question: Is this okay? Do I live with the application/octet-stream.
Any recommendations on how to fix this?

Cheers,
Tom

Re: Mime type detection

Posted by Sheryl John <sh...@gmail.com>.
Hi Thomas,

There was a fix in Tika to support *.hf files and it was for 0.9.
I saw it here :
http://mail-archives.apache.org/mod_mbox/tika-dev/201103.mbox/%3C129269711.10858.1299771239436.JavaMail.tomcat@hel.zones.apache.org%3E

The cas-filemanager 0.3 version is using tika-core-0.8.jar. And I'm not
sure if the latest oodt version is using tika 0.9 or above.

On Tue, Feb 21, 2012 at 4:20 AM, Thomas Bennett <lm...@gmail.com> wrote:

> Hi,
>
> I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
>
> I'm ingesting hdf5 files (ext .h5) into my archive.
>
> I've captured the methodCall and here is the actual parameter that is
> passed to the File Manager on a successful.
>
> <member>
>     <name>references</name>
>        ...
>                         <member>
>                             <name>mimeType</name>
>                             <value>application/octet-stream</value>
>                         </member>
>                         <member>
>                             <name>origReference</name>
>                             <value>file:/var/kat/data/1329472755.h5</value>
>                         </member>
>        ...
> </member>
>
> As you can see the mimeType is detected as application/octet-stream.
>
> This mimeType is auto-detected by the CAS-Crawler (I'm using the AutoDetectProductCrawler
> crawlerId).
>
> However. I configure the Crawler policy/mimetypes.xml:
>
> <mime-info>
> <mime-type type="product/hdf5">
>  <glob pattern="\d{10}\.h5$" isregex="true"/>
> </mime-type>
> </mime-info>
>
> and policy/mime-extractor-map.xml:
>
> <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas"
> magic="true or false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
>  <mime type="product/hdf5">
> <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
>  <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> <preCondComparators>
>  <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> </preCondComparators>
>  </extractor>
> </mime>
> </cas:mimetypemap>
>
> The AutoDetectProductCrawler now uses this to detect the file and extract
> the metadata. However, when it comes to MimeType detection, this is done in
> the following line of code in
> org.apache.oodt.cas.filemgr.structs.Reference.java:
>
>
>         try {
>
>             this.mimeType = mimeTypeRepository
>
>                     .getMimeType(new URL(origRef));
>
>         } catch (MalformedURLException e) {
>
>             e.printStackTrace();
>
>         }
> So the mime-type is actually detected by the Tika library. Woot! So Tika
> does not seem to know about .h5 files and that they are hdf5 files.
>
> Forcing a MimeType to be "application/x-hdf" in the MetaData results in
> the mimetype being appended.
>
> MimeTypeapplication/x-hdfapplication/octet-stream applicationoctet-stream
>
> So my question: Is this okay? Do I live with the application/octet-stream.
> Any recommendations on how to fix this?
>
> Cheers,
> Tom
>
>
>
>
>
>


-- 
-Sheryl

Re: Mime type detection

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Tom,

On Feb 22, 2012, at 2:51 AM, Thomas Bennett wrote:

> Hi Sheryl and Chris,
> 
> Thanks for the feedback. Much appreciated :)

Anytime!

> 
> I'll go ahead and make that wiki page.

+1.

> 
> Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release?

+1, let's do that...

Cheers,
Chris

> On 21 February 2012 23:48, Mattmann, Chris A (388J) <ch...@jpl.nasa.gov> wrote:
> Hey Tom,
> 
> GREAT description of how to get a MIME type added to Tika and cataloged in FM.
> I'll try and add this to the wiki if you or someone else doesn't beat me to it :)
> 
> That being said, this is a fine approach to this use case. Sheryl's other email
> stating that newer versions of Tika understand the .h5 extension out of the box
> are correct. I think we could make this automatically supported in OODT by:
> 
> 1. upgrading to Tika 1.0
> 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are
> coordinated across the components in OODT to get on the same version.
> 
> Until those happen, your solution is fine!
> 
> Cheers,
> Chris
> 
> On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:
> 
> > Hi,
> >
> > I see that the file manager extracts the mime type from the product references that are passed to it via the xml-rcp ingestProduct call.
> >
> > I'm ingesting hdf5 files (ext .h5) into my archive.
> >
> > I've captured the methodCall and here is the actual parameter that is passed to the File Manager on a successful.
> >
> > <member>
> >     <name>references</name>
> >        ...
> >                         <member>
> >                             <name>mimeType</name>
> >                             <value>application/octet-stream</value>
> >                         </member>
> >                         <member>
> >                             <name>origReference</name>
> >                             <value>file:/var/kat/data/1329472755.h5</value>
> >                         </member>
> >        ...
> > </member>
> >
> > As you can see the mimeType is detected as application/octet-stream.
> >
> > This mimeType is auto-detected by the CAS-Crawler (I'm using the AutoDetectProductCrawler crawlerId).
> >
> > However. I configure the Crawler policy/mimetypes.xml:
> >
> > <mime-info>
> >       <mime-type type="product/hdf5">
> >               <glob pattern="\d{10}\.h5$" isregex="true"/>
> >       </mime-type>
> > </mime-info>
> >
> > and policy/mime-extractor-map.xml:
> >
> > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or false" mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> >       <mime type="product/hdf5">
> >               <extractor class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> >                       <config file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> >                       <preCondComparators>
> >                               <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> >                       </preCondComparators>
> >               </extractor>
> >       </mime>
> > </cas:mimetypemap>
> >
> > The AutoDetectProductCrawler now uses this to detect the file and extract the metadata. However, when it comes to MimeType detection, this is done in the following line of code in org.apache.oodt.cas.filemgr.structs.Reference.java:
> >
> >
> >         try {
> >             this.mimeType = mimeTypeRepository
> >
> >                     .getMimeType(new URL(origRef));
> >
> >         } catch (MalformedURLException e) {
> >
> >             e.printStackTrace();
> >
> >         }
> >
> > So the mime-type is actually detected by the Tika library. Woot! So Tika does not seem to know about .h5 files and that they are hdf5 files.
> >
> > Forcing a MimeType to be "application/x-hdf" in the MetaData results in the mimetype being appended.
> >
> > MimeType
> > application/x-hdf
> > application/octet-stream
> > application
> > octet-stream
> >
> > So my question: Is this okay? Do I live with the application/octet-stream. Any recommendations on how to fix this?
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Mime type detection

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Tom,

On Feb 22, 2012, at 2:51 AM, Thomas Bennett wrote:

> Hi Sheryl and Chris,
> 
> Thanks for the feedback. Much appreciated :)

Anytime!

> 
> I'll go ahead and make that wiki page.

+1.

> 
> Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release?

+1, let's do that...

Cheers,
Chris

> On 21 February 2012 23:48, Mattmann, Chris A (388J) <ch...@jpl.nasa.gov> wrote:
> Hey Tom,
> 
> GREAT description of how to get a MIME type added to Tika and cataloged in FM.
> I'll try and add this to the wiki if you or someone else doesn't beat me to it :)
> 
> That being said, this is a fine approach to this use case. Sheryl's other email
> stating that newer versions of Tika understand the .h5 extension out of the box
> are correct. I think we could make this automatically supported in OODT by:
> 
> 1. upgrading to Tika 1.0
> 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are
> coordinated across the components in OODT to get on the same version.
> 
> Until those happen, your solution is fine!
> 
> Cheers,
> Chris
> 
> On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:
> 
> > Hi,
> >
> > I see that the file manager extracts the mime type from the product references that are passed to it via the xml-rcp ingestProduct call.
> >
> > I'm ingesting hdf5 files (ext .h5) into my archive.
> >
> > I've captured the methodCall and here is the actual parameter that is passed to the File Manager on a successful.
> >
> > <member>
> >     <name>references</name>
> >        ...
> >                         <member>
> >                             <name>mimeType</name>
> >                             <value>application/octet-stream</value>
> >                         </member>
> >                         <member>
> >                             <name>origReference</name>
> >                             <value>file:/var/kat/data/1329472755.h5</value>
> >                         </member>
> >        ...
> > </member>
> >
> > As you can see the mimeType is detected as application/octet-stream.
> >
> > This mimeType is auto-detected by the CAS-Crawler (I'm using the AutoDetectProductCrawler crawlerId).
> >
> > However. I configure the Crawler policy/mimetypes.xml:
> >
> > <mime-info>
> >       <mime-type type="product/hdf5">
> >               <glob pattern="\d{10}\.h5$" isregex="true"/>
> >       </mime-type>
> > </mime-info>
> >
> > and policy/mime-extractor-map.xml:
> >
> > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or false" mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> >       <mime type="product/hdf5">
> >               <extractor class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> >                       <config file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> >                       <preCondComparators>
> >                               <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> >                       </preCondComparators>
> >               </extractor>
> >       </mime>
> > </cas:mimetypemap>
> >
> > The AutoDetectProductCrawler now uses this to detect the file and extract the metadata. However, when it comes to MimeType detection, this is done in the following line of code in org.apache.oodt.cas.filemgr.structs.Reference.java:
> >
> >
> >         try {
> >             this.mimeType = mimeTypeRepository
> >
> >                     .getMimeType(new URL(origRef));
> >
> >         } catch (MalformedURLException e) {
> >
> >             e.printStackTrace();
> >
> >         }
> >
> > So the mime-type is actually detected by the Tika library. Woot! So Tika does not seem to know about .h5 files and that they are hdf5 files.
> >
> > Forcing a MimeType to be "application/x-hdf" in the MetaData results in the mimetype being appended.
> >
> > MimeType
> > application/x-hdf
> > application/octet-stream
> > application
> > octet-stream
> >
> > So my question: Is this okay? Do I live with the application/octet-stream. Any recommendations on how to fix this?
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Mime type detection

Posted by Thomas Bennett <lm...@gmail.com>.
Hi Sheryl and Chris,

Thanks for the feedback. Much appreciated :)

I'll go ahead and make that wiki page.

Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release?

Cheers,
Tom

On 21 February 2012 23:48, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Tom,
>
> GREAT description of how to get a MIME type added to Tika and cataloged in
> FM.
> I'll try and add this to the wiki if you or someone else doesn't beat me
> to it :)
>
> That being said, this is a fine approach to this use case. Sheryl's other
> email
> stating that newer versions of Tika understand the .h5 extension out of
> the box
> are correct. I think we could make this automatically supported in OODT by:
>
> 1. upgrading to Tika 1.0
> 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are
> coordinated across the components in OODT to get on the same version.
>
> Until those happen, your solution is fine!
>
> Cheers,
> Chris
>
> On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:
>
> > Hi,
> >
> > I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
> >
> > I'm ingesting hdf5 files (ext .h5) into my archive.
> >
> > I've captured the methodCall and here is the actual parameter that is
> passed to the File Manager on a successful.
> >
> > <member>
> >     <name>references</name>
> >        ...
> >                         <member>
> >                             <name>mimeType</name>
> >                             <value>application/octet-stream</value>
> >                         </member>
> >                         <member>
> >                             <name>origReference</name>
> >
> <value>file:/var/kat/data/1329472755.h5</value>
> >                         </member>
> >        ...
> > </member>
> >
> > As you can see the mimeType is detected as application/octet-stream.
> >
> > This mimeType is auto-detected by the CAS-Crawler (I'm using the
> AutoDetectProductCrawler crawlerId).
> >
> > However. I configure the Crawler policy/mimetypes.xml:
> >
> > <mime-info>
> >       <mime-type type="product/hdf5">
> >               <glob pattern="\d{10}\.h5$" isregex="true"/>
> >       </mime-type>
> > </mime-info>
> >
> > and policy/mime-extractor-map.xml:
> >
> > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas"
> magic="true or false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> >       <mime type="product/hdf5">
> >               <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> >                       <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> >                       <preCondComparators>
> >                               <preCondComparator
> id="CheckThatDataFileSizeIsGreaterThanZero"/>
> >                       </preCondComparators>
> >               </extractor>
> >       </mime>
> > </cas:mimetypemap>
> >
> > The AutoDetectProductCrawler now uses this to detect the file and
> extract the metadata. However, when it comes to MimeType detection, this is
> done in the following line of code in
> org.apache.oodt.cas.filemgr.structs.Reference.java:
> >
> >
> >         try {
> >             this.mimeType = mimeTypeRepository
> >
> >                     .getMimeType(new URL(origRef));
> >
> >         } catch (MalformedURLException e) {
> >
> >             e.printStackTrace();
> >
> >         }
> >
> > So the mime-type is actually detected by the Tika library. Woot! So Tika
> does not seem to know about .h5 files and that they are hdf5 files.
> >
> > Forcing a MimeType to be "application/x-hdf" in the MetaData results in
> the mimetype being appended.
> >
> > MimeType
> > application/x-hdf
> > application/octet-stream
> > application
> > octet-stream
> >
> > So my question: Is this okay? Do I live with the
> application/octet-stream. Any recommendations on how to fix this?
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Mime type detection

Posted by Thomas Bennett <lm...@gmail.com>.
Hi Sheryl and Chris,

Thanks for the feedback. Much appreciated :)

I'll go ahead and make that wiki page.

Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release?

Cheers,
Tom

On 21 February 2012 23:48, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Tom,
>
> GREAT description of how to get a MIME type added to Tika and cataloged in
> FM.
> I'll try and add this to the wiki if you or someone else doesn't beat me
> to it :)
>
> That being said, this is a fine approach to this use case. Sheryl's other
> email
> stating that newer versions of Tika understand the .h5 extension out of
> the box
> are correct. I think we could make this automatically supported in OODT by:
>
> 1. upgrading to Tika 1.0
> 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are
> coordinated across the components in OODT to get on the same version.
>
> Until those happen, your solution is fine!
>
> Cheers,
> Chris
>
> On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:
>
> > Hi,
> >
> > I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
> >
> > I'm ingesting hdf5 files (ext .h5) into my archive.
> >
> > I've captured the methodCall and here is the actual parameter that is
> passed to the File Manager on a successful.
> >
> > <member>
> >     <name>references</name>
> >        ...
> >                         <member>
> >                             <name>mimeType</name>
> >                             <value>application/octet-stream</value>
> >                         </member>
> >                         <member>
> >                             <name>origReference</name>
> >
> <value>file:/var/kat/data/1329472755.h5</value>
> >                         </member>
> >        ...
> > </member>
> >
> > As you can see the mimeType is detected as application/octet-stream.
> >
> > This mimeType is auto-detected by the CAS-Crawler (I'm using the
> AutoDetectProductCrawler crawlerId).
> >
> > However. I configure the Crawler policy/mimetypes.xml:
> >
> > <mime-info>
> >       <mime-type type="product/hdf5">
> >               <glob pattern="\d{10}\.h5$" isregex="true"/>
> >       </mime-type>
> > </mime-info>
> >
> > and policy/mime-extractor-map.xml:
> >
> > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas"
> magic="true or false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> >       <mime type="product/hdf5">
> >               <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> >                       <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> >                       <preCondComparators>
> >                               <preCondComparator
> id="CheckThatDataFileSizeIsGreaterThanZero"/>
> >                       </preCondComparators>
> >               </extractor>
> >       </mime>
> > </cas:mimetypemap>
> >
> > The AutoDetectProductCrawler now uses this to detect the file and
> extract the metadata. However, when it comes to MimeType detection, this is
> done in the following line of code in
> org.apache.oodt.cas.filemgr.structs.Reference.java:
> >
> >
> >         try {
> >             this.mimeType = mimeTypeRepository
> >
> >                     .getMimeType(new URL(origRef));
> >
> >         } catch (MalformedURLException e) {
> >
> >             e.printStackTrace();
> >
> >         }
> >
> > So the mime-type is actually detected by the Tika library. Woot! So Tika
> does not seem to know about .h5 files and that they are hdf5 files.
> >
> > Forcing a MimeType to be "application/x-hdf" in the MetaData results in
> the mimetype being appended.
> >
> > MimeType
> > application/x-hdf
> > application/octet-stream
> > application
> > octet-stream
> >
> > So my question: Is this okay? Do I live with the
> application/octet-stream. Any recommendations on how to fix this?
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Mime type detection

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Tom,

GREAT description of how to get a MIME type added to Tika and cataloged in FM.
I'll try and add this to the wiki if you or someone else doesn't beat me to it :)

That being said, this is a fine approach to this use case. Sheryl's other email
stating that newer versions of Tika understand the .h5 extension out of the box
are correct. I think we could make this automatically supported in OODT by:

1. upgrading to Tika 1.0
2. filing a JIRA issue associated with #1 and making sure Tika upgrades are 
coordinated across the components in OODT to get on the same version.

Until those happen, your solution is fine!

Cheers,
Chris

On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:

> Hi,
> 
> I see that the file manager extracts the mime type from the product references that are passed to it via the xml-rcp ingestProduct call.
> 
> I'm ingesting hdf5 files (ext .h5) into my archive.
> 
> I've captured the methodCall and here is the actual parameter that is passed to the File Manager on a successful.
> 
> <member>
>     <name>references</name>
>        ...
>                         <member>
>                             <name>mimeType</name>
>                             <value>application/octet-stream</value>
>                         </member>
>                         <member>
>                             <name>origReference</name>
>                             <value>file:/var/kat/data/1329472755.h5</value>
>                         </member>
>        ...
> </member>
> 
> As you can see the mimeType is detected as application/octet-stream.
> 
> This mimeType is auto-detected by the CAS-Crawler (I'm using the AutoDetectProductCrawler crawlerId).
> 
> However. I configure the Crawler policy/mimetypes.xml:
> 
> <mime-info>
> 	<mime-type type="product/hdf5">
> 		<glob pattern="\d{10}\.h5$" isregex="true"/>
> 	</mime-type>
> </mime-info>
> 
> and policy/mime-extractor-map.xml:
> 
> <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or false" mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> 	<mime type="product/hdf5">
> 		<extractor class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> 			<config file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> 			<preCondComparators>
> 				<preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> 			</preCondComparators>
> 		</extractor>
> 	</mime>
> </cas:mimetypemap>
> 
> The AutoDetectProductCrawler now uses this to detect the file and extract the metadata. However, when it comes to MimeType detection, this is done in the following line of code in org.apache.oodt.cas.filemgr.structs.Reference.java:
> 
> 
>         try {
>             this.mimeType = mimeTypeRepository
> 
>                     .getMimeType(new URL(origRef));
> 
>         } catch (MalformedURLException e) {
> 
>             e.printStackTrace();
> 
>         }
> 
> So the mime-type is actually detected by the Tika library. Woot! So Tika does not seem to know about .h5 files and that they are hdf5 files. 
> 
> Forcing a MimeType to be "application/x-hdf" in the MetaData results in the mimetype being appended.
> 
> MimeType	
> application/x-hdf
> application/octet-stream
> application
> octet-stream
> 
> So my question: Is this okay? Do I live with the application/octet-stream. Any recommendations on how to fix this?
> 
> Cheers,
> Tom
> 
> 
> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Mime type detection

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Tom,

GREAT description of how to get a MIME type added to Tika and cataloged in FM.
I'll try and add this to the wiki if you or someone else doesn't beat me to it :)

That being said, this is a fine approach to this use case. Sheryl's other email
stating that newer versions of Tika understand the .h5 extension out of the box
are correct. I think we could make this automatically supported in OODT by:

1. upgrading to Tika 1.0
2. filing a JIRA issue associated with #1 and making sure Tika upgrades are 
coordinated across the components in OODT to get on the same version.

Until those happen, your solution is fine!

Cheers,
Chris

On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:

> Hi,
> 
> I see that the file manager extracts the mime type from the product references that are passed to it via the xml-rcp ingestProduct call.
> 
> I'm ingesting hdf5 files (ext .h5) into my archive.
> 
> I've captured the methodCall and here is the actual parameter that is passed to the File Manager on a successful.
> 
> <member>
>     <name>references</name>
>        ...
>                         <member>
>                             <name>mimeType</name>
>                             <value>application/octet-stream</value>
>                         </member>
>                         <member>
>                             <name>origReference</name>
>                             <value>file:/var/kat/data/1329472755.h5</value>
>                         </member>
>        ...
> </member>
> 
> As you can see the mimeType is detected as application/octet-stream.
> 
> This mimeType is auto-detected by the CAS-Crawler (I'm using the AutoDetectProductCrawler crawlerId).
> 
> However. I configure the Crawler policy/mimetypes.xml:
> 
> <mime-info>
> 	<mime-type type="product/hdf5">
> 		<glob pattern="\d{10}\.h5$" isregex="true"/>
> 	</mime-type>
> </mime-info>
> 
> and policy/mime-extractor-map.xml:
> 
> <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or false" mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> 	<mime type="product/hdf5">
> 		<extractor class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> 			<config file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> 			<preCondComparators>
> 				<preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> 			</preCondComparators>
> 		</extractor>
> 	</mime>
> </cas:mimetypemap>
> 
> The AutoDetectProductCrawler now uses this to detect the file and extract the metadata. However, when it comes to MimeType detection, this is done in the following line of code in org.apache.oodt.cas.filemgr.structs.Reference.java:
> 
> 
>         try {
>             this.mimeType = mimeTypeRepository
> 
>                     .getMimeType(new URL(origRef));
> 
>         } catch (MalformedURLException e) {
> 
>             e.printStackTrace();
> 
>         }
> 
> So the mime-type is actually detected by the Tika library. Woot! So Tika does not seem to know about .h5 files and that they are hdf5 files. 
> 
> Forcing a MimeType to be "application/x-hdf" in the MetaData results in the mimetype being appended.
> 
> MimeType	
> application/x-hdf
> application/octet-stream
> application
> octet-stream
> 
> So my question: Is this okay? Do I live with the application/octet-stream. Any recommendations on how to fix this?
> 
> Cheers,
> Tom
> 
> 
> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Mime type detection

Posted by Sheryl John <sh...@gmail.com>.
Hi Thomas,

There was a fix in Tika to support *.hf files and it was for 0.9.
I saw it here :
http://mail-archives.apache.org/mod_mbox/tika-dev/201103.mbox/%3C129269711.10858.1299771239436.JavaMail.tomcat@hel.zones.apache.org%3E

The cas-filemanager 0.3 version is using tika-core-0.8.jar. And I'm not
sure if the latest oodt version is using tika 0.9 or above.

On Tue, Feb 21, 2012 at 4:20 AM, Thomas Bennett <lm...@gmail.com> wrote:

> Hi,
>
> I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
>
> I'm ingesting hdf5 files (ext .h5) into my archive.
>
> I've captured the methodCall and here is the actual parameter that is
> passed to the File Manager on a successful.
>
> <member>
>     <name>references</name>
>        ...
>                         <member>
>                             <name>mimeType</name>
>                             <value>application/octet-stream</value>
>                         </member>
>                         <member>
>                             <name>origReference</name>
>                             <value>file:/var/kat/data/1329472755.h5</value>
>                         </member>
>        ...
> </member>
>
> As you can see the mimeType is detected as application/octet-stream.
>
> This mimeType is auto-detected by the CAS-Crawler (I'm using the AutoDetectProductCrawler
> crawlerId).
>
> However. I configure the Crawler policy/mimetypes.xml:
>
> <mime-info>
> <mime-type type="product/hdf5">
>  <glob pattern="\d{10}\.h5$" isregex="true"/>
> </mime-type>
> </mime-info>
>
> and policy/mime-extractor-map.xml:
>
> <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas"
> magic="true or false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
>  <mime type="product/hdf5">
> <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
>  <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> <preCondComparators>
>  <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> </preCondComparators>
>  </extractor>
> </mime>
> </cas:mimetypemap>
>
> The AutoDetectProductCrawler now uses this to detect the file and extract
> the metadata. However, when it comes to MimeType detection, this is done in
> the following line of code in
> org.apache.oodt.cas.filemgr.structs.Reference.java:
>
>
>         try {
>
>             this.mimeType = mimeTypeRepository
>
>                     .getMimeType(new URL(origRef));
>
>         } catch (MalformedURLException e) {
>
>             e.printStackTrace();
>
>         }
> So the mime-type is actually detected by the Tika library. Woot! So Tika
> does not seem to know about .h5 files and that they are hdf5 files.
>
> Forcing a MimeType to be "application/x-hdf" in the MetaData results in
> the mimetype being appended.
>
> MimeTypeapplication/x-hdfapplication/octet-stream applicationoctet-stream
>
> So my question: Is this okay? Do I live with the application/octet-stream.
> Any recommendations on how to fix this?
>
> Cheers,
> Tom
>
>
>
>
>
>


-- 
-Sheryl