You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (Created) (JIRA)" <ji...@apache.org> on 2012/02/16 18:19:00 UTC

[jira] [Created] (TIKA-862) JPSS HDF5 files not being detected appropriately

JPSS HDF5 files not being detected appropriately
------------------------------------------------

                 Key: TIKA-862
                 URL: https://issues.apache.org/jira/browse/TIKA-862
             Project: Tika
          Issue Type: Bug
            Reporter: Richard Yu
            Assignee: Chris A. Mattmann


As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:

from [~minfing]:

{quote}
We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
{noformat}
[ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
Content-Encoding: windows-1252
Content-Length: 22187952
Content-Type: text/plain
resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
[ryu@localhost hdf5extractor]$
{noformat}

We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).

Let me know if more information is needed. Thanks!

Richard
{quote}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209788#comment-13209788 ] 

Chris A. Mattmann commented on TIKA-862:
----------------------------------------

Hi Richard, thanks. Do you know why the other file wouldn't work with h5dump? Do you think it's related to Tika not parsing it too? Tika uses the NetCDF Java library, so I'm wondering if they are related....
                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>         Attachments: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5
>
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Richard Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Yu updated TIKA-862:
----------------------------

    Attachment: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5

The original one is too big to attach. Here is a smllaer file I found and it has the same issue:

[ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m /home/ryu/gtpclient/RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5 
Content-Encoding: ISO-8859-1
Content-Length: 2048
Content-Type: text/plain
resourceName: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5
[ryu@localhost hdf5extractor]$ 


Thanks!

Richard

                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>         Attachments: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5
>
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Richard Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Yu updated TIKA-862:
----------------------------

    Attachment: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5

The original one is too big to attach. Here is a smllaer file I found and it has the same issue:

[ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m /home/ryu/gtpclient/RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5 
Content-Encoding: ISO-8859-1
Content-Length: 2048
Content-Type: text/plain
resourceName: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5
[ryu@localhost hdf5extractor]$ 


Thanks!


                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>         Attachments: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5
>
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Richard Yu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209812#comment-13209812 ] 

Richard Yu commented on TIKA-862:
---------------------------------

The one I sent earlier do not pass the h5dump test.  It also do not pass the Tika test (i.e. Just showed 4 lines)
I deleted the file from my test smaples and here are the rest that I keep:
[ryu@localhost hdf5]$ ls
IICMO_npp_d20120119_t1301328_e1302569_b01180_c20120119195316463240_noaa_ops.h5
RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5
SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
VSTYO_npp_d20120120_t0617066_e0618308_b01190_c20120120123536501739_noaa_ops.hdf5

[ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m IICMO_npp_d20120119_t1301328_e1302569_b01180_c20120119195316463240_noaa_ops.h5 
Content-Encoding: windows-1252
Content-Length: 14800864
Content-Type: text/plain
resourceName: IICMO_npp_d20120119_t1301328_e1302569_b01180_c20120119195316463240_noaa_ops.h5

[ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5 
Content-Encoding: windows-1252
Content-Length: 20888
Content-Type: text/plain
resourceName: RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5

[ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 
Content-Encoding: windows-1252
Content-Length: 22187952
Content-Type: text/plain
resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5

[ryu@localhost hdf5]$ java -jar /usr/local/extractors/tika-app-1.0.jar -m VSTYO_npp_d20120120_t0617066_e0618308_b01190_c20120120123536501739_noaa_ops.hdf5 
Content-Encoding: windows-1252
Content-Length: 12328128
Content-Type: text/plain
resourceName: VSTYO_npp_d20120120_t0617066_e0618308_b01190_c20120120123536501739_noaa_ops.hdf5


All of them works with h5dump.  All of them are huge file except RNSCA....


I would download more smaller file and test it aginst Tika/h5dump.  Not sure this information help you?  Let me know.  Thanks!

Richard




                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>         Attachments: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5
>
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-862:
-----------------------------------

          Component/s: parser
    Affects Version/s: 1.0

- classify and identify version (I think)
                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Richard Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Yu updated TIKA-862:
----------------------------

    Attachment: RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5

This file works with h5dump.  The previous does not work with d5dump.
                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>         Attachments: RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA_npp_d20111121_t1935200_e1935400_b00346_c20111122203300301515_noaa_ops.h5
>
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-862) JPSS HDF5 files not being detected appropriately

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209513#comment-13209513 ] 

Chris A. Mattmann commented on TIKA-862:
----------------------------------------

Hi Richard, can you:

* attach that sample HDF 5 file to JIRA here? Or point me to a URL where I can get it?
* let me know what version of Tika you are using -- looks like 1.0 -- can you confirm that?

I'll take the above and then investigate what we're seeing and get right back to you!

                
> JPSS HDF5 files not being detected appropriately
> ------------------------------------------------
>
>                 Key: TIKA-862
>                 URL: https://issues.apache.org/jira/browse/TIKA-862
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Richard Yu
>            Assignee: Chris A. Mattmann
>
> As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this:
> from [~minfing]:
> {quote}
> We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line:
> {noformat}
> [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
> > /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> Content-Encoding: windows-1252
> Content-Length: 22187952
> Content-Type: text/plain
> resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
> [ryu@localhost hdf5extractor]$
> {noformat}
> We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata).
> Let me know if more information is needed. Thanks!
> Richard
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira