You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/04/05 22:13:27 UTC

[jira] Created: (TIKA-400) netCDF Tika Parser

netCDF Tika Parser
------------------

                 Key: TIKA-400
                 URL: https://issues.apache.org/jira/browse/TIKA-400
             Project: Tika
          Issue Type: New Feature
          Components: parser
         Environment: indep. of env.
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
             Fix For: 0.8


Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-400) netCDF Tika Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856910#action_12856910 ] 

Chris A. Mattmann commented on TIKA-400:
----------------------------------------

Hey Jukka:

I think we do since the NetCDF lib relies on it. I agree with you on accessing internal resources. The problem is, this NetCDF library (which seems to be the most used/maintained from a Java perspective), expects to be responsible for handling the way content is delivered to it too. In fact, NetCDF and HDF concern themselves not only with obtaining data from a particular stream/content, but also, how that content is represented, because the data volumes are so large, they have to make optimizations in how to extract and represent the data for the purposes of access to it.

So, I actually ran into something similar here in terms of e.g., the core abstraction for opening up a NetCdfFile in the lib is only a File as input -- it's really hard to pass it a stream, which is what Tika expects. Arg! Very frustrating indeed. I'll look around and see if there is another ASL friendly NetCDF Java library (does anyone else know of one?)

Cheers,
Chris


> netCDF Tika Parser
> ------------------
>
>                 Key: TIKA-400
>                 URL: https://issues.apache.org/jira/browse/TIKA-400
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>
> Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (TIKA-400) netCDF Tika Parser

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-400.
--------------------------------

    Resolution: Fixed

> why does Maven allow external repos to referenced at all then if they want you to just put all your artifacts into central anyways?

It's useful for example for companies that have their private repositories and never plan to release their code to the central. But if you're an open source project, then a separate repository just makes life more difficult for downstream users.

> can we resolve this issue and then open a new one

Sure, re-resolving.

> netCDF Tika Parser
> ------------------
>
>                 Key: TIKA-400
>                 URL: https://issues.apache.org/jira/browse/TIKA-400
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>
> Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (TIKA-400) netCDF Tika Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856032#action_12856032 ] 

Chris A. Mattmann commented on TIKA-400:
----------------------------------------

Hey Jukka: 

Interesting -- makes sense, but oddly, why does Maven allow external repos to referenced at all then if they want you to just put all your artifacts into central anyways? 

I'll send an email to the NetCDF'ers asking if they would be willing to upload their jars to central and copy tika-dev@, so stay tuned. In the meanwhile, FWIW, can we resolve this issue and then open a new one to track updating the Tika POM to ref the new central NetCDF jars assuming the NetCDF'ers are cool with uploading? I'm a fan of just creating new issues and linking them.

Cool?

Cheers,
Chris

> netCDF Tika Parser
> ------------------
>
>                 Key: TIKA-400
>                 URL: https://issues.apache.org/jira/browse/TIKA-400
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>
> Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Reopened: (TIKA-400) netCDF Tika Parser

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting reopened TIKA-400:
--------------------------------


It's bad practice (see [1]) to reference external repositories in POMs meant to be released to Maven Central. Can we ask NetCDF to consider uploading their jars to Maven Central?

We should also update the license files in tika-app and tika-bundle to include the NetCDF license terms.

[1] http://www.sonatype.com/people/2010/03/why-external-repos-are-being-phased-out-of-central/

> netCDF Tika Parser
> ------------------
>
>                 Key: TIKA-400
>                 URL: https://issues.apache.org/jira/browse/TIKA-400
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>
> Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (TIKA-400) netCDF Tika Parser

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856884#action_12856884 ] 

Jukka Zitting commented on TIKA-400:
------------------------------------

BTW, do we need the commons-httpclient dependency for this? If possible, it would be good if the parsing process didn't try to access external resources.

> netCDF Tika Parser
> ------------------
>
>                 Key: TIKA-400
>                 URL: https://issues.apache.org/jira/browse/TIKA-400
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>
> Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (TIKA-400) netCDF Tika Parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-400.
------------------------------------

    Resolution: Fixed

- basic support added in r931037. Room for extension. Currently I have to load the whole netCDF file into memory to overcome the limitations of the netCDF java API from NCAR, which doesn't handle streams (perhaps, it's even a limitation of the netCDF api, which is random access file based, according to the docs). I included basic unit tests right now. So, we've got a start, extensions welcome!

> netCDF Tika Parser
> ------------------
>
>                 Key: TIKA-400
>                 URL: https://issues.apache.org/jira/browse/TIKA-400
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>
> Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.