You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Rishi Verma (JIRA)" <ji...@apache.org> on 2015/05/08 03:29:00 UTC

[jira] [Commented] (TIKA-1577) NetCDF Data Extraction

    [ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533722#comment-14533722 ] 

Rishi Verma commented on TIKA-1577:
-----------------------------------

Hi Annie, All

I'm going to take a crack at this. Feel free to assign this to me!

My plan: leverage Tika's ParseContext to give a couple of content extraction "modes". Doing so will allow the developer to configure netCDF variable extraction to better scale with huge amounts of variable content. I'm aiming for the following modes:
  1. "Default Mode": default to either Zero Mode or Preview Mode.
  2. "Zero Mode": no variable content is read. This is the same as the current capability. 
  2. "Preview Mode": a limited amount of variable content read, starting from index zero. Probably one or two indices only, since the text buffer can become massive very quickly.
  3. "Custom Mode": provide ability to specify a custom variable Range to extract for ALL variables. If the range is greater than the size of a respective dimension within a variable, then the maximum size of the dimension will be extracted only. I'm specifically targeting a custom Range that applies to all variables concurrently, because Tika's philosophy (to me) seems to predicate limited knowledge of the actual data. Plus, if the user has a very specific use case involving something like a need to extract a particular variable's slice + range + step, then IMO Tika is not the tool to use, instead, the netCDF library should be utilized (which gives this type of maximum flexibility).  
  4. "Full Mode": extract all variable content. Note, this can result in a Tika exception if more than 100,000 characters are extracted when calling "handler.toString()".  

In terms of XHTML structure, I'm thinking a nested "<ul><li>" structure, that starts with the left-most dimension first for a given variable, and generates inner "<ul><li>" structures for each subsequent dimension's data. Doing this will provide some visible structure when rendering to a viewer's screen, but also provide for much easier parsing via XML then a giant singular list of variable content.

> NetCDF Data Extraction
> ----------------------
>
>                 Key: TIKA-1577
>                 URL: https://issues.apache.org/jira/browse/TIKA-1577
>             Project: Tika
>          Issue Type: Improvement
>          Components: handler, parser
>    Affects Versions: 1.7
>            Reporter: Ann Burgess
>            Assignee: Ann Burgess
>              Labels: features, handler
>             Fix For: 1.9
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)