You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2011/05/20 18:01:59 UTC

Towards 1.0

Hi,

It's a few months since 0.9 and our Tika in Action book is soon ready
for print, so I think it's good time to start planning for the 1.0
release.

There are a few odds and ends that I'd still like to sort out in the
trunk, but overall I think we're in a pretty much ready for the switch
from 0.x to 1.x.

One major issue to be decided is whether we want to follow up with the
earlier intention of dropping deprecated functionality (like the
three-argument parse() method) before the 1.0 release. I think we
should do that and also make some other backwards-incompatible
cleanups while we're at it. That way we'll have less old baggage to
carry as we evolve through the 1.x release cycle.

Another thing to think about is whether we want to do a formal Apache
press release about Tika reaching 1.0 status.

BR,

Jukka Zitting

Re: Towards 1.0

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 20 May 2011, Jukka Zitting wrote:
> There are a few odds and ends that I'd still like to sort out in the 
> trunk, but overall I think we're in a pretty much ready for the switch 
> from 0.x to 1.x.

I'd like to get an updated POI release in first, along with the few 
patches that are waiting for it. I'll see if I can get a release vote for 
that going in the next few days.

> One major issue to be decided is whether we want to follow up with the 
> earlier intention of dropping deprecated functionality (like the 
> three-argument parse() method) before the 1.0 release. I think we should 
> do that and also make some other backwards-incompatible cleanups while 
> we're at it. That way we'll have less old baggage to carry as we evolve 
> through the 1.x release cycle.

I'm tempted to say we do what Lucene does, and do two releases. The 1.0 
would ditch the deprecated bits, and the (say) 0.9.9 would still have 
them. Not sure if we have enough changes and users to warrant that though?

> Another thing to think about is whether we want to do a formal Apache
> press release about Tika reaching 1.0 status.

I'd say yes. Sally's the one to ask about this. Since you've got all the 
blurb about Tika ready from the book, is this something you or Chris could 
take the lead on? We'd probably need to give her a rough draft now, a 
final draft when the vote starts, and it'd go out when the release hits 
the mirrors

Nick

Re: Towards 1.0

Posted by Julien Nioche <li...@gmail.com>.
Hi

It's a few months since 0.9 and our Tika in Action book is soon ready
> for print, so I think it's good time to start planning for the 1.0
> release.
>
> There are a few odds and ends that I'd still like to sort out in the
> trunk, but overall I think we're in a pretty much ready for the switch
> from 0.x to 1.x.
>

+1


>
> One major issue to be decided is whether we want to follow up with the
> earlier intention of dropping deprecated functionality (like the
> three-argument parse() method) before the 1.0 release. I think we
> should do that and also make some other backwards-incompatible
> cleanups while we're at it. That way we'll have less old baggage to
> carry as we evolve through the 1.x release cycle


+1 this is the perfect time to do these changes

We'll spend some time next week on Tika-657 i.e process the Enron corpus
with Tika + Behemoth; we'll probably find things to improve on the email
parser as a result. Would be good to do 1.0 after that maybe?

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Towards 1.0

Posted by Steve Aulenbach <sa...@neoninc.org>.
Hi,

As one of those new "1.0" users Jukka mentioned, I am busy learning Tika. To
get a better handle on all Tika does, I ran a commercial tool against
revision 1130663. It identified a few missing files in /tika-parsers. In
case it is useful for the 1.0 release plans, here is the list.

/tika-parsers/src/main/java/org/apache/tika/parser/iwork/IWorkParser.java

/tika-parsers/src/test/java/org/apache/tika/mime/PatternsTest.java

/tika-parsers/src/test/java/org/apache/tika/parser/CompositeParserTest.java

/tika-parsers/src/test/java/org/apache/tika/parser/DummyParser.java

/tika-parsers/src/test/java/org/apache/tika/parser/opendocument/ODFParserTest.java

/tika-parsers/src/test/java/org/apache/tika/parser/opendocument/OpenOfficeParserTest.java


/tika-parsers/target/surefire-reports/TEST-org.apache.tika.mime.PatternsTest.xml

/tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.CompositeParserTest.xml

/tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.opendocument.ODFParserTest.xml

/tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.opendocument.OpenOfficeParserTest.xml


Thanks,

Steve

On Mon, May 23, 2011 at 7:45 AM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> Thanks for the feedback! It sounds like we should be good to go for
> the 1.0 release in about a month from now.
>
> The release doesn't need to be perfect as we can always do 1.1 and
> other releases after that, so I wouldn't put any single issue as a
> blocker. On the other hand this will probably be the first Tika
> release that many new users will encounter, so we should strive to
> make it as good as we can.
>
> It sounds like we have consensus to get rid of the deprecated stuff
> before 1.0. I don't think a separate 0.9.9 or 0.10 release for that is
> really needed, but it would be good to create a 0.x branch right
> before the backwards-incompatible changes so people who have trouble
> with the upgrade still have something more recent than 0.9 to work
> with.
>
> We'll take lead with Chris about the press release and will circulate
> the draft so everyone can chime in with suggestions for improvements.
> It would be great to have a few supporting quotes from organizations
> that are already using Tika. Any takers? I can probably get Adobe on
> board.
>
> BR,
>
> Jukka Zitting
>

Re: Towards 1.0

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Jukka, looking forward to working on this with you and Sally...

Cheers,
Chris

On May 23, 2011, at 3:45 AM, Jukka Zitting wrote:

> Hi,
> 
> Thanks for the feedback! It sounds like we should be good to go for
> the 1.0 release in about a month from now.
> 
> The release doesn't need to be perfect as we can always do 1.1 and
> other releases after that, so I wouldn't put any single issue as a
> blocker. On the other hand this will probably be the first Tika
> release that many new users will encounter, so we should strive to
> make it as good as we can.
> 
> It sounds like we have consensus to get rid of the deprecated stuff
> before 1.0. I don't think a separate 0.9.9 or 0.10 release for that is
> really needed, but it would be good to create a 0.x branch right
> before the backwards-incompatible changes so people who have trouble
> with the upgrade still have something more recent than 0.9 to work
> with.
> 
> We'll take lead with Chris about the press release and will circulate
> the draft so everyone can chime in with suggestions for improvements.
> It would be great to have a few supporting quotes from organizations
> that are already using Tika. Any takers? I can probably get Adobe on
> board.
> 
> BR,
> 
> Jukka Zitting


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Towards 1.0

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Thanks for the feedback! It sounds like we should be good to go for
the 1.0 release in about a month from now.

The release doesn't need to be perfect as we can always do 1.1 and
other releases after that, so I wouldn't put any single issue as a
blocker. On the other hand this will probably be the first Tika
release that many new users will encounter, so we should strive to
make it as good as we can.

It sounds like we have consensus to get rid of the deprecated stuff
before 1.0. I don't think a separate 0.9.9 or 0.10 release for that is
really needed, but it would be good to create a 0.x branch right
before the backwards-incompatible changes so people who have trouble
with the upgrade still have something more recent than 0.9 to work
with.

We'll take lead with Chris about the press release and will circulate
the draft so everyone can chime in with suggestions for improvements.
It would be great to have a few supporting quotes from organizations
that are already using Tika. Any takers? I can probably get Adobe on
board.

BR,

Jukka Zitting

Re: Towards 1.0

Posted by Steve Aulenbach <sa...@neoninc.org>.
Hi Chris,

Thanks for the ODL reference. I'm working through it now.

Steve

On Sat, May 28, 2011 at 1:55 AM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Steve,
>
>
> > Chris, do you have a good reference for ODL files?
>
> The NASA Planetary Data System (PDS) Standards Reference and Chapter 12 on
> ODL is the best one I know:
>
> http://pds.nasa.gov/tools/standards-reference.shtml
>
> > It sounds like MinODL
> > parser will allow you to traverse from Group to Group Data Fields to
> > dimensions and the variables in an HDF-EOS file
>
> +1, yep.
>
> >
> > and to dimensions and variables in netCDF land, true?
>
> +1, yep.
>
> That's the goal!
>
> Cheers,
> Chris
>
> >
> > On Fri, May 27, 2011 at 12:31 PM, Mattmann, Chris A (388J) <
> > chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> Hey Steve!
> >>
> >> Nice to see you show up on the list :-) Yep, I totally agree, I have a
> >> couple of useful additions I'm going to create issues for and contribute
> >> back to Tika:
> >>
> >> 1. MinODL parser for ODL files themselves and also used in 2 below;
> >> 2. ParseContext properties identifying:
> >>  - groups that are in fact ODL values, that need to be parsed with the
> >> MinODL parser (useful for NetCDF and for HDF)
> >>  - what groups to select out (e.g., in HDF, by Path
> >> /Group1/SubGroup1/Property, and in NetCDF just by name)
> >>
> >> I think the combination of those will help the HDF and NetCDF parsers to
> >> become more robust, and configurable. Also, GDAL is high on my priority
> >> list. I've already built the Java bindings, but am working through some
> >> trickery with GDAL since it doesn't like the fact that Tika isn't file
> >> based, and when we use TikaInputStream, it creates a file of arbitrary
> >> extension (which ticks off GDAL as it's looking for something specific).
> I
> >> have a work-around though in the works...
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >> On May 26, 2011, at 4:20 AM, Steve Aulenbach wrote:
> >>
> >>> Hi Chris,
> >>>
> >>> I think your plan to improve the netCDF and HDF parsing is a great one.
> >> The
> >>> richness of a full ncdump of netCDF metadata and a full ncdump HDF-EOS
> >>> metadata would be an excellent addition to the 1.0 release of Tika. I
> >> have
> >>> discussed Tika to several science data user  and they usually ask about
> >>> netCDF and HDF-EOS metadata capabilities. A GDAL parser is also a great
> >>> idea.
> >>>
> >>> Thanks,
> >>> Steve
> >>>
> >>> On Fri, May 20, 2011 at 12:22 PM, Mattmann, Chris A (388J) <
> >>> chris.a.mattmann@jpl.nasa.gov> wrote:
> >>>
> >>>> Hey Jukka et al.,
> >>>>
> >>>>> It's a few months since 0.9 and our Tika in Action book is soon ready
> >>>>> for print, so I think it's good time to start planning for the 1.0
> >>>>> release.
> >>>>
> >>>> Looking forward to not writing anything for a while :-) I doubt it'll
> >>>> happen knowing how things go, but also really really happy with where
> >> the
> >>>> book is (and banging on those last revisions! :-) ).
> >>>>
> >>>>>
> >>>>> There are a few odds and ends that I'd still like to sort out in the
> >>>>> trunk, but overall I think we're in a pretty much ready for the
> switch
> >>>>> from 0.x to 1.x.
> >>>>
> >>>> +1.
> >>>>
> >>>>>
> >>>>> One major issue to be decided is whether we want to follow up with
> the
> >>>>> earlier intention of dropping deprecated functionality (like the
> >>>>> three-argument parse() method) before the 1.0 release.
> >>>>
> >>>> +1, I'd be fine with this. I'm a fan of following through on things
> that
> >> we
> >>>> say we're going to do if for no other good reason than we said we're
> >> going
> >>>> to do it.
> >>>>
> >>>> +1 to dropping the 3 arg parse method.
> >>>>
> >>>>> I think we
> >>>>> should do that and also make some other backwards-incompatible
> >>>>> cleanups while we're at it. That way we'll have less old baggage to
> >>>>> carry as we evolve through the 1.x release cycle.
> >>>>
> >>>> +1, my biggest thing to work on is improving the NetCDF and HDF
> parsing,
> >>>> adding an ODL parser (I'll create an issue for this), adding some
> >> spatial
> >>>> parsers (working on the GDAL one right now), and maybe some
> >> documentation on
> >>>> how to use the science data file formats. I should have time over the
> >> next
> >>>> month or so to complete these.
> >>>>
> >>>>>
> >>>>> Another thing to think about is whether we want to do a formal Apache
> >>>>> press release about Tika reaching 1.0 status.
> >>>>
> >>>> +1. I'd be happy to work with Jukka, as Nick suggested, to draft this,
> >> and
> >>>> then from there to work with Sally to make it happen.
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Cheers,
> >>>> Chris
> >>>>
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Senior Computer Scientist
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 171-266B, Mailstop: 171-246
> >>>> Email: chris.a.mattmann@nasa.gov
> >>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Assistant Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>
> >>>>
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Towards 1.0

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Steve,


> Chris, do you have a good reference for ODL files?

The NASA Planetary Data System (PDS) Standards Reference and Chapter 12 on ODL is the best one I know:

http://pds.nasa.gov/tools/standards-reference.shtml

> It sounds like MinODL
> parser will allow you to traverse from Group to Group Data Fields to
> dimensions and the variables in an HDF-EOS file

+1, yep.

> 
> and to dimensions and variables in netCDF land, true?

+1, yep.

That's the goal!

Cheers,
Chris

> 
> On Fri, May 27, 2011 at 12:31 PM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
> 
>> Hey Steve!
>> 
>> Nice to see you show up on the list :-) Yep, I totally agree, I have a
>> couple of useful additions I'm going to create issues for and contribute
>> back to Tika:
>> 
>> 1. MinODL parser for ODL files themselves and also used in 2 below;
>> 2. ParseContext properties identifying:
>>  - groups that are in fact ODL values, that need to be parsed with the
>> MinODL parser (useful for NetCDF and for HDF)
>>  - what groups to select out (e.g., in HDF, by Path
>> /Group1/SubGroup1/Property, and in NetCDF just by name)
>> 
>> I think the combination of those will help the HDF and NetCDF parsers to
>> become more robust, and configurable. Also, GDAL is high on my priority
>> list. I've already built the Java bindings, but am working through some
>> trickery with GDAL since it doesn't like the fact that Tika isn't file
>> based, and when we use TikaInputStream, it creates a file of arbitrary
>> extension (which ticks off GDAL as it's looking for something specific). I
>> have a work-around though in the works...
>> 
>> Cheers,
>> Chris
>> 
>> 
>> On May 26, 2011, at 4:20 AM, Steve Aulenbach wrote:
>> 
>>> Hi Chris,
>>> 
>>> I think your plan to improve the netCDF and HDF parsing is a great one.
>> The
>>> richness of a full ncdump of netCDF metadata and a full ncdump HDF-EOS
>>> metadata would be an excellent addition to the 1.0 release of Tika. I
>> have
>>> discussed Tika to several science data user  and they usually ask about
>>> netCDF and HDF-EOS metadata capabilities. A GDAL parser is also a great
>>> idea.
>>> 
>>> Thanks,
>>> Steve
>>> 
>>> On Fri, May 20, 2011 at 12:22 PM, Mattmann, Chris A (388J) <
>>> chris.a.mattmann@jpl.nasa.gov> wrote:
>>> 
>>>> Hey Jukka et al.,
>>>> 
>>>>> It's a few months since 0.9 and our Tika in Action book is soon ready
>>>>> for print, so I think it's good time to start planning for the 1.0
>>>>> release.
>>>> 
>>>> Looking forward to not writing anything for a while :-) I doubt it'll
>>>> happen knowing how things go, but also really really happy with where
>> the
>>>> book is (and banging on those last revisions! :-) ).
>>>> 
>>>>> 
>>>>> There are a few odds and ends that I'd still like to sort out in the
>>>>> trunk, but overall I think we're in a pretty much ready for the switch
>>>>> from 0.x to 1.x.
>>>> 
>>>> +1.
>>>> 
>>>>> 
>>>>> One major issue to be decided is whether we want to follow up with the
>>>>> earlier intention of dropping deprecated functionality (like the
>>>>> three-argument parse() method) before the 1.0 release.
>>>> 
>>>> +1, I'd be fine with this. I'm a fan of following through on things that
>> we
>>>> say we're going to do if for no other good reason than we said we're
>> going
>>>> to do it.
>>>> 
>>>> +1 to dropping the 3 arg parse method.
>>>> 
>>>>> I think we
>>>>> should do that and also make some other backwards-incompatible
>>>>> cleanups while we're at it. That way we'll have less old baggage to
>>>>> carry as we evolve through the 1.x release cycle.
>>>> 
>>>> +1, my biggest thing to work on is improving the NetCDF and HDF parsing,
>>>> adding an ODL parser (I'll create an issue for this), adding some
>> spatial
>>>> parsers (working on the GDAL one right now), and maybe some
>> documentation on
>>>> how to use the science data file formats. I should have time over the
>> next
>>>> month or so to complete these.
>>>> 
>>>>> 
>>>>> Another thing to think about is whether we want to do a formal Apache
>>>>> press release about Tika reaching 1.0 status.
>>>> 
>>>> +1. I'd be happy to work with Jukka, as Nick suggested, to draft this,
>> and
>>>> then from there to work with Sally to make it happen.
>>>> 
>>>> Thanks!
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Towards 1.0

Posted by Steve Aulenbach <sa...@neoninc.org>.
Hi Chris,

Chris, do you have a good reference for ODL files? It sounds like MinODL
parser will allow you to traverse from Group to Group Data Fields to
dimensions and the variables in an HDF-EOS file

netcdf /Users/saulenbach/dev/tikaModisTest/MOD15A2.A2010209.h09v04.005.2010219082157.hdf
{
 variables:
   char StructMetadata.0(32000);
   char CoreMetadata.0(16125);
   char ArchiveMetadata.0(5336);
   char ENGINEERING_DATA(8337);

 Group MOD_Grid_MOD15A2 {

   Group Data Fields {
     dimensions:
       XDim = 1200;
       YDim = 1200;
     variables:
       double Fpar_1km(YDim=1200, XDim=1200);
         :scale_factor_err = 0.0; // double
         :add_offset_err = 0.0; // double
         :calibrated_nt = 21; // int
         :long_name = "MOD15A2 MODIS/Terra Gridded 1KM FPAR (8-day composite)";
         :units = "Percent";
         :MOD15A2_FILLVALUE_DOC = "MOD15A2 FILL VALUE LEGEND\n255 =
_Fillvalue, assigned when:\n    * the MODAGAGG suf. reflectance for
channel VIS, NIR was assigned its _Fillvalue, or\n    * land cover
pixel itself was assigned _Fillvalus 255 or 254.\n254 = land cover
assigned as perennial salt or inland fresh water.\n253 = land cover
assigned as barren, sparse vegetation (rock, tundra, desert.)\n252 =
land cover assigned as perennial snow, ice.\n251 = land cover assigned
as \"permanent\" wetlands/inundated marshlands.\n250 = land cover
assigned as urban/built-up.\n249 = land cover assigned as
\"unclassified\" or not able to determine.\n";


and to dimensions and variables in netCDF land, true?

netcdf file:/Users/saulenbach/src/tika/tika-site/tika-parsers/target/test-classes/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
{
 dimensions:
   lat = 128;
   lon = 256;
   bnds = 2;
   plev = 17;
   time = UNLIMITED;   // (1 currently)
 variables:
   float area(lat=128, lon=256);
     :long_name = "Surface area";
     :units = "meter2";
   double lat_bnds(lat=128, bnds=2);
   double lon_bnds(lon=256, bnds=2);
   int msk_rgn(lat=128, lon=256);
     :long_name = "Mask region";
     :units = "bool";
   float pr(time=1, lat=128, lon=256);
     :comment = "Created using NCL code CCSM_atmm_2cf.ncl on\n machine
eagle163s";
     :missing_value = 1.0E20f; // float
     :_FillValue = 1.0E20f; // float
     :cell_methods = "time: mean (interval: 1 month)";
     :history = "(PRECC+PRECL)*r[h2o]";
     :original_units = "m-1 s-1";
     :original_name = "PRECC, PRECL";
     :standard_name = "precipitation_flux";
     :units = "kg m-2 s-1";
     :long_name = "precipitation_flux";
     :cell_method = "time: mean";


Thanks,
Steve

On Fri, May 27, 2011 at 12:31 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Steve!
>
> Nice to see you show up on the list :-) Yep, I totally agree, I have a
> couple of useful additions I'm going to create issues for and contribute
> back to Tika:
>
> 1. MinODL parser for ODL files themselves and also used in 2 below;
> 2. ParseContext properties identifying:
>   - groups that are in fact ODL values, that need to be parsed with the
> MinODL parser (useful for NetCDF and for HDF)
>   - what groups to select out (e.g., in HDF, by Path
> /Group1/SubGroup1/Property, and in NetCDF just by name)
>
> I think the combination of those will help the HDF and NetCDF parsers to
> become more robust, and configurable. Also, GDAL is high on my priority
> list. I've already built the Java bindings, but am working through some
> trickery with GDAL since it doesn't like the fact that Tika isn't file
> based, and when we use TikaInputStream, it creates a file of arbitrary
> extension (which ticks off GDAL as it's looking for something specific). I
> have a work-around though in the works...
>
> Cheers,
> Chris
>
>
> On May 26, 2011, at 4:20 AM, Steve Aulenbach wrote:
>
> > Hi Chris,
> >
> > I think your plan to improve the netCDF and HDF parsing is a great one.
> The
> > richness of a full ncdump of netCDF metadata and a full ncdump HDF-EOS
> > metadata would be an excellent addition to the 1.0 release of Tika. I
> have
> > discussed Tika to several science data user  and they usually ask about
> > netCDF and HDF-EOS metadata capabilities. A GDAL parser is also a great
> > idea.
> >
> > Thanks,
> > Steve
> >
> > On Fri, May 20, 2011 at 12:22 PM, Mattmann, Chris A (388J) <
> > chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> Hey Jukka et al.,
> >>
> >>> It's a few months since 0.9 and our Tika in Action book is soon ready
> >>> for print, so I think it's good time to start planning for the 1.0
> >>> release.
> >>
> >> Looking forward to not writing anything for a while :-) I doubt it'll
> >> happen knowing how things go, but also really really happy with where
> the
> >> book is (and banging on those last revisions! :-) ).
> >>
> >>>
> >>> There are a few odds and ends that I'd still like to sort out in the
> >>> trunk, but overall I think we're in a pretty much ready for the switch
> >>> from 0.x to 1.x.
> >>
> >> +1.
> >>
> >>>
> >>> One major issue to be decided is whether we want to follow up with the
> >>> earlier intention of dropping deprecated functionality (like the
> >>> three-argument parse() method) before the 1.0 release.
> >>
> >> +1, I'd be fine with this. I'm a fan of following through on things that
> we
> >> say we're going to do if for no other good reason than we said we're
> going
> >> to do it.
> >>
> >> +1 to dropping the 3 arg parse method.
> >>
> >>> I think we
> >>> should do that and also make some other backwards-incompatible
> >>> cleanups while we're at it. That way we'll have less old baggage to
> >>> carry as we evolve through the 1.x release cycle.
> >>
> >> +1, my biggest thing to work on is improving the NetCDF and HDF parsing,
> >> adding an ODL parser (I'll create an issue for this), adding some
> spatial
> >> parsers (working on the GDAL one right now), and maybe some
> documentation on
> >> how to use the science data file formats. I should have time over the
> next
> >> month or so to complete these.
> >>
> >>>
> >>> Another thing to think about is whether we want to do a formal Apache
> >>> press release about Tika reaching 1.0 status.
> >>
> >> +1. I'd be happy to work with Jukka, as Nick suggested, to draft this,
> and
> >> then from there to work with Sally to make it happen.
> >>
> >> Thanks!
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Towards 1.0

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Steve!

Nice to see you show up on the list :-) Yep, I totally agree, I have a couple of useful additions I'm going to create issues for and contribute back to Tika:

1. MinODL parser for ODL files themselves and also used in 2 below;
2. ParseContext properties identifying: 
   - groups that are in fact ODL values, that need to be parsed with the MinODL parser (useful for NetCDF and for HDF)
   - what groups to select out (e.g., in HDF, by Path /Group1/SubGroup1/Property, and in NetCDF just by name)

I think the combination of those will help the HDF and NetCDF parsers to become more robust, and configurable. Also, GDAL is high on my priority list. I've already built the Java bindings, but am working through some trickery with GDAL since it doesn't like the fact that Tika isn't file based, and when we use TikaInputStream, it creates a file of arbitrary extension (which ticks off GDAL as it's looking for something specific). I have a work-around though in the works...

Cheers,
Chris


On May 26, 2011, at 4:20 AM, Steve Aulenbach wrote:

> Hi Chris,
> 
> I think your plan to improve the netCDF and HDF parsing is a great one. The
> richness of a full ncdump of netCDF metadata and a full ncdump HDF-EOS
> metadata would be an excellent addition to the 1.0 release of Tika. I have
> discussed Tika to several science data user  and they usually ask about
> netCDF and HDF-EOS metadata capabilities. A GDAL parser is also a great
> idea.
> 
> Thanks,
> Steve
> 
> On Fri, May 20, 2011 at 12:22 PM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
> 
>> Hey Jukka et al.,
>> 
>>> It's a few months since 0.9 and our Tika in Action book is soon ready
>>> for print, so I think it's good time to start planning for the 1.0
>>> release.
>> 
>> Looking forward to not writing anything for a while :-) I doubt it'll
>> happen knowing how things go, but also really really happy with where the
>> book is (and banging on those last revisions! :-) ).
>> 
>>> 
>>> There are a few odds and ends that I'd still like to sort out in the
>>> trunk, but overall I think we're in a pretty much ready for the switch
>>> from 0.x to 1.x.
>> 
>> +1.
>> 
>>> 
>>> One major issue to be decided is whether we want to follow up with the
>>> earlier intention of dropping deprecated functionality (like the
>>> three-argument parse() method) before the 1.0 release.
>> 
>> +1, I'd be fine with this. I'm a fan of following through on things that we
>> say we're going to do if for no other good reason than we said we're going
>> to do it.
>> 
>> +1 to dropping the 3 arg parse method.
>> 
>>> I think we
>>> should do that and also make some other backwards-incompatible
>>> cleanups while we're at it. That way we'll have less old baggage to
>>> carry as we evolve through the 1.x release cycle.
>> 
>> +1, my biggest thing to work on is improving the NetCDF and HDF parsing,
>> adding an ODL parser (I'll create an issue for this), adding some spatial
>> parsers (working on the GDAL one right now), and maybe some documentation on
>> how to use the science data file formats. I should have time over the next
>> month or so to complete these.
>> 
>>> 
>>> Another thing to think about is whether we want to do a formal Apache
>>> press release about Tika reaching 1.0 status.
>> 
>> +1. I'd be happy to work with Jukka, as Nick suggested, to draft this, and
>> then from there to work with Sally to make it happen.
>> 
>> Thanks!
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Towards 1.0

Posted by Steve Aulenbach <sa...@neoninc.org>.
Hi Chris,

I think your plan to improve the netCDF and HDF parsing is a great one. The
richness of a full ncdump of netCDF metadata and a full ncdump HDF-EOS
metadata would be an excellent addition to the 1.0 release of Tika. I have
discussed Tika to several science data user  and they usually ask about
netCDF and HDF-EOS metadata capabilities. A GDAL parser is also a great
idea.

Thanks,
Steve

On Fri, May 20, 2011 at 12:22 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Jukka et al.,
>
> > It's a few months since 0.9 and our Tika in Action book is soon ready
> > for print, so I think it's good time to start planning for the 1.0
> > release.
>
> Looking forward to not writing anything for a while :-) I doubt it'll
> happen knowing how things go, but also really really happy with where the
> book is (and banging on those last revisions! :-) ).
>
> >
> > There are a few odds and ends that I'd still like to sort out in the
> > trunk, but overall I think we're in a pretty much ready for the switch
> > from 0.x to 1.x.
>
> +1.
>
> >
> > One major issue to be decided is whether we want to follow up with the
> > earlier intention of dropping deprecated functionality (like the
> > three-argument parse() method) before the 1.0 release.
>
> +1, I'd be fine with this. I'm a fan of following through on things that we
> say we're going to do if for no other good reason than we said we're going
> to do it.
>
> +1 to dropping the 3 arg parse method.
>
> > I think we
> > should do that and also make some other backwards-incompatible
> > cleanups while we're at it. That way we'll have less old baggage to
> > carry as we evolve through the 1.x release cycle.
>
> +1, my biggest thing to work on is improving the NetCDF and HDF parsing,
> adding an ODL parser (I'll create an issue for this), adding some spatial
> parsers (working on the GDAL one right now), and maybe some documentation on
> how to use the science data file formats. I should have time over the next
> month or so to complete these.
>
> >
> > Another thing to think about is whether we want to do a formal Apache
> > press release about Tika reaching 1.0 status.
>
> +1. I'd be happy to work with Jukka, as Nick suggested, to draft this, and
> then from there to work with Sally to make it happen.
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: Towards 1.0

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Jukka et al.,

> It's a few months since 0.9 and our Tika in Action book is soon ready
> for print, so I think it's good time to start planning for the 1.0
> release.

Looking forward to not writing anything for a while :-) I doubt it'll happen knowing how things go, but also really really happy with where the book is (and banging on those last revisions! :-) ).

> 
> There are a few odds and ends that I'd still like to sort out in the
> trunk, but overall I think we're in a pretty much ready for the switch
> from 0.x to 1.x.

+1.

> 
> One major issue to be decided is whether we want to follow up with the
> earlier intention of dropping deprecated functionality (like the
> three-argument parse() method) before the 1.0 release.

+1, I'd be fine with this. I'm a fan of following through on things that we say we're going to do if for no other good reason than we said we're going to do it. 

+1 to dropping the 3 arg parse method.

> I think we
> should do that and also make some other backwards-incompatible
> cleanups while we're at it. That way we'll have less old baggage to
> carry as we evolve through the 1.x release cycle.

+1, my biggest thing to work on is improving the NetCDF and HDF parsing, adding an ODL parser (I'll create an issue for this), adding some spatial parsers (working on the GDAL one right now), and maybe some documentation on how to use the science data file formats. I should have time over the next month or so to complete these.

> 
> Another thing to think about is whether we want to do a formal Apache
> press release about Tika reaching 1.0 status.

+1. I'd be happy to work with Jukka, as Nick suggested, to draft this, and then from there to work with Sally to make it happen.

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Towards 1.0

Posted by Ken Krugler <kk...@transpac.com>.
Hi Jukka,

A 1.0 release sounds like a great idea.

On my list of things I'd like to straighten out by then:

1. There are still a number of HTML parser issues that I'd like to resolve first.

Many of these are assigned to me :) Hoping to have some free time after mid-June.

2. I've got vague concern about the current state of running Tika with subsets of all parsers.

This still seems fragile.

3. Language detection is still pretty lame.

Same as with HTML parsing, many of these are assigned to me.

Hoping I've got time to take a run at using LLR to improve accuracy and performance.

-- Ken

On May 20, 2011, at 9:01am, Jukka Zitting wrote:

> Hi,
> 
> It's a few months since 0.9 and our Tika in Action book is soon ready
> for print, so I think it's good time to start planning for the 1.0
> release.
> 
> There are a few odds and ends that I'd still like to sort out in the
> trunk, but overall I think we're in a pretty much ready for the switch
> from 0.x to 1.x.
> 
> One major issue to be decided is whether we want to follow up with the
> earlier intention of dropping deprecated functionality (like the
> three-argument parse() method) before the 1.0 release. I think we
> should do that and also make some other backwards-incompatible
> cleanups while we're at it. That way we'll have less old baggage to
> carry as we evolve through the 1.x release cycle.
> 
> Another thing to think about is whether we want to do a formal Apache
> press release about Tika reaching 1.0 status.
> 
> BR,
> 
> Jukka Zitting

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g