You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/06/01 13:03:07 UTC
RE: [DISCUSS] 1.9 Tika release?
Will run rc against govdocs1 and commoncrawl to see what I find. Results by tomorrow.
Thank you, Chris!
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Sunday, May 31, 2015 2:52 PM
To: dev@tika.apache.org
Subject: [DISCUSS] 1.9 Tika release?
Hey Folks,
There’s been lots of new Tika goodness coming with the GeoTopic stuff,
the ExternParser fixes, and also with FFMPEG and EXIF improvements.
I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
0.9 RC right now too and am in the mood).
Please feel free to VOTE with your feet and I’m happy to cut more than
1 RC if I missed anything of course.
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: [DISCUSS] 1.9 Tika release?
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Ack hopefully RC #2 can spin then. Keep me posted thanks TA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, June 3, 2015 at 5:46 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
>Fixed eval code, thanks to Nick.
>
>Now running against doc/x list fixes to confirm success.
>
>Will rerun tomorrow on full set, with results by noon ETD.
>
>-----Original Message-----
>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>Sent: Wednesday, June 03, 2015 7:28 AM
>To: dev@tika.apache.org
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>Y. Will do.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Tuesday, June 02, 2015 10:31 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>Thanks Tim. Looks like you and Nick have been fixing some other
>stuff too.
>
>Let me know when I should spin RC #2. Ready and willing! :-)
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:53 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>Thank you. Sorry about this. I had hoped to run against govdocs1
>>soonish to look for these before an rc was cut.
>>
>>There were three critical points in the code that accounted for all of
>>the exceptions:
>>
>>1) one where I had left in a debugging RuntimeException throw instead of
>>a swallow/return ""
>>2) one where poi fairly commonly throws a RuntimeException when something
>>goes wrong in paragraph.getList()
>>3) failure of imagination that a value might not be found in XWPF's
>>numbering, which led to NPE
>>
>>These are all fixed locally. Will rerun over night with results
>>tomorrow.
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Monday, June 01, 2015 9:06 PM
>>To: dev@tika.apache.org
>>Subject: Re: [DISCUSS] 1.9 Tika release?
>>
>>ACK nw, will be happy to spin RC #2 when we’re ready :)
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>-----Original Message-----
>>From: <Allison>, "Timothy B." <ta...@mitre.org>
>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Date: Monday, June 1, 2015 at 6:04 PM
>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>>-1
>>>
>>>Do not release because there are roughly 200 new exceptions in govdocs1
>>>caused by the list processing code I just added to doc and docx files.
>>>
>>>Argh...
>>>
>>>Will fix asap.
>>>
>>>-----Original Message-----
>>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>>Sent: Monday, June 01, 2015 7:03 AM
>>>To: dev@tika.apache.org
>>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>>
>>>Will run rc against govdocs1 and commoncrawl to see what I find.
>>>Results
>>>by tomorrow.
>>>
>>>Thank you, Chris!
>>>
>>>-----Original Message-----
>>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>Sent: Sunday, May 31, 2015 2:52 PM
>>>To: dev@tika.apache.org
>>>Subject: [DISCUSS] 1.9 Tika release?
>>>
>>>Hey Folks,
>>>
>>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>>0.9 RC right now too and am in the mood).
>>>
>>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>>1 RC if I missed anything of course.
>>>
>>>Cheers,
>>>Chris
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattmann@nasa.gov
>>>WWW: http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>
>
RE: [DISCUSS] 1.9 Tika release?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Nick!
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: Friday, June 05, 2015 6:15 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?
>> text/dif+xml->application/dif+xml
>Expected and fine
Agreed on the mime type, but is there a reason we're losing text? Or was that incorrect duplication earlier?
>> text/plain; charset=windows-1252->application/pdf
>> text/plain; charset=windows-1255->application/pdf
>These are (hopefully!) PDFs with junk on the front, so good
Agreed.
>> text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
>Not sure if this is correct or not, maybe double check these by hand?
>> It looks like the change of the magic range for pdfs was a good move
>> (for govdocs1, at least). However, we’re now losing content from those
>> files that are now identified as bibtex.
>The *tex formats have an application/ mimetype but no parser, so now we
>correctly detect them they stopped going through the text parser as a
>fallback. I've hopefully fixed that in r1683702, by marking their
>mimetypes as descending from text, so the text parser can claim them if
>nothing else can
Thank you!
>> For govdocs1, we’re now at 6,653 “caught” exceptions for container
>> documents (out of 979,143=0.7%), but we have roughly 33k exceptions for
>> embedded documents out of 1,364,552=2.4%). As before, I need to confirm
>> that something didn’t go wrong with my code; it could also be the case
>> that the files are being mis-id’d as Excel… For now, though, it looks
>> like that high # is driven by embedded Excel files.
>Maybe best to raise one new jira issue per main area, and upload a single
>sample file from govdocs that shows the problem, and we can tackle them in
>turn in 1.10/1.11?
Y. Ran out of steam last night.
Re: [DISCUSS] 1.9 Tika release?
Posted by Konstantin Gribov <gr...@gmail.com>.
I think, files that moved from text/plain to pdf should also be checked by
hand since we have quite new low-priority magic for pdfs ("%PDF-1." and
"%PDF-2." in first 0.5kB of stream).
--
Best regards,
Konstantin Gribov
пт, 5 июня 2015 г. в 13:16, Nick Burch <ap...@gagravarr.org>:
> On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
> > Changes in mime detection for "main" files:
> >
> > text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
> > text/plain; charset=windows-1252->application/x-bibtex-text-file
> > text/html; charset=ISO-8859-1->application/x-bibtex-text-file
>
> I think these are expected and good
>
> > text/dif+xml->application/dif+xml
>
> Expected and fine
>
> > text/plain; charset=windows-1252->application/pdf
> > text/plain; charset=windows-1255->application/pdf
>
> These are (hopefully!) PDFs with junk on the front, so good
>
> > text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
>
> Not sure if this is correct or not, maybe double check these by hand?
>
>
> > It looks like the change of the magic range for pdfs was a good move
> > (for govdocs1, at least). However, we’re now losing content from those
> > files that are now identified as bibtex.
>
> The *tex formats have an application/ mimetype but no parser, so now we
> correctly detect them they stopped going through the text parser as a
> fallback. I've hopefully fixed that in r1683702, by marking their
> mimetypes as descending from text, so the text parser can claim them if
> nothing else can
>
>
> > For govdocs1, we’re now at 6,653 “caught” exceptions for container
> > documents (out of 979,143=0.7%), but we have roughly 33k exceptions for
> > embedded documents out of 1,364,552=2.4%). As before, I need to confirm
> > that something didn’t go wrong with my code; it could also be the case
> > that the files are being mis-id’d as Excel… For now, though, it looks
> > like that high # is driven by embedded Excel files.
>
> Maybe best to raise one new jira issue per main area, and upload a single
> sample file from govdocs that shows the problem, and we can tackle them in
> turn in 1.10/1.11?
>
> Nick
RE: [DISCUSS] 1.9 Tika release?
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
> Changes in mime detection for "main" files:
>
> text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
> text/plain; charset=windows-1252->application/x-bibtex-text-file
> text/html; charset=ISO-8859-1->application/x-bibtex-text-file
I think these are expected and good
> text/dif+xml->application/dif+xml
Expected and fine
> text/plain; charset=windows-1252->application/pdf
> text/plain; charset=windows-1255->application/pdf
These are (hopefully!) PDFs with junk on the front, so good
> text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
Not sure if this is correct or not, maybe double check these by hand?
> It looks like the change of the magic range for pdfs was a good move
> (for govdocs1, at least). However, we’re now losing content from those
> files that are now identified as bibtex.
The *tex formats have an application/ mimetype but no parser, so now we
correctly detect them they stopped going through the text parser as a
fallback. I've hopefully fixed that in r1683702, by marking their
mimetypes as descending from text, so the text parser can claim them if
nothing else can
> For govdocs1, we’re now at 6,653 “caught” exceptions for container
> documents (out of 979,143=0.7%), but we have roughly 33k exceptions for
> embedded documents out of 1,364,552=2.4%). As before, I need to confirm
> that something didn’t go wrong with my code; it could also be the case
> that the files are being mis-id’d as Excel… For now, though, it looks
> like that high # is driven by embedded Excel files.
Maybe best to raise one new jira issue per main area, and upload a single
sample file from govdocs that shows the problem, and we can tackle them in
turn in 1.10/1.11?
Nick
RE: [DISCUSS] 1.9 Tika release?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Chris and all,
I think we're good to go for rc2. Shouldn't have been so optimistic on time estimate. I'm sorry it took so long.
I just finished the preliminary analyses.
Many thanks to Nick for figuring out that the new eval code was hotter off the press than it should have been. :)
Details on runs against govdocs1
10 "fixed" exceptions
4 "new" exceptions
More attachments in a handful of .doc files.
Metadata values roughly equivalent.
Changes in mime detection for "main" files:
text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
11
text/dif+xml->application/dif+xml
9
text/plain; charset=windows-1252->application/pdf
4
text/plain; charset=windows-1252->application/x-bibtex-text-file
3
text/plain; charset=windows-1255->application/pdf
3
text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
2
text/html; charset=ISO-8859-1->application/x-bibtex-text-file
1
Changes in mime detection for embedded files:
CONCAT(DETECTED_CONTENT_TYPE_A, '->', DETECTED_CONTENT_TYPE_B) <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
application/x-msmetafile->application/pdf
42
image/x-pict->application/pdf
28
application/x-msmetafile->application/zlib
5
application/x-emf->application/zlib
2
application/octet-stream->application/zlib
1
Nick has already opened an issue to look into the extra wrapping around pdfs.
It looks like the change of the magic range for pdfs was a good move (for govdocs1, at least). However, we’re now losing content from those files that are now identified as bibtex.
ONE BIG AREA FOR FURTHER ANALYSIS
In the earlier eval code, we had one row per input/parent/container document. I’ve modified it so that we now have one row per document, whether it is an attachment/embedded document or a parent/container document. I’m also now storing the stacktraces from the embedded documents.
For govdocs1, we’re now at 6,653 “caught” exceptions for container documents (out of 979,143=0.7%), but we have roughly 33k exceptions for embedded documents out of 1,364,552=2.4%). As before, I need to confirm that something didn’t go wrong with my code; it could also be the case that the files are being mis-id’d as Excel… For now, though, it looks like that high # is driven by embedded Excel files.
Compare exceptions for container files:
DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
application/xml
1781
application/vnd.ms-powerpoint
968
application/msword
681
application/vnd.ms-excel
272
application/pdf
107
application/vnd.google-earth.kml+xml
19
image/jpeg
5
application/x-tika-msoffice
5
text/plain; charset=ISO-8859-1
5
application/vnd.ms-excel.sheet.3
4
application/vnd.openxmlformats-officedocument.presentationml.presentation
3
application/vnd.ms-excel.sheet.4
3
application/xhtml+xml; charset=UTF-8
2
application/rdf+xml
2
image/vnd.dwg
2
application/rtf
2
application/rss+xml
1
application/dita+xml; format=topic
1
application/vnd.openxmlformats-officedocument.wordprocessingml.document
1
text/html; charset=windows-1252
1
text/html; charset=ISO-8859-1
1
With exceptions for embedded files:
DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
application/vnd.ms-excel
26008
image/png
2184
application/vnd.visio
2019
image/x-ms-bmp
1899
image/jpeg
311
image/vnd.dwg
51
application/x-font-ttf
40
image/vnd.adobe.photoshop
21
application/x-tika-msoffice
14
application/vnd.google-earth.kml+xml
13
application/vnd.ms-powerpoint
12
null
9
application/xml
3
application/msword
2
application/pdf
1
Top 10 file types overall for parent/container files
DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
application/pdf
230860
image/jpeg
109282
text/html; charset=ISO-8859-1
86344
application/msword
76984
text/plain; charset=ISO-8859-1
64186
application/vnd.ms-excel
58930
application/vnd.ms-powerpoint
51169
text/html; charset=windows-1252
50292
text/plain; charset=windows-1252
42988
text/html; charset=UTF-8
41003
Top 10 files types for embedded files
DETECTED_CONTENT_TYPE_B <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
CNT <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>
image/png
470639
image/jpeg
261050
application/x-msmetafile
231803
application/x-emf
103058
application/x-tika-msoffice
82964
application/vnd.ms-excel
54296
image/x-pict
29764
application/msword
22699
image/x-ms-bmp
21525
application/x-tika-msoffice-embedded; format=ole10_native
19751
For anyone who wants to kick the tires on the apparent embedded excel issue, here are some files sorted by asc order of the json length:
428/428996.ppt
428996.ppt/11
application/vnd.ms-excel
6050
920/920182.ppt
920182.ppt/2
application/vnd.ms-excel
6093
852/852522.ppt
852522.ppt/758
application/vnd.ms-excel
6110
851/851799.ppt
851799.ppt/789
application/vnd.ms-excel
6110
854/854876.ppt
854876.ppt/696
application/vnd.ms-excel
6112
703/703075.ppt
703075.ppt/830
application/vnd.ms-excel
6112
849/849126.ppt
849126.ppt/880
application/vnd.ms-excel
6114
849/849621.ppt
849621.ppt/861
application/vnd.ms-excel
6116
847/847762.ppt
847762.ppt/992
application/vnd.ms-excel
6119
Looks like the majority are embedded in ppt, but there are several embedded in xls as well.
Cheers,
Tim
-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Wednesday, June 03, 2015 8:46 PM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?
Fixed eval code, thanks to Nick.
Now running against doc/x list fixes to confirm success.
Will rerun tomorrow on full set, with results by noon ETD.
-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Wednesday, June 03, 2015 7:28 AM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
Y. Will do.
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Tuesday, June 02, 2015 10:31 PM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Re: [DISCUSS] 1.9 Tika release?
Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.
Let me know when I should spin RC #2. Ready and willing! :-)
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>>
Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Subject: RE: [DISCUSS] 1.9 Tika release?
>Thank you. Sorry about this. I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally. Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org<ma...@tika.apache.org>
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>>
>Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org<ma...@tika.apache.org>
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find. Results
>>by tomorrow.
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org<ma...@tika.apache.org>
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>
RE: [DISCUSS] 1.9 Tika release?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Fixed eval code, thanks to Nick.
Now running against doc/x list fixes to confirm success.
Will rerun tomorrow on full set, with results by noon ETD.
-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Wednesday, June 03, 2015 7:28 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?
Y. Will do.
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Tuesday, June 02, 2015 10:31 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] 1.9 Tika release?
Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.
Let me know when I should spin RC #2. Ready and willing! :-)
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
>Thank you. Sorry about this. I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally. Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find. Results
>>by tomorrow.
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>
RE: [DISCUSS] 1.9 Tika release?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y. Will do.
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Tuesday, June 02, 2015 10:31 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] 1.9 Tika release?
Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.
Let me know when I should spin RC #2. Ready and willing! :-)
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
>Thank you. Sorry about this. I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally. Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find. Results
>>by tomorrow.
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>
Re: [DISCUSS] 1.9 Tika release?
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.
Let me know when I should spin RC #2. Ready and willing! :-)
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
>Thank you. Sorry about this. I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally. Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find. Results
>>by tomorrow.
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>
RE: [DISCUSS] 1.9 Tika release?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you. Sorry about this. I had hoped to run against govdocs1 soonish to look for these before an rc was cut.
There were three critical points in the code that accounted for all of the exceptions:
1) one where I had left in a debugging RuntimeException throw instead of a swallow/return ""
2) one where poi fairly commonly throws a RuntimeException when something goes wrong in paragraph.getList()
3) failure of imagination that a value might not be found in XWPF's numbering, which led to NPE
These are all fixed locally. Will rerun over night with results tomorrow.
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Monday, June 01, 2015 9:06 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] 1.9 Tika release?
ACK nw, will be happy to spin RC #2 when we’re ready :)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:04 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
>-1
>
>Do not release because there are roughly 200 new exceptions in govdocs1
>caused by the list processing code I just added to doc and docx files.
>
>Argh...
>
>Will fix asap.
>
>-----Original Message-----
>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>Sent: Monday, June 01, 2015 7:03 AM
>To: dev@tika.apache.org
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>Will run rc against govdocs1 and commoncrawl to see what I find. Results
>by tomorrow.
>
>Thank you, Chris!
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Sunday, May 31, 2015 2:52 PM
>To: dev@tika.apache.org
>Subject: [DISCUSS] 1.9 Tika release?
>
>Hey Folks,
>
>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>0.9 RC right now too and am in the mood).
>
>Please feel free to VOTE with your feet and I’m happy to cut more than
>1 RC if I missed anything of course.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
Re: [DISCUSS] 1.9 Tika release?
Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
ACK nw, will be happy to spin RC #2 when we’re ready :)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:04 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?
>-1
>
>Do not release because there are roughly 200 new exceptions in govdocs1
>caused by the list processing code I just added to doc and docx files.
>
>Argh...
>
>Will fix asap.
>
>-----Original Message-----
>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>Sent: Monday, June 01, 2015 7:03 AM
>To: dev@tika.apache.org
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>Will run rc against govdocs1 and commoncrawl to see what I find. Results
>by tomorrow.
>
>Thank you, Chris!
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Sunday, May 31, 2015 2:52 PM
>To: dev@tika.apache.org
>Subject: [DISCUSS] 1.9 Tika release?
>
>Hey Folks,
>
>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>0.9 RC right now too and am in the mood).
>
>Please feel free to VOTE with your feet and I’m happy to cut more than
>1 RC if I missed anything of course.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
RE: [DISCUSS] 1.9 Tika release?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
-1
Do not release because there are roughly 200 new exceptions in govdocs1 caused by the list processing code I just added to doc and docx files.
Argh...
Will fix asap.
-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, June 01, 2015 7:03 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?
Will run rc against govdocs1 and commoncrawl to see what I find. Results by tomorrow.
Thank you, Chris!
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Sunday, May 31, 2015 2:52 PM
To: dev@tika.apache.org
Subject: [DISCUSS] 1.9 Tika release?
Hey Folks,
There’s been lots of new Tika goodness coming with the GeoTopic stuff,
the ExternParser fixes, and also with FFMPEG and EXIF improvements.
I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
0.9 RC right now too and am in the mood).
Please feel free to VOTE with your feet and I’m happy to cut more than
1 RC if I missed anything of course.
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++