You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/06/01 13:03:07 UTC

RE: [DISCUSS] 1.9 Tika release?

Will run rc against govdocs1 and commoncrawl to see what I find.  Results by tomorrow. 

Thank you, Chris!

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Sunday, May 31, 2015 2:52 PM
To: dev@tika.apache.org
Subject: [DISCUSS] 1.9 Tika release?

Hey Folks,

There’s been lots of new Tika goodness coming with the GeoTopic stuff,
the ExternParser fixes, and also with FFMPEG and EXIF improvements.
I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
0.9 RC right now too and am in the mood).

Please feel free to VOTE with your feet and I’m happy to cut more than
1 RC if I missed anything of course.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [DISCUSS] 1.9 Tika release?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Ack hopefully RC #2 can spin then. Keep me posted thanks TA

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, June 3, 2015 at 5:46 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?

>Fixed eval code, thanks to Nick.
>
>Now running against doc/x list fixes to confirm success.
>
>Will rerun tomorrow on full set, with results by noon ETD.
>
>-----Original Message-----
>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>Sent: Wednesday, June 03, 2015 7:28 AM
>To: dev@tika.apache.org
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>Y.  Will do.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Tuesday, June 02, 2015 10:31 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>Thanks Tim. Looks like you and Nick have been fixing some other
>stuff too.
>
>Let me know when I should spin RC #2. Ready and willing! :-)
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:53 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>Thank you.  Sorry about this.  I had hoped to run against govdocs1
>>soonish to look for these before an rc was cut.
>>
>>There were three critical points in the code that accounted for all of
>>the exceptions:
>>
>>1) one where I had left in a debugging RuntimeException throw instead of
>>a swallow/return ""
>>2) one where poi fairly commonly throws a RuntimeException when something
>>goes wrong in paragraph.getList()
>>3) failure of imagination that a value might not be found in XWPF's
>>numbering, which led to NPE
>>
>>These are all fixed locally.  Will rerun over night with results
>>tomorrow.
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Monday, June 01, 2015 9:06 PM
>>To: dev@tika.apache.org
>>Subject: Re: [DISCUSS] 1.9 Tika release?
>>
>>ACK nw, will be happy to spin RC #2 when we’re ready :)
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>-----Original Message-----
>>From: <Allison>, "Timothy B." <ta...@mitre.org>
>>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Date: Monday, June 1, 2015 at 6:04 PM
>>To: "dev@tika.apache.org" <de...@tika.apache.org>
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>>-1
>>>
>>>Do not release because there are roughly 200 new exceptions in govdocs1
>>>caused by the list processing code I just added to doc and docx files.
>>>
>>>Argh...
>>>
>>>Will fix asap.
>>>
>>>-----Original Message-----
>>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>>Sent: Monday, June 01, 2015 7:03 AM
>>>To: dev@tika.apache.org
>>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>>
>>>Will run rc against govdocs1 and commoncrawl to see what I find.
>>>Results
>>>by tomorrow. 
>>>
>>>Thank you, Chris!
>>>
>>>-----Original Message-----
>>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>>Sent: Sunday, May 31, 2015 2:52 PM
>>>To: dev@tika.apache.org
>>>Subject: [DISCUSS] 1.9 Tika release?
>>>
>>>Hey Folks,
>>>
>>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>>0.9 RC right now too and am in the mood).
>>>
>>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>>1 RC if I missed anything of course.
>>>
>>>Cheers,
>>>Chris
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattmann@nasa.gov
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>
>


RE: [DISCUSS] 1.9 Tika release?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Nick!

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Friday, June 05, 2015 6:15 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?


>> text/dif+xml->application/dif+xml

>Expected and fine

Agreed on the mime type, but is there a reason we're losing text?  Or was that incorrect duplication earlier?


>> text/plain; charset=windows-1252->application/pdf
>> text/plain; charset=windows-1255->application/pdf

>These are (hopefully!) PDFs with junk on the front, so good

Agreed.

>> text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1

>Not sure if this is correct or not, maybe double check these by hand?


>> It looks like the change of the magic range for pdfs was a good move 
>> (for govdocs1, at least).  However, we’re now losing content from those 
>> files that are now identified as bibtex.

>The *tex formats have an application/ mimetype but no parser, so now we 
>correctly detect them they stopped going through the text parser as a 
>fallback. I've hopefully fixed that in r1683702, by marking their 
>mimetypes as descending from text, so the text parser can claim them if 
>nothing else can

Thank you!

>> For govdocs1, we’re now at 6,653 “caught” exceptions for container 
>> documents (out of 979,143=0.7%), but we have roughly 33k exceptions for 
>> embedded documents out of 1,364,552=2.4%).  As before, I need to confirm 
>> that something didn’t go wrong with my code; it could also be the case 
>> that the files are being mis-id’d as Excel… For now, though, it looks 
>> like that high # is driven by embedded Excel files.

>Maybe best to raise one new jira issue per main area, and upload a single 
>sample file from govdocs that shows the problem, and we can tackle them in 
>turn in 1.10/1.11?

Y. Ran out of steam last night.

Re: [DISCUSS] 1.9 Tika release?

Posted by Konstantin Gribov <gr...@gmail.com>.
I think, files that moved from text/plain to pdf should also be checked by
hand since we have quite new low-priority magic for pdfs ("%PDF-1." and
"%PDF-2." in first 0.5kB of stream).

-- 
Best regards,
Konstantin Gribov

пт, 5 июня 2015 г. в 13:16, Nick Burch <ap...@gagravarr.org>:

> On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
> > Changes in mime detection for "main" files:
> >
> > text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
> > text/plain; charset=windows-1252->application/x-bibtex-text-file
> > text/html; charset=ISO-8859-1->application/x-bibtex-text-file
>
> I think these are expected and good
>
> > text/dif+xml->application/dif+xml
>
> Expected and fine
>
> > text/plain; charset=windows-1252->application/pdf
> > text/plain; charset=windows-1255->application/pdf
>
> These are (hopefully!) PDFs with junk on the front, so good
>
> > text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
>
> Not sure if this is correct or not, maybe double check these by hand?
>
>
> > It looks like the change of the magic range for pdfs was a good move
> > (for govdocs1, at least).  However, we’re now losing content from those
> > files that are now identified as bibtex.
>
> The *tex formats have an application/ mimetype but no parser, so now we
> correctly detect them they stopped going through the text parser as a
> fallback. I've hopefully fixed that in r1683702, by marking their
> mimetypes as descending from text, so the text parser can claim them if
> nothing else can
>
>
> > For govdocs1, we’re now at 6,653 “caught” exceptions for container
> > documents (out of 979,143=0.7%), but we have roughly 33k exceptions for
> > embedded documents out of 1,364,552=2.4%).  As before, I need to confirm
> > that something didn’t go wrong with my code; it could also be the case
> > that the files are being mis-id’d as Excel… For now, though, it looks
> > like that high # is driven by embedded Excel files.
>
> Maybe best to raise one new jira issue per main area, and upload a single
> sample file from govdocs that shows the problem, and we can tackle them in
> turn in 1.10/1.11?
>
> Nick

RE: [DISCUSS] 1.9 Tika release?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
> Changes in mime detection for "main" files:
>
> text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
> text/plain; charset=windows-1252->application/x-bibtex-text-file
> text/html; charset=ISO-8859-1->application/x-bibtex-text-file

I think these are expected and good

> text/dif+xml->application/dif+xml

Expected and fine

> text/plain; charset=windows-1252->application/pdf
> text/plain; charset=windows-1255->application/pdf

These are (hopefully!) PDFs with junk on the front, so good

> text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1

Not sure if this is correct or not, maybe double check these by hand?


> It looks like the change of the magic range for pdfs was a good move 
> (for govdocs1, at least).  However, we’re now losing content from those 
> files that are now identified as bibtex.

The *tex formats have an application/ mimetype but no parser, so now we 
correctly detect them they stopped going through the text parser as a 
fallback. I've hopefully fixed that in r1683702, by marking their 
mimetypes as descending from text, so the text parser can claim them if 
nothing else can


> For govdocs1, we’re now at 6,653 “caught” exceptions for container 
> documents (out of 979,143=0.7%), but we have roughly 33k exceptions for 
> embedded documents out of 1,364,552=2.4%).  As before, I need to confirm 
> that something didn’t go wrong with my code; it could also be the case 
> that the files are being mis-id’d as Excel… For now, though, it looks 
> like that high # is driven by embedded Excel files.

Maybe best to raise one new jira issue per main area, and upload a single 
sample file from govdocs that shows the problem, and we can tackle them in 
turn in 1.10/1.11?

Nick

RE: [DISCUSS] 1.9 Tika release?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Chris and all,



I think we're good to go for rc2.  Shouldn't have been so optimistic on time estimate. I'm sorry it took so long.



I just finished the preliminary analyses.



Many thanks to Nick for figuring out that the new eval code was hotter off the press than it should have been. :)



Details on runs against govdocs1



10 "fixed" exceptions

4 "new" exceptions



More attachments in a handful of .doc files.

Metadata values roughly equivalent.



Changes in mime detection for "main" files:
text/plain; charset=ISO-8859-1->application/x-bibtex-text-file

11

text/dif+xml->application/dif+xml

9

text/plain; charset=windows-1252->application/pdf

4

text/plain; charset=windows-1252->application/x-bibtex-text-file

3

text/plain; charset=windows-1255->application/pdf

3

text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1

2

text/html; charset=ISO-8859-1->application/x-bibtex-text-file

1




Changes in mime detection for embedded files:
CONCAT(DETECTED_CONTENT_TYPE_A, '->', DETECTED_CONTENT_TYPE_B)  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/x-msmetafile->application/pdf

42

image/x-pict->application/pdf

28

application/x-msmetafile->application/zlib

5

application/x-emf->application/zlib

2

application/octet-stream->application/zlib

1




Nick has already opened an issue to look into the extra wrapping around pdfs.



It looks like the change of the magic range for pdfs was a good move (for govdocs1, at least).  However, we’re now losing content from those files that are now identified as bibtex.



ONE BIG AREA FOR FURTHER ANALYSIS

In the earlier eval code, we had one row per input/parent/container document.  I’ve modified it so that we now have one row per document, whether it is an attachment/embedded document or a parent/container document.  I’m also now storing the stacktraces from the embedded documents.



For govdocs1, we’re now at 6,653 “caught” exceptions for container documents (out of 979,143=0.7%), but we have roughly 33k exceptions for embedded documents out of 1,364,552=2.4%).  As before, I need to confirm that something didn’t go wrong with my code; it could also be the case that the files are being mis-id’d as Excel… For now, though, it looks like that high # is driven by embedded Excel files.



Compare exceptions for container files:
DETECTED_CONTENT_TYPE_B  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/xml

1781

application/vnd.ms-powerpoint

968

application/msword

681

application/vnd.ms-excel

272

application/pdf

107

application/vnd.google-earth.kml+xml

19

image/jpeg

5

application/x-tika-msoffice

5

text/plain; charset=ISO-8859-1

5

application/vnd.ms-excel.sheet.3

4

application/vnd.openxmlformats-officedocument.presentationml.presentation

3

application/vnd.ms-excel.sheet.4

3

application/xhtml+xml; charset=UTF-8

2

application/rdf+xml

2

image/vnd.dwg

2

application/rtf

2

application/rss+xml

1

application/dita+xml; format=topic

1

application/vnd.openxmlformats-officedocument.wordprocessingml.document

1

text/html; charset=windows-1252

1

text/html; charset=ISO-8859-1

1




With exceptions for embedded files:
DETECTED_CONTENT_TYPE_B  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/vnd.ms-excel

26008

image/png

2184

application/vnd.visio

2019

image/x-ms-bmp

1899

image/jpeg

311

image/vnd.dwg

51

application/x-font-ttf

40

image/vnd.adobe.photoshop

21

application/x-tika-msoffice

14

application/vnd.google-earth.kml+xml

13

application/vnd.ms-powerpoint

12

null

9

application/xml

3

application/msword

2

application/pdf

1






Top 10 file types overall for parent/container files
DETECTED_CONTENT_TYPE_B  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

application/pdf

230860

image/jpeg

109282

text/html; charset=ISO-8859-1

86344

application/msword

76984

text/plain; charset=ISO-8859-1

64186

application/vnd.ms-excel

58930

application/vnd.ms-powerpoint

51169

text/html; charset=windows-1252

50292

text/plain; charset=windows-1252

42988

text/html; charset=UTF-8

41003




Top 10 files types for embedded files
DETECTED_CONTENT_TYPE_B  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

CNT  <http://localhost:8082/query.do?jsessionid=4fc37683c72b47aa816002ec1d66dd1f>

image/png

470639

image/jpeg

261050

application/x-msmetafile

231803

application/x-emf

103058

application/x-tika-msoffice

82964

application/vnd.ms-excel

54296

image/x-pict

29764

application/msword

22699

image/x-ms-bmp

21525

application/x-tika-msoffice-embedded; format=ole10_native

19751






For anyone who wants to kick the tires on the apparent embedded excel issue, here are some files sorted by asc order of the json length:


428/428996.ppt

428996.ppt/11

application/vnd.ms-excel

6050

920/920182.ppt

920182.ppt/2

application/vnd.ms-excel

6093

852/852522.ppt

852522.ppt/758

application/vnd.ms-excel

6110

851/851799.ppt

851799.ppt/789

application/vnd.ms-excel

6110

854/854876.ppt

854876.ppt/696

application/vnd.ms-excel

6112

703/703075.ppt

703075.ppt/830

application/vnd.ms-excel

6112

849/849126.ppt

849126.ppt/880

application/vnd.ms-excel

6114

849/849621.ppt

849621.ppt/861

application/vnd.ms-excel

6116

847/847762.ppt

847762.ppt/992

application/vnd.ms-excel

6119




Looks like the majority are embedded in ppt, but there are several embedded in xls as well.



Cheers,



          Tim

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Wednesday, June 03, 2015 8:46 PM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?



Fixed eval code, thanks to Nick.



Now running against doc/x list fixes to confirm success.



Will rerun tomorrow on full set, with results by noon ETD.



-----Original Message-----

From: Allison, Timothy B. [mailto:tallison@mitre.org]

Sent: Wednesday, June 03, 2015 7:28 AM

To: dev@tika.apache.org<ma...@tika.apache.org>

Subject: RE: [DISCUSS] 1.9 Tika release?



Y.  Will do.



-----Original Message-----

From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]

Sent: Tuesday, June 02, 2015 10:31 PM

To: dev@tika.apache.org<ma...@tika.apache.org>

Subject: Re: [DISCUSS] 1.9 Tika release?



Thanks Tim. Looks like you and Nick have been fixing some other

stuff too.



Let me know when I should spin RC #2. Ready and willing! :-)



Cheers,

Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.

Chief Architect

Instrument Software and Science Data Systems Section (398)

NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Office: 168-519, Mailstop: 168-527

Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>

WWW:  http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Adjunct Associate Professor, Computer Science Department

University of Southern California, Los Angeles, CA 90089 USA

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++









-----Original Message-----

From: <Allison>, "Timothy B." <ta...@mitre.org>>

Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>

Date: Monday, June 1, 2015 at 6:53 PM

To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>

Subject: RE: [DISCUSS] 1.9 Tika release?



>Thank you.  Sorry about this.  I had hoped to run against govdocs1

>soonish to look for these before an rc was cut.

>

>There were three critical points in the code that accounted for all of

>the exceptions:

>

>1) one where I had left in a debugging RuntimeException throw instead of

>a swallow/return ""

>2) one where poi fairly commonly throws a RuntimeException when something

>goes wrong in paragraph.getList()

>3) failure of imagination that a value might not be found in XWPF's

>numbering, which led to NPE

>

>These are all fixed locally.  Will rerun over night with results tomorrow.

>

>-----Original Message-----

>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]

>Sent: Monday, June 01, 2015 9:06 PM

>To: dev@tika.apache.org<ma...@tika.apache.org>

>Subject: Re: [DISCUSS] 1.9 Tika release?

>

>ACK nw, will be happy to spin RC #2 when we’re ready :)

>

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>Chris Mattmann, Ph.D.

>Chief Architect

>Instrument Software and Science Data Systems Section (398)

>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

>Office: 168-519, Mailstop: 168-527

>Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>

>WWW:  http://sunset.usc.edu/~mattmann/

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>Adjunct Associate Professor, Computer Science Department

>University of Southern California, Los Angeles, CA 90089 USA

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>

>

>

>-----Original Message-----

>From: <Allison>, "Timothy B." <ta...@mitre.org>>

>Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>

>Date: Monday, June 1, 2015 at 6:04 PM

>To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>

>Subject: RE: [DISCUSS] 1.9 Tika release?

>

>>-1

>>

>>Do not release because there are roughly 200 new exceptions in govdocs1

>>caused by the list processing code I just added to doc and docx files.

>>

>>Argh...

>>

>>Will fix asap.

>>

>>-----Original Message-----

>>From: Allison, Timothy B. [mailto:tallison@mitre.org]

>>Sent: Monday, June 01, 2015 7:03 AM

>>To: dev@tika.apache.org<ma...@tika.apache.org>

>>Subject: RE: [DISCUSS] 1.9 Tika release?

>>

>>Will run rc against govdocs1 and commoncrawl to see what I find.  Results

>>by tomorrow.

>>

>>Thank you, Chris!

>>

>>-----Original Message-----

>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]

>>Sent: Sunday, May 31, 2015 2:52 PM

>>To: dev@tika.apache.org<ma...@tika.apache.org>

>>Subject: [DISCUSS] 1.9 Tika release?

>>

>>Hey Folks,

>>

>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,

>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.

>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT

>>0.9 RC right now too and am in the mood).

>>

>>Please feel free to VOTE with your feet and I’m happy to cut more than

>>1 RC if I missed anything of course.

>>

>>Cheers,

>>Chris

>>

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>>Chris Mattmann, Ph.D.

>>Chief Architect

>>Instrument Software and Science Data Systems Section (398)

>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

>>Office: 168-519, Mailstop: 168-527

>>Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>

>>WWW:  http://sunset.usc.edu/~mattmann/

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>>Adjunct Associate Professor, Computer Science Department

>>University of Southern California, Los Angeles, CA 90089 USA

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>>

>>

>



RE: [DISCUSS] 1.9 Tika release?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Fixed eval code, thanks to Nick.

Now running against doc/x list fixes to confirm success.

Will rerun tomorrow on full set, with results by noon ETD.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Wednesday, June 03, 2015 7:28 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?

Y.  Will do.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Tuesday, June 02, 2015 10:31 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] 1.9 Tika release?

Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.

Let me know when I should spin RC #2. Ready and willing! :-)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?

>Thank you.  Sorry about this.  I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally.  Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find.  Results
>>by tomorrow. 
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>


RE: [DISCUSS] 1.9 Tika release?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y.  Will do.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Tuesday, June 02, 2015 10:31 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] 1.9 Tika release?

Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.

Let me know when I should spin RC #2. Ready and willing! :-)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?

>Thank you.  Sorry about this.  I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally.  Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find.  Results
>>by tomorrow. 
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>


Re: [DISCUSS] 1.9 Tika release?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Tim. Looks like you and Nick have been fixing some other
stuff too.

Let me know when I should spin RC #2. Ready and willing! :-)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:53 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?

>Thank you.  Sorry about this.  I had hoped to run against govdocs1
>soonish to look for these before an rc was cut.
>
>There were three critical points in the code that accounted for all of
>the exceptions:
>
>1) one where I had left in a debugging RuntimeException throw instead of
>a swallow/return ""
>2) one where poi fairly commonly throws a RuntimeException when something
>goes wrong in paragraph.getList()
>3) failure of imagination that a value might not be found in XWPF's
>numbering, which led to NPE
>
>These are all fixed locally.  Will rerun over night with results tomorrow.
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Monday, June 01, 2015 9:06 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] 1.9 Tika release?
>
>ACK nw, will be happy to spin RC #2 when we’re ready :)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: <Allison>, "Timothy B." <ta...@mitre.org>
>Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>Date: Monday, June 1, 2015 at 6:04 PM
>To: "dev@tika.apache.org" <de...@tika.apache.org>
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>>-1
>>
>>Do not release because there are roughly 200 new exceptions in govdocs1
>>caused by the list processing code I just added to doc and docx files.
>>
>>Argh...
>>
>>Will fix asap.
>>
>>-----Original Message-----
>>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>>Sent: Monday, June 01, 2015 7:03 AM
>>To: dev@tika.apache.org
>>Subject: RE: [DISCUSS] 1.9 Tika release?
>>
>>Will run rc against govdocs1 and commoncrawl to see what I find.  Results
>>by tomorrow. 
>>
>>Thank you, Chris!
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>>Sent: Sunday, May 31, 2015 2:52 PM
>>To: dev@tika.apache.org
>>Subject: [DISCUSS] 1.9 Tika release?
>>
>>Hey Folks,
>>
>>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>>0.9 RC right now too and am in the mood).
>>
>>Please feel free to VOTE with your feet and I’m happy to cut more than
>>1 RC if I missed anything of course.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>


RE: [DISCUSS] 1.9 Tika release?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you.  Sorry about this.  I had hoped to run against govdocs1 soonish to look for these before an rc was cut.

There were three critical points in the code that accounted for all of the exceptions:

1) one where I had left in a debugging RuntimeException throw instead of a swallow/return ""
2) one where poi fairly commonly throws a RuntimeException when something goes wrong in paragraph.getList()
3) failure of imagination that a value might not be found in XWPF's numbering, which led to NPE

These are all fixed locally.  Will rerun over night with results tomorrow.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, June 01, 2015 9:06 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] 1.9 Tika release?

ACK nw, will be happy to spin RC #2 when we’re ready :)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:04 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?

>-1
>
>Do not release because there are roughly 200 new exceptions in govdocs1
>caused by the list processing code I just added to doc and docx files.
>
>Argh...
>
>Will fix asap.
>
>-----Original Message-----
>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>Sent: Monday, June 01, 2015 7:03 AM
>To: dev@tika.apache.org
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>Will run rc against govdocs1 and commoncrawl to see what I find.  Results
>by tomorrow. 
>
>Thank you, Chris!
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Sunday, May 31, 2015 2:52 PM
>To: dev@tika.apache.org
>Subject: [DISCUSS] 1.9 Tika release?
>
>Hey Folks,
>
>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>0.9 RC right now too and am in the mood).
>
>Please feel free to VOTE with your feet and I’m happy to cut more than
>1 RC if I missed anything of course.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


Re: [DISCUSS] 1.9 Tika release?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
ACK nw, will be happy to spin RC #2 when we’re ready :)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: <Allison>, "Timothy B." <ta...@mitre.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 1, 2015 at 6:04 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: [DISCUSS] 1.9 Tika release?

>-1
>
>Do not release because there are roughly 200 new exceptions in govdocs1
>caused by the list processing code I just added to doc and docx files.
>
>Argh...
>
>Will fix asap.
>
>-----Original Message-----
>From: Allison, Timothy B. [mailto:tallison@mitre.org]
>Sent: Monday, June 01, 2015 7:03 AM
>To: dev@tika.apache.org
>Subject: RE: [DISCUSS] 1.9 Tika release?
>
>Will run rc against govdocs1 and commoncrawl to see what I find.  Results
>by tomorrow. 
>
>Thank you, Chris!
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Sunday, May 31, 2015 2:52 PM
>To: dev@tika.apache.org
>Subject: [DISCUSS] 1.9 Tika release?
>
>Hey Folks,
>
>There’s been lots of new Tika goodness coming with the GeoTopic stuff,
>the ExternParser fixes, and also with FFMPEG and EXIF improvements.
>I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
>0.9 RC right now too and am in the mood).
>
>Please feel free to VOTE with your feet and I’m happy to cut more than
>1 RC if I missed anything of course.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


RE: [DISCUSS] 1.9 Tika release?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
-1

Do not release because there are roughly 200 new exceptions in govdocs1 caused by the list processing code I just added to doc and docx files.

Argh...

Will fix asap.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Monday, June 01, 2015 7:03 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?

Will run rc against govdocs1 and commoncrawl to see what I find.  Results by tomorrow. 

Thank you, Chris!

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Sunday, May 31, 2015 2:52 PM
To: dev@tika.apache.org
Subject: [DISCUSS] 1.9 Tika release?

Hey Folks,

There’s been lots of new Tika goodness coming with the GeoTopic stuff,
the ExternParser fixes, and also with FFMPEG and EXIF improvements.
I’ve got cycles today so I will try and cut a 1.9 RC (I’m doing the OODT
0.9 RC right now too and am in the mood).

Please feel free to VOTE with your feet and I’m happy to cut more than
1 RC if I missed anything of course.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++