You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2019/11/22 13:25:50 UTC

regression tests for 1.23-rc1

All,
  I started the regression tests on a random set of 500k files.  I found
this morning that it was _still_ going.  It turns out I had accidentally
configured extract images for PDFs, which adds to the processing time and
leads to more OOMs.
  I restarted the regression tests this morning with that feature turned
off.

       Best,

                   Tim

Re: regression tests for 1.23-rc1

Posted by Tim Allison <ta...@apache.org>.

All,

New reports are here:
http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz

I ran these with the most recent 1.23-SNAPSHOT on the full 500k sample.
There are a few things to look into, but nothing that leaps out to me.

Unless there are objections, I'll roll rc1 shortly.

Cheers,

      Tim

On Mon, Nov 25, 2019 at 10:57 AM Tim Allison <ta...@apache.org> wrote:

> d) is not a problem.  It was caused by a bit of idiocy in my random file
> selection code that allowed for duplicate files...so the list did have 500k
> file names, but only ~270k unique file names.
>
> On Mon, Nov 25, 2019 at 10:08 AM Tim Allison <ta...@apache.org> wrote:
>
>> All,
>>   I finished the regression tests, and the reports are available here:
>> http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
>>   My takeaways:
>>   a) we need to fix the new code in the PDFParser that set's whether or
>> not there is a digital signature.  That should be set, not add
>>   b) we are getting a few new exceptions on going over the safety maximum
>> for byte array allocation in POI.  We can make this configurable at the
>> Tika level.
>>   c) there are a few new problems with EMF parsing, but these won't harm
>> parsing the rest of the file.
>>   d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
>> there were ~500k in the list...I need to figure out what went wrong.
>>
>>   If I find nothing concerning on d), are we ready to roll 1.23-rc1?
>>
>>               Cheers,
>>
>>                            Tim
>>
>> On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <ta...@apache.org> wrote:
>>
>>> All,
>>>   I started the regression tests on a random set of 500k files.  I found
>>> this morning that it was _still_ going.  It turns out I had accidentally
>>> configured extract images for PDFs, which adds to the processing time and
>>> leads to more OOMs.
>>>   I restarted the regression tests this morning with that feature turned
>>> off.
>>>
>>>        Best,
>>>
>>>                    Tim
>>>
>>

Re: regression tests for 1.23-rc1

Posted by Tim Allison <ta...@apache.org>.

d) is not a problem.  It was caused by a bit of idiocy in my random file
selection code that allowed for duplicate files...so the list did have 500k
file names, but only ~270k unique file names.

On Mon, Nov 25, 2019 at 10:08 AM Tim Allison <ta...@apache.org> wrote:

> All,
>   I finished the regression tests, and the reports are available here:
> http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
>   My takeaways:
>   a) we need to fix the new code in the PDFParser that set's whether or
> not there is a digital signature.  That should be set, not add
>   b) we are getting a few new exceptions on going over the safety maximum
> for byte array allocation in POI.  We can make this configurable at the
> Tika level.
>   c) there are a few new problems with EMF parsing, but these won't harm
> parsing the rest of the file.
>   d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
> there were ~500k in the list...I need to figure out what went wrong.
>
>   If I find nothing concerning on d), are we ready to roll 1.23-rc1?
>
>               Cheers,
>
>                            Tim
>
> On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <ta...@apache.org> wrote:
>
>> All,
>>   I started the regression tests on a random set of 500k files.  I found
>> this morning that it was _still_ going.  It turns out I had accidentally
>> configured extract images for PDFs, which adds to the processing time and
>> leads to more OOMs.
>>   I restarted the regression tests this morning with that feature turned
>> off.
>>
>>        Best,
>>
>>                    Tim
>>
>

Re: regression tests for 1.23-rc1

Posted by Tim Allison <ta...@apache.org>.

All,
  I finished the regression tests, and the reports are available here:
http://162.242.228.174/reports/reports_tika_1.22_vs_1.23-pre-rc1.tgz
  My takeaways:
  a) we need to fix the new code in the PDFParser that set's whether or not
there is a digital signature.  That should be set, not add
  b) we are getting a few new exceptions on going over the safety maximum
for byte array allocation in POI.  We can make this configurable at the
Tika level.
  c) there are a few new problems with EMF parsing, but these won't harm
parsing the rest of the file.
  d) both runs (1.22 and 1.23-pre-rc1) only processed ~250k files, but
there were ~500k in the list...I need to figure out what went wrong.

  If I find nothing concerning on d), are we ready to roll 1.23-rc1?

              Cheers,

                           Tim

On Fri, Nov 22, 2019 at 8:25 AM Tim Allison <ta...@apache.org> wrote:

> All,
>   I started the regression tests on a random set of 500k files.  I found
> this morning that it was _still_ going.  It turns out I had accidentally
> configured extract images for PDFs, which adds to the processing time and
> leads to more OOMs.
>   I restarted the regression tests this morning with that feature turned
> off.
>
>        Best,
>
>                    Tim
>

Re: regression tests for 1.23-rc1

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I feel like you just experienced a wonderful lesson that we all peridodically experience….  “Extracting data at scale”

I wonder, is there any, way of coming up with hueristics to predict how long the process would take?  “Based on your settings, based on your doc types, based on sizes, based on historical records….   It will take 20 hours to run”…

> On Nov 22, 2019, at 8:25 AM, Tim Allison <ta...@apache.org> wrote:
> 
> All,
>  I started the regression tests on a random set of 500k files.  I found
> this morning that it was _still_ going.  It turns out I had accidentally
> configured extract images for PDFs, which adds to the processing time and
> leads to more OOMs.
>  I restarted the regression tests this morning with that feature turned
> off.
> 
>       Best,
> 
>                   Tim

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.