You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/11/01 12:48:44 UTC
RE: Running tika-eval on the Rackspace vm
Sorry. Fixed.
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Tuesday, October 31, 2017 6:08 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm
Am 31.10.2017 um 20:53 schrieb Allison, Timothy B.:
>> It's not possible to rename / remove the files / directories mentioned in part 1 due to not having the permissions.
> Gah. Sorry. Tilman, I added you to "collab" and chgrp to collab on /work /data2/docs /data3/batch_runs and /data4/batch_runs.
But the directories themselves don't have "w" rights for group so I can't profit from my membership... (unless I missed something, I haven't done much *nix since the 90ies) For example I can't rename /work/batch-apps/tika_working/logs to /work/batch-apps/tika_working/___logs .
Tilman
>
>> The directory is named batch-apps, not batch_apps.
> Fixed. Thank you.
>
>> Re the "A" version - is this the "good" version, so I could simply download tika-app and put it there? Or just build tika with a specific PDFBox version?
> If the current version of tika-app has the right version of PDFBox for your "before" examples, then y, you can just download tika-app.jar. We release less frequently than PDFBox, so it's possible that you'll want to build from scratch with the most recent previous release of PDFBox.
>
> In my mind, A is the "before/baseline" version and B is the
> SNAPSHOT/RC version. So, hopefully, B is the "good" one. 😊
>
> Let me know what other problems you encounter.
>
> Cheers,
>
> Tim
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
There's definitively some problem with creating a temp file... I
inserted this line in dumpXLSX
TempFile.createTempFile("tilman", "txt");
and got an exception.
I also added " -Djava.io.tmpdir=/tmp" to the call but this didn't help.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 07.11.2017 um 16:21 schrieb Allison, Timothy B.:
> Great! Thank you, Tilman!
>
> I updated the wiki based on your feedback. Let me know if I should add anything else while the experience is fresh.
Please change "Run the PDFParser tests..." into "Build tika-parsers
separately to make sure that this version is added to the repository and
will be used by the tika-app build. Run the PDFParser tests...."
This is because building tika-app does not trigger a rebuild of
tika-parsers.
Tilman
>
> Best,
>
> Tim
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Monday, November 6, 2017 3:00 PM
> To: dev@pdfbox.apache.org
> Subject: Re: Running tika-eval on the Rackspace vm
>
> I think I was successful, the report now makes sense, as if Tim had created it himself :-) The two issues I just created are related to a comparison between 2.0.8 and 2.0.4.
>
> So for that next board report, we can now (additional to the existing
> text) tell that there is now a second committer who can run the tests.
>
> Tilman
>
> Am 05.11.2017 um 22:06 schrieb Tilman Hausherr:
>> I've come closer to find out what's happening. I found out that
>> tika-app was running with PDFBox 2.0.7 all the time regardless of what
>> pdfbox version is in the pom.xml.
>>
>> Apparently, building tika-app uses tika-parsers from the repository
>> (instead building tika-parsers it again), which needs 2.0.7.
>> Explicitely building tika-parsers before building tika-app helps.
>>
>> This is new to me, in PDFBox if one builds the app all dependencies
>> are built as well.
>>
>> Tilman
>>
>> Am 04.11.2017 um 14:48 schrieb Tilman Hausherr:
>>> So it's done:
>>> /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017
>>>
>>> I wonder why the differences are so few, especially in meta where I
>>> KNOW that there are differences, due to the handling of empty strings
>>> with BOM. Maybe it is because I skipped the "A" phase and used
>>> existing data from a 2.0.4 run that I found, or because I use a
>>> current tika trunk and not the existing binary that was on the server.
>>>
>>> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk
>>> and then compare with the "B" I did.
>>>
>>> Tilman
>>>
>>>
>>> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
>>>> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
>>>>> I'm not sure what you mean by...sorry
>>>>>> - "H" is missing, which is identical to "C"
>>>>
>>>> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
>>>>
>>>> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing.
>>>> Of course it is obvious that it has to be done, but I am a
>>>> perfectionist. I'd like to have this documentation for the "me" in a
>>>> few months when I have forgotten what I did the last days. Or for
>>>> the next person.
>>>>
>>>> Thanks for the fixes you did. I wonder why writing to /tmp didn't
>>>> work - it did work from the command line. I've started the command
>>>> again, I'm not sure when I will report about it. I'm a bit exhausted
>>>> from non-software activities :-(
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
RE: Running tika-eval on the Rackspace vm
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great! Thank you, Tilman!
I updated the wiki based on your feedback. Let me know if I should add anything else while the experience is fresh.
Best,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Monday, November 6, 2017 3:00 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm
I think I was successful, the report now makes sense, as if Tim had created it himself :-) The two issues I just created are related to a comparison between 2.0.8 and 2.0.4.
So for that next board report, we can now (additional to the existing
text) tell that there is now a second committer who can run the tests.
Tilman
Am 05.11.2017 um 22:06 schrieb Tilman Hausherr:
> I've come closer to find out what's happening. I found out that
> tika-app was running with PDFBox 2.0.7 all the time regardless of what
> pdfbox version is in the pom.xml.
>
> Apparently, building tika-app uses tika-parsers from the repository
> (instead building tika-parsers it again), which needs 2.0.7.
> Explicitely building tika-parsers before building tika-app helps.
>
> This is new to me, in PDFBox if one builds the app all dependencies
> are built as well.
>
> Tilman
>
> Am 04.11.2017 um 14:48 schrieb Tilman Hausherr:
>> So it's done:
>> /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017
>>
>> I wonder why the differences are so few, especially in meta where I
>> KNOW that there are differences, due to the handling of empty strings
>> with BOM. Maybe it is because I skipped the "A" phase and used
>> existing data from a 2.0.4 run that I found, or because I use a
>> current tika trunk and not the existing binary that was on the server.
>>
>> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk
>> and then compare with the "B" I did.
>>
>> Tilman
>>
>>
>> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
>>> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
>>>> I'm not sure what you mean by...sorry
>>>>> - "H" is missing, which is identical to "C"
>>>
>>>
>>> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
>>>
>>> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing.
>>> Of course it is obvious that it has to be done, but I am a
>>> perfectionist. I'd like to have this documentation for the "me" in a
>>> few months when I have forgotten what I did the last days. Or for
>>> the next person.
>>>
>>> Thanks for the fixes you did. I wonder why writing to /tmp didn't
>>> work - it did work from the command line. I've started the command
>>> again, I'm not sure when I will report about it. I'm a bit exhausted
>>> from non-software activities :-(
>>>
>>> Tilman
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
I think I was successful, the report now makes sense, as if Tim had
created it himself :-) The two issues I just created are related to a
comparison between 2.0.8 and 2.0.4.
So for that next board report, we can now (additional to the existing
text) tell that there is now a second committer who can run the tests.
Tilman
Am 05.11.2017 um 22:06 schrieb Tilman Hausherr:
> I've come closer to find out what's happening. I found out that
> tika-app was running with PDFBox 2.0.7 all the time regardless of what
> pdfbox version is in the pom.xml.
>
> Apparently, building tika-app uses tika-parsers from the repository
> (instead building tika-parsers it again), which needs 2.0.7.
> Explicitely building tika-parsers before building tika-app helps.
>
> This is new to me, in PDFBox if one builds the app all dependencies
> are built as well.
>
> Tilman
>
> Am 04.11.2017 um 14:48 schrieb Tilman Hausherr:
>> So it's done:
>> /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017
>>
>> I wonder why the differences are so few, especially in meta where I
>> KNOW that there are differences, due to the handling of empty strings
>> with BOM. Maybe it is because I skipped the "A" phase and used
>> existing data from a 2.0.4 run that I found, or because I use a
>> current tika trunk and not the existing binary that was on the server.
>>
>> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk
>> and then compare with the "B" I did.
>>
>> Tilman
>>
>>
>> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
>>> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
>>>> I'm not sure what you mean by...sorry
>>>>> - "H" is missing, which is identical to "C"
>>>
>>>
>>> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
>>>
>>> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing.
>>> Of course it is obvious that it has to be done, but I am a
>>> perfectionist. I'd like to have this documentation for the "me" in a
>>> few months when I have forgotten what I did the last days. Or for
>>> the next person.
>>>
>>> Thanks for the fixes you did. I wonder why writing to /tmp didn't
>>> work - it did work from the command line. I've started the command
>>> again, I'm not sure when I will report about it. I'm a bit exhausted
>>> from non-software activities :-(
>>>
>>> Tilman
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
I've come closer to find out what's happening. I found out that tika-app
was running with PDFBox 2.0.7 all the time regardless of what pdfbox
version is in the pom.xml.
Apparently, building tika-app uses tika-parsers from the repository
(instead building tika-parsers it again), which needs 2.0.7. Explicitely
building tika-parsers before building tika-app helps.
This is new to me, in PDFBox if one builds the app all dependencies are
built as well.
Tilman
Am 04.11.2017 um 14:48 schrieb Tilman Hausherr:
> So it's done:
> /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017
>
> I wonder why the differences are so few, especially in meta where I
> KNOW that there are differences, due to the handling of empty strings
> with BOM. Maybe it is because I skipped the "A" phase and used
> existing data from a 2.0.4 run that I found, or because I use a
> current tika trunk and not the existing binary that was on the server.
>
> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk
> and then compare with the "B" I did.
>
> Tilman
>
>
> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
>> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
>>> I'm not sure what you mean by...sorry
>>>> - "H" is missing, which is identical to "C"
>>
>>
>> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
>>
>> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. Of
>> course it is obvious that it has to be done, but I am a
>> perfectionist. I'd like to have this documentation for the "me" in a
>> few months when I have forgotten what I did the last days. Or for the
>> next person.
>>
>> Thanks for the fixes you did. I wonder why writing to /tmp didn't
>> work - it did work from the command line. I've started the command
>> again, I'm not sure when I will report about it. I'm a bit exhausted
>> from non-software activities :-(
>>
>> Tilman
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
So it's done:
/work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017
I wonder why the differences are so few, especially in meta where I KNOW
that there are differences, due to the handling of empty strings with
BOM. Maybe it is because I skipped the "A" phase and used existing data
from a 2.0.4 run that I found, or because I use a current tika trunk and
not the existing binary that was on the server.
I'm thinking of creating a new "A" with 2.0.4 with current tika trunk
and then compare with the "B" I did.
Tilman
Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
>> I'm not sure what you mean by...sorry
>>> - "H" is missing, which is identical to "C"
>
>
> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
>
> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. Of
> course it is obvious that it has to be done, but I am a perfectionist.
> I'd like to have this documentation for the "me" in a few months when
> I have forgotten what I did the last days. Or for the next person.
>
> Thanks for the fixes you did. I wonder why writing to /tmp didn't work
> - it did work from the command line. I've started the command again,
> I'm not sure when I will report about it. I'm a bit exhausted from
> non-software activities :-(
>
> Tilman
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
> I'm not sure what you mean by...sorry
>> - "H" is missing, which is identical to "C"
I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. Of
course it is obvious that it has to be done, but I am a perfectionist.
I'd like to have this documentation for the "me" in a few months when I
have forgotten what I did the last days. Or for the next person.
Thanks for the fixes you did. I wonder why writing to /tmp didn't work -
it did work from the command line. I've started the command again, I'm
not sure when I will report about it. I'm a bit exhausted from
non-software activities :-(
Tilman
RE: Running tika-eval on the Rackspace vm
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Tilman,
Thank you for the toe-stubbing. I'm sorry that it wasn't easier...
I created a new user with collab permissions and ran through the process.
You are right about the privileges on the tmp directory... POI needs a tmp directory to write xlsx. I created a tmp directory in /work/eval and added a direction to set tmp dir via -Djava.io.tmpdir=tmp
I'm not sure what you mean by...sorry
>- "H" is missing, which is identical to "C"
I updated the permissions on appBatchExecutor.sh
I also added a recommendation to umask g+rw before starting.
Let me know if I need to fix anything else or if I missed something you've already identified but I missed. ☹
Thank you, again.
Best,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Thursday, November 2, 2017 5:47 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm
I'm almost done... then I got this when doing the last step:
[tilman@cloud-server-02 eval]$ java -jar tika-eval-1.17-SNAPSHOT.jar Report -db pdfboxAvsB
0 [main] INFO org.apache.tika.eval.reports.Report - Writing report:
All Mimes In A to mimes/all_mimes_A.xlsx Exception in thread "main" java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at
org.apache.poi.util.DefaultTempFileCreationStrategy.createTempFile(DefaultTempFileCreationStrategy.java:110)
at org.apache.poi.util.TempFile.createTempFile(TempFile.java:66)
at
org.apache.poi.xssf.streaming.SXSSFWorkbook.write(SXSSFWorkbook.java:924)
at org.apache.tika.eval.reports.Report.dumpXLSX(Report.java:85)
at org.apache.tika.eval.reports.Report.writeReport(Report.java:64)
at
org.apache.tika.eval.reports.ResultsReporter.execute(ResultsReporter.java:305)
at
org.apache.tika.eval.reports.ResultsReporter.main(ResultsReporter.java:266)
at
org.apache.tika.eval.TikaEvalCLI.handleReport(TikaEvalCLI.java:264)
at org.apache.tika.eval.TikaEvalCLI.execute(TikaEvalCLI.java:52)
at org.apache.tika.eval.TikaEvalCLI.main(TikaEvalCLI.java:273)
I changed the source, and now I got the path, it is /work/eval/reports/mimes/all_mimes_A.xlsx . The file exists and it is empty.
I tried with a 1.16 version and the same happened.
Then I thought, maybe the file with the permission problem isn't the target at all; could this be some temp file / temp directory where I don't have permission?
smaller improvements for the documentation:
- appBatchExecutor.sh should have 775 permission or the documentation should have "nohup sh ./appBatchExecutor.sh &"
- "H" is missing, which is identical to "C"
- mention that "pdfboxAvsB" db files are to be removed before starting?
I had accidentally aborted a run and couldn't restart.
Tilman
memo for me:
java -jar tika-eval-1.17-SNAPSHOT.jar Compare -extractsA
/data4/batch_runs/pdfbox_2_0_4 -extractsB
/data4/batch_runs/pdfbox_2_0_9-SNAPSHOT1 -db pdfboxAvsB
java -jar tika-eval-1.17-SNAPSHOT.jar Report -db pdfboxAvsB
Re: Running tika-eval on the Rackspace vm
Posted by Tilman Hausherr <TH...@t-online.de>.
I'm almost done... then I got this when doing the last step:
[tilman@cloud-server-02 eval]$ java -jar tika-eval-1.17-SNAPSHOT.jar
Report -db pdfboxAvsB
0 [main] INFO org.apache.tika.eval.reports.Report - Writing report:
All Mimes In A to mimes/all_mimes_A.xlsx
Exception in thread "main" java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at
org.apache.poi.util.DefaultTempFileCreationStrategy.createTempFile(DefaultTempFileCreationStrategy.java:110)
at org.apache.poi.util.TempFile.createTempFile(TempFile.java:66)
at
org.apache.poi.xssf.streaming.SXSSFWorkbook.write(SXSSFWorkbook.java:924)
at org.apache.tika.eval.reports.Report.dumpXLSX(Report.java:85)
at org.apache.tika.eval.reports.Report.writeReport(Report.java:64)
at
org.apache.tika.eval.reports.ResultsReporter.execute(ResultsReporter.java:305)
at
org.apache.tika.eval.reports.ResultsReporter.main(ResultsReporter.java:266)
at
org.apache.tika.eval.TikaEvalCLI.handleReport(TikaEvalCLI.java:264)
at org.apache.tika.eval.TikaEvalCLI.execute(TikaEvalCLI.java:52)
at org.apache.tika.eval.TikaEvalCLI.main(TikaEvalCLI.java:273)
I changed the source, and now I got the path, it is
/work/eval/reports/mimes/all_mimes_A.xlsx . The file exists and it is empty.
I tried with a 1.16 version and the same happened.
Then I thought, maybe the file with the permission problem isn't the
target at all; could this be some temp file / temp directory where I
don't have permission?
smaller improvements for the documentation:
- appBatchExecutor.sh should have 775 permission or the documentation
should have "nohup sh ./appBatchExecutor.sh &"
- "H" is missing, which is identical to "C"
- mention that "pdfboxAvsB" db files are to be removed before starting?
I had accidentally aborted a run and couldn't restart.
Tilman
memo for me:
java -jar tika-eval-1.17-SNAPSHOT.jar Compare -extractsA
/data4/batch_runs/pdfbox_2_0_4 -extractsB
/data4/batch_runs/pdfbox_2_0_9-SNAPSHOT1 -db pdfboxAvsB
java -jar tika-eval-1.17-SNAPSHOT.jar Report -db pdfboxAvsB