You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2020/05/10 19:49:10 UTC
[Bug 64418] New: Finding text in textfields is very slow
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
Bug ID: 64418
Summary: Finding text in textfields is very slow
Product: POI
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P2
Component: XWPF
Assignee: dev@poi.apache.org
Reporter: jenskutschke@gmx.de
Target Milestone: ---
I am scanning docx documents for occurences of specific words / search terms.
The code I am using is seen below.
The search terms can literally be anywhere: in header, footer, paragraphs,
tables, text fields, ...
When using an even complex document that uses no / very few textfields, parsing
takes a few seconds. As soon as multiple text fields are involved, parsing
takes a considerate amount of time, e.g. 30 seconds or even more than a minute.
Is there aynthing I am doing wrong in how I use the API, or is there an issue
with XWPF?
Thanks,
Jens
private static void findInBodyElements(String key, List<IBodyElement>
bodyElements, ArrayList<String> resultList) {
if (resultList.contains(key)) {
return;
}
for (IBodyElement bodyElement : bodyElements) {
if
(bodyElement.getElementType().compareTo(BodyElementType.PARAGRAPH) == 0) {
findInParagraph(key, (XWPFParagraph) bodyElement, resultList);
if (resultList.contains(key)) {
return;
}
findInTextfield(key, (XWPFParagraph) bodyElement, resultList);
if (resultList.contains(key)) {
return;
}
}
if (bodyElement.getElementType().compareTo(BodyElementType.TABLE)
== 0) {
findInTable(key, (XWPFTable) bodyElement, resultList);
}
}
}
private static void findInParagraph(String key, XWPFParagraph
xwpfParagraph, ArrayList<String> resultList) {
if (resultList.contains(key)) {
return;
}
//for (XWPFParagraph paragraph : xwpfParagraphs) {
List<XWPFRun> runs = xwpfParagraph.getRuns();
String find = key;
TextSegment found = xwpfParagraph.searchText(find, new
PositionInParagraph());
if (found != null) {
if (!resultList.contains(key)) {
resultList.add(key);
return;
}
}
}
private static void findInTextfield(String key, XWPFParagraph
xwpfParagraph, ArrayList<String> resultList) {
if (resultList.contains(key)) {
return;
}
XmlCursor cursor = xwpfParagraph.getCTP().newCursor();
cursor.selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'
.//*/w:txbxContent/w:p/w:r");
List<XmlObject> ctrsintxtbx = new ArrayList<XmlObject>();
while (cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctrsintxtbx.add(obj);
}
for (XmlObject obj : ctrsintxtbx) {
try {
CTR ctr = CTR.Factory.parse(obj.xmlText());
XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
String text = bufferrun.getText(0);
if (text != null && text.contains(key)) {
if (!resultList.contains(key)) {
resultList.add(key);
return;
}
}
} catch (Exception ex) {
log.error("Unable to iterate text fields", ex);
}
}
}
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
Dominik Stadler <do...@gmx.at> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #3 from Dominik Stadler <do...@gmx.at> ---
Thanks, but unfortunately there is lots of code which is not related to the
problem and thus makes reproducing and analyzing this very hard. The app seems
to not finish for a very long time for me. It also looks a bit like you are
iterating over the contents of the document many times with all the
placeholders and some of the loops in your application.
Can you reduce the code in the sample project as much as possible so that it
still shows the problem, but does not do all the things that are only needed
for your application?
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
--- Comment #4 from j-lawyer.org <je...@gmx.de> ---
Thanks Dominik for looking into this. I have stripped down the test case, the
URL is still the same: https://www.j-lawyer.org/temp/DocXShowCase.zip
- has a list of 50 strings to be searched in documents
- has two documents, both just 1 page - (a) has no textfields and (b) has 10
text fields
- each of the 50 strings is searched for using a loop, so i am iterating each
document fifty times
Basically I just want to know which of the 50 strings are contained in the
documents.
Thanks,
Jens
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
j-lawyer.org <je...@gmx.de> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEEDINFO |NEW
--- Comment #2 from j-lawyer.org <je...@gmx.de> ---
Thank you Dominik for the reply.
I just created a fully runnable example:
https://www.j-lawyer.org/temp/DocXShowCase.zip
It is a Netbeans project that includes runnable test case as well as example
documents. Both docx documents are comparable in complexity, one has no text
fields, the other one has 10 text fields.
When running the code, those are the performance numbers:
without textfields, search: 676
with textfields, search: 15678
So, when text fields are involved, there is 23x factor for execution times.
Let me know if I can provide anything else and I will be on top of it in no
time.
Thanks!
Jens / j-lawyer.org
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
--- Comment #7 from PJ Fanning <fa...@yahoo.com> ---
Instead of `CTP.Factory.parse(embeddedParagraph.xmlText())` could you try
`CTP.Factory.parse(embeddedParagraph.getDomNode())`
This might lower the overhead of the parse call
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
j-lawyer.org <je...@gmx.de> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEEDINFO |NEW
--- Comment #6 from j-lawyer.org <je...@gmx.de> ---
Well, I would love to get rid of the expensive XML handling - however, I do not
see how I could avoid it given POIs API.
Is there an alternative approach for "getting all text content of text fields /
text boxes"?
Even Apache Tika seems to use the exact same approach in their
XWPFWordExtractorDecorator.java:
331 // Also extract any paragraphs embedded in text boxes
332 //Note "w:txbxContent//"...must look for all descendant
paragraphs
333 //not just the immediate children of txbxContent -- TIKA-2807
334 if (config.getIncludeShapeBasedContent()) {
335 for (XmlObject embeddedParagraph :
paragraph.getCTP().selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare
namespace
wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
.//*/wps:txbx/w:txbxContent//w:p")) {
336 extractParagraph(new
XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()),
paragraph.getBody()), listManager, xhtml);
337 }
338 }
Am I missing something?
Thanks,
Jens
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
--- Comment #8 from j-lawyer.org <je...@gmx.de> ---
Thanks for the suggestion PJ!
I am not too familiar with the more low level APIs of POI.
In the code I initially posted (findInTextfield method), I am using an XWPFRun
which cannot be fed with a CTP
CTR ctr = CTR.Factory.parse(obj.xmlText());
XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
String text = bufferrun.getText(0);
if (text != null && text.contains(key)) {
if (!resultList.contains(key)) {
resultList.add(key);
return;
}
}
When replacing
CTR ctr = CTR.Factory.parse(obj.xmlText());
with
CTR ctr = CTR.Factory.parse(obj.getDomNode());
my code does no longer work - the text retrieved does no longer contain / find
my search strings. Using the first line however (which involves re-parsing XML)
works as expected.
I have challenges finding proper Javadocs for CTP and CTR, assume they
represent some disjoint sets of XML complex types.
Do you have any hints on why the two variations above have different behaviour?
Thanks,
Jens
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
j-lawyer.org <je...@gmx.de> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEEDINFO |NEW
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
Dominik Stadler <do...@gmx.at> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #1 from Dominik Stadler <do...@gmx.at> ---
Can you provide a sample file which shows the slowdown? Would make it much
easier to try to analyze/reproduce it.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 64418] Finding text in textfields is very slow
Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64418
Dominik Stadler <do...@gmx.at> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #5 from Dominik Stadler <do...@gmx.at> ---
The following line is taking most of the CPU by far, so you likely need to
rework your code to not have to produce XML and then parse it in again
afterwards.
CTR.Factory.parse(obj.xmlText())
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org