You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by bu...@apache.org on 2020/05/10 19:49:10 UTC

[Bug 64418] New: Finding text in textfields is very slow

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

            Bug ID: 64418
           Summary: Finding text in textfields is very slow
           Product: POI
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: dev@poi.apache.org
          Reporter: jenskutschke@gmx.de
  Target Milestone: ---

I am scanning docx documents for occurences of specific words / search terms. 

The code I am using is seen below.

The search terms can literally be anywhere: in header, footer, paragraphs,
tables, text fields, ...

When using an even complex document that uses no / very few textfields, parsing
takes a few seconds. As soon as multiple text fields are involved, parsing
takes a considerate amount of time, e.g. 30 seconds or even more than a minute.


Is there aynthing I am doing wrong in how I use the API, or is there an issue
with XWPF?

Thanks,
Jens



    private static void findInBodyElements(String key, List<IBodyElement>
bodyElements, ArrayList<String> resultList) {
        if (resultList.contains(key)) {
            return;
        }

        for (IBodyElement bodyElement : bodyElements) {
            if
(bodyElement.getElementType().compareTo(BodyElementType.PARAGRAPH) == 0) {
                findInParagraph(key, (XWPFParagraph) bodyElement, resultList);
                if (resultList.contains(key)) {
                    return;
                }
                findInTextfield(key, (XWPFParagraph) bodyElement, resultList);
                if (resultList.contains(key)) {
                    return;
                }

            }
            if (bodyElement.getElementType().compareTo(BodyElementType.TABLE)
== 0) {
                findInTable(key, (XWPFTable) bodyElement, resultList);

            }
        }
    }

    private static void findInParagraph(String key, XWPFParagraph
xwpfParagraph, ArrayList<String> resultList) {

        if (resultList.contains(key)) {
            return;
        }

        //for (XWPFParagraph paragraph : xwpfParagraphs) {
        List<XWPFRun> runs = xwpfParagraph.getRuns();

        String find = key;
        TextSegment found = xwpfParagraph.searchText(find, new
PositionInParagraph());
        if (found != null) {
            if (!resultList.contains(key)) {
                resultList.add(key);
                return;
            }
        }

    }

    private static void findInTextfield(String key, XWPFParagraph
xwpfParagraph, ArrayList<String> resultList) {

        if (resultList.contains(key)) {
            return;
        }

        XmlCursor cursor = xwpfParagraph.getCTP().newCursor();
        cursor.selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'
.//*/w:txbxContent/w:p/w:r");

        List<XmlObject> ctrsintxtbx = new ArrayList<XmlObject>();

        while (cursor.hasNextSelection()) {
            cursor.toNextSelection();
            XmlObject obj = cursor.getObject();
            ctrsintxtbx.add(obj);
        }
        for (XmlObject obj : ctrsintxtbx) {
            try {
                CTR ctr = CTR.Factory.parse(obj.xmlText());
                XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
                String text = bufferrun.getText(0);
                if (text != null && text.contains(key)) {
                    if (!resultList.contains(key)) {
                        resultList.add(key);
                        return;
                    }
                }
            } catch (Exception ex) {
                log.error("Unable to iterate text fields", ex);
            }
        }

    }

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #3 from Dominik Stadler <do...@gmx.at> ---
Thanks, but unfortunately there is lots of code which is not related to the
problem and thus makes reproducing and analyzing this very hard. The app seems
to not finish for a very long time for me. It also looks a bit like you are
iterating over the contents of the document many times with all the
placeholders and some of the loops in your application.

Can you reduce the code in the sample project as much as possible so that it
still shows the problem, but does not do all the things that are only needed
for your application?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

--- Comment #4 from j-lawyer.org <je...@gmx.de> ---
Thanks Dominik for looking into this. I have stripped down the test case, the
URL is still the same: https://www.j-lawyer.org/temp/DocXShowCase.zip

- has a list of 50 strings to be searched in documents
- has two documents, both just 1 page - (a) has no textfields and (b) has 10
text fields
- each of the 50 strings is searched for using a loop, so i am iterating each
document fifty times

Basically I just want to know which of the 50 strings are contained in the
documents.

Thanks,
Jens

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <je...@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #2 from j-lawyer.org <je...@gmx.de> ---
Thank you Dominik for the reply. 

I just created a fully runnable example:
https://www.j-lawyer.org/temp/DocXShowCase.zip

It is a Netbeans project that includes runnable test case as well as example
documents. Both docx documents are comparable in complexity, one has no text
fields, the other one has 10 text fields. 

When running the code, those are the performance numbers: 

without textfields, search: 676
with textfields, search: 15678

So, when text fields are involved, there is 23x factor for execution times.

Let me know if I can provide anything else and I will be on top of it in no
time.

Thanks!
Jens / j-lawyer.org

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

--- Comment #7 from PJ Fanning <fa...@yahoo.com> ---
Instead of `CTP.Factory.parse(embeddedParagraph.xmlText())` could you try
`CTP.Factory.parse(embeddedParagraph.getDomNode())`

This might lower the overhead of the parse call

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <je...@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #6 from j-lawyer.org <je...@gmx.de> ---
Well, I would love to get rid of the expensive XML handling - however, I do not
see how I could avoid it given POIs API. 

Is there an alternative approach for "getting all text content of text fields /
text boxes"?

Even Apache Tika seems to use the exact same approach in their
XWPFWordExtractorDecorator.java:

  331         // Also extract any paragraphs embedded in text boxes
  332         //Note "w:txbxContent//"...must look for all descendant
paragraphs
  333         //not just the immediate children of txbxContent -- TIKA-2807
  334         if (config.getIncludeShapeBasedContent()) {
  335             for (XmlObject embeddedParagraph :
paragraph.getCTP().selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare
namespace
wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
.//*/wps:txbx/w:txbxContent//w:p")) {
  336                 extractParagraph(new
XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()),
paragraph.getBody()), listManager, xhtml);
  337             }
  338         }


Am I missing something?

Thanks,
Jens

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

--- Comment #8 from j-lawyer.org <je...@gmx.de> ---
Thanks for the suggestion PJ!

I am not too familiar with the more low level APIs of POI. 

In the code I initially posted (findInTextfield method), I am using an XWPFRun
which cannot be fed with a CTP


                CTR ctr = CTR.Factory.parse(obj.xmlText());
                XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody) xwpfParagraph);
                String text = bufferrun.getText(0);
                if (text != null && text.contains(key)) {
                    if (!resultList.contains(key)) {
                        resultList.add(key);
                        return;
                    }
                }


When replacing 

CTR ctr = CTR.Factory.parse(obj.xmlText());

with

CTR ctr = CTR.Factory.parse(obj.getDomNode());

my code does no longer work - the text retrieved does no longer contain / find
my search strings. Using the first line however (which involves re-parsing XML)
works as expected. 
I have challenges finding proper Javadocs for CTP and CTR, assume they
represent some disjoint sets of XML complex types. 

Do you have any hints on why the two variations above have different behaviour?

Thanks,
Jens

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <je...@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Dominik Stadler <do...@gmx.at> ---
Can you provide a sample file which shows the slowdown? Would make it much
easier to try to analyze/reproduce it.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

[Bug 64418] Finding text in textfields is very slow

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #5 from Dominik Stadler <do...@gmx.at> ---
The following line is taking most of the CPU by far, so you likely need to
rework your code to not have to produce XML and then parse it in again
afterwards. 

CTR.Factory.parse(obj.xmlText())

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org