You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org> on 2010/11/19 13:23:14 UTC

[jira] Created: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs causes

Infinite recursion when trying to extract text from specific types of PDFs causes
---------------------------------------------------------------------------------

                 Key: PDFBOX-895
                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.3.1
            Reporter: Panayiotis Vlissidis
            Priority: Critical
         Attachments: test.pdf

Hello and thanks for PDFBox.

We just started using PDFBox for text extraction(through Tika) 
and it fails to finish text extraction falling in an infinite loop
and never returning the text.

Please note that this happens only for a specific type of PDF
documents(used for hand writing recognition) such as the one attached. 
Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
but I think that PDFBox should at least break out if extraction is not possible.

I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
Please let me know if you need any information or my help in any way.

Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler reopened PDFBOX-895:
---------------------------------------


Reopened because of reopening PDFBOX-956

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966086#action_12966086 ] 

Martijn Brinkers commented on PDFBOX-895:
-----------------------------------------

If you disable SuppressDuplicateOverlappingText (i.e., set it to false), text extraction only takes a few seconds. I guess trying to remove duplicate text takes such a long time because the background characters used are only from a small set of characters (d, r, l, u). The algorithm to detect overlap therefore takes a very long time. The PDF format is actually not optimal for text extraction and therefore trying to detect whether a character overlaps or not can be time consuming in cases like this. In this particular situation I think it's better to disable overlap detection.

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs causes

Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Panayiotis Vlissidis updated PDFBOX-895:
----------------------------------------

    Attachment: test.pdf

> Infinite recursion when trying to extract text from specific types of PDFs causes
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966081#action_12966081 ] 

Panayiotis Vlissidis commented on PDFBOX-895:
---------------------------------------------

Hello Martijn,

First of all, I would like to thank you for looking into this issue.

Yes, you are right the text is extracted with no exceptions.

But I did not say that it throws an exception, rather that is seems
to loop infinitely, on which I was wrong since the text does actually
gets extracted although it takes a really really really long time to finish.

I left it running today and it took about 79 minutes to finish!!!!
This is unacceptable for us and I hope that you agree too.
My current workaround is to use a  different thread 
and allow it to run for a specific amount of time until I interrupt it.

You are also right about the background characters and that is because
, as I already mentioned, it is a PDF specially constructed for hand writing 
recognition.  As such the background text information is really of no use to us.

To sum this up, the problem seems to be different from the one I initially thought
and I guess that a better alternative than the current workaround would be 
to be able to disable extraction of the background text through some kind of 
property of the PDFTextStripper class.

Does anyone know if this is feasible and if so how difficult would it be 
to implement such a feature (if not already implemented)?

Any help or ideas about this issue would be greatly appreciated.

Thanks once more for your time.

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Resolved] (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-895.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0

I found a suitable solution for PDFBOX-956 and now the performance is back.
Set to resolved.
                
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>             Fix For: 1.7.0
>
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-895.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5.0
         Assignee: Andreas Lehmkühler

This works fine after resolvinf PDFBOX-956 even if the suppress duplicates algorithm is enabled.

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>             Fix For: 1.5.0
>
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-895:
--------------------------------------

    Fix Version/s:     (was: 1.5.0)

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Panayiotis Vlissidis updated PDFBOX-895:
----------------------------------------

    Summary: Infinite recursion when trying to extract text from specific types of PDFs  (was: Infinite recursion when trying to extract text from specific types of PDFs causes)

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964646#action_12964646 ] 

Martijn Brinkers commented on PDFBOX-895:
-----------------------------------------

I'm able to extract text without any getting any exception. The background of the PDF however seems to be created from a huge amount of characters like "rdrruludrluulluduudd". The extracted text therefore contains a large number of "random looking" characters.



> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994218#comment-12994218 ] 

Panayiotis Vlissidis commented on PDFBOX-895:
---------------------------------------------

Excellent!!!

Thanks again for all your hard work and time invested into PDFBox

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>             Fix For: 1.5.0
>
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira