You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org> on 2010/11/19 13:23:14 UTC
[jira] Created: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs causes
Infinite recursion when trying to extract text from specific types of PDFs causes
---------------------------------------------------------------------------------
Key: PDFBOX-895
URL: https://issues.apache.org/jira/browse/PDFBOX-895
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.3.1
Reporter: Panayiotis Vlissidis
Priority: Critical
Attachments: test.pdf
Hello and thanks for PDFBox.
We just started using PDFBox for text extraction(through Tika)
and it fails to finish text extraction falling in an infinite loop
and never returning the text.
Please note that this happens only for a specific type of PDF
documents(used for hand writing recognition) such as the one attached.
Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
but I think that PDFBox should at least break out if extraction is not possible.
I wish I could give you more information but I know nothing about PDF format, parsing, etc.
Please let me know if you need any information or my help in any way.
Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler reopened PDFBOX-895:
---------------------------------------
Reopened because of reopening PDFBOX-956
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Assignee: Andreas Lehmkühler
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966086#action_12966086 ]
Martijn Brinkers commented on PDFBOX-895:
-----------------------------------------
If you disable SuppressDuplicateOverlappingText (i.e., set it to false), text extraction only takes a few seconds. I guess trying to remove duplicate text takes such a long time because the background characters used are only from a small set of characters (d, r, l, u). The algorithm to detect overlap therefore takes a very long time. The PDF format is actually not optimal for text extraction and therefore trying to detect whether a character overlaps or not can be time consuming in cases like this. In this particular situation I think it's better to disable overlap detection.
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs causes
Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Panayiotis Vlissidis updated PDFBOX-895:
----------------------------------------
Attachment: test.pdf
> Infinite recursion when trying to extract text from specific types of PDFs causes
> ---------------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966081#action_12966081 ]
Panayiotis Vlissidis commented on PDFBOX-895:
---------------------------------------------
Hello Martijn,
First of all, I would like to thank you for looking into this issue.
Yes, you are right the text is extracted with no exceptions.
But I did not say that it throws an exception, rather that is seems
to loop infinitely, on which I was wrong since the text does actually
gets extracted although it takes a really really really long time to finish.
I left it running today and it took about 79 minutes to finish!!!!
This is unacceptable for us and I hope that you agree too.
My current workaround is to use a different thread
and allow it to run for a specific amount of time until I interrupt it.
You are also right about the background characters and that is because
, as I already mentioned, it is a PDF specially constructed for hand writing
recognition. As such the background text information is really of no use to us.
To sum this up, the problem seems to be different from the one I initially thought
and I guess that a better alternative than the current workaround would be
to be able to disable extraction of the background text through some kind of
property of the PDFTextStripper class.
Does anyone know if this is feasible and if so how difficult would it be
to implement such a feature (if not already implemented)?
Any help or ideas about this issue would be greatly appreciated.
Thanks once more for your time.
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] [Resolved] (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-895.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.7.0
I found a suitable solution for PDFBOX-956 and now the performance is back.
Set to resolved.
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Assignee: Andreas Lehmkühler
> Priority: Critical
> Fix For: 1.7.0
>
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-895.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.5.0
Assignee: Andreas Lehmkühler
This works fine after resolvinf PDFBOX-956 even if the suppress duplicates algorithm is enabled.
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Assignee: Andreas Lehmkühler
> Priority: Critical
> Fix For: 1.5.0
>
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-895:
--------------------------------------
Fix Version/s: (was: 1.5.0)
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Assignee: Andreas Lehmkühler
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Panayiotis Vlissidis updated PDFBOX-895:
----------------------------------------
Summary: Infinite recursion when trying to extract text from specific types of PDFs (was: Infinite recursion when trying to extract text from specific types of PDFs causes)
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964646#action_12964646 ]
Martijn Brinkers commented on PDFBOX-895:
-----------------------------------------
I'm able to extract text without any getting any exception. The background of the PDF however seems to be created from a huge amount of characters like "rdrruludrluulluduudd". The extracted text therefore contains a large number of "random looking" characters.
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Priority: Critical
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-895) Infinite recursion when trying to
extract text from specific types of PDFs
Posted by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994218#comment-12994218 ]
Panayiotis Vlissidis commented on PDFBOX-895:
---------------------------------------------
Excellent!!!
Thanks again for all your hard work and time invested into PDFBox
> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
> Key: PDFBOX-895
> URL: https://issues.apache.org/jira/browse/PDFBOX-895
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Panayiotis Vlissidis
> Assignee: Andreas Lehmkühler
> Priority: Critical
> Fix For: 1.5.0
>
> Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika)
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached.
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc.
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira