You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Panayiotis Vlissidis (JIRA)" <ji...@apache.org> on 2010/12/02 13:04:11 UTC
[jira] Commented: (PDFBOX-895) Infinite recursion when trying to extract text from specific types of PDFs

    [ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966081#action_12966081 ] 

Panayiotis Vlissidis commented on PDFBOX-895:
---------------------------------------------

Hello Martijn,

First of all, I would like to thank you for looking into this issue.

Yes, you are right the text is extracted with no exceptions.

But I did not say that it throws an exception, rather that is seems
to loop infinitely, on which I was wrong since the text does actually
gets extracted although it takes a really really really long time to finish.

I left it running today and it took about 79 minutes to finish!!!!
This is unacceptable for us and I hope that you agree too.
My current workaround is to use a  different thread 
and allow it to run for a specific amount of time until I interrupt it.

You are also right about the background characters and that is because
, as I already mentioned, it is a PDF specially constructed for hand writing 
recognition.  As such the background text information is really of no use to us.

To sum this up, the problem seems to be different from the one I initially thought
and I guess that a better alternative than the current workaround would be 
to be able to disable extraction of the background text through some kind of 
property of the PDFTextStripper class.

Does anyone know if this is feasible and if so how difficult would it be 
to implement such a feature (if not already implemented)?

Any help or ideas about this issue would be greatly appreciated.

Thanks once more for your time.

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not possible.
> I wish I could give you more information but I know nothing about PDF format, parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.