You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Niraj Bhawnani (JIRA)" <ji...@apache.org> on 2010/07/14 03:27:51 UTC

[jira] Created: (PDFBOX-774) convertToImage causes JVM crash on certain PDFs

convertToImage causes JVM crash on certain PDFs
-----------------------------------------------

                 Key: PDFBOX-774
                 URL: https://issues.apache.org/jira/browse/PDFBOX-774
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.2.1, 1.2.0
            Reporter: Niraj Bhawnani


I'm evaluating PDFBox and as part of the process I tried out several PDFs on it. One of the issues I found was on converting certain PDFs to images, it crashed the JVM with this message (Ubuntu Lucid Lynx 64-bit):

{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe5b6be1a37, pid=2133, tid=140628023412496
#
# JRE version: 6.0_20-b02
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode linux-amd64 )
# Problematic frame:
# C  [libfontmanager.so+0x27a37]
#
# An error report file with more information is saved as:
# /home/xxxxxx/hs_err_pid2133.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
{noformat}

Of course, this seems like an issue with Java but it would be nice if PDFBox somehow worked around it. I tested this on 2 separate 64-bit Linux boxes as well as a 32-bit Windows box. Pretty much the same error on both platforms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-774) convertToImage causes JVM crash on certain PDFs

Posted by "Niraj Bhawnani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niraj Bhawnani updated PDFBOX-774:
----------------------------------

    Attachment: IC_bp_strategy_presentation_march_2010_slides.pdf

Attached an example PDF where this happens that I grabbed off a Google search for "pdf presentation slides"

> convertToImage causes JVM crash on certain PDFs
> -----------------------------------------------
>
>                 Key: PDFBOX-774
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-774
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.2.1
>            Reporter: Niraj Bhawnani
>         Attachments: IC_bp_strategy_presentation_march_2010_slides.pdf
>
>
> I'm evaluating PDFBox and as part of the process I tried out several PDFs on it. One of the issues I found was on converting certain PDFs to images, it crashed the JVM with this message (Ubuntu Lucid Lynx 64-bit):
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007fe5b6be1a37, pid=2133, tid=140628023412496
> #
> # JRE version: 6.0_20-b02
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode linux-amd64 )
> # Problematic frame:
> # C  [libfontmanager.so+0x27a37]
> #
> # An error report file with more information is saved as:
> # /home/xxxxxx/hs_err_pid2133.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> {noformat}
> Of course, this seems like an issue with Java but it would be nice if PDFBox somehow worked around it. I tested this on 2 separate 64-bit Linux boxes as well as a 32-bit Windows box. Pretty much the same error on both platforms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: Text extraction : do we need those files ?

Posted by Bernard Segonnes <bs...@free.fr>.
Thanks for the answer.

The PDFBOX-586 is from myself  :-)

So, as I expect to have customers in asian, and 'righ to left' countries : I
will keep those files :-(

(I sometimes have Out Of Memory Exception I should catch as my app. runs on
mobile devices/phones).  I will optimize elsewhere.

Selon Jukka Zitting <ju...@gmail.com>:

> Hi,
>
> On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <bs...@free.fr> wrote:
> > I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary
> is
> > too big & too slow (probably due to memory constraints...) : around 5Mo  
> (9Mo
> > once installed on a mobile device : too much)
>
> See PDFBOX-586 [1] for some related progress.
>
> > Are the files in :
> > 1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V  
> UniKS-UTF8-H
> > ...)  I would be please to remove all those files :-)
>
> These are only needed for processing PDF documents that use CJK
> (Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
> from the internal font-specific character identification codes to
> Unicode.
>
> > 2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml
> > pdf_zh_Hant.xml ....)
>
> These are part of the ICU4J library. You only need ICU4J for handling
> Arabic and other right-to-left languages.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-586
>
> BR,
>
> Jukka Zitting
>


Bernard SEGONNES
-------------------------------------
bsegonnes@free.fr
http://bsegonnes.free.fr

Re: Text extraction : do we need those files ?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Aug 9, 2010 at 3:41 PM, Bernard Segonnes <bs...@free.fr> wrote:
> I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary is
> too big & too slow (probably due to memory constraints...) : around 5Mo   (9Mo
> once installed on a mobile device : too much)

See PDFBOX-586 [1] for some related progress.

> Are the files in :
> 1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V   UniKS-UTF8-H
> ...)  I would be please to remove all those files :-)

These are only needed for processing PDF documents that use CJK
(Chinese, Japanese, Korean) fonts. These CMaps are needed to translate
from the internal font-specific character identification codes to
Unicode.

> 2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml
> pdf_zh_Hant.xml ....)

These are part of the ICU4J library. You only need ICU4J for handling
Arabic and other right-to-left languages.

[1] https://issues.apache.org/jira/browse/PDFBOX-586

BR,

Jukka Zitting

Text extraction : do we need those files ?

Posted by Bernard Segonnes <bs...@free.fr>.
Hi,

I have ported PDFBox 1.1.0 on Android  (only text extraction).  The binary is
too big & too slow (probably due to memory constraints...) : around 5Mo   (9Mo
once installed on a mobile device : too much)

So I'm looking for files I can delete.... I only need to extract text.

Are the files in :
1)  cmap     require ?    (78-EUC_H   Adobe-CNS-5   GBK-EUC-V   UniKS-UTF8-H
...)  I would be please to remove all those files :-)


2) pdf_*.xml  are they require for text extraction ?     (pdf_he_IL.xml  
pdf_zh_Hant.xml ....)


3) other resoucres file I can remove ?

Thanks for the help.

[jira] Resolved: (PDFBOX-774) convertToImage causes JVM crash on certain PDFs

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved PDFBOX-774.
----------------------------------

      Assignee: Jukka Zitting
    Resolution: Duplicate

The fix to PDFBOX-780 works around this issue.

> convertToImage causes JVM crash on certain PDFs
> -----------------------------------------------
>
>                 Key: PDFBOX-774
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-774
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.2.1
>            Reporter: Niraj Bhawnani
>            Assignee: Jukka Zitting
>         Attachments: IC_bp_strategy_presentation_march_2010_slides.pdf
>
>
> I'm evaluating PDFBox and as part of the process I tried out several PDFs on it. One of the issues I found was on converting certain PDFs to images, it crashed the JVM with this message (Ubuntu Lucid Lynx 64-bit):
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007fe5b6be1a37, pid=2133, tid=140628023412496
> #
> # JRE version: 6.0_20-b02
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode linux-amd64 )
> # Problematic frame:
> # C  [libfontmanager.so+0x27a37]
> #
> # An error report file with more information is saved as:
> # /home/xxxxxx/hs_err_pid2133.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> {noformat}
> Of course, this seems like an issue with Java but it would be nice if PDFBox somehow worked around it. I tested this on 2 separate 64-bit Linux boxes as well as a 32-bit Windows box. Pretty much the same error on both platforms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.