You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jonathan (JIRA)" <ji...@apache.org> on 2019/05/09 11:57:00 UTC

[jira] [Updated] (PDFBOX-4539) Cache CharsetDecoder

     [ https://issues.apache.org/jira/browse/PDFBOX-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan updated PDFBOX-4539:
-----------------------------
    Description: 
We were using PDFBox to parse and process a large number of PDFs, which could potentially contains thousands of pages in total, so performance mattered to us.

Thus, we'd like to suggest to cache the CharsetDecoder, which is currently instantiated on each call of `isValidUTF8(byte[])`.

Our suggestion in BaseParser.java
{code:java}
private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();

/**
 * Returns true if a byte sequence is valid UTF-8.
 */
private boolean isValidUTF8(byte[] input)
{
    try
    {
        csUTF_8.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch (CharacterCodingException e)
    {
        return false;
    }
}
{code}
 

  was:
We were using PDFBox to parse and process a large number of PDFs, which could potentially contains thousands of pages in total, so performance mattered to us.

Thus, we'd like to suggest to cache the CharsetDecoder, which is currently instantiated on each call of `isValidUTF8(byte[])`.

Our suggestion in BaseParser.java
{code:java}
private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();

/**
 * Returns true if a byte sequence is valid UTF-8.
 */
private boolean isValidUTF8(byte[] input)
{
    CharsetDecoder cs = Charsets.UTF_8.newDecoder();
    try
    {
        cs.decode(ByteBuffer.wrap(input));
        csUTF_8.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch (CharacterCodingException e)
    {
        return false;
    }
}
{code}
 


> Cache CharsetDecoder
> --------------------
>
>                 Key: PDFBOX-4539
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4539
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.14
>            Reporter: Jonathan
>            Priority: Major
>              Labels: performance
>             Fix For: 2.0.16
>
>
> We were using PDFBox to parse and process a large number of PDFs, which could potentially contains thousands of pages in total, so performance mattered to us.
> Thus, we'd like to suggest to cache the CharsetDecoder, which is currently instantiated on each call of `isValidUTF8(byte[])`.
> Our suggestion in BaseParser.java
> {code:java}
> private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();
> /**
>  * Returns true if a byte sequence is valid UTF-8.
>  */
> private boolean isValidUTF8(byte[] input)
> {
>     try
>     {
>         csUTF_8.decode(ByteBuffer.wrap(input));
>         return true;
>     }
>     catch (CharacterCodingException e)
>     {
>         return false;
>     }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org