You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Takashi Komatsubara (JIRA)" <ji...@apache.org> on 2009/08/31 12:08:32 UTC

[jira] Created: (PDFBOX-509) Japanese Characters are garbled.

Japanese Characters are garbled.
--------------------------------

                 Key: PDFBOX-509
                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
            Reporter: Takashi Komatsubara
             Fix For: 0.8.0-incubator
         Attachments: mat10.pdf

The extracted from the attached Japanese pdf  is completely garbled.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750879#action_12750879 ] 

Andreas Lehmkühler commented on PDFBOX-509:
-------------------------------------------

This issue isn't limited to CJK users, there are a lot of documents using this mapping too, espacially in conjunction with CID-Fonts. There are some sample files in the pdfbox testing area. 

IMHO:
- sooner or later (I hope sooner) we have to solve this issue for everyone not only for CJK users
- it doesn't make sense to add a compile option or something like that
- we can't stop you forking your own version of pdfbox, but I advise you against doing this because of such a "little" issue

I suggest to create your own private patch for your needs until we'll solve this issue for every situation.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>             Fix For: 0.8.0-incubator
>
>         Attachments: mat10.pdf, Support_Substitutions.patch
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Do we should be able to extract text from ownter-password protected pdf file?

Posted by Takashi Komatsubara <ta...@gmail.com>.

Adam,

Sorry If I am wrong ... but just let me explain some.

Owner-password protected PDF file, I could extract the text.
Use-password protected PDF file, I could "NOT" extract the text.

When you open the owner-password protected pdf file, we can see the content 
without specifing "password".
That's the point.

Takashi.


----- Original Message ----- 
From: <Ad...@swmc.com>
To: <pd...@incubator.apache.org>
Sent: Tuesday, September 01, 2009 3:17 AM
Subject: Re: Do we should be able to extract text from ownter-password 
protected pdf file?


I tested you patch and confirmed that this does NOT work for encrypted
files.  Here's the stacktrace:

Exception in thread "main"
org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied
password does not match either the owner or user password in the document.
        at
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:231)
        at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1014)
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:184)

In case my line numbers are off, line 184 is: document.openProtection( sdm
); which happens before the lines which were commented out by your patch.

I believe you're saying that the text can be extracted from password
protected, non-encrypted files.  If it's possible to password protect PDFs
without using encryption, that's news to me.   I'm not sure what the point
would be of password protecting something if you're not going to encrypt
it, since that would only give a false sense of security, not any actual
security.  So, I just wanted to clear that up so people don't read your
post and think that all PDF security is completely broken.  When I first
read it, I thought you were implying that any password protected document
could be read without the password.

As for whether we "should" be able to do this or not, I'd say the
ExtractText program which comes with PDFBox should respect the permissions
by default, and perhaps have an option to extract password protected,
unencrypted documents (without a password).  I'm not sure what one would
call that option... -bypassPassword ?

--Adam




"Takashi Komatsubara" <ta...@gmail.com>
08/31/2009 04:05
Please respond to
pdfbox-dev@incubator.apache.org


To
<pd...@incubator.apache.org>
cc

Subject
Do we should be able to extract text from ownter-password protected pdf
file?






Hi team,

Technically, we can do extract text from "Owner" password protected pdf
file
without specifing "owner" password. Right?

Do we should be able to do that ? or not.

The reason why I'm asking is I am using the PDFBox for audting the content

of the pdf file.
So, whether the user want to make "text extract" permission disabled or
not,
I need to look into the content of the "owner password" protected pdf
file.

Old PDFbox could do this.

What do you think?

Takashi




?  Click here to submit conditions

This email and any content within or attached hereto from  Sun West Mortgage 
Company, Inc.  is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. 
If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or the taking of any action in reliance on 
the contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any 
other personal or financial information in the content of the email. Should 
you have any questions, please call  (800) 453 7884.   =

Re: Do we should be able to extract text from ownter-password protected pdf file?

Posted by Ad...@swmc.com.

I tested you patch and confirmed that this does NOT work for encrypted
files. Here's the stacktrace:

Exception in thread "main"
org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied
password does not match either the owner or user password in the document.
at
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:231)
at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1014)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:184)

In case my line numbers are off, line 184 is: document.openProtection( sdm
); which happens before the lines which were commented out by your patch.

I believe you're saying that the text can be extracted from password
protected, non-encrypted files. If it's possible to password protect PDFs
without using encryption, that's news to me. I'm not sure what the point
would be of password protecting something if you're not going to encrypt
it, since that would only give a false sense of security, not any actual
security. So, I just wanted to clear that up so people don't read your
post and think that all PDF security is completely broken. When I first
read it, I thought you were implying that any password protected document
could be read without the password.

As for whether we "should" be able to do this or not, I'd say the
ExtractText program which comes with PDFBox should respect the permissions
by default, and perhaps have an option to extract password protected,
unencrypted documents (without a password). I'm not sure what one would
call that option... -bypassPassword ?

--Adam

"Takashi Komatsubara" <ta...@gmail.com>
08/31/2009 04:05
Please respond to
pdfbox-dev@incubator.apache.org

To
<pd...@incubator.apache.org>
cc

Subject
Do we should be able to extract text from ownter-password protected pdf
file?

Hi team,

Technically, we can do extract text from "Owner" password protected pdf
file
without specifing "owner" password. Right?

Do we should be able to do that ? or not.

The reason why I'm asking is I am using the PDFBox for audting the content

of the pdf file.
So, whether the user want to make "text extract" permission disabled or
not,
I need to look into the content of the "owner password" protected pdf
file.

Old PDFbox could do this.

What do you think?

Takashi

? Click here to submit conditions

This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.

Do we should be able to extract text from ownter-password protected pdf file?

Posted by Takashi Komatsubara <ta...@gmail.com>.

Hi team,

Technically, we can do extract text from "Owner" password protected pdf file 
without specifing "owner" password. Right?

Do we should be able to do that ? or not.

The reason why I'm asking is I am using the PDFBox for audting the content 
of the pdf file.
So, whether the user want to make "text extract" permission disabled or not, 
I need to look into the content of the "owner password" protected pdf file.

Old PDFbox could do this.

What do you think?

Takashi

[jira] Updated: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-509:
---------------------------------------

    Attachment: Support_Substitutions.patch

Here is the patch.
This one can be applied the latest repository ( Today is 31/08/2009 ).

PLEASE REVIEW THIS PATCH.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>             Fix For: 0.8.0-incubator
>
>         Attachments: mat10.pdf, Support_Substitutions.patch
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-509:
--------------------------------------

    Fix Version/s:     (was: 0.8.0-incubator)

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>         Attachments: mat10.pdf, Support_Substitutions.patch
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749551#action_12749551 ] 

Andreas Lehmkühler commented on PDFBOX-509:
-------------------------------------------

Your patch is similar to that one you already contributed some time ago and I'm afraid it has the same unwanted sideeffects because of the substitution of "Identity-H" cmap. 
See PDFBOX-420 for more details.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>             Fix For: 0.8.0-incubator
>
>         Attachments: mat10.pdf, Support_Substitutions.patch
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-509:
---------------------------------------

    Attachment: mat10.pdf

This file is from Japanese government website.


> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>             Fix For: 0.8.0-incubator
>
>         Attachments: mat10.pdf
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751257#action_12751257 ] 

Takashi Komatsubara commented on PDFBOX-509:
--------------------------------------------

Hi Andreas,

OK, I see!.
I really understand.

I will continue to post these kind of patches to this. Because many cjk developers are watching the status of PDFBox development.
They are high level developers. It should be quite easy to apply the patches by themselves.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>             Fix For: 0.8.0-incubator
>
>         Attachments: mat10.pdf, Support_Substitutions.patch
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-509) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749680#action_12749680 ] 

Takashi Komatsubara commented on PDFBOX-509:
--------------------------------------------

Hi Andreas.
Sorry I forgot the past communications...

The fonts like "Identity-H" seem to be mandatory for CJK users.

http://www.talkgraphics.com/showthread.php?t=33648

What about adding the compile options which enable us to compile the portion which I have posted?
It is possible for me to create japanese-pdfbox project on the sourceforge.net, for example.
But, I want to keep that our PDFBox is one-binary, always.

Takashi.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>             Fix For: 0.8.0-incubator
>
>         Attachments: mat10.pdf, Support_Substitutions.patch
>
>
> The extracted from the attached Japanese pdf  is completely garbled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.