You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Takashi Komatsubara (JIRA)" <ji...@apache.org> on 2009/02/09 11:56:59 UTC

[jira] Created: (PDFBOX-420) Japanese Characters are garbled.

Japanese Characters are garbled.
--------------------------------

                 Key: PDFBOX-420
                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator
            Reporter: Takashi Komatsubara
            Priority: Critical


The extracted Japanese characters are completely garbled.
This issue is very critical for Japanese users.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi,

> Let me know when I can do test the CJK support improvement.
> I will confirm what is lack and where should be improved.
For the moment I'm short of time but I'll try to include the patch soon.
I've already patched my local installation. There have to be done some
more improvements to support chinese and korean as well. But I'm on it ...

Andreas Lehmkühler

RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
Jukka,

Thank you so much.

Let me know when I can do test the CJK support improvement.
I will confirm what is lack and where should be improved.

Takashi.


-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Tuesday, March 03, 2009 11:09 PM
To: pdfbox-dev@incubator.apache.org; takashi@sendmail.com
Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Hi,

On Tue, Mar 3, 2009 at 2:53 PM, Takashi Komatsubara
<ta...@sendmail.com> wrote:
> Here is the URL from which we can get the patch.
>
https://sourceforge.net/tracker/index.php?func=detail&aid=1640071&group_id=7
> 8314&atid=552834

OK, good. That issue was imported to Jira as issue PDFBOX-238. I've
added the reference to PDFBOX-420, so as far as I'm concerned we now
have the required paper trail in place for applying this patch to
trunk.

BR,

Jukka Zitting


Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Mar 3, 2009 at 2:53 PM, Takashi Komatsubara
<ta...@sendmail.com> wrote:
> Here is the URL from which we can get the patch.
> https://sourceforge.net/tracker/index.php?func=detail&aid=1640071&group_id=7
> 8314&atid=552834

OK, good. That issue was imported to Jira as issue PDFBOX-238. I've
added the reference to PDFBOX-420, so as far as I'm concerned we now
have the required paper trail in place for applying this patch to
trunk.

BR,

Jukka Zitting

RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
Hi Jukka,

Thank you for your kindness.

Here is the URL from which we can get the patch.
https://sourceforge.net/tracker/index.php?func=detail&aid=1640071&group_id=7
8314&atid=552834

Regards,
Takashi.

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Tuesday, March 03, 2009 5:52 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Hi,

On Mon, Mar 2, 2009 at 4:21 PM, Takashi Komatsubara
<ta...@sendmail.com> wrote:
> I have tried to contact with Pin Xue, have never got any response from
him.
> The patches he provided was at the sourceforge, and I have no relationship
with him.
> My work was collect cjk patches and fixes some tiny issues.

OK, thanks for the background.

> There is no possibilities to get his permission. So, please ignore this
> patches and forget to support CJK by the Apache PDFBox.

Do you have references to the original submissions in SourceForge? We
already have a generic entry in the legal documentation of Apache
PDFBox that covers all BSD-licensed contributions to past PDFBox
versions. This should cover also Pin Xue's work, but it would be good
to have a record of what was contributed and where before we include
the code.

BR,

Jukka Zitting


Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Mar 2, 2009 at 4:21 PM, Takashi Komatsubara
<ta...@sendmail.com> wrote:
> I have tried to contact with Pin Xue, have never got any response from him.
> The patches he provided was at the sourceforge, and I have no relationship with him.
> My work was collect cjk patches and fixes some tiny issues.

OK, thanks for the background.

> There is no possibilities to get his permission. So, please ignore this
> patches and forget to support CJK by the Apache PDFBox.

Do you have references to the original submissions in SourceForge? We
already have a generic entry in the legal documentation of Apache
PDFBox that covers all BSD-licensed contributions to past PDFBox
versions. This should cover also Pin Xue's work, but it would be good
to have a record of what was contributed and where before we include
the code.

BR,

Jukka Zitting

RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
Jukka,

I have tried to contact with Pin Xue, have never got any response from him.
The patches he provided was at the sourceforge, and I have no relationship with him.
My work was collect cjk patches and fixes some tiny issues.

There is no possibilities to get his permission.
So, please ignore this patches and forget to support CJK by the Apache PDFBox.

If I have sometime to create my original CJK patches or someone contribute that kind of patches, that't the time we start to think to support CJK for Korea/China/Japanese people.

Thank you,
Takashi.



-----Original Message-----
From: Jukka Zitting (JIRA) [mailto:jira@apache.org] 
Sent: Tuesday, February 17, 2009 7:23 PM
To: pdfbox-dev@incubator.apache.org
Subject: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.


    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674167#action_12674167 ] 

Jukka Zitting commented on PDFBOX-420:
--------------------------------------

If you can, please try contacting Pin Xue again to get his confirmation on contributing this under the Apache license.

How/why did you work together on this? If you share an employer and wrote this patch at work, then an OK from your manager is typically enough and often necessary, depending on the copyright terms of the employment contract.

How much of the code is written by Pin Xue? We can include the code under the BSD license, but that requires special mentioning in the LICENSE and NOTICE files that we and all the downstream projects need to worry about. So if it's just a few lines of code then we're probably better off by just rewriting them if we can't get an OK for the Apache license.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673826#action_12673826 ] 

Takashi Komatsubara commented on PDFBOX-420:
--------------------------------------------

Hi Hi Andreas,

I was waiting for his response, but not arrived to me now.

In this case, what should I take as a next action?

Takashi.



> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Andreas Lehmkühler <an...@lehmi.de>.
>> I've done some tests yesterday. After solving an utf-issue with my
>> environment (we've already talked about that) I was able to run the
>> tests. I found the same differences than you and for now I still doesn't
>> know if it is a problem or an improvement, but I'm still on it.
> 
> Hi Andreas,
> 
> Was it a unicode issue or a line termination problem?  We should try to
> modify the test setup so that it will work in all environments.  Can you
> give more details on what your previous failures were?
I guess it was a unicode issue. I'll try to determine were the real
problem is, when I'm back from apachecon. Perhaps I'll find some time
during the conference.

Andreas


Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Brian Carrier <ca...@digital-evidence.org>.
On Mar 24, 2009, at 2:25 AM, Andreas Lehmkühler wrote:

> Hi Brian,
>
> I've done some tests yesterday. After solving an utf-issue with my
> environment (we've already talked about that) I was able to run the
> tests. I found the same differences than you and for now I still  
> doesn't
> know if it is a problem or an improvement, but I'm still on it.

Hi Andreas,

Was it a unicode issue or a line termination problem?  We should try  
to modify the test setup so that it will work in all environments.   
Can you give more details on what your previous failures were?

thanks,
brian

RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
Andre,

Would you share your changes with me ?
I will check them as soon as possible.

Takashi.


-----Original Message-----
From: Andreas Lehmkühler [mailto:andreas@lehmi.de] 
Sent: Tuesday, March 24, 2009 3:25 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Hi Brian,

I've done some tests yesterday. After solving an utf-issue with my
environment (we've already talked about that) I was able to run the
tests. I found the same differences than you and for now I still doesn't
know if it is a problem or an improvement, but I'm still on it.

Andreas

Brian Carrier schrieb:
> Unless someone who was involved with the creation of the patches that
> broke the regression test can help resolve them soon, I propose that the
> patches be reverted.  This is making it difficult for us to develop new
> patches that do not break the tests.
> 
> thanks,
> brian
> 
> 
> 
> On Mar 20, 2009, at 8:48 AM, Brian Carrier (JIRA) wrote:
> 
>>
>>     [
>>
https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plu
gin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683868#action_
12683868
>> ]
>>
>> Brian Carrier commented on PDFBOX-420:
>> --------------------------------------
>>
>> Can someone who worked on this patch please look at the regression
>> test failures to see if we should update the files in the regression
>> test (i.e. the patch improves the results in the regression test) or
>> if the patch introduces some new issues? We have some patches that we
>> want to commit, but this issue needs to be resolved first so that our
>> patch can include updated regression test files.
>>
>> See:
>> http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg00726.html
>>
>>> Japanese Characters are garbled.
>>> --------------------------------
>>>
>>>                 Key: PDFBOX-420
>>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>>             Project: PDFBox
>>>          Issue Type: Bug
>>>          Components: Text extraction
>>>    Affects Versions: 0.8.0-incubator
>>>            Reporter: Takashi Komatsubara
>>>            Priority: Critical
>>>         Attachments: supportJapanese-fontbox.patch,
>>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>>>
>>>
>>> The extracted Japanese characters are completely garbled.
>>> This issue is very critical for Japanese users.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>



Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Brian,

I've done some tests yesterday. After solving an utf-issue with my
environment (we've already talked about that) I was able to run the
tests. I found the same differences than you and for now I still doesn't
know if it is a problem or an improvement, but I'm still on it.

Andreas

Brian Carrier schrieb:
> Unless someone who was involved with the creation of the patches that
> broke the regression test can help resolve them soon, I propose that the
> patches be reverted.  This is making it difficult for us to develop new
> patches that do not break the tests.
> 
> thanks,
> brian
> 
> 
> 
> On Mar 20, 2009, at 8:48 AM, Brian Carrier (JIRA) wrote:
> 
>>
>>     [
>> https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683868#action_12683868
>> ]
>>
>> Brian Carrier commented on PDFBOX-420:
>> --------------------------------------
>>
>> Can someone who worked on this patch please look at the regression
>> test failures to see if we should update the files in the regression
>> test (i.e. the patch improves the results in the regression test) or
>> if the patch introduces some new issues? We have some patches that we
>> want to commit, but this issue needs to be resolved first so that our
>> patch can include updated regression test files.
>>
>> See:
>> http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg00726.html
>>
>>> Japanese Characters are garbled.
>>> --------------------------------
>>>
>>>                 Key: PDFBOX-420
>>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>>             Project: PDFBox
>>>          Issue Type: Bug
>>>          Components: Text extraction
>>>    Affects Versions: 0.8.0-incubator
>>>            Reporter: Takashi Komatsubara
>>>            Priority: Critical
>>>         Attachments: supportJapanese-fontbox.patch,
>>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>>>
>>>
>>> The extracted Japanese characters are completely garbled.
>>> This issue is very critical for Japanese users.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>


RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
Sorry Brian, I haven't enough time to contribute for this.
Just give me a few days. I will investigate what is the problem for this.

Regards,
Takashi.


-----Original Message-----
From: Brian Carrier [mailto:carrier@digital-evidence.org] 
Sent: Tuesday, March 24, 2009 1:01 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Unless someone who was involved with the creation of the patches that  
broke the regression test can help resolve them soon, I propose that  
the patches be reverted.  This is making it difficult for us to  
develop new patches that do not break the tests.

thanks,
brian



On Mar 20, 2009, at 8:48 AM, Brian Carrier (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PDFBOX-420? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12683868#action_12683868 ]
>
> Brian Carrier commented on PDFBOX-420:
> --------------------------------------
>
> Can someone who worked on this patch please look at the regression  
> test failures to see if we should update the files in the  
> regression test (i.e. the patch improves the results in the  
> regression test) or if the patch introduces some new issues? We  
> have some patches that we want to commit, but this issue needs to  
> be resolved first so that our patch can include updated regression  
> test files.
>
> See: http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/ 
> msg00726.html
>
>> Japanese Characters are garbled.
>> --------------------------------
>>
>>                 Key: PDFBOX-420
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: Text extraction
>>    Affects Versions: 0.8.0-incubator
>>            Reporter: Takashi Komatsubara
>>            Priority: Critical
>>         Attachments: supportJapanese-fontbox.patch,  
>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>>
>>
>> The extracted Japanese characters are completely garbled.
>> This issue is very critical for Japanese users.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>



Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Brian Carrier <ca...@digital-evidence.org>.
Unless someone who was involved with the creation of the patches that  
broke the regression test can help resolve them soon, I propose that  
the patches be reverted.  This is making it difficult for us to  
develop new patches that do not break the tests.

thanks,
brian



On Mar 20, 2009, at 8:48 AM, Brian Carrier (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PDFBOX-420? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12683868#action_12683868 ]
>
> Brian Carrier commented on PDFBOX-420:
> --------------------------------------
>
> Can someone who worked on this patch please look at the regression  
> test failures to see if we should update the files in the  
> regression test (i.e. the patch improves the results in the  
> regression test) or if the patch introduces some new issues? We  
> have some patches that we want to commit, but this issue needs to  
> be resolved first so that our patch can include updated regression  
> test files.
>
> See: http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/ 
> msg00726.html
>
>> Japanese Characters are garbled.
>> --------------------------------
>>
>>                 Key: PDFBOX-420
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: Text extraction
>>    Affects Versions: 0.8.0-incubator
>>            Reporter: Takashi Komatsubara
>>            Priority: Critical
>>         Attachments: supportJapanese-fontbox.patch,  
>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>>
>>
>> The extracted Japanese characters are completely garbled.
>> This issue is very critical for Japanese users.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683868#action_12683868 ] 

Brian Carrier commented on PDFBOX-420:
--------------------------------------

Can someone who worked on this patch please look at the regression test failures to see if we should update the files in the regression test (i.e. the patch improves the results in the regression test) or if the patch introduces some new issues? We have some patches that we want to commit, but this issue needs to be resolved first so that our patch can include updated regression test files. 

See: http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg00726.html

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671908#action_12671908 ] 

Brian Carrier commented on PDFBOX-420:
--------------------------------------

Do you have a sample PDF file that you can be used to verify the fix (i.e. one that is garbled with the current version and that is clean with the patch)?

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-420:
---------------------------------------

    Attachment: textextract._20090326_01.zip

junit test fail log.
I have confirmed that there are a lot of issues caused by this patch.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip, textextract._20090326_01.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694170#action_12694170 ] 

Brian Carrier commented on PDFBOX-420:
--------------------------------------

I just reverted the changes to PDFont.java so that the regression tests would pass. The new Conversion class still exists in the trunk though. 

Sending        trunk/src/main/java/org/apache/pdfbox/pdmodel/font/PDFont.java
Transmitting file data .
Committed revision 760505.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip, textextract._20090326_01.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680138#action_12680138 ] 

Andreas Lehmkühler commented on PDFBOX-420:
-------------------------------------------

I've commited the remaining part of the patch with version 751664.

Please perform some tests on the enhancement. I suggest to leave this issue open until we have positive test results from our japanese pdfbox-users. 

Thanks in advance for that.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-420:
---------------------------------------

    Attachment: supportJapanese.patch

Here is the patch for Pdfbox.

Some codes are part of Fontbox.

Takashi.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese.patch
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
OK, Thank you!

Takashi.


-----Original Message-----
From: Andreas Lehmkuhler [mailto:andreas@lehmi.de]
Sent: 2009/03/07 (土) 11:25
To: pdfbox-dev@incubator.apache.org
Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.
 
Hi Takashi,


please wait, I've committed only the fontbox-part of the patch. The
pdfbox-part is still missing. I fear you have to wait a few more days.

Andreas Lehmkuhler

> Takashi Komatsubara commented on PDFBOX-420:
> --------------------------------------------
> 
> Thank you so much, Andreas.
> 
> Many Japanese Java developers are watching this.
> Great job! I will try to compile and do some testing !
> 
> Thank you, again.
> Takashi.
> 
> 
> 
>> Japanese Characters are garbled.
>> --------------------------------
>>
>>                 Key: PDFBOX-420
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: Text extraction
>>    Affects Versions: 0.8.0-incubator
>>            Reporter: Takashi Komatsubara
>>            Priority: Critical
>>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>>
>>
>> The extracted Japanese characters are completely garbled.
>> This issue is very critical for Japanese users.
> 




Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Takashi,


please wait, I've committed only the fontbox-part of the patch. The
pdfbox-part is still missing. I fear you have to wait a few more days.

Andreas Lehmkühler

> Takashi Komatsubara commented on PDFBOX-420:
> --------------------------------------------
> 
> Thank you so much, Andreas.
> 
> Many Japanese Java developers are watching this.
> Great job! I will try to compile and do some testing !
> 
> Thank you, again.
> Takashi.
> 
> 
> 
>> Japanese Characters are garbled.
>> --------------------------------
>>
>>                 Key: PDFBOX-420
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: Text extraction
>>    Affects Versions: 0.8.0-incubator
>>            Reporter: Takashi Komatsubara
>>            Priority: Critical
>>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>>
>>
>> The extracted Japanese characters are completely garbled.
>> This issue is very critical for Japanese users.
> 



[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679869#action_12679869 ] 

Takashi Komatsubara commented on PDFBOX-420:
--------------------------------------------

Thank you so much, Andreas.

Many Japanese Java developers are watching this.
Great job! I will try to compile and do some testing !

Thank you, again.
Takashi.



> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673193#action_12673193 ] 

Takashi Komatsubara commented on PDFBOX-420:
--------------------------------------------

Hi Andreas,

>1. Your patch contains a package with 4 new classes. All of them have an old pdfbox license header. If I add this changes to pdfbox, we have to change the license to the Apache >License 2.0. [1]. Is that ok for you and the author Pin Xue who is mentioned in 2 of these files?

I am asking to him. Please give a few days.


>2. Is the cmapSubstitutions mapping in PDFont complete or do you only add the mappings you are interested in? I asked, because if I add the code, I'd like to use a complete mapping. >As far as I understand the CharCode2Unicode mapping there are some unicode files missing in your mapping, e.g. the korean files.

Sorry, I was only focusing on Japanese language.

It is so great to support another CJK languages.
Please go ahead. I really appreciate it.

Regards,
Takashi.



> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-420:
---------------------------------------

    Attachment: supportJapanese-fontbox.patch

This is the patch for FontBox.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Brian Carrier <ca...@digital-evidence.org>.
Hi Takashi,

Thanks.  I left the conversion packages in the trunk to help make the  
debugging and later commits a little easier.  Hopefully we can get it  
checked in again soon.

thanks,
brian

On Mar 30, 2009, at 10:34 PM, Takashi Komatsubara wrote:

> Brian,
>
> Unfortunatelly, I have to agree with your proposal.
> (There are no problems for Japanese people to use this patch in  
> Japanese
> environment, actually).
>
> Thank you for your great effort.
>
> Takashi.
>
> -----Original Message-----
> From: Brian Carrier [mailto:carrier@digital-evidence.org]
> Sent: Monday, March 30, 2009 11:45 PM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are  
> garbled.
>
> Hello,
>
> The regression tests have been broken for over 2 weeks and this is
> greatly impacting our ability to check in some bug fixes because we
> need to regenerate new regression test files (for example, we fixed
> some spacing issues that can be seen in the regression tests).  I
> think this patch needs to be reverted until it either passes the
> regression tests or only has failures that are because the patch
> fixed some bugs that existed in the regression tests (in which case
> we can simply fix the regression tests).
>
> Unless I hear otherwise, I will plan to revert the patch tomorrow
> (Tuesday).
>
> thanks,
> brian
>
>
>
> On Mar 26, 2009, at 3:03 AM, Takashi Komatsubara (JIRA) wrote:
>
>>
>>     [ https://issues.apache.org/jira/browse/PDFBOX-420?
>> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel&focusedCommentId=12689376#action_12689376 ]
>>
>> Takashi Komatsubara commented on PDFBOX-420:
>> --------------------------------------------
>>
>> Andreas,
>>
>> The mapping "Identity-H" to "JIS" is no problem, itself.
>>
>> Though, I've confirmed that there are wrongly extracted characters
>> in the output txt file which was dynamically created druing calling
>> "ant testextract" command.
>> The characteter is  T, for example.
>>
>> ( This  " T " character cannot input from Japanese keybord with
>> normal operation. But we can copy/paste it )
>>
>> Let me take a look further.
>>
>>> Japanese Characters are garbled.
>>> --------------------------------
>>>
>>>                 Key: PDFBOX-420
>>>                 URL: https://issues.apache.org/jira/browse/ 
>>> PDFBOX-420
>>>             Project: PDFBox
>>>          Issue Type: Bug
>>>          Components: Text extraction
>>>    Affects Versions: 0.8.0-incubator
>>>            Reporter: Takashi Komatsubara
>>>            Priority: Critical
>>>         Attachments: supportJapanese-fontbox.patch,
>>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip,
>>> textextract._20090326_01.zip
>>>
>>>
>>> The extracted Japanese characters are completely garbled.
>>> This issue is very critical for Japanese users.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>
>


RE: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Takashi Komatsubara <ta...@sendmail.com>.
Brian,

Unfortunatelly, I have to agree with your proposal.
(There are no problems for Japanese people to use this patch in Japanese
environment, actually).

Thank you for your great effort.

Takashi.

-----Original Message-----
From: Brian Carrier [mailto:carrier@digital-evidence.org] 
Sent: Monday, March 30, 2009 11:45 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Hello,

The regression tests have been broken for over 2 weeks and this is  
greatly impacting our ability to check in some bug fixes because we  
need to regenerate new regression test files (for example, we fixed  
some spacing issues that can be seen in the regression tests).  I  
think this patch needs to be reverted until it either passes the  
regression tests or only has failures that are because the patch  
fixed some bugs that existed in the regression tests (in which case  
we can simply fix the regression tests).

Unless I hear otherwise, I will plan to revert the patch tomorrow  
(Tuesday).

thanks,
brian



On Mar 26, 2009, at 3:03 AM, Takashi Komatsubara (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PDFBOX-420? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12689376#action_12689376 ]
>
> Takashi Komatsubara commented on PDFBOX-420:
> --------------------------------------------
>
> Andreas,
>
> The mapping "Identity-H" to "JIS" is no problem, itself.
>
> Though, I've confirmed that there are wrongly extracted characters  
> in the output txt file which was dynamically created druing calling  
> "ant testextract" command.
> The characteter is  T, for example.
>
> ( This  " T " character cannot input from Japanese keybord with  
> normal operation. But we can copy/paste it )
>
> Let me take a look further.
>
>> Japanese Characters are garbled.
>> --------------------------------
>>
>>                 Key: PDFBOX-420
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: Text extraction
>>    Affects Versions: 0.8.0-incubator
>>            Reporter: Takashi Komatsubara
>>            Priority: Critical
>>         Attachments: supportJapanese-fontbox.patch,  
>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip,  
>> textextract._20090326_01.zip
>>
>>
>> The extracted Japanese characters are completely garbled.
>> This issue is very critical for Japanese users.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>



Re: [jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by Brian Carrier <ca...@digital-evidence.org>.
Hello,

The regression tests have been broken for over 2 weeks and this is  
greatly impacting our ability to check in some bug fixes because we  
need to regenerate new regression test files (for example, we fixed  
some spacing issues that can be seen in the regression tests).  I  
think this patch needs to be reverted until it either passes the  
regression tests or only has failures that are because the patch  
fixed some bugs that existed in the regression tests (in which case  
we can simply fix the regression tests).

Unless I hear otherwise, I will plan to revert the patch tomorrow  
(Tuesday).

thanks,
brian



On Mar 26, 2009, at 3:03 AM, Takashi Komatsubara (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PDFBOX-420? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12689376#action_12689376 ]
>
> Takashi Komatsubara commented on PDFBOX-420:
> --------------------------------------------
>
> Andreas,
>
> The mapping "Identity-H" to "JIS" is no problem, itself.
>
> Though, I've confirmed that there are wrongly extracted characters  
> in the output txt file which was dynamically created druing calling  
> "ant testextract" command.
> The characteter is  ™, for example.
>
> ( This  " ™ " character cannot input from Japanese keybord with  
> normal operation. But we can copy/paste it )
>
> Let me take a look further.
>
>> Japanese Characters are garbled.
>> --------------------------------
>>
>>                 Key: PDFBOX-420
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: Text extraction
>>    Affects Versions: 0.8.0-incubator
>>            Reporter: Takashi Komatsubara
>>            Priority: Critical
>>         Attachments: supportJapanese-fontbox.patch,  
>> supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip,  
>> textextract._20090326_01.zip
>>
>>
>> The extracted Japanese characters are completely garbled.
>> This issue is very critical for Japanese users.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689376#action_12689376 ] 

Takashi Komatsubara commented on PDFBOX-420:
--------------------------------------------

Andreas,

The mapping "Identity-H" to "JIS" is no problem, itself.

Though, I've confirmed that there are wrongly extracted characters in the output txt file which was dynamically created druing calling "ant testextract" command.
The characteter is  ™, for example.

( This  " ™ " character cannot input from Japanese keybord with normal operation. But we can copy/paste it )

Let me take a look further.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip, textextract._20090326_01.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673196#action_12673196 ] 

Takashi Komatsubara commented on PDFBOX-420:
--------------------------------------------

Hi Andreas,


I am asking to him. Please give a few days.



Sorry, I was only focusing on Japanese language.

It is so great to support another CJK languages.
Please go ahead. I really appreciate it.

Regards,
Takashi.




> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689132#action_12689132 ] 

Andreas Lehmkühler commented on PDFBOX-420:
-------------------------------------------

As far as a understand the whole encoding stuff the issue comes up every time truetype-CID-fonts are used. Whenever these kind of fonts is used "Identity-H" is used as encoding. The patch maps these encoding to the characterset  "JIS" which stands for a ISO-2022-JP, a japanese mapping (see org.apache.pdfbox.encoding.conversion.CJKEncodings.java).
So finally I don't know where to find the solution. Is it wrong to simply map "Identity-H" to "JIS" or is the reason for this problem the missing support for CID-fonts.

Any suggestions or hints for solving this issue? 

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674167#action_12674167 ] 

Jukka Zitting commented on PDFBOX-420:
--------------------------------------

If you can, please try contacting Pin Xue again to get his confirmation on contributing this under the Apache license.

How/why did you work together on this? If you share an employer and wrote this patch at work, then an OK from your manager is typically enough and often necessary, depending on the copyright terms of the employment contract.

How much of the code is written by Pin Xue? We can include the code under the BSD license, but that requires special mentioning in the LICENSE and NOTICE files that we and all the downstream projects need to worry about. So if it's just a few lines of code then we're probably better off by just rewriting them if we can't get an OK for the Apache license.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680159#action_12680159 ] 

Andreas Lehmkühler commented on PDFBOX-420:
-------------------------------------------

I've forgot that the first patch is part of the fontbox-lib. So that I've to commit a new version of the fontbox-lib including the needed patch. You'll find that new lib in version 751690

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-420:
---------------------------------------

    Attachment: TestFilesForJapaneseGarbledIssue.zip

Hi Brian,

Before my change, I mean, the current PdfBox can handle Japanese character if the pdf file is version 1.2 or 1.3.
If the version of the pdf is 1.4, this garble issue is happening.

My changing is to correct this issue targeting the version 1.4.

If the pdf file is version 1.5 or 1.6, sometimes this issue is happening again.
I'm trying to fix this issue.

BTW, the attached file is good sample.

Thank you, again.

Takashi.




> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679846#action_12679846 ] 

Andreas Lehmkühler commented on PDFBOX-420:
-------------------------------------------

I've commited the fontbox-part of the patch with version 751206, so that it will part of the upcoming fontbox-release.

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-420) Japanese Characters are garbled.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672967#action_12672967 ] 

Andreas Lehmkühler commented on PDFBOX-420:
-------------------------------------------

I have two questions before I try to add your code to the trunk:

1. Your patch contains a package with 4 new classes. All of them have an old pdfbox license header. If I add this changes to pdfbox, we have to change the license to the Apache License 2.0. [1]. Is that ok for you and the author Pin Xue who is mentioned in 2 of these files?

2. Is the cmapSubstitutions mapping in PDFont complete or do you only add the mappings you are interested in? I asked, because if I add the code, I'd like to use a complete mapping. As far as I understand the CharCode2Unicode mapping there are some unicode files missing in your mapping, e.g. the korean files.


[1] http://www.apache.org/licenses/LICENSE-2.0

> Japanese Characters are garbled.
> --------------------------------
>
>                 Key: PDFBOX-420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-420
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Takashi Komatsubara
>            Priority: Critical
>         Attachments: supportJapanese-fontbox.patch, supportJapanese.patch, TestFilesForJapaneseGarbledIssue.zip
>
>
> The extracted Japanese characters are completely garbled.
> This issue is very critical for Japanese users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.