You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "MRIT64 (JIRA)" <ji...@apache.org> on 2009/11/26 10:06:39 UTC

[jira] Created: (PDFBOX-570) Windings font recognition + spacing issue

Windings font recognition + spacing issue
-----------------------------------------

Key: PDFBOX-570
URL: https://issues.apache.org/jira/browse/PDFBOX-570
Project: PDFBox
Issue Type: Wish
Affects Versions: 0.7.3
Environment: Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded
Reporter: MRIT64

Windings characters issue
-------------------------

If filed this question first in Tika's wish list (tika-331) but Ken Krugler suggest it was a PDFBox issue.

I have PDF files that include some characters in Windings font.
Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters. That is normal regarding these characters codes inside Windings font, but when hands pictures are replaced by alphabetic characters like A, B, etc. that disturbs further lexical analysis.

Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
(see http://www.alanwood.net/demos/wingdings.html for possible correspondences).

Attached files :

test1.pdf is a PDF file including Windings characters. Some are commonly used by people, others less fequently.

Parsing_result1.txt is the text file produced by Tika.

test2.pdf is another example with the same WORD source file converted into PDF with another tool, and Parsing_result2.txt is the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.

Spacing issue
-------------

Look at lines 10 and 11 in test2.pdf.
Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) :

ðLocalisation des zones de livraison et de stockage
ðLocalisation des zones dangereuses

There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika).

If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get :

ð Localisation des zones de livraison et de stockage
ð Localisation des zones dangereuses

...with a space between ð and Localisation.

In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a single word in following analysis.

Regards

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-570) Windings font recognition + spacing issue

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783228#action_12783228 ] 

Andreas Lehmkühler commented on PDFBOX-570:
-------------------------------------------

PDFBox 0.7.3 is a quite old version. We recommend to use the 0.8.0 version of PDFBox. It contains a lot of improvements. Did you ever try Tika 0.5 AFAIK it uses the current version of pdfbox.

> Windings font recognition + spacing issue
> -----------------------------------------
>
>                 Key: PDFBOX-570
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-570
>             Project: PDFBox
>          Issue Type: Wish
>    Affects Versions: 0.7.3
>         Environment: Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> Windings characters issue
> -------------------------
> If filed this question first in Tika's wish list (tika-331) but Ken Krugler suggest it was a PDFBox issue.
> I have PDF files that include some characters in Windings font. 
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters. That is normal regarding these characters codes inside Windings font, but when hands pictures are replaced by alphabetic characters like A, B, etc. that disturbs further lexical analysis.
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ? 
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences). 
> Attached files :
> test1.pdf is a PDF file including Windings characters. Some are commonly used by people, others less fequently. 
> Parsing_result1.txt is the text file produced by Tika.
> test2.pdf is another example with the same WORD source file converted into PDF with another tool, and Parsing_result2.txt is the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.
> Spacing issue 
> -------------
> Look at lines 10 and 11 in test2.pdf. 
> Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) : 
> ðLocalisation des zones de livraison et de stockage 
> ðLocalisation des zones dangereuses 
> There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika). 
> If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get : 
> ð Localisation des zones de livraison et de stockage 
> ð Localisation des zones dangereuses 
> ...with a space between ð and Localisation. 
> In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a single word in following analysis. 
> Regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-570) Windings font recognition + spacing issue

Posted by "MRIT64 (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783281#action_12783281 ] 

MRIT64 commented on PDFBOX-570:
-------------------------------

I have tried Tika 0.5 :
- The spacing issue is solved for the two test files
- Nothing has changed for the Windings characters. Results are the same.

> Windings font recognition + spacing issue
> -----------------------------------------
>
>                 Key: PDFBOX-570
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-570
>             Project: PDFBox
>          Issue Type: Wish
>    Affects Versions: 0.7.3
>         Environment: Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> Windings characters issue
> -------------------------
> If filed this question first in Tika's wish list (tika-331) but Ken Krugler suggest it was a PDFBox issue.
> I have PDF files that include some characters in Windings font. 
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters. That is normal regarding these characters codes inside Windings font, but when hands pictures are replaced by alphabetic characters like A, B, etc. that disturbs further lexical analysis.
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ? 
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences). 
> Attached files :
> test1.pdf is a PDF file including Windings characters. Some are commonly used by people, others less fequently. 
> Parsing_result1.txt is the text file produced by Tika.
> test2.pdf is another example with the same WORD source file converted into PDF with another tool, and Parsing_result2.txt is the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.
> Spacing issue 
> -------------
> Look at lines 10 and 11 in test2.pdf. 
> Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) : 
> ðLocalisation des zones de livraison et de stockage 
> ðLocalisation des zones dangereuses 
> There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika). 
> If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get : 
> ð Localisation des zones de livraison et de stockage 
> ð Localisation des zones dangereuses 
> ...with a space between ð and Localisation. 
> In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a single word in following analysis. 
> Regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-570) Windings font recognition + spacing issue

Posted by "MRIT64 (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

MRIT64 updated PDFBOX-570:
--------------------------

    Attachment: Parsing_Result2.txt
                test2.pdf

> Windings font recognition + spacing issue
> -----------------------------------------
>
>                 Key: PDFBOX-570
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-570
>             Project: PDFBox
>          Issue Type: Wish
>    Affects Versions: 0.7.3
>         Environment: Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> Windings characters issue
> -------------------------
> If filed this question first in Tika's wish list (tika-331) but Ken Krugler suggest it was a PDFBox issue.
> I have PDF files that include some characters in Windings font. 
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters. That is normal regarding these characters codes inside Windings font, but when hands pictures are replaced by alphabetic characters like A, B, etc. that disturbs further lexical analysis.
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ? 
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences). 
> Attached files :
> test1.pdf is a PDF file including Windings characters. Some are commonly used by people, others less fequently. 
> Parsing_result1.txt is the text file produced by Tika.
> test2.pdf is another example with the same WORD source file converted into PDF with another tool, and Parsing_result2.txt is the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.
> Spacing issue 
> -------------
> Look at lines 10 and 11 in test2.pdf. 
> Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) : 
> ðLocalisation des zones de livraison et de stockage 
> ðLocalisation des zones dangereuses 
> There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika). 
> If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get : 
> ð Localisation des zones de livraison et de stockage 
> ð Localisation des zones dangereuses 
> ...with a space between ð and Localisation. 
> In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a single word in following analysis. 
> Regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-570) Windings font recognition + spacing issue

Posted by "MRIT64 (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

MRIT64 updated PDFBOX-570:
--------------------------

    Attachment: Parsing_Result1.txt
                test1.pdf

> Windings font recognition + spacing issue
> -----------------------------------------
>
>                 Key: PDFBOX-570
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-570
>             Project: PDFBox
>          Issue Type: Wish
>    Affects Versions: 0.7.3
>         Environment: Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> Windings characters issue
> -------------------------
> If filed this question first in Tika's wish list (tika-331) but Ken Krugler suggest it was a PDFBox issue.
> I have PDF files that include some characters in Windings font. 
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters. That is normal regarding these characters codes inside Windings font, but when hands pictures are replaced by alphabetic characters like A, B, etc. that disturbs further lexical analysis.
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ? 
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences). 
> Attached files :
> test1.pdf is a PDF file including Windings characters. Some are commonly used by people, others less fequently. 
> Parsing_result1.txt is the text file produced by Tika.
> test2.pdf is another example with the same WORD source file converted into PDF with another tool, and Parsing_result2.txt is the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.
> Spacing issue 
> -------------
> Look at lines 10 and 11 in test2.pdf. 
> Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) : 
> ðLocalisation des zones de livraison et de stockage 
> ðLocalisation des zones dangereuses 
> There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika). 
> If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get : 
> ð Localisation des zones de livraison et de stockage 
> ð Localisation des zones dangereuses 
> ...with a space between ð and Localisation. 
> In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a single word in following analysis. 
> Regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.