You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Manos Karampasis (JIRA)" <ji...@apache.org> on 2010/07/04 19:40:50 UTC

[jira] Created: (PDFBOX-770) Greek text extraction

Greek text extraction
---------------------

                 Key: PDFBOX-770
                 URL: https://issues.apache.org/jira/browse/PDFBOX-770
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.0
         Environment: Ubuntu 10.04
            Reporter: Manos Karampasis


Greek text extraction error
Ι have a greek pdf but after extraction the greek letter π is extracted as pi

for expamle
original text in pdf
"φυσικών προσώπων"

extracted text
"φυσικών piροσώpiων"

due to this problem solr is not indexing documents correctly

is there any configuration I can make?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-770) Greek text extraction

Posted by "Manos Karampasis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Manos Karampasis updated PDFBOX-770:
------------------------------------

    Affects Version/s: 1.2.1
                       1.3.0
        Fix Version/s:     (was: 1.3.0)

> Greek text extraction
> ---------------------
>
>                 Key: PDFBOX-770
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-770
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>         Environment: Ubuntu 10.04
>            Reporter: Manos Karampasis
>         Attachments: 3842.html, 3842.pdf
>
>
> Greek text extraction error
> Ι have a greek pdf but 
> a) after extraction the greek letter π is extracted as pi
> for expamle
> original text in pdf
> "φυσικών προσώπων"
> extracted text
> "φυσικών piροσώpiων"
> b) the greek letter μ is displayed as µ 
> there is no difference in display except that is different encoding and when searching for μ cannot find it (you find only the uppercase Μ)
> if you copy  μ as displayed search for that is working fine
> e.g. the word is displayed as "κλίµακας" but it is different from the typed word κλίμακα due to the letter μ
> due to this problem solr is not indexing documents correctly
> is there any configuration I can make?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-770) Greek text extraction

Posted by "Manos Karampasis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Manos Karampasis updated PDFBOX-770:
------------------------------------

    Attachment: 3842.pdf
                3842.html

Ι have posted the original file and the output of extraction in html format

> Greek text extraction
> ---------------------
>
>                 Key: PDFBOX-770
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-770
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Ubuntu 10.04
>            Reporter: Manos Karampasis
>         Attachments: 3842.html, 3842.pdf
>
>
> Greek text extraction error
> Ι have a greek pdf but after extraction the greek letter π is extracted as pi
> for expamle
> original text in pdf
> "φυσικών προσώπων"
> extracted text
> "φυσικών piροσώpiων"
> due to this problem solr is not indexing documents correctly
> is there any configuration I can make?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-770) Greek text extraction

Posted by "Manos Karampasis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Manos Karampasis updated PDFBOX-770:
------------------------------------

      Description: 
Greek text extraction error
Ι have a greek pdf but 
a) after extraction the greek letter π is extracted as pi

for expamle
original text in pdf
"φυσικών προσώπων"

extracted text
"φυσικών piροσώpiων"

b) the greek letter μ is displayed as µ 
there is no difference in display except that is different encoding and when searching for μ cannot find it (you find only the uppercase Μ)
if you copy  μ as displayed search for that is working fine

e.g. the word is displayed as "κλίµακας" but it is different from the typed word κλίμακα due to the letter μ


due to this problem solr is not indexing documents correctly

is there any configuration I can make?

  was:
Greek text extraction error
Ι have a greek pdf but after extraction the greek letter π is extracted as pi

for expamle
original text in pdf
"φυσικών προσώπων"

extracted text
"φυσικών piροσώpiων"

due to this problem solr is not indexing documents correctly

is there any configuration I can make?

    Fix Version/s: 1.3.0

> Greek text extraction
> ---------------------
>
>                 Key: PDFBOX-770
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-770
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Ubuntu 10.04
>            Reporter: Manos Karampasis
>             Fix For: 1.3.0
>
>         Attachments: 3842.html, 3842.pdf
>
>
> Greek text extraction error
> Ι have a greek pdf but 
> a) after extraction the greek letter π is extracted as pi
> for expamle
> original text in pdf
> "φυσικών προσώπων"
> extracted text
> "φυσικών piροσώpiων"
> b) the greek letter μ is displayed as µ 
> there is no difference in display except that is different encoding and when searching for μ cannot find it (you find only the uppercase Μ)
> if you copy  μ as displayed search for that is working fine
> e.g. the word is displayed as "κλίµακας" but it is different from the typed word κλίμακα due to the letter μ
> due to this problem solr is not indexing documents correctly
> is there any configuration I can make?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.