You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Fran Rojas <fr...@gmail.com> on 2020/04/14 21:08:31 UTC

Detection of chess figure characters

Hello,
    I am a user of your wonderful library.
    I use it in a program that extracts chess games from PDF books.
    The library is very good, but it would be great if it could detect chess figure characters.

    They exist in UTF-8:
♔♕♖♗♘♙♚♛♜♝♞♟

    I wonder if you are planning to detect them in a future version of the library.

    Good job with the library!
kind regards,
Fran.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Detection of chess figure characters

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 15.04.2020 um 23:12 schrieb Peter Murray-Rust:
> Do you know whether the figures are characters or bitmap images (we find a
> lot of this in scientific publications). If they are characters with
> non-standard codes , then there's probably only a small number of
> characters in each font. We mapped this for some thousands of maths symbols
> and I'd guess it's a smaller problem for chess. Alternatively the pieces
> may be small bitmapped images  Our AMI3 tool , uses PDFBox and stores all
> images and removes duplicates, recording the coordinates. It's Open source
> and you are welcome to try it.
> Mail me if so.


Or share such a file (upload to a sharehoster).

If you can't, open it with PDFDebugger and look around until you find 
the fonts resources, and tell what you see.

Tilman



>
> P.
>
>
> On Wed, Apr 15, 2020 at 9:19 PM Fran Rojas <fr...@gmail.com> wrote:
>
>> Hello Tilman,
>>
>> I have just tested the pdf with adobe reader and it neither recognized the
>> characters.
>>
>> Then, what would the stragegy be ?
>> Is there any way that the library returns the images of unrecognized
>> characters so that the application could make an effort to interpret them
>> (via a specialized OCR or something like that) ?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Detection of chess figure characters

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
Do you know whether the figures are characters or bitmap images (we find a
lot of this in scientific publications). If they are characters with
non-standard codes , then there's probably only a small number of
characters in each font. We mapped this for some thousands of maths symbols
and I'd guess it's a smaller problem for chess. Alternatively the pieces
may be small bitmapped images  Our AMI3 tool , uses PDFBox and stores all
images and removes duplicates, recording the coordinates. It's Open source
and you are welcome to try it.
Mail me if so.

P.


On Wed, Apr 15, 2020 at 9:19 PM Fran Rojas <fr...@gmail.com> wrote:

> Hello Tilman,
>
> I have just tested the pdf with adobe reader and it neither recognized the
> characters.
>
> Then, what would the stragegy be ?
> Is there any way that the library returns the images of unrecognized
> characters so that the application could make an effort to interpret them
> (via a specialized OCR or something like that) ?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

-- 
"I always retain copyright in my papers, and nothing in any contract I sign
with any publisher will override that fact. You should do the same".

Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Detection of chess figure characters

Posted by Fran Rojas <fr...@gmail.com>.
Hello Tilman,

I have just tested the pdf with adobe reader and it neither recognized the characters.

Then, what would the stragegy be ?
Is there any way that the library returns the images of unrecognized characters so that the application could make an effort to interpret them (via a specialized OCR or something like that) ?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Detection of chess figure characters

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 14.04.2020 um 23:08 schrieb Fran Rojas:
> Hello,
>      I am a user of your wonderful library.
>      I use it in a program that extracts chess games from PDF books.
>      The library is very good, but it would be great if it could detect chess figure characters.
>
>      They exist in UTF-8:
> ♔♕♖♗♘♙♚♛♜♝♞♟
>
>      I wonder if you are planning to detect them in a future version of the library.


No, but if these are in a PDF as text (with correct encoding), then they 
would be extracted today too. See if these extract with Adobe Reader. If 
yes, then we can too.

Tilman



>
>      Good job with the library!
> kind regards,
> Fran.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org