You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2016/03/29 20:12:30 UTC

Custom glyphlist for text extraction

Hi,

I was wondering if we lost the capability to supply a custom glyph list file as discussed here: http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529

PDFTextStreamEngine seems to have it hardcoded ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't find a way to override that.

Do I miss something?

BR
Maruan

Re: Custom glyphlist for text extraction

Posted by John Hewson <jo...@jahewson.com>.

> On 30 Mar 2016, at 01:59, John Hewson <jo...@jahewson.com> wrote:
> 
> 
> 
> -- John
> 
>> On 29 Mar 2016, at 21:31, Daniel Persson <ma...@gmail.com> wrote:
>> 
>> Hi Maruan
>> 
>> I extended the class to override that. Then again I extended the
>> PDFStreamEngine because I required more extensive changes but the principle
>> should be sound.
> 
> That's right but subclasses of PDFTextStreamEngine such as PDFTextStripper don't have access to that. So yes, we've lost that capability for PDFTextStripper.
> 
> What's needed is for the glyphList in PDFTextStripper to be overridden, either by making it protected or adding a getter/setter (the latter is probably a bit easier for users). Note that GlyphLists are immutable and may be arbitrarily chained by wrapping with another GlyphList, as the constructor of PDFTextStripper does.

Correction: "as the constructor of PDFTextStreamEngine does".

-- John

> 
> -- John
> 
>> best regards
>> Daniel
>> 
>>> On Tue, Mar 29, 2016, 20:12 Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>> 
>>> Hi,
>>> 
>>> I was wondering if we lost the capability to supply a custom glyph list
>>> file as discussed here:
>>> http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529
>>> 
>>> PDFTextStreamEngine seems to have it hardcoded
>>> ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't
>>> find a way to override that.
>>> 
>>> Do I miss something?
>>> 
>>> BR
>>> Maruan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Custom glyphlist for text extraction

Posted by John Hewson <jo...@jahewson.com>.

-- John

> On 29 Mar 2016, at 21:31, Daniel Persson <ma...@gmail.com> wrote:
> 
> Hi Maruan
> 
> I extended the class to override that. Then again I extended the
> PDFStreamEngine because I required more extensive changes but the principle
> should be sound.

That's right but subclasses of PDFTextStreamEngine such as PDFTextStripper don't have access to that. So yes, we've lost that capability for PDFTextStripper.

What's needed is for the glyphList in PDFTextStripper to be overridden, either by making it protected or adding a getter/setter (the latter is probably a bit easier for users). Note that GlyphLists are immutable and may be arbitrarily chained by wrapping with another GlyphList, as the constructor of PDFTextStripper does.

-- John

> best regards
> Daniel
> 
>> On Tue, Mar 29, 2016, 20:12 Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>> 
>> Hi,
>> 
>> I was wondering if we lost the capability to supply a custom glyph list
>> file as discussed here:
>> http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529
>> 
>> PDFTextStreamEngine seems to have it hardcoded
>> ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't
>> find a way to override that.
>> 
>> Do I miss something?
>> 
>> BR
>> Maruan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Custom glyphlist for text extraction

Posted by Daniel Persson <ma...@gmail.com>.
Hi Maruan

I extended the class to override that. Then again I extended the
PDFStreamEngine because I required more extensive changes but the principle
should be sound.

best regards
Daniel

On Tue, Mar 29, 2016, 20:12 Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> Hi,
>
> I was wondering if we lost the capability to supply a custom glyph list
> file as discussed here:
> http://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java/36034529#36034529
>
> PDFTextStreamEngine seems to have it hardcoded
> ["org/apache/pdfbox/resources/glyphlist/additional.txt";] and I couldn't
> find a way to override that.
>
> Do I miss something?
>
> BR
> Maruan