You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Flynn, Peter" <pf...@ucc.ie> on 2018/01/25 15:47:09 UTC

Mismatch between XeLaTeX fontspec and Apache PDFBox

I have a very large number of bibliographic references in BiBTeX format which we need to make available individually in formal reference formats within web pages (as HTML, not as embedded images).

I experimented a couple of years ago with Apache PDFBox and found that it could extract the text from a PDF and preserve bold and italics. This would let us use LaTeX to typeset each PDF in the required format, and then have PDFBox extract the text with bold and italics in all the right places.

Regular pdflatex with old-style bibtex is insufficient, as it doesn't handle all the UTF-8 characters we need, and the reference formats supported are out of date; XeLaTeX with biblatex and biber do all this just fine...but...

...if I do this using the fontspec package (the standard way to provide XeLaTeX with the font data for handling UTF-8 diacritics), the output has all accented characters, but PDFBox doesn't recognise the bold or italic. If I omit the fontspec package, PDFBox can get the bold and italics, but XeLaTeX will omit the diacritics.

Examples of both PDFs and both HTML files are at http://epu.ucc.ie/latex/pdfbox-xelatex-fontspec-error.zip

As I don't know the internals either of fontspec or of PDFBox, I am hoping that someone on the pdfbox mailing list or the comp.text.tex newsgroup may have a lead.

///Peter

Re: Mismatch between XeLaTeX fontspec and Apache PDFBox

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 29.01.2018 um 11:02 schrieb Flynn, Peter:
> 
> /bin/java -jar /usr/local/src/pdfbox-app-1.8.4.jar \
>                      ExtractText -html -force V$pubno-crop.pdf \
>                      V$pubno-crop.html

You are using an ancient version of PDFBox, please update to a more recent 
version like 1.8.13 or better to 2.0.8

Andreas

> --
> Peter Flynn | Academic & Collaborative Technologies | University College Cork IT Services | ☎ +353 21 490 2609 | ✉ pflynn@ucc.ie<ma...@ucc.ie> | 🌍 www.ucc.ie<http://www.ucc.ie>
> 
> 
> 
> On 2018-01-28 12:30:58+00:00 Tilman Hausherr wrote:
> 
> Hi,
> I can only answer about PDFBox... no PDF has anything bold. Both have
> something italic.
> 
> Yes, sorry about that. I picked an example that only has italic.
> 
> The PDF without fontspec doesn't have the "é".
> 
> Correct. But it does convert with PDFBox and identifies the italics.
> 
> The PDF with fontspec can be converted to HTML with "ExtractText -html"
> 
> I convert with the command
> 
> /bin/java -jar /usr/local/src/pdfbox-app-1.8.4.jar ExtractText -html -force filename.pdf filename.html
> 
> and the results were as given in thre .zip file: no italics. What version are you using?
> 
> P
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

RE: Mismatch between XeLaTeX fontspec and Apache PDFBox

Posted by "Flynn, Peter" <pf...@ucc.ie>.

Sorry, forgot to edit off the stuff at the top.

P

--
Peter Flynn | Academic & Collaborative Technologies | University College Cork IT Services | ☎ +353 21 490 2609 | ✉ pflynn@ucc.ie<ma...@ucc.ie> | 🌍 www.ucc.ie<http://www.ucc.ie>



On 2018-01-29 10:02:29+00:00 Flynn, Peter wrote:


/bin/java -jar /usr/local/src/pdfbox-app-1.8.4.jar \
                    ExtractText -html -force V$pubno-crop.pdf \
                    V$pubno-crop.html
--
Peter Flynn | Academic & Collaborative Technologies | University College Cork IT Services | ☎ +353 21 490 2609 | ✉ pflynn@ucc.ie<ma...@ucc.ie> | 🌍 www.ucc.ie<http://www.ucc.ie>



On 2018-01-28 12:30:58+00:00 Tilman Hausherr wrote:

Hi,
I can only answer about PDFBox... no PDF has anything bold. Both have
something italic.

Yes, sorry about that. I picked an example that only has italic.

The PDF without fontspec doesn't have the "é".

Correct. But it does convert with PDFBox and identifies the italics.

The PDF with fontspec can be converted to HTML with "ExtractText -html"

I convert with the command

/bin/java -jar /usr/local/src/pdfbox-app-1.8.4.jar ExtractText -html -force filename.pdf filename.html

and the results were as given in thre .zip file: no italics. What version are you using?

P

RE: Mismatch between XeLaTeX fontspec and Apache PDFBox

Posted by "Flynn, Peter" <pf...@ucc.ie>.

/bin/java -jar /usr/local/src/pdfbox-app-1.8.4.jar \
                    ExtractText -html -force V$pubno-crop.pdf \
                    V$pubno-crop.html
--
Peter Flynn | Academic & Collaborative Technologies | University College Cork IT Services | ☎ +353 21 490 2609 | ✉ pflynn@ucc.ie<ma...@ucc.ie> | 🌍 www.ucc.ie<http://www.ucc.ie>



On 2018-01-28 12:30:58+00:00 Tilman Hausherr wrote:

Hi,
I can only answer about PDFBox... no PDF has anything bold. Both have
something italic.

Yes, sorry about that. I picked an example that only has italic.

The PDF without fontspec doesn't have the "é".

Correct. But it does convert with PDFBox and identifies the italics.

The PDF with fontspec can be converted to HTML with "ExtractText -html"

I convert with the command

/bin/java -jar /usr/local/src/pdfbox-app-1.8.4.jar ExtractText -html -force filename.pdf filename.html

and the results were as given in thre .zip file: no italics. What version are you using?

P

Re: Mismatch between XeLaTeX fontspec and Apache PDFBox

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,
I can only answer about PDFBox... no PDF has anything bold. Both have 
something italic. The PDF without fontspec doesn't have the "é". The PDF 
with fontspec can be converted to HTML with "ExtractText -html" and this 
is the HTML I get:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title></title> 
<meta http-equiv="Content-Type" content="text/html; charset="UTF-8"> 
</head> <body> <div style="page-break-before:always; 
page-break-after:always"><div><p>Wiel, J&#233;r&#244;me aan de (2018). 
&#8216;Irish intelligence, 1880s-1922&#8217;. In: <i>Cultures of 
Intelligence in the Era of the World Wars</i>. Ed. by Simon Ball et al. 
Oxford: Oxford University Press.</p> </div></div> </body></html>

So the italic is there.

Tilman



Am 25.01.2018 um 16:47 schrieb Flynn, Peter:
> I have a very large number of bibliographic references in BiBTeX format which we need to make available individually in formal reference formats within web pages (as HTML, not as embedded images).
>
> I experimented a couple of years ago with Apache PDFBox and found that it could extract the text from a PDF and preserve bold and italics. This would let us use LaTeX to typeset each PDF in the required format, and then have PDFBox extract the text with bold and italics in all the right places.
>
> Regular pdflatex with old-style bibtex is insufficient, as it doesn't handle all the UTF-8 characters we need, and the reference formats supported are out of date; XeLaTeX with biblatex and biber do all this just fine...but...
>
> ...if I do this using the fontspec package (the standard way to provide XeLaTeX with the font data for handling UTF-8 diacritics), the output has all accented characters, but PDFBox doesn't recognise the bold or italic. If I omit the fontspec package, PDFBox can get the bold and italics, but XeLaTeX will omit the diacritics.
>
> Examples of both PDFs and both HTML files are at http://epu.ucc.ie/latex/pdfbox-xelatex-fontspec-error.zip
>
> As I don't know the internals either of fontspec or of PDFBox, I am hoping that someone on the pdfbox mailing list or the comp.text.tex newsgroup may have a lead.
>
> ///Peter
>
>