You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Mohit Goyal <Mo...@pb.com> on 2016/05/13 06:28:50 UTC
PdfParser giving garbage character
Hi,
I have one pdf which has data in Malyalam(Indian Language). I tried to parse this data using apache Tika I got garbage character '?' in output.
I checked Pdf using pdffont utility seems like some tounicodetable is missing.
Output of pdffont
Config Error: No display font for 'Symbol' Config Error: No display font for 'ZapfDingbats'
**name type emb sub uni object I**D
------------------------------------ ----------------- --- --- --- ---------
YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0
Times-Roman Type 1 no no no 1672 0
Times-Bold Type 1 no no no 127 0
Please find attached pdf.
Code:
BufferedWriter writer= Files.newWriter(new File("file-output.txt"), Charset.forName("UTF-8"));
BodyContentHandler handler = new BodyContentHandler(writer);
ParseContext pcontext = new ParseContext();
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
Any suggestions??
Thanks
Mohit Goyal
________________________________
RE: PdfParser giving garbage character
Posted by "Allison, Timothy B." <ta...@mitre.org>.
> Are you sure that you are using PDFBox. The code doesn't look like ours.
That’s Tika.
-----Original Message-----
From: Andreas Lehmkühler [mailto:andreas@lehmi.de]
Sent: Friday, May 13, 2016 5:53 AM
To: Mohit Goyal <Mo...@pb.com>; users@pdfbox.apache.org
Subject: Re: PdfParser giving garbage character
> Mohit Goyal <Mo...@pb.com> hat am 13. Mai 2016 um 08:28 geschrieben:
>
>
> Hi,
>
> I have one pdf which has data in Malyalam(Indian Language). I tried to
> parse this data using apache Tika I got garbage character '?' in output.
>
>
> I checked Pdf using pdffont utility seems like some tounicodetable is missing.
> Output of pdffont
> Config Error: No display font for 'Symbol' Config Error: No display
> font for 'ZapfDingbats'
> **name type emb sub uni object
> I**D
> ------------------------------------ ----------------- --- --- ---
> ---------
> YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0
> Times-Roman Type 1 no no no 1672 0
> Times-Bold Type 1 no no no 127 0
>
>
> Please find attached pdf.
The pdf didn't make it due to some restrictions to the mailing list. You have to provide a link to a public download.
>
> Code:
>
> BufferedWriter writer= Files.newWriter(new
> File("file-output.txt"), Charset.forName("UTF-8")); BodyContentHandler
> handler = new BodyContentHandler(writer); ParseContext pcontext = new
> ParseContext(); Metadata metadata = new Metadata();
> PDFParser pdfparser = new PDFParser();
> pdfparser.parse(inputstream, handler, metadata,pcontext);
>
> Any suggestions??
Are you sure that you are using PDFBox. The code doesn't look like ours.
>
> Thanks
> Mohit Goyal
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: PdfParser giving garbage character
Posted by Andreas Lehmkühler <an...@lehmi.de>.
> Mohit Goyal <Mo...@pb.com> hat am 13. Mai 2016 um 08:28 geschrieben:
>
>
> Hi,
>
> I have one pdf which has data in Malyalam(Indian Language). I tried to parse
> this data using apache Tika I got garbage character '?' in output.
>
>
> I checked Pdf using pdffont utility seems like some tounicodetable is missing.
> Output of pdffont
> Config Error: No display font for 'Symbol' Config Error: No display font for
> 'ZapfDingbats'
> **name type emb sub uni object
> I**D
> ------------------------------------ ----------------- --- --- --- ---------
> YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0
> Times-Roman Type 1 no no no 1672 0
> Times-Bold Type 1 no no no 127 0
>
>
> Please find attached pdf.
The pdf didn't make it due to some restrictions to the mailing list. You have to
provide a link to a public download.
>
> Code:
>
> BufferedWriter writer= Files.newWriter(new
> File("file-output.txt"), Charset.forName("UTF-8"));
> BodyContentHandler handler = new BodyContentHandler(writer);
> ParseContext pcontext = new ParseContext();
> Metadata metadata = new Metadata();
> PDFParser pdfparser = new PDFParser();
> pdfparser.parse(inputstream, handler, metadata,pcontext);
>
> Any suggestions??
Are you sure that you are using PDFBox. The code doesn't look like ours.
>
> Thanks
> Mohit Goyal
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org