You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Mohit Goyal <Mo...@pb.com> on 2016/05/13 06:28:50 UTC

PdfParser giving garbage character

Hi,

I have one pdf which has data in Malyalam(Indian Language). I tried to parse this data using apache Tika I got garbage character '?' in output.


I checked Pdf using pdffont utility seems like some tounicodetable is missing.
Output of pdffont
Config Error: No display font for 'Symbol' Config Error: No display font for 'ZapfDingbats'
**name                                 type              emb sub uni object I**D
------------------------------------ ----------------- --- --- --- ---------
YTLJPR+AnjaliOldLipi                 CID TrueType      yes yes yes   1671  0
Times-Roman                          Type 1            no  no  no    1672  0
Times-Bold                           Type 1            no  no  no     127  0


Please find attached pdf.

Code:

                BufferedWriter writer=  Files.newWriter(new File("file-output.txt"), Charset.forName("UTF-8"));
BodyContentHandler handler = new BodyContentHandler(writer);
ParseContext pcontext = new ParseContext();
Metadata metadata = new Metadata();
       PDFParser pdfparser = new PDFParser();
       pdfparser.parse(inputstream, handler, metadata,pcontext);

Any suggestions??

Thanks
Mohit Goyal

________________________________

RE: PdfParser giving garbage character

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> Are you sure that you are using PDFBox. The code doesn't look like ours.

That’s Tika.

-----Original Message-----
From: Andreas Lehmkühler [mailto:andreas@lehmi.de] 
Sent: Friday, May 13, 2016 5:53 AM
To: Mohit Goyal <Mo...@pb.com>; users@pdfbox.apache.org
Subject: Re: PdfParser giving garbage character

> Mohit Goyal <Mo...@pb.com> hat am 13. Mai 2016 um 08:28 geschrieben:
> 
> 
> Hi,
> 
> I have one pdf which has data in Malyalam(Indian Language). I tried to 
> parse this data using apache Tika I got garbage character '?' in output.
> 
> 
> I checked Pdf using pdffont utility seems like some tounicodetable is missing.
> Output of pdffont
> Config Error: No display font for 'Symbol' Config Error: No display 
> font for 'ZapfDingbats'
> **name                                 type              emb sub uni object
> I**D
> ------------------------------------ ----------------- --- --- --- 
> ---------
> YTLJPR+AnjaliOldLipi                 CID TrueType      yes yes yes   1671  0
> Times-Roman                          Type 1            no  no  no    1672  0
> Times-Bold                           Type 1            no  no  no     127  0
> 
> 
> Please find attached pdf.
The pdf didn't make it due to some restrictions to the mailing list. You have to provide a link to a public download.
> 
> Code:
> 
>                 BufferedWriter writer=  Files.newWriter(new 
> File("file-output.txt"), Charset.forName("UTF-8")); BodyContentHandler 
> handler = new BodyContentHandler(writer); ParseContext pcontext = new 
> ParseContext(); Metadata metadata = new Metadata();
>        PDFParser pdfparser = new PDFParser();
>        pdfparser.parse(inputstream, handler, metadata,pcontext);
> 
> Any suggestions??
Are you sure that you are using PDFBox. The code doesn't look like ours.
> 
> Thanks
> Mohit Goyal

BR
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: PdfParser giving garbage character

Posted by Andreas Lehmkühler <an...@lehmi.de>.

> Mohit Goyal <Mo...@pb.com> hat am 13. Mai 2016 um 08:28 geschrieben:
> 
> 
> Hi,
> 
> I have one pdf which has data in Malyalam(Indian Language). I tried to parse
> this data using apache Tika I got garbage character '?' in output.
> 
> 
> I checked Pdf using pdffont utility seems like some tounicodetable is missing.
> Output of pdffont
> Config Error: No display font for 'Symbol' Config Error: No display font for
> 'ZapfDingbats'
> **name                                 type              emb sub uni object
> I**D
> ------------------------------------ ----------------- --- --- --- ---------
> YTLJPR+AnjaliOldLipi                 CID TrueType      yes yes yes   1671  0
> Times-Roman                          Type 1            no  no  no    1672  0
> Times-Bold                           Type 1            no  no  no     127  0
> 
> 
> Please find attached pdf.
The pdf didn't make it due to some restrictions to the mailing list. You have to
provide a link to a public download.
> 
> Code:
> 
>                 BufferedWriter writer=  Files.newWriter(new
> File("file-output.txt"), Charset.forName("UTF-8"));
> BodyContentHandler handler = new BodyContentHandler(writer);
> ParseContext pcontext = new ParseContext();
> Metadata metadata = new Metadata();
>        PDFParser pdfparser = new PDFParser();
>        pdfparser.parse(inputstream, handler, metadata,pcontext);
> 
> Any suggestions??
Are you sure that you are using PDFBox. The code doesn't look like ours.
> 
> Thanks
> Mohit Goyal

BR
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org