You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Laxmi Narayan <co...@gmail.com> on 2018/01/25 03:22:26 UTC

FW: Word Merging Problem

Hi Team,

I have a problem while text extracting from pdf. When we extracting the text
words merge together.  Can you suggest me , what we have to do for the same.

 

I have attached the PDF file from which I am extracting the text. And I am
using the below code to extract the text.

 

Please help me as soon as possible.

 

private static string GetTextByArea_Orgnal(PDDocument doc, int x, int y, int
w, int h)

        {

            PDFTextStripperByArea stripper = new
PDFTextStripperByArea("UTF-8");

            stripper.setLineSeparator(" ");

            stripper.setDropThreshold(3);

            stripper.setWordSeparator(" ");

            stripper.setParagraphStart("<p>");

            stripper.setParagraphEnd("</p>");

            stripper.setIndentThreshold(1);

            stripper.setSortByPosition(true);

            //==================

 

            //==================

 

            Dimension d = new Dimension(w, h);

            Rectangle rect = new Rectangle(new Point(x, y), d);

            stripper.addRegion("class1", rect);

            java.util.List allPages =
doc.getDocumentCatalog().getAllPages();

            PDPage firstPage = (PDPage)allPages.get(0);

            //// overlay the region with a cyan rectangle to check if I got
the coordinates and dimensions right

            PDPageContentStream contentStream = new PDPageContentStream(doc,
firstPage, true, true);

            contentStream.setNonStrokingColor(Color.CYAN);

            contentStream.fillRect(x, y, w, h);

            contentStream.close();

            ////=============

            stripper.extractRegions(firstPage);

            return stripper.getTextForRegion("class1");

        }

 

 

Thanks,

Laxmi Narayan

Re: FW: Word Merging Problem

Posted by Tilman Hausherr <TH...@t-online.de>.

I tried running your code and I can't because it was written for an 
older version of PDFBox (probably 1.8) and it has a syntax error and the 
parameters are missing so I doubt your code ever ran that way. I tried 
running ExtractText on PDFBox 1.8 and yes, many blanks are missing. So 
please use the current version 2.0.8. I found one occurrence where the 
blank was missing ("Wewould") but Adobe Reader has the same problem.

Tilman


Am 25.01.2018 um 04:22 schrieb Laxmi Narayan:
>
> Hi Team,
>
> I have a problem while text extracting from pdf. When we extracting 
> the text words merge together.  Can you suggest me , what we have to 
> do for the same.
>
> I have attached the PDF file from which I am extracting the text. And 
> I am using the below code to extract the text.
>
> Please help me as soon as possible.
>
> privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int 
> y, int w, int h)
>
>         {
>
> PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8");
>
> stripper.setLineSeparator(" ");
>
> stripper.setDropThreshold(3);
>
> stripper.setWordSeparator(" ");
>
> stripper.setParagraphStart("<p>");
>
> stripper.setParagraphEnd("</p>");
>
> stripper.setIndentThreshold(1);
>
> stripper.setSortByPosition(true);
>
> //==================
>
> //==================
>
> Dimension d = new Dimension(w, h);
>
> Rectangle rect = new Rectangle(new Point(x, y), d);
>
> stripper.addRegion("class1", rect);
>
> java.util.List allPages = doc.getDocumentCatalog().getAllPages();
>
> PDPage firstPage = (PDPage)allPages.get(0);
>
> //// overlay the region with a cyan rectangle to check if I got the 
> coordinates and dimensions right
>
> PDPageContentStream contentStream = new PDPageContentStream(doc, 
> firstPage, true, true);
>
> contentStream.setNonStrokingColor(Color.CYAN);
>
> contentStream.fillRect(x, y, w, h);
>
> contentStream.close();
>
> ////=============
>
> stripper.extractRegions(firstPage);
>
> return stripper.getTextForRegion("class1");
>
>         }
>
> Thanks,
>
> Laxmi Narayan
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: FW: Word Merging Problem

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,
Please upload your file to a sharehoster. PDF files don't go through. 
And please tell what PDF version you're using (hopefully 2.0.8). And 
please post to the user, not to the dev mailing list.

I was able to access your file because your post was stuck in 
moderation. I don't have the time to try your code now (will do 
tonight). I tried with the ExtractText command line utility and that one 
did have blanks.

Tilman



Am 25.01.2018 um 04:22 schrieb Laxmi Narayan:
>
> Hi Team,
>
> I have a problem while text extracting from pdf. When we extracting 
> the text words merge together.  Can you suggest me , what we have to 
> do for the same.
>
> I have attached the PDF file from which I am extracting the text. And 
> I am using the below code to extract the text.
>
> Please help me as soon as possible.
>
> privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int 
> y, int w, int h)
>
>         {
>
> PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8");
>
> stripper.setLineSeparator(" ");
>
> stripper.setDropThreshold(3);
>
> stripper.setWordSeparator(" ");
>
> stripper.setParagraphStart("<p>");
>
> stripper.setParagraphEnd("</p>");
>
> stripper.setIndentThreshold(1);
>
> stripper.setSortByPosition(true);
>
> //==================
>
> //==================
>
> Dimension d = new Dimension(w, h);
>
> Rectangle rect = new Rectangle(new Point(x, y), d);
>
> stripper.addRegion("class1", rect);
>
> java.util.List allPages = doc.getDocumentCatalog().getAllPages();
>
> PDPage firstPage = (PDPage)allPages.get(0);
>
> //// overlay the region with a cyan rectangle to check if I got the 
> coordinates and dimensions right
>
> PDPageContentStream contentStream = new PDPageContentStream(doc, 
> firstPage, true, true);
>
> contentStream.setNonStrokingColor(Color.CYAN);
>
> contentStream.fillRect(x, y, w, h);
>
> contentStream.close();
>
> ////=============
>
> stripper.extractRegions(firstPage);
>
> return stripper.getTextForRegion("class1");
>
>         }
>
> Thanks,
>
> Laxmi Narayan
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail:dev-help@pdfbox.apache.org