You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/03/04 22:19:56 UTC
[jira] Commented: (PDFBOX-77) PDF-Extraction splits words by spaces

    [ https://issues.apache.org/jira/browse/PDFBOX-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678926#action_12678926 ] 

Justin LeFebvre commented on PDFBOX-77:
---------------------------------------

The test file when run with ExtractText, now doesn't seem to have a spacing issue, though diacritics still seem to be a problem.


> PDF-Extraction splits words by spaces
> -------------------------------------
>
>                 Key: PDFBOX-77
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-77
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
> Originally submitted by mkrebs on 2005-08-03 06:14.
> I'm currently working on an indexing-service using 
> jakarta lucene. Momentarily in a first step i have to 
> index about 80 PDF-Documents (later approximatly 
> 1000 documents). Ca. 60 documents (of the 80) are 
> extracted by PDFBox without any problems. But for 
> around 20 documents, PDFBox generates too many 
> spaces, which means, that words like "hostname" are 
> extracted to "ho stna me". This happens by nearly all 
> words contained in these 20 documents.
> Because i have to extract page-information i am using 
> PDFTextStripper with the methods setPageSeparator
> (..), setStartPage(..), setEndPage(..) and writeText():
> StringWriter s = new StringWriter();
> PDDocument pddoc = null;
> try {
>  pddoc = PDDocument.load(f);
>  int pageCount = pddoc.getPageCount();
>  PDFTextStripper stripper = new 
> PDFTextStripper();
>  stripper.setPageSeparator
> (IndexFiles.DOCUMENT_PAGE_SEPARATOR);
>  stripper.setStartPage(1);
>  stripper.setEndPage(pageCount);
>  stripper.writeText(pddoc, s);
> } finally {
>  if (pddoc != null)
>   pddoc.close();
> }
> StringBuffer contents = s.getBuffer();
> In respect for my indexing-service it is impossible to 
> index these documents correctly.
> I have tried to BugFix PDFBox 
> (PDFTextStripper.flushText()) and established, that the 
> width returned by TextPosition.getWidth() is incorrect. 
> When i multiply this width with TextPosition.getXScale
> (), these documents are extracted correctly. But other 
> before correctly extracted documents loose nearly all 
> spaces, which means, that a complete sentence dont 
> contain any spaces between words.
> I have tried: PDFBox-0.7.2-dev-20050730.jar, but the 
> problem still remains.
> Example Text-Output:
> 3.5 SSL V erbindunge n
>        JSSE (Java Sec ure So ck et E xt en sion)
>        im po rt ja vax .n et .ss l. *
>        W ese nt lich e ¨And erung im Cl ient Pr ogr 
> amm : (F act o ry
>       P att e rn )
>       Erse tze
>       Soc ket s = new Soc ket (ho stna me, port numb er)
>       du rc h
>       SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
> S SLS ocke tFac tory .get Def ault ();
>       SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket( 
> hos tnam e, p ortn umbe r);
>        Daf ¨u r mus s SSL k o n fig u riert sein (si ehe 
> unten).
>       V o rl ¨aufige V.......
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
> 03_2_SSL.pdf (application/pdf), 189828 bytes
> university-lecture

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.