You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/03/04 22:19:56 UTC
[jira] Commented: (PDFBOX-77) PDF-Extraction splits words by spaces
[ https://issues.apache.org/jira/browse/PDFBOX-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678926#action_12678926 ]
Justin LeFebvre commented on PDFBOX-77:
---------------------------------------
The test file when run with ExtractText, now doesn't seem to have a spacing issue, though diacritics still seem to be a problem.
> PDF-Extraction splits words by spaces
> -------------------------------------
>
> Key: PDFBOX-77
> URL: https://issues.apache.org/jira/browse/PDFBOX-77
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
> Originally submitted by mkrebs on 2005-08-03 06:14.
> I'm currently working on an indexing-service using
> jakarta lucene. Momentarily in a first step i have to
> index about 80 PDF-Documents (later approximatly
> 1000 documents). Ca. 60 documents (of the 80) are
> extracted by PDFBox without any problems. But for
> around 20 documents, PDFBox generates too many
> spaces, which means, that words like "hostname" are
> extracted to "ho stna me". This happens by nearly all
> words contained in these 20 documents.
> Because i have to extract page-information i am using
> PDFTextStripper with the methods setPageSeparator
> (..), setStartPage(..), setEndPage(..) and writeText():
> StringWriter s = new StringWriter();
> PDDocument pddoc = null;
> try {
> pddoc = PDDocument.load(f);
> int pageCount = pddoc.getPageCount();
> PDFTextStripper stripper = new
> PDFTextStripper();
> stripper.setPageSeparator
> (IndexFiles.DOCUMENT_PAGE_SEPARATOR);
> stripper.setStartPage(1);
> stripper.setEndPage(pageCount);
> stripper.writeText(pddoc, s);
> } finally {
> if (pddoc != null)
> pddoc.close();
> }
> StringBuffer contents = s.getBuffer();
> In respect for my indexing-service it is impossible to
> index these documents correctly.
> I have tried to BugFix PDFBox
> (PDFTextStripper.flushText()) and established, that the
> width returned by TextPosition.getWidth() is incorrect.
> When i multiply this width with TextPosition.getXScale
> (), these documents are extracted correctly. But other
> before correctly extracted documents loose nearly all
> spaces, which means, that a complete sentence dont
> contain any spaces between words.
> I have tried: PDFBox-0.7.2-dev-20050730.jar, but the
> problem still remains.
> Example Text-Output:
> 3.5 SSL V erbindunge n
> JSSE (Java Sec ure So ck et E xt en sion)
> im po rt ja vax .n et .ss l. *
> W ese nt lich e ¨And erung im Cl ient Pr ogr
> amm : (F act o ry
> P att e rn )
> Erse tze
> Soc ket s = new Soc ket (ho stna me, port numb er)
> du rc h
> SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
> S SLS ocke tFac tory .get Def ault ();
> SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket(
> hos tnam e, p ortn umbe r);
> Daf ¨u r mus s SSL k o n fig u riert sein (si ehe
> unten).
> V o rl ¨aufige V.......
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
> 03_2_SSL.pdf (application/pdf), 189828 bytes
> university-lecture
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.