You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/10/21 12:05:59 UTC

[jira] Updated: (PDFBOX-77) PDF-Extraction splits words by spaces

     [ https://issues.apache.org/jira/browse/PDFBOX-77?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated PDFBOX-77:
--------------------------------

      Description: 
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
Originally submitted by mkrebs on 2005-08-03 06:14.

I'm currently working on an indexing-service using 
jakarta lucene. Momentarily in a first step i have to 
index about 80 PDF-Documents (later approximatly 
1000 documents). Ca. 60 documents (of the 80) are 
extracted by PDFBox without any problems. But for 
around 20 documents, PDFBox generates too many 
spaces, which means, that words like "hostname" are 
extracted to "ho stna me". This happens by nearly all 
words contained in these 20 documents.
Because i have to extract page-information i am using 
PDFTextStripper with the methods setPageSeparator
(..), setStartPage(..), setEndPage(..) and writeText():

StringWriter s = new StringWriter();
PDDocument pddoc = null;
try {
 pddoc = PDDocument.load(f);
 int pageCount = pddoc.getPageCount();
 PDFTextStripper stripper = new 
PDFTextStripper();
 stripper.setPageSeparator
(IndexFiles.DOCUMENT_PAGE_SEPARATOR);
 stripper.setStartPage(1);
 stripper.setEndPage(pageCount);
 stripper.writeText(pddoc, s);
} finally {
 if (pddoc != null)
  pddoc.close();
}
StringBuffer contents = s.getBuffer();

In respect for my indexing-service it is impossible to 
index these documents correctly.

I have tried to BugFix PDFBox 
(PDFTextStripper.flushText()) and established, that the 
width returned by TextPosition.getWidth() is incorrect. 
When i multiply this width with TextPosition.getXScale
(), these documents are extracted correctly. But other 
before correctly extracted documents loose nearly all 
spaces, which means, that a complete sentence dont 
contain any spaces between words.

I have tried: PDFBox-0.7.2-dev-20050730.jar, but the 
problem still remains.

Example Text-Output:

3.5 SSL V erbindunge n
      • JSSE (Java Sec ure So ck et E xt en sion)
      • im po rt ja vax .n et .ss l. *
      • W ese nt lich e ¨And erung im Cl ient Pr ogr 
amm : (F act o ry
      P att e rn )
      Erse tze
      Soc ket s = new Soc ket (ho stna me, port numb er)
      du rc h
      SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
S SLS ocke tFac tory .get Def ault ();
      SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket( 
hos tnam e, p ortn umbe r);
      • Daf ¨u r mus s SSL k o n fig u riert sein (si ehe 
unten).
      V o rl ¨aufige V.......


[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
03_2_SSL.pdf (application/pdf), 189828 bytes
university-lecture

  was:
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
Originally submitted by mkrebs on 2005-08-03 06:14.

I'm currently working on an indexing-service using 
jakarta lucene. Momentarily in a first step i have to 
index about 80 PDF-Documents (later approximatly 
1000 documents). Ca. 60 documents (of the 80) are 
extracted by PDFBox without any problems. But for 
around 20 documents, PDFBox generates too many 
spaces, which means, that words like "hostname" are 
extracted to "ho stna me". This happens by nearly all 
words contained in these 20 documents.
Because i have to extract page-information i am using 
PDFTextStripper with the methods setPageSeparator
(..), setStartPage(..), setEndPage(..) and writeText():

StringWriter s = new StringWriter();
PDDocument pddoc = null;
try {
 pddoc = PDDocument.load(f);
 int pageCount = pddoc.getPageCount();
 PDFTextStripper stripper = new 
PDFTextStripper();
 stripper.setPageSeparator
(IndexFiles.DOCUMENT_PAGE_SEPARATOR);
 stripper.setStartPage(1);
 stripper.setEndPage(pageCount);
 stripper.writeText(pddoc, s);
} finally {
 if (pddoc != null)
  pddoc.close();
}
StringBuffer contents = s.getBuffer();

In respect for my indexing-service it is impossible to 
index these documents correctly.

I have tried to BugFix PDFBox 
(PDFTextStripper.flushText()) and established, that the 
width returned by TextPosition.getWidth() is incorrect. 
When i multiply this width with TextPosition.getXScale
(), these documents are extracted correctly. But other 
before correctly extracted documents loose nearly all 
spaces, which means, that a complete sentence dont 
contain any spaces between words.

I have tried: PDFBox-0.7.2-dev-20050730.jar, but the 
problem still remains.

Example Text-Output:

3.5 SSL V erbindunge n
      • JSSE (Java Sec ure So ck et E xt en sion)
      • im po rt ja vax .n et .ss l. *
      • W ese nt lich e ¨And erung im Cl ient Pr ogr 
amm : (F act o ry
      P att e rn )
      Erse tze
      Soc ket s = new Soc ket (ho stna me, port numb er)
      du rc h
      SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
S SLS ocke tFac tory .get Def ault ();
      SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket( 
hos tnam e, p ortn umbe r);
      • Daf ¨u r mus s SSL k o n fig u riert sein (si ehe 
unten).
      V o rl ¨aufige V.......


[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
03_2_SSL.pdf (application/pdf), 189828 bytes
university-lecture

         Priority: Blocker
         Reporter: Jukka Zitting
    Fix Version/s: 0.8.0-incubator

> PDF-Extraction splits words by spaces
> -------------------------------------
>
>                 Key: PDFBOX-77
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-77
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Jukka Zitting
>            Priority: Blocker
>             Fix For: 0.8.0-incubator
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
> Originally submitted by mkrebs on 2005-08-03 06:14.
> I'm currently working on an indexing-service using 
> jakarta lucene. Momentarily in a first step i have to 
> index about 80 PDF-Documents (later approximatly 
> 1000 documents). Ca. 60 documents (of the 80) are 
> extracted by PDFBox without any problems. But for 
> around 20 documents, PDFBox generates too many 
> spaces, which means, that words like "hostname" are 
> extracted to "ho stna me". This happens by nearly all 
> words contained in these 20 documents.
> Because i have to extract page-information i am using 
> PDFTextStripper with the methods setPageSeparator
> (..), setStartPage(..), setEndPage(..) and writeText():
> StringWriter s = new StringWriter();
> PDDocument pddoc = null;
> try {
>  pddoc = PDDocument.load(f);
>  int pageCount = pddoc.getPageCount();
>  PDFTextStripper stripper = new 
> PDFTextStripper();
>  stripper.setPageSeparator
> (IndexFiles.DOCUMENT_PAGE_SEPARATOR);
>  stripper.setStartPage(1);
>  stripper.setEndPage(pageCount);
>  stripper.writeText(pddoc, s);
> } finally {
>  if (pddoc != null)
>   pddoc.close();
> }
> StringBuffer contents = s.getBuffer();
> In respect for my indexing-service it is impossible to 
> index these documents correctly.
> I have tried to BugFix PDFBox 
> (PDFTextStripper.flushText()) and established, that the 
> width returned by TextPosition.getWidth() is incorrect. 
> When i multiply this width with TextPosition.getXScale
> (), these documents are extracted correctly. But other 
> before correctly extracted documents loose nearly all 
> spaces, which means, that a complete sentence dont 
> contain any spaces between words.
> I have tried: PDFBox-0.7.2-dev-20050730.jar, but the 
> problem still remains.
> Example Text-Output:
> 3.5 SSL V erbindunge n
>       • JSSE (Java Sec ure So ck et E xt en sion)
>       • im po rt ja vax .n et .ss l. *
>       • W ese nt lich e ¨And erung im Cl ient Pr ogr 
> amm : (F act o ry
>       P att e rn )
>       Erse tze
>       Soc ket s = new Soc ket (ho stna me, port numb er)
>       du rc h
>       SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
> S SLS ocke tFac tory .get Def ault ();
>       SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket( 
> hos tnam e, p ortn umbe r);
>       • Daf ¨u r mus s SSL k o n fig u riert sein (si ehe 
> unten).
>       V o rl ¨aufige V.......
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
> 03_2_SSL.pdf (application/pdf), 189828 bytes
> university-lecture

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.