You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/10/21 12:05:59 UTC
[jira] Updated: (PDFBOX-77) PDF-Extraction splits words by spaces
[ https://issues.apache.org/jira/browse/PDFBOX-77?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated PDFBOX-77:
--------------------------------
Description:
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
Originally submitted by mkrebs on 2005-08-03 06:14.
I'm currently working on an indexing-service using
jakarta lucene. Momentarily in a first step i have to
index about 80 PDF-Documents (later approximatly
1000 documents). Ca. 60 documents (of the 80) are
extracted by PDFBox without any problems. But for
around 20 documents, PDFBox generates too many
spaces, which means, that words like "hostname" are
extracted to "ho stna me". This happens by nearly all
words contained in these 20 documents.
Because i have to extract page-information i am using
PDFTextStripper with the methods setPageSeparator
(..), setStartPage(..), setEndPage(..) and writeText():
StringWriter s = new StringWriter();
PDDocument pddoc = null;
try {
pddoc = PDDocument.load(f);
int pageCount = pddoc.getPageCount();
PDFTextStripper stripper = new
PDFTextStripper();
stripper.setPageSeparator
(IndexFiles.DOCUMENT_PAGE_SEPARATOR);
stripper.setStartPage(1);
stripper.setEndPage(pageCount);
stripper.writeText(pddoc, s);
} finally {
if (pddoc != null)
pddoc.close();
}
StringBuffer contents = s.getBuffer();
In respect for my indexing-service it is impossible to
index these documents correctly.
I have tried to BugFix PDFBox
(PDFTextStripper.flushText()) and established, that the
width returned by TextPosition.getWidth() is incorrect.
When i multiply this width with TextPosition.getXScale
(), these documents are extracted correctly. But other
before correctly extracted documents loose nearly all
spaces, which means, that a complete sentence dont
contain any spaces between words.
I have tried: PDFBox-0.7.2-dev-20050730.jar, but the
problem still remains.
Example Text-Output:
3.5 SSL V erbindunge n
• JSSE (Java Sec ure So ck et E xt en sion)
• im po rt ja vax .n et .ss l. *
• W ese nt lich e ¨And erung im Cl ient Pr ogr
amm : (F act o ry
P att e rn )
Erse tze
Soc ket s = new Soc ket (ho stna me, port numb er)
du rc h
SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
S SLS ocke tFac tory .get Def ault ();
SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket(
hos tnam e, p ortn umbe r);
• Daf ¨u r mus s SSL k o n fig u riert sein (si ehe
unten).
V o rl ¨aufige V.......
[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
03_2_SSL.pdf (application/pdf), 189828 bytes
university-lecture
was:
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
Originally submitted by mkrebs on 2005-08-03 06:14.
I'm currently working on an indexing-service using
jakarta lucene. Momentarily in a first step i have to
index about 80 PDF-Documents (later approximatly
1000 documents). Ca. 60 documents (of the 80) are
extracted by PDFBox without any problems. But for
around 20 documents, PDFBox generates too many
spaces, which means, that words like "hostname" are
extracted to "ho stna me". This happens by nearly all
words contained in these 20 documents.
Because i have to extract page-information i am using
PDFTextStripper with the methods setPageSeparator
(..), setStartPage(..), setEndPage(..) and writeText():
StringWriter s = new StringWriter();
PDDocument pddoc = null;
try {
pddoc = PDDocument.load(f);
int pageCount = pddoc.getPageCount();
PDFTextStripper stripper = new
PDFTextStripper();
stripper.setPageSeparator
(IndexFiles.DOCUMENT_PAGE_SEPARATOR);
stripper.setStartPage(1);
stripper.setEndPage(pageCount);
stripper.writeText(pddoc, s);
} finally {
if (pddoc != null)
pddoc.close();
}
StringBuffer contents = s.getBuffer();
In respect for my indexing-service it is impossible to
index these documents correctly.
I have tried to BugFix PDFBox
(PDFTextStripper.flushText()) and established, that the
width returned by TextPosition.getWidth() is incorrect.
When i multiply this width with TextPosition.getXScale
(), these documents are extracted correctly. But other
before correctly extracted documents loose nearly all
spaces, which means, that a complete sentence dont
contain any spaces between words.
I have tried: PDFBox-0.7.2-dev-20050730.jar, but the
problem still remains.
Example Text-Output:
3.5 SSL V erbindunge n
JSSE (Java Sec ure So ck et E xt en sion)
im po rt ja vax .n et .ss l. *
W ese nt lich e ¨And erung im Cl ient Pr ogr
amm : (F act o ry
P att e rn )
Erse tze
Soc ket s = new Soc ket (ho stna me, port numb er)
du rc h
SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
S SLS ocke tFac tory .get Def ault ();
SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket(
hos tnam e, p ortn umbe r);
Daf ¨u r mus s SSL k o n fig u riert sein (si ehe
unten).
V o rl ¨aufige V.......
[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
03_2_SSL.pdf (application/pdf), 189828 bytes
university-lecture
Priority: Blocker
Reporter: Jukka Zitting
Fix Version/s: 0.8.0-incubator
> PDF-Extraction splits words by spaces
> -------------------------------------
>
> Key: PDFBOX-77
> URL: https://issues.apache.org/jira/browse/PDFBOX-77
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Jukka Zitting
> Priority: Blocker
> Fix For: 0.8.0-incubator
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1251041
> Originally submitted by mkrebs on 2005-08-03 06:14.
> I'm currently working on an indexing-service using
> jakarta lucene. Momentarily in a first step i have to
> index about 80 PDF-Documents (later approximatly
> 1000 documents). Ca. 60 documents (of the 80) are
> extracted by PDFBox without any problems. But for
> around 20 documents, PDFBox generates too many
> spaces, which means, that words like "hostname" are
> extracted to "ho stna me". This happens by nearly all
> words contained in these 20 documents.
> Because i have to extract page-information i am using
> PDFTextStripper with the methods setPageSeparator
> (..), setStartPage(..), setEndPage(..) and writeText():
> StringWriter s = new StringWriter();
> PDDocument pddoc = null;
> try {
> pddoc = PDDocument.load(f);
> int pageCount = pddoc.getPageCount();
> PDFTextStripper stripper = new
> PDFTextStripper();
> stripper.setPageSeparator
> (IndexFiles.DOCUMENT_PAGE_SEPARATOR);
> stripper.setStartPage(1);
> stripper.setEndPage(pageCount);
> stripper.writeText(pddoc, s);
> } finally {
> if (pddoc != null)
> pddoc.close();
> }
> StringBuffer contents = s.getBuffer();
> In respect for my indexing-service it is impossible to
> index these documents correctly.
> I have tried to BugFix PDFBox
> (PDFTextStripper.flushText()) and established, that the
> width returned by TextPosition.getWidth() is incorrect.
> When i multiply this width with TextPosition.getXScale
> (), these documents are extracted correctly. But other
> before correctly extracted documents loose nearly all
> spaces, which means, that a complete sentence dont
> contain any spaces between words.
> I have tried: PDFBox-0.7.2-dev-20050730.jar, but the
> problem still remains.
> Example Text-Output:
> 3.5 SSL V erbindunge n
> • JSSE (Java Sec ure So ck et E xt en sion)
> • im po rt ja vax .n et .ss l. *
> • W ese nt lich e ¨And erung im Cl ient Pr ogr
> amm : (F act o ry
> P att e rn )
> Erse tze
> Soc ket s = new Soc ket (ho stna me, port numb er)
> du rc h
> SSL Soc ketF acto ry s f = (SS LSoc ketF acto ry)
> S SLS ocke tFac tory .get Def ault ();
> SSL Soc ket s = (SSL Sock et) sf.c reat eSoc ket(
> hos tnam e, p ortn umbe r);
> • Daf ¨u r mus s SSL k o n fig u riert sein (si ehe
> unten).
> V o rl ¨aufige V.......
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1251041&file_id=144276
> 03_2_SSL.pdf (application/pdf), 189828 bytes
> university-lecture
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.