You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Strein, Mark C CIV USARMY TRADOC ANALYSIS CTR (US)" <ma...@mail.mil> on 2014/08/04 13:22:24 UTC
RE: How to find the position of a specific paragraph in the input
PDF? (UNCLASSIFIED)
Classification: UNCLASSIFIED
Caveats: NONE
Morning Sir,
The basic construct for extracting the value in a field is:
field.getFullyQualifiedName().equalsIgnoreCase(fullyQualifiedName).getValue(
) - note: I use fully qualified names(FQN) to prevent errors
My way of extracting the FQN is as follows(the short version):
private void processField(PDField field,boolean buildPDList) throws
IOException
{
List kids = field.getKids();
if(kids != null)
{
Iterator kidsIter = kids.iterator();
while(kidsIter.hasNext())
{
Object pdfObj = kidsIter.next();
if(pdfObj instanceof PDField)
{
PDField kid = (PDField)pdfObj;
processField(kid,buildPDList);
}
}
}
else
{
If(!buildPDlist)
{
System.err.println(field.getFullyQualifiedName());
}
else
{
//other processing
}
}
}
Hope that helps.
V/R,
Mark Strein
-----Original Message-----
From: Amir H. Jadidinejad [mailto:amir.jadidi@yahoo.com.INVALID]
Sent: Sunday, August 03, 2014 8:53 PM
To: user pdfbox
Subject: How to find the position of a specific paragraph in the input PDF?
I'm going to extract the content of a PDF file using PDFBox library. The
content should be processed paragraph-by-paragraph and for each paragraph, I
need its position for follow-up processing. Using the following code, I can
extract the whole content of an input PDF:
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper(); String txt =
stripper.getText(doc); doc.close();
I have two problems:
1. I don't know how to extract the content paragraph by paragraph.
2. I don't know how to store the position of a paragraph for follow-up
processing (for example highlighting and etc.)
Thanks.
Classification: UNCLASSIFIED
Caveats: NONE
Re: How to find the position of a specific paragraph in the input PDF? (UNCLASSIFIED)
Posted by "Amir H. Jadidinejad" <am...@yahoo.com.INVALID>.
Dear Mark,
Thanks for your reply. Unfortunately, I don't understand the relation between your post and the question!
I'm newbie in PDFBox, would you please elaborate how to extract the position of a specific paragraph using the attached code?
It seems that it works with "fields" in the input pdf file. I'm looking for paragraphs, what's their relation?
Kind regards,
Amir
________________________________
From: "Strein, Mark C CIV USARMY TRADOC ANALYSIS CTR (US)" <ma...@mail.mil>
To: "users@pdfbox.apache.org" <us...@pdfbox.apache.org>; Amir H. Jadidinejad <am...@yahoo.com>
Sent: Monday, August 4, 2014 3:52 PM
Subject: RE: How to find the position of a specific paragraph in the input PDF? (UNCLASSIFIED)
Classification: UNCLASSIFIED
Caveats: NONE
Morning Sir,
The basic construct for extracting the value in a field is:
field.getFullyQualifiedName().equalsIgnoreCase(fullyQualifiedName).getValue(
) - note: I use fully qualified names(FQN) to prevent errors
My way of extracting the FQN is as follows(the short version):
private void processField(PDField field,boolean buildPDList) throws
IOException
{
List kids = field.getKids();
if(kids != null)
{
Iterator kidsIter = kids.iterator();
while(kidsIter.hasNext())
{
Object pdfObj = kidsIter.next();
if(pdfObj instanceof PDField)
{
PDField kid = (PDField)pdfObj;
processField(kid,buildPDList);
}
}
}
else
{
If(!buildPDlist)
{
System.err.println(field.getFullyQualifiedName());
}
else
{
//other processing
}
}
}
Hope that helps.
V/R,
Mark Strein
-----Original Message-----
From: Amir H. Jadidinejad [mailto:amir.jadidi@yahoo.com.INVALID]
Sent: Sunday, August 03, 2014 8:53 PM
To: user pdfbox
Subject: How to find the position of a specific paragraph in the input PDF?
I'm going to extract the content of a PDF file using PDFBox library. The
content should be processed paragraph-by-paragraph and for each paragraph, I
need its position for follow-up processing. Using the following code, I can
extract the whole content of an input PDF:
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper(); String txt =
stripper.getText(doc); doc.close();
I have two problems:
1. I don't know how to extract the content paragraph by paragraph.
2. I don't know how to store the position of a paragraph for follow-up
processing (for example highlighting and etc.)
Thanks.
Classification: UNCLASSIFIED
Caveats: NONE