You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Mac P <po...@hotmail.com> on 2012/09/01 04:24:27 UTC
How can I manipulate text in PDF'd by using PDFBox
Hello Forum
Is there any way to to split a master pdf file consisted of so many pages into separate pages based on the content or keywords in each page?
Each page has the person's first and last name. I would like to grep the last name and write a scripts to separate each page, turn it into a new pdf file with the last name being part of the file name instead of sequential numbers matching the total number of pages at the end of each file name.
I know PDFs are binary documents. Are there any tools to look up the last names and manipulate them that way?
Thanks
Mac
Re: How can I manipulate text in PDF'd by using PDFBox
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
Am 01.09.2012 04:24, schrieb Mac P:
>
> Hello Forum
>
> Is there any way to to split a master pdf file consisted of so many pages into separate pages based on the content or keywords in each page?
>
> Each page has the person's first and last name. I would like to grep the last name and write a scripts to separate each page, turn it into a new pdf file with the last name being part of the file name instead of sequential numbers matching the total number of pages at the end of each file name.
>
> I know PDFs are binary documents. Are there any tools to look up the last names and manipulate them that way?
Use PDFSplit [1] to split your pdf in single pages and ExtractText [2] to get
the string your looking for. The first goal should work out of the box the
latter could be complicated depending on the used fonts etc. Just give it a try.
> Thanks
>
> Mac
BR
Andreas Lehmkühler
P.S.: Subscribe yourself correctly to the mailing-list [3], otherwise you won't
get any answer.
[1] http://pdfbox.apache.org/commandlineutilities/PDFSplit.html
[2] http://pdfbox.apache.org/commandlineutilities/ExtractText.html
[3] http://pdfbox.apache.org/mail-lists.html
Re: How can I manipulate text in PDF'd by using PDFBox
Posted by jl...@gi-bon.sk.
Hi Mac,
you can use PDFTextStripper for this.
it will return you all texts from pages
Best regards
Juraj Lonc
GI-BÓN, spol. s r.o.
Management Systems
Bratislavská 11
SK - 010 01 Žilina
Tel: +421-41-564 3437-8
Mobil: +421-907-815 147
Fax: +421-41-564 3439
e-mail: jlonc@gi-bon.sk
homepage: http://www.gi-bon.sk
From: Mac P <po...@hotmail.com>
To: pdfbox <us...@pdfbox.apache.org>,
Date: 01. 09. 2012 10:02
Subject: How can I manipulate text in PDF'd by using PDFBox
Hello Forum
Is there any way to to split a master pdf file consisted of so many pages
into separate pages based on the content or keywords in each page?
Each page has the person's first and last name. I would like to grep the
last name and write a scripts to separate each page, turn it into a new
pdf file with the last name being part of the file name instead of
sequential numbers matching the total number of pages at the end of each
file name.
I know PDFs are binary documents. Are there any tools to look up the last
names and manipulate them that way?
Thanks
Mac