You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Mac P <po...@hotmail.com> on 2012/09/01 04:24:27 UTC

How can I manipulate text in PDF'd by using PDFBox

Hello Forum

Is there any way to to split a master pdf file consisted of so many pages into separate pages based on the content or keywords in each page?

Each page has the person's first and last name. I would like to grep the last name and write a scripts to separate each page, turn it into a new pdf file with the last name being part of the file name instead of sequential numbers matching the total number of pages at the end of each file name.

I know PDFs are binary documents. Are there any tools to look up the last names and manipulate them that way?

Thanks

Mac

Re: How can I manipulate text in PDF'd by using PDFBox

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 01.09.2012 04:24, schrieb Mac P:
>
> Hello Forum
>
> Is there any way to to split a master pdf file consisted of so many pages into separate pages based on the content or keywords in each page?
>
> Each page has the person's first and last name. I would like to grep the last name and write a scripts to separate each page, turn it into a new pdf file with the last name being part of the file name instead of sequential numbers matching the total number of pages at the end of each file name.
>
> I know PDFs are binary documents. Are there any tools to look up the last names and manipulate them that way?
Use PDFSplit [1] to split your pdf in single pages and ExtractText [2] to get 
the string your looking for. The first goal should work out of the box the 
latter could be complicated depending on the used fonts etc. Just give it a try.

> Thanks
>
> Mac

BR
Andreas Lehmkühler

P.S.: Subscribe yourself correctly to the mailing-list [3], otherwise you won't 
get any answer.

[1] http://pdfbox.apache.org/commandlineutilities/PDFSplit.html
[2] http://pdfbox.apache.org/commandlineutilities/ExtractText.html
[3] http://pdfbox.apache.org/mail-lists.html

Re: How can I manipulate text in PDF'd by using PDFBox

Posted by jl...@gi-bon.sk.

Hi Mac,

you can use PDFTextStripper for this.
it will return you all texts from pages


Best regards
Juraj Lonc


GI-BÓN, spol. s r.o.
Management Systems

Bratislavská 11
SK - 010 01 Žilina
Tel: +421-41-564 3437-8
Mobil: +421-907-815 147
Fax: +421-41-564 3439
e-mail: jlonc@gi-bon.sk
homepage: http://www.gi-bon.sk 





From:   Mac P <po...@hotmail.com>
To:     pdfbox <us...@pdfbox.apache.org>, 
Date:   01. 09. 2012 10:02
Subject:        How can I manipulate text in PDF'd by using PDFBox




Hello Forum

Is there any way to to split a master pdf file consisted of so many pages 
into separate pages based on the content or keywords in each page?

Each page has the person's first and last name. I would like to grep the 
last name and write a scripts to separate each page, turn it into a new 
pdf file with the last name being part of the file name instead of 
sequential numbers matching the total number of pages at the end of each 
file name.

I know PDFs are binary documents. Are there any tools to look up the last 
names and manipulate them that way?

Thanks

Mac