You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by win harrington <wi...@yahoo.com.INVALID> on 2016/09/29 13:08:51 UTC
extract bullet points from a PDF
I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
The items are indented. I wantthe text and the indentation level.
The input is like this:
- abc
- def
- xyz
- ghi
- 123
- 456
Can I convert that to:abc def xyz ghi 123 456
The last step will be toadd tags. I have code to do this:
<abc></abc><def></def> <xyz></xyz> <ghi></ghi> <123></123>
<456></456>
Thank you. Win Harrington
Re: extract bullet points from a PDF
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.09.2016 um 21:11 schrieb Harrington, Ferdinand B:
> I found PDFText2HTML.java. Is there an example of how to call it?
Yes, see TestPDFText2HTML.java
I doubt that it can do indents.
Tilman
> Outlook distorted my message. The data is indented like this
> As bullets:
>
> Abc
> Def
> Xyz
> Ghi
> 123
> 456
>
> Thank you.
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, September 29, 2016 2:44 PM
> To: users@pdfbox.apache.org
> Subject: Re: extract bullet points from a PDF
>
> Am 29.09.2016 um 15:08 schrieb win harrington:
>> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
>> The items are indented. I wantthe text and the indentation level.
>> The input is like this:
>> - abc
>> - def
>>
>> - xyz
>> - ghi
>>
>> - 123
>> - 456
>>
>>
>> Can I convert that to:abc def xyz ghi 123 456
>> The last step will be toadd tags. I have code to do this:
>> <abc></abc><def></def> <xyz></xyz> <ghi></ghi> <123></123>
>> <456></456>
> This sounds like an ordinary java question, i.e. parse some text. PDFBox
> does have some rudimentary paragraph detection, I don't know if it
> works. Try the PDFText2HTML tool in the source download.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ________________________________
>
> This e-mail and any attachments are intended only for the use of the addressee(s) named herein and may contain proprietary information. If you are not the intended recipient of this e-mail or believe that you received this email in error, please take immediate action to notify the sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments from your computer; and do not disseminate, distribute, use, or copy this message and any attachments.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: extract bullet points from a PDF
Posted by John Hewson <jo...@jahewson.com>.
> On 29 Sep 2016, at 12:11, Harrington, Ferdinand B <Fe...@ManTech.com> wrote:
>
> I found PDFText2HTML.java. Is there an example of how to call it?
> Outlook distorted my message. The data is indented like this
> As bullets:
>
> Abc
> Def
> Xyz
> Ghi
> 123
> 456
Text in PDF is just placed using (x, y) coordinates, it’s not like HTML where there is
markup which describes the nesting, e.g. <li>, <ul>.
If you want to figure out the nesting from the placement, you’ll have to write some
code which does that.
— John
> Thank you.
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, September 29, 2016 2:44 PM
> To: users@pdfbox.apache.org
> Subject: Re: extract bullet points from a PDF
>
> Am 29.09.2016 um 15:08 schrieb win harrington:
>> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
>> The items are indented. I wantthe text and the indentation level.
>> The input is like this:
>> - abc
>> - def
>>
>> - xyz
>> - ghi
>>
>> - 123
>> - 456
>>
>>
>> Can I convert that to:abc def xyz ghi 123 456
>> The last step will be toadd tags. I have code to do this:
>> <abc></abc><def></def> <xyz></xyz> <ghi></ghi> <123></123>
>> <456></456>
>
> This sounds like an ordinary java question, i.e. parse some text. PDFBox
> does have some rudimentary paragraph detection, I don't know if it
> works. Try the PDFText2HTML tool in the source download.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ________________________________
>
> This e-mail and any attachments are intended only for the use of the addressee(s) named herein and may contain proprietary information. If you are not the intended recipient of this e-mail or believe that you received this email in error, please take immediate action to notify the sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments from your computer; and do not disseminate, distribute, use, or copy this message and any attachments.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
RE: extract bullet points from a PDF
Posted by "Harrington, Ferdinand B" <Fe...@ManTech.com>.
I found PDFText2HTML.java. Is there an example of how to call it?
Outlook distorted my message. The data is indented like this
As bullets:
Abc
Def
Xyz
Ghi
123
456
Thank you.
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Thursday, September 29, 2016 2:44 PM
To: users@pdfbox.apache.org
Subject: Re: extract bullet points from a PDF
Am 29.09.2016 um 15:08 schrieb win harrington:
> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
> The items are indented. I wantthe text and the indentation level.
> The input is like this:
> - abc
> - def
>
> - xyz
> - ghi
>
> - 123
> - 456
>
>
> Can I convert that to:abc def xyz ghi 123 456
> The last step will be toadd tags. I have code to do this:
> <abc></abc><def></def> <xyz></xyz> <ghi></ghi> <123></123>
> <456></456>
This sounds like an ordinary java question, i.e. parse some text. PDFBox
does have some rudimentary paragraph detection, I don't know if it
works. Try the PDFText2HTML tool in the source download.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
________________________________
This e-mail and any attachments are intended only for the use of the addressee(s) named herein and may contain proprietary information. If you are not the intended recipient of this e-mail or believe that you received this email in error, please take immediate action to notify the sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments from your computer; and do not disseminate, distribute, use, or copy this message and any attachments.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: extract bullet points from a PDF
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.09.2016 um 15:08 schrieb win harrington:
> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
> The items are indented. I wantthe text and the indentation level.
> The input is like this:
> - abc
> - def
>
> - xyz
> - ghi
>
> - 123
> - 456
>
>
> Can I convert that to:abc def xyz ghi 123 456
> The last step will be toadd tags. I have code to do this:
> <abc></abc><def></def> <xyz></xyz> <ghi></ghi> <123></123>
> <456></456>
This sounds like an ordinary java question, i.e. parse some text. PDFBox
does have some rudimentary paragraph detection, I don't know if it
works. Try the PDFText2HTML tool in the source download.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org