You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by win harrington <wi...@yahoo.com.INVALID> on 2016/09/29 13:08:51 UTC

extract bullet points from a PDF

I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
The items are indented. I wantthe text and the indentation level.
The input is like this:   
   - abc
   - def
   
   - xyz
   - ghi
   
   - 123
   - 456


Can I convert that to:abc def   xyz   ghi      123      456
The last step will be toadd tags. I have code to do this:
<abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>        <123></123>
        <456></456>

Thank you. Win Harrington



Re: extract bullet points from a PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.09.2016 um 21:11 schrieb Harrington, Ferdinand B:
> I found PDFText2HTML.java. Is there an example of how to call it?

Yes, see TestPDFText2HTML.java

I doubt that it can do indents.

Tilman

> Outlook distorted my message. The data is indented like this
> As bullets:
>
> Abc
> Def
>       Xyz
>       Ghi
>            123
>            456
>
> Thank you.
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, September 29, 2016 2:44 PM
> To: users@pdfbox.apache.org
> Subject: Re: extract bullet points from a PDF
>
> Am 29.09.2016 um 15:08 schrieb win harrington:
>> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
>> The items are indented. I wantthe text and the indentation level.
>> The input is like this:
>>      - abc
>>      - def
>>
>>      - xyz
>>      - ghi
>>
>>      - 123
>>      - 456
>>
>>
>> Can I convert that to:abc def   xyz   ghi      123      456
>> The last step will be toadd tags. I have code to do this:
>> <abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>        <123></123>
>>           <456></456>
> This sounds like an ordinary java question, i.e. parse some text. PDFBox
> does have some rudimentary paragraph detection, I don't know if it
> works. Try the PDFText2HTML tool in the source download.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ________________________________
>
> This e-mail and any attachments are intended only for the use of the addressee(s) named herein and may contain proprietary information. If you are not the intended recipient of this e-mail or believe that you received this email in error, please take immediate action to notify the sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments from your computer; and do not disseminate, distribute, use, or copy this message and any attachments.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: extract bullet points from a PDF

Posted by John Hewson <jo...@jahewson.com>.
> On 29 Sep 2016, at 12:11, Harrington, Ferdinand B <Fe...@ManTech.com> wrote:
> 
> I found PDFText2HTML.java. Is there an example of how to call it?
> Outlook distorted my message. The data is indented like this
> As bullets:
> 
> Abc
> Def
>     Xyz
>     Ghi
>          123
>          456

Text in PDF is just placed using (x, y) coordinates, it’s not like HTML where there is
markup which describes the nesting, e.g. <li>, <ul>.

If you want to figure out the nesting from the placement, you’ll have to write some
code which does that.

— John

> Thank you.
> 
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, September 29, 2016 2:44 PM
> To: users@pdfbox.apache.org
> Subject: Re: extract bullet points from a PDF
> 
> Am 29.09.2016 um 15:08 schrieb win harrington:
>> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
>> The items are indented. I wantthe text and the indentation level.
>> The input is like this:
>>    - abc
>>    - def
>> 
>>    - xyz
>>    - ghi
>> 
>>    - 123
>>    - 456
>> 
>> 
>> Can I convert that to:abc def   xyz   ghi      123      456
>> The last step will be toadd tags. I have code to do this:
>> <abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>        <123></123>
>>         <456></456>
> 
> This sounds like an ordinary java question, i.e. parse some text. PDFBox
> does have some rudimentary paragraph detection, I don't know if it
> works. Try the PDFText2HTML tool in the source download.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> ________________________________
> 
> This e-mail and any attachments are intended only for the use of the addressee(s) named herein and may contain proprietary information. If you are not the intended recipient of this e-mail or believe that you received this email in error, please take immediate action to notify the sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments from your computer; and do not disseminate, distribute, use, or copy this message and any attachments.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


RE: extract bullet points from a PDF

Posted by "Harrington, Ferdinand B" <Fe...@ManTech.com>.
I found PDFText2HTML.java. Is there an example of how to call it?
Outlook distorted my message. The data is indented like this
As bullets:

Abc
Def
     Xyz
     Ghi
          123
          456

Thank you.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Thursday, September 29, 2016 2:44 PM
To: users@pdfbox.apache.org
Subject: Re: extract bullet points from a PDF

Am 29.09.2016 um 15:08 schrieb win harrington:
> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
> The items are indented. I wantthe text and the indentation level.
> The input is like this:
>     - abc
>     - def
>
>     - xyz
>     - ghi
>
>     - 123
>     - 456
>
>
> Can I convert that to:abc def   xyz   ghi      123      456
> The last step will be toadd tags. I have code to do this:
> <abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>        <123></123>
>          <456></456>

This sounds like an ordinary java question, i.e. parse some text. PDFBox
does have some rudimentary paragraph detection, I don't know if it
works. Try the PDFText2HTML tool in the source download.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


________________________________

This e-mail and any attachments are intended only for the use of the addressee(s) named herein and may contain proprietary information. If you are not the intended recipient of this e-mail or believe that you received this email in error, please take immediate action to notify the sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments from your computer; and do not disseminate, distribute, use, or copy this message and any attachments.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: extract bullet points from a PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.09.2016 um 15:08 schrieb win harrington:
> I would like to extract all the lists of bullet points from a PDF fileand put them into an xml format.
> The items are indented. I wantthe text and the indentation level.
> The input is like this:
>     - abc
>     - def
>     
>     - xyz
>     - ghi
>     
>     - 123
>     - 456
>
>
> Can I convert that to:abc def   xyz   ghi      123      456
> The last step will be toadd tags. I have code to do this:
> <abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>        <123></123>
>          <456></456>

This sounds like an ordinary java question, i.e. parse some text. PDFBox 
does have some rudimentary paragraph detection, I don't know if it 
works. Try the PDFText2HTML tool in the source download.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org