You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by flywire <fl...@gmail.com> on 2021/08/23 23:28:00 UTC

PDF2MD - Paragraphs

https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf shows a
clear break between paragraphs. I'm on Win10 using:

java -jar pdfbox-app-2.0.24.jar ExtractText %1

Each line is extracted but there is no newline for the paragraph. How can I
insert one during text extraction?

I've read about it being a best guess and edge cases. I've never compiled
java before so take it steady.

>

Re: PDF2MD - Codeblocks

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 24.08.2021 um 01:55 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> codeblocks identified by a change of font and no other fonts on those
> lines. I'd like to insert control codes before and after them while I'm
> extracting text.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractText %1
>
> Required code before and after codeblocks is: %newline%```%newline%
>
> I've never compiled java before so take it steady.
>

Each TextPosition object has its own font, so you could try to detect 
the font changes.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: PDF2MD - Codeblocks

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 24.08.2021 um 06:17 schrieb flywire:
> With a bit of customisation, PDFBox should be able to parse pdf to md
> <https://www.markdownguide.org/cheat-sheet/>. This probably involves a
> process like PDFText2HTML.java
> <https://svn.apache.org/repos/asf/pdfbox/branches/2.0/tools/src/main/java/org/apache/pdfbox/tools/PDFText2HTML.java>,
> possibly just modifying that processor, but I'm open to advice.
>
> I can find tutorials on how to program in Java but I'd like to know the
> approach (how to go about it) with PDFBox. A lot of syntax is just matching
> patterns, an approach that lets me use leaflet.js without knowing js.
>
> Hopefully, any code given in an answer is explained clearly enough so I can
> understand it.
>

The problem is that this would require code changes. The command line 
utilities are for some mainstream requirements.

Some of your requirements (separating lines) can probably be met by 
using the methods of PDFTextStripper. Have a look at its javadoc. And 
yes, PDFText2HTML.java does some of this.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: PDF2MD - Codeblocks

Posted by flywire <fl...@gmail.com>.

With a bit of customisation, PDFBox should be able to parse pdf to md
<https://www.markdownguide.org/cheat-sheet/>. This probably involves a
process like PDFText2HTML.java
<https://svn.apache.org/repos/asf/pdfbox/branches/2.0/tools/src/main/java/org/apache/pdfbox/tools/PDFText2HTML.java>,
possibly just modifying that processor, but I'm open to advice.

I can find tutorials on how to program in Java but I'd like to know the
approach (how to go about it) with PDFBox. A lot of syntax is just matching
patterns, an approach that lets me use leaflet.js without knowing js.

Hopefully, any code given in an answer is explained clearly enough so I can
understand it.

Re: PDF2MD - Codeblocks

Posted by Tilman Hausherr <TH...@t-online.de>.

Are you asking how to program in java or what is this about?

Tilman

Am 24.08.2021 um 01:55 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> codeblocks identified by a change of font and no other fonts on those
> lines. I'd like to insert control codes before and after them while I'm
> extracting text.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractText %1
>
> Required code before and after codeblocks is: %newline%```%newline%
>
> I've never compiled java before so take it steady.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: ExtractImages Ignoring Textboxes

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 24.08.2021 um 02:03 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> textboxes which are extracted as images containing a solid black box. How
> can I ignore those text boxes while extracting images and not increment
> image number contained in the filename. They always occur as the last two
> images on page 1.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractImages %1
>
> I've never compiled java before so take it steady.
>
These images are fully transparent, so you'd have to detect that. (check 
the alpha channel)

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

ExtractImages Ignoring Textboxes

Posted by flywire <fl...@gmail.com>.

https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
textboxes which are extracted as images containing a solid black box. How
can I ignore those text boxes while extracting images and not increment
image number contained in the filename. They always occur as the last two
images on page 1.

I'm on Win10 using:

java -jar pdfbox-app-2.0.24.jar ExtractImages %1

I've never compiled java before so take it steady.

>

PDF2MD - Codeblocks

Posted by flywire <fl...@gmail.com>.

https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
codeblocks identified by a change of font and no other fonts on those
lines. I'd like to insert control codes before and after them while I'm
extracting text.

I'm on Win10 using:

java -jar pdfbox-app-2.0.24.jar ExtractText %1

Required code before and after codeblocks is: %newline%```%newline%

I've never compiled java before so take it steady.

>

Re: PDF2MD - Images

Posted by flywire <fl...@gmail.com>.

Figures, Tables etc often have a unique caption line eg Figure N:
Description...

After extracting text I used this workaround to post-process the markdown
files on Win10 with GNU sed (hence ^^):

======= display proposed changes
for %f in (*.md) do sed -n 's/\(^^Figure \)\([0-9]\+\)\(\:
.*\)/\n![](%~nf-\2.png)\n\1\2\3/p' %f

======= change in-place
for %f in (*.md) do sed -i 's/\(^^Figure \)\([0-9]\+\)\(\:
.*\)/\n![](%~nf-\2.png)\n\1\2\3/' %f

Re: PDF2MD - Images

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 24.08.2021 um 01:43 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> images and I'd like to replace them with code while I'm extracting text.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractText %1
>
> Required code is: %newline%[](%filename%-%image-no%.png)%newline%
>
> %filename% is without path or extension and %image-no% is the sequential
> number given by:
>
> java -jar pdfbox-app-2.0.24.jar ExtractImages %1
>
> I've never compiled java before so take it steady.
>

Likely tricky, you would have to write a modified text stripper that 
generates a pseudo text when it hits an image.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

PDF2MD - Images

Posted by flywire <fl...@gmail.com>.

https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
images and I'd like to replace them with code while I'm extracting text.

I'm on Win10 using:

java -jar pdfbox-app-2.0.24.jar ExtractText %1

Required code is: %newline%[](%filename%-%image-no%.png)%newline%

%filename% is without path or extension and %image-no% is the sequential
number given by:

java -jar pdfbox-app-2.0.24.jar ExtractImages %1

I've never compiled java before so take it steady.