You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by flywire <fl...@gmail.com> on 2021/08/23 23:55:24 UTC
PDF2MD - Codeblocks
https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
codeblocks identified by a change of font and no other fonts on those
lines. I'd like to insert control codes before and after them while I'm
extracting text.
I'm on Win10 using:
java -jar pdfbox-app-2.0.24.jar ExtractText %1
Required code before and after codeblocks is: %newline%```%newline%
I've never compiled java before so take it steady.
>
Re: PDF2MD - Codeblocks
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 24.08.2021 um 01:55 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> codeblocks identified by a change of font and no other fonts on those
> lines. I'd like to insert control codes before and after them while I'm
> extracting text.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractText %1
>
> Required code before and after codeblocks is: %newline%```%newline%
>
> I've never compiled java before so take it steady.
>
Each TextPosition object has its own font, so you could try to detect
the font changes.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: PDF2MD - Codeblocks
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 24.08.2021 um 06:17 schrieb flywire:
> With a bit of customisation, PDFBox should be able to parse pdf to md
> <https://www.markdownguide.org/cheat-sheet/>. This probably involves a
> process like PDFText2HTML.java
> <https://svn.apache.org/repos/asf/pdfbox/branches/2.0/tools/src/main/java/org/apache/pdfbox/tools/PDFText2HTML.java>,
> possibly just modifying that processor, but I'm open to advice.
>
> I can find tutorials on how to program in Java but I'd like to know the
> approach (how to go about it) with PDFBox. A lot of syntax is just matching
> patterns, an approach that lets me use leaflet.js without knowing js.
>
> Hopefully, any code given in an answer is explained clearly enough so I can
> understand it.
>
The problem is that this would require code changes. The command line
utilities are for some mainstream requirements.
Some of your requirements (separating lines) can probably be met by
using the methods of PDFTextStripper. Have a look at its javadoc. And
yes, PDFText2HTML.java does some of this.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: PDF2MD - Codeblocks
Posted by flywire <fl...@gmail.com>.
With a bit of customisation, PDFBox should be able to parse pdf to md
<https://www.markdownguide.org/cheat-sheet/>. This probably involves a
process like PDFText2HTML.java
<https://svn.apache.org/repos/asf/pdfbox/branches/2.0/tools/src/main/java/org/apache/pdfbox/tools/PDFText2HTML.java>,
possibly just modifying that processor, but I'm open to advice.
I can find tutorials on how to program in Java but I'd like to know the
approach (how to go about it) with PDFBox. A lot of syntax is just matching
patterns, an approach that lets me use leaflet.js without knowing js.
Hopefully, any code given in an answer is explained clearly enough so I can
understand it.
Re: PDF2MD - Codeblocks
Posted by Tilman Hausherr <TH...@t-online.de>.
Are you asking how to program in java or what is this about?
Tilman
Am 24.08.2021 um 01:55 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> codeblocks identified by a change of font and no other fonts on those
> lines. I'd like to insert control codes before and after them while I'm
> extracting text.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractText %1
>
> Required code before and after codeblocks is: %newline%```%newline%
>
> I've never compiled java before so take it steady.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: ExtractImages Ignoring Textboxes
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 24.08.2021 um 02:03 schrieb flywire:
> https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
> textboxes which are extracted as images containing a solid black box. How
> can I ignore those text boxes while extracting images and not increment
> image number contained in the filename. They always occur as the last two
> images on page 1.
>
> I'm on Win10 using:
>
> java -jar pdfbox-app-2.0.24.jar ExtractImages %1
>
> I've never compiled java before so take it steady.
>
These images are fully transparent, so you'd have to detect that. (check
the alpha channel)
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
ExtractImages Ignoring Textboxes
Posted by flywire <fl...@gmail.com>.
https://fivedots.coe.psu.ac.th/~ad/jlop/chaps/46.%20Addons.pdf contains
textboxes which are extracted as images containing a solid black box. How
can I ignore those text boxes while extracting images and not increment
image number contained in the filename. They always occur as the last two
images on page 1.
I'm on Win10 using:
java -jar pdfbox-app-2.0.24.jar ExtractImages %1
I've never compiled java before so take it steady.
>