You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2019/07/11 17:25:27 UTC

How to parse PDF more effectively

Hi

I've used Tika to parse this invoice PDF:

https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

(AutoDetectParser, ToTextContentHandler), see below what is returned.
The numbers like (1), (2) are added by myself, this is the preferred order
(approximately).

Is it possible to hint somehow to Tika how to report the content ?

Thanks Sergey

PDF Invoice Example
Invoice

(5)Payment is due within 30 days from date of invoice. Late payment is
subject to fees of 5% per month.

Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com

Page 1/1

(2)From:

DEMO - Sliced Invoices

Suite 5A-1204

123 Somewhere Street

Your City AZ 12345

admin@slicedinvoices.com

(1)Invoice Number INV-3337

Order Number 12345

Invoice Date January 25, 2016

Due Date January 31, 2016

Total Due $93.50

(3)To:

Test Business

123 Somewhere St

Melbourne, VIC 3000

test@test.com

(4) Hrs/Qty Service Rate/Price Adjust Sub Total

1.00
Web Design
This is a sample description...

$85.00 0.00% $85.00

Sub Total $85.00

Tax $8.50

Total $93.50

(5) ANZ Bank

ACC # 1234 1234

BSB # 4321 432 Pa
id

Re: How to parse PDF more effectively

Posted by Sergey Beryozkin <sb...@gmail.com>.
If it is not possible yet, then please create an issue and assign it to me
(can do myself as well), will take care of it a bit later on.
(don't mind having tika config in JSON format supported as well)

Sergey

On Wed, Jul 17, 2019 at 5:20 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim,
>
> How does one configure PDFParserConfig in tika-config.xml ? May be as one
> of the PDFParser properties ?
> PDFParser.setSortByPosition (and other simple setters) are deprecated so
> setting a 'sprtByPosition' as one of the PDFParser properties goes via the
> deprecated call path (probably not a big deal though :-))
> I also looked at the source and I'm still not sure which ContentHandler
> did you use to get the HTML tags added.
> (I may experiment with a custom one sitting on top of it adding the table
> tags may be...)
> Sergey
>
> On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
>> Hi Tim
>>
>> Thanks, I'm going to try to experiment with different complex enough PDFs
>> in order to figure out how to enhance the Quarkus Tika extension, what to
>> let customize, etc (I'll link to it in a follow up email).
>> Your output looks better :-), and which ContentHandler did you use ?
>>
>> Sergey
>>
>> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Might not need to break out the neural nets just yet...try turning on
>>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>>>
>>> This is what you get:
>>>
>>>
>>>
>>> <title>PDF Invoice Example</title>
>>> </head>
>>> <body><div class="page"><p />
>>> <p>Invoice
>>> </p>
>>> <p>From: Invoice Number INV-3337
>>> </p>
>>> <p>DEMO - Sliced Invoices Order Number 12345
>>> Suite 5A-1204 Invoice Date January 25, 2016
>>> 123 Somewhere Street Due Date January 31, 2016
>>> Your City AZ 12345
>>> admin@slicedinvoices.com Total Due $93.50
>>> </p>
>>> <p>To:
>>> Test Business
>>> 123 Somewhere St
>>> Melbourne, VIC 3000
>>> test@test.com
>>> </p>
>>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
>>> </p>
>>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
>>> </p>
>>> <p>Pa
>>> idSub Total $85.00
>>> </p>
>>> <p>Tax $8.50
>>> Total $93.50
>>> </p>
>>> <p>ANZ Bank
>>> ACC # 1234 1234
>>> BSB # 4321 432
>>> </p>
>>> <p>Payment is due within 30 days from date of invoice. Late payment is
>>> subject to fees of 5% per month.
>>> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>>> Page 1/1</p>
>>> <p />
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
>>> </a></div>
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
>>> </a></div>
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
>>> </a></div>
>>> <div class="annotation"><a
>>> href="mailto:admin@slicedinvoices.com">mailto:admin@slicedinvoices.com
>>> </a></div>
>>> </div>
>>> </body></html>
>>>
>>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>> >
>>> > Hi
>>> >
>>> > I've used Tika to parse this invoice PDF:
>>> >
>>> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>>> >
>>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
>>> > The numbers like (1), (2) are added by myself, this is the preferred
>>> order (approximately).
>>> >
>>> > Is it possible to hint somehow to Tika how to report the content ?
>>> >
>>> > Thanks Sergey
>>> >
>>> > PDF Invoice Example
>>> > Invoice
>>> >
>>> > (5)Payment is due within 30 days from date of invoice. Late payment is
>>> subject to fees of 5% per month.
>>> >
>>> > Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>>> >
>>> > Page 1/1
>>> >
>>> > (2)From:
>>> >
>>> > DEMO - Sliced Invoices
>>> >
>>> > Suite 5A-1204
>>> >
>>> > 123 Somewhere Street
>>> >
>>> > Your City AZ 12345
>>> >
>>> > admin@slicedinvoices.com
>>> >
>>> > (1)Invoice Number INV-3337
>>> >
>>> > Order Number 12345
>>> >
>>> > Invoice Date January 25, 2016
>>> >
>>> > Due Date January 31, 2016
>>> >
>>> > Total Due $93.50
>>> >
>>> > (3)To:
>>> >
>>> > Test Business
>>> >
>>> > 123 Somewhere St
>>> >
>>> > Melbourne, VIC 3000
>>> >
>>> > test@test.com
>>> >
>>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>>> >
>>> > 1.00
>>> > Web Design
>>> > This is a sample description...
>>> >
>>> > $85.00 0.00% $85.00
>>> >
>>> > Sub Total $85.00
>>> >
>>> > Tax $8.50
>>> >
>>> > Total $93.50
>>> >
>>> > (5) ANZ Bank
>>> >
>>> > ACC # 1234 1234
>>> >
>>> > BSB # 4321 432 Pa
>>> > id
>>>
>>

Re: How to parse PDF more effectively

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim
This will help for sure, will try after my my PTO
Thanks Sergey

On Thu 18 Jul 2019, 13:14 Tim Allison, <ta...@apache.org> wrote:

> Hi Sergey,
>
>   Sorry, I thought I hit send on this yesterday...
>
>   In reverse order, I used the ToXMLContentHandler:
>
> https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L208
>
>   For configuring via tika-config.xml, see, e.g.:
>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml
>
>   The trick, though, is to exclude the PDFParser from the default
> parser and then add the custom configured one back in:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
>
>   Let me know if you have any surprises.
>
>           Best,
>
>                     Tim
>
> On Wed, Jul 17, 2019 at 12:20 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> >
> > Hi Tim,
> >
> > How does one configure PDFParserConfig in tika-config.xml ? May be as
> one of the PDFParser properties ?
> > PDFParser.setSortByPosition (and other simple setters) are deprecated so
> setting a 'sprtByPosition' as one of the PDFParser properties goes via the
> deprecated call path (probably not a big deal though :-))
> > I also looked at the source and I'm still not sure which ContentHandler
> did you use to get the HTML tags added.
> > (I may experiment with a custom one sitting on top of it adding the
> table tags may be...)
> > Sergey
> >
> > On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> >>
> >> Hi Tim
> >>
> >> Thanks, I'm going to try to experiment with different complex enough
> PDFs in order to figure out how to enhance the Quarkus Tika extension, what
> to let customize, etc (I'll link to it in a follow up email).
> >> Your output looks better :-), and which ContentHandler did you use ?
> >>
> >> Sergey
> >>
> >> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <ta...@apache.org>
> wrote:
> >>>
> >>> Might not need to break out the neural nets just yet...try turning on
> >>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
> >>>
> >>> This is what you get:
> >>>
> >>>
> >>>
> >>> <title>PDF Invoice Example</title>
> >>> </head>
> >>> <body><div class="page"><p />
> >>> <p>Invoice
> >>> </p>
> >>> <p>From: Invoice Number INV-3337
> >>> </p>
> >>> <p>DEMO - Sliced Invoices Order Number 12345
> >>> Suite 5A-1204 Invoice Date January 25, 2016
> >>> 123 Somewhere Street Due Date January 31, 2016
> >>> Your City AZ 12345
> >>> admin@slicedinvoices.com Total Due $93.50
> >>> </p>
> >>> <p>To:
> >>> Test Business
> >>> 123 Somewhere St
> >>> Melbourne, VIC 3000
> >>> test@test.com
> >>> </p>
> >>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
> >>> </p>
> >>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
> >>> </p>
> >>> <p>Pa
> >>> idSub Total $85.00
> >>> </p>
> >>> <p>Tax $8.50
> >>> Total $93.50
> >>> </p>
> >>> <p>ANZ Bank
> >>> ACC # 1234 1234
> >>> BSB # 4321 432
> >>> </p>
> >>> <p>Payment is due within 30 days from date of invoice. Late payment is
> >>> subject to fees of 5% per month.
> >>> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
> >>> Page 1/1</p>
> >>> <p />
> >>> <div class="annotation"><a
> >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
> </a></div>
> >>> <div class="annotation"><a
> >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
> </a></div>
> >>> <div class="annotation"><a
> >>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
> </a></div>
> >>> <div class="annotation"><a
> >>> href="mailto:admin@slicedinvoices.com">mailto:admin@slicedinvoices.com
> </a></div>
> >>> </div>
> >>> </body></html>
> >>>
> >>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> >>> >
> >>> > Hi
> >>> >
> >>> > I've used Tika to parse this invoice PDF:
> >>> >
> >>> >
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
> >>> >
> >>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
> >>> > The numbers like (1), (2) are added by myself, this is the preferred
> order (approximately).
> >>> >
> >>> > Is it possible to hint somehow to Tika how to report the content ?
> >>> >
> >>> > Thanks Sergey
> >>> >
> >>> > PDF Invoice Example
> >>> > Invoice
> >>> >
> >>> > (5)Payment is due within 30 days from date of invoice. Late payment
> is subject to fees of 5% per month.
> >>> >
> >>> > Thanks for choosing DEMO - Sliced Invoices |
> admin@slicedinvoices.com
> >>> >
> >>> > Page 1/1
> >>> >
> >>> > (2)From:
> >>> >
> >>> > DEMO - Sliced Invoices
> >>> >
> >>> > Suite 5A-1204
> >>> >
> >>> > 123 Somewhere Street
> >>> >
> >>> > Your City AZ 12345
> >>> >
> >>> > admin@slicedinvoices.com
> >>> >
> >>> > (1)Invoice Number INV-3337
> >>> >
> >>> > Order Number 12345
> >>> >
> >>> > Invoice Date January 25, 2016
> >>> >
> >>> > Due Date January 31, 2016
> >>> >
> >>> > Total Due $93.50
> >>> >
> >>> > (3)To:
> >>> >
> >>> > Test Business
> >>> >
> >>> > 123 Somewhere St
> >>> >
> >>> > Melbourne, VIC 3000
> >>> >
> >>> > test@test.com
> >>> >
> >>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
> >>> >
> >>> > 1.00
> >>> > Web Design
> >>> > This is a sample description...
> >>> >
> >>> > $85.00 0.00% $85.00
> >>> >
> >>> > Sub Total $85.00
> >>> >
> >>> > Tax $8.50
> >>> >
> >>> > Total $93.50
> >>> >
> >>> > (5) ANZ Bank
> >>> >
> >>> > ACC # 1234 1234
> >>> >
> >>> > BSB # 4321 432 Pa
> >>> > id
>

Re: How to parse PDF more effectively

Posted by Tim Allison <ta...@apache.org>.
Hi Sergey,

  Sorry, I thought I hit send on this yesterday...

  In reverse order, I used the ToXMLContentHandler:
https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaTest.java#L208

  For configuring via tika-config.xml, see, e.g.:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml

  The trick, though, is to exclude the PDFParser from the default
parser and then add the custom configured one back in:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066

  Let me know if you have any surprises.

          Best,

                    Tim

On Wed, Jul 17, 2019 at 12:20 PM Sergey Beryozkin <sb...@gmail.com> wrote:
>
> Hi Tim,
>
> How does one configure PDFParserConfig in tika-config.xml ? May be as one of the PDFParser properties ?
> PDFParser.setSortByPosition (and other simple setters) are deprecated so setting a 'sprtByPosition' as one of the PDFParser properties goes via the deprecated call path (probably not a big deal though :-))
> I also looked at the source and I'm still not sure which ContentHandler did you use to get the HTML tags added.
> (I may experiment with a custom one sitting on top of it adding the table tags may be...)
> Sergey
>
> On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <sb...@gmail.com> wrote:
>>
>> Hi Tim
>>
>> Thanks, I'm going to try to experiment with different complex enough PDFs in order to figure out how to enhance the Quarkus Tika extension, what to let customize, etc (I'll link to it in a follow up email).
>> Your output looks better :-), and which ContentHandler did you use ?
>>
>> Sergey
>>
>> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>> Might not need to break out the neural nets just yet...try turning on
>>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>>>
>>> This is what you get:
>>>
>>>
>>>
>>> <title>PDF Invoice Example</title>
>>> </head>
>>> <body><div class="page"><p />
>>> <p>Invoice
>>> </p>
>>> <p>From: Invoice Number INV-3337
>>> </p>
>>> <p>DEMO - Sliced Invoices Order Number 12345
>>> Suite 5A-1204 Invoice Date January 25, 2016
>>> 123 Somewhere Street Due Date January 31, 2016
>>> Your City AZ 12345
>>> admin@slicedinvoices.com Total Due $93.50
>>> </p>
>>> <p>To:
>>> Test Business
>>> 123 Somewhere St
>>> Melbourne, VIC 3000
>>> test@test.com
>>> </p>
>>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
>>> </p>
>>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
>>> </p>
>>> <p>Pa
>>> idSub Total $85.00
>>> </p>
>>> <p>Tax $8.50
>>> Total $93.50
>>> </p>
>>> <p>ANZ Bank
>>> ACC # 1234 1234
>>> BSB # 4321 432
>>> </p>
>>> <p>Payment is due within 30 days from date of invoice. Late payment is
>>> subject to fees of 5% per month.
>>> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>>> Page 1/1</p>
>>> <p />
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div>
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div>
>>> <div class="annotation"><a
>>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div>
>>> <div class="annotation"><a
>>> href="mailto:admin@slicedinvoices.com">mailto:admin@slicedinvoices.com</a></div>
>>> </div>
>>> </body></html>
>>>
>>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sb...@gmail.com> wrote:
>>> >
>>> > Hi
>>> >
>>> > I've used Tika to parse this invoice PDF:
>>> >
>>> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>>> >
>>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
>>> > The numbers like (1), (2) are added by myself, this is the preferred order (approximately).
>>> >
>>> > Is it possible to hint somehow to Tika how to report the content ?
>>> >
>>> > Thanks Sergey
>>> >
>>> > PDF Invoice Example
>>> > Invoice
>>> >
>>> > (5)Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.
>>> >
>>> > Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>>> >
>>> > Page 1/1
>>> >
>>> > (2)From:
>>> >
>>> > DEMO - Sliced Invoices
>>> >
>>> > Suite 5A-1204
>>> >
>>> > 123 Somewhere Street
>>> >
>>> > Your City AZ 12345
>>> >
>>> > admin@slicedinvoices.com
>>> >
>>> > (1)Invoice Number INV-3337
>>> >
>>> > Order Number 12345
>>> >
>>> > Invoice Date January 25, 2016
>>> >
>>> > Due Date January 31, 2016
>>> >
>>> > Total Due $93.50
>>> >
>>> > (3)To:
>>> >
>>> > Test Business
>>> >
>>> > 123 Somewhere St
>>> >
>>> > Melbourne, VIC 3000
>>> >
>>> > test@test.com
>>> >
>>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>>> >
>>> > 1.00
>>> > Web Design
>>> > This is a sample description...
>>> >
>>> > $85.00 0.00% $85.00
>>> >
>>> > Sub Total $85.00
>>> >
>>> > Tax $8.50
>>> >
>>> > Total $93.50
>>> >
>>> > (5) ANZ Bank
>>> >
>>> > ACC # 1234 1234
>>> >
>>> > BSB # 4321 432 Pa
>>> > id

Re: How to parse PDF more effectively

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim,

How does one configure PDFParserConfig in tika-config.xml ? May be as one
of the PDFParser properties ?
PDFParser.setSortByPosition (and other simple setters) are deprecated so
setting a 'sprtByPosition' as one of the PDFParser properties goes via the
deprecated call path (probably not a big deal though :-))
I also looked at the source and I'm still not sure which ContentHandler did
you use to get the HTML tags added.
(I may experiment with a custom one sitting on top of it adding the table
tags may be...)
Sergey

On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim
>
> Thanks, I'm going to try to experiment with different complex enough PDFs
> in order to figure out how to enhance the Quarkus Tika extension, what to
> let customize, etc (I'll link to it in a follow up email).
> Your output looks better :-), and which ContentHandler did you use ?
>
> Sergey
>
> On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <ta...@apache.org> wrote:
>
>> Might not need to break out the neural nets just yet...try turning on
>> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>>
>> This is what you get:
>>
>>
>>
>> <title>PDF Invoice Example</title>
>> </head>
>> <body><div class="page"><p />
>> <p>Invoice
>> </p>
>> <p>From: Invoice Number INV-3337
>> </p>
>> <p>DEMO - Sliced Invoices Order Number 12345
>> Suite 5A-1204 Invoice Date January 25, 2016
>> 123 Somewhere Street Due Date January 31, 2016
>> Your City AZ 12345
>> admin@slicedinvoices.com Total Due $93.50
>> </p>
>> <p>To:
>> Test Business
>> 123 Somewhere St
>> Melbourne, VIC 3000
>> test@test.com
>> </p>
>> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
>> </p>
>> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
>> </p>
>> <p>Pa
>> idSub Total $85.00
>> </p>
>> <p>Tax $8.50
>> Total $93.50
>> </p>
>> <p>ANZ Bank
>> ACC # 1234 1234
>> BSB # 4321 432
>> </p>
>> <p>Payment is due within 30 days from date of invoice. Late payment is
>> subject to fees of 5% per month.
>> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>> Page 1/1</p>
>> <p />
>> <div class="annotation"><a
>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
>> </a></div>
>> <div class="annotation"><a
>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
>> </a></div>
>> <div class="annotation"><a
>> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
>> </a></div>
>> <div class="annotation"><a
>> href="mailto:admin@slicedinvoices.com">mailto:admin@slicedinvoices.com
>> </a></div>
>> </div>
>> </body></html>
>>
>> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>> >
>> > Hi
>> >
>> > I've used Tika to parse this invoice PDF:
>> >
>> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>> >
>> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
>> > The numbers like (1), (2) are added by myself, this is the preferred
>> order (approximately).
>> >
>> > Is it possible to hint somehow to Tika how to report the content ?
>> >
>> > Thanks Sergey
>> >
>> > PDF Invoice Example
>> > Invoice
>> >
>> > (5)Payment is due within 30 days from date of invoice. Late payment is
>> subject to fees of 5% per month.
>> >
>> > Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>> >
>> > Page 1/1
>> >
>> > (2)From:
>> >
>> > DEMO - Sliced Invoices
>> >
>> > Suite 5A-1204
>> >
>> > 123 Somewhere Street
>> >
>> > Your City AZ 12345
>> >
>> > admin@slicedinvoices.com
>> >
>> > (1)Invoice Number INV-3337
>> >
>> > Order Number 12345
>> >
>> > Invoice Date January 25, 2016
>> >
>> > Due Date January 31, 2016
>> >
>> > Total Due $93.50
>> >
>> > (3)To:
>> >
>> > Test Business
>> >
>> > 123 Somewhere St
>> >
>> > Melbourne, VIC 3000
>> >
>> > test@test.com
>> >
>> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>> >
>> > 1.00
>> > Web Design
>> > This is a sample description...
>> >
>> > $85.00 0.00% $85.00
>> >
>> > Sub Total $85.00
>> >
>> > Tax $8.50
>> >
>> > Total $93.50
>> >
>> > (5) ANZ Bank
>> >
>> > ACC # 1234 1234
>> >
>> > BSB # 4321 432 Pa
>> > id
>>
>

Re: How to parse PDF more effectively

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim

Thanks, I'm going to try to experiment with different complex enough PDFs
in order to figure out how to enhance the Quarkus Tika extension, what to
let customize, etc (I'll link to it in a follow up email).
Your output looks better :-), and which ContentHandler did you use ?

Sergey

On Thu, Jul 11, 2019 at 7:23 PM Tim Allison <ta...@apache.org> wrote:

> Might not need to break out the neural nets just yet...try turning on
> sortByPosition via the PDFParserConfig and/or tika_config.xml.
>
> This is what you get:
>
>
>
> <title>PDF Invoice Example</title>
> </head>
> <body><div class="page"><p />
> <p>Invoice
> </p>
> <p>From: Invoice Number INV-3337
> </p>
> <p>DEMO - Sliced Invoices Order Number 12345
> Suite 5A-1204 Invoice Date January 25, 2016
> 123 Somewhere Street Due Date January 31, 2016
> Your City AZ 12345
> admin@slicedinvoices.com Total Due $93.50
> </p>
> <p>To:
> Test Business
> 123 Somewhere St
> Melbourne, VIC 3000
> test@test.com
> </p>
> <p>Hrs/Qty Service Rate/Price Adjust Sub Total
> </p>
> <p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
> </p>
> <p>Pa
> idSub Total $85.00
> </p>
> <p>Tax $8.50
> Total $93.50
> </p>
> <p>ANZ Bank
> ACC # 1234 1234
> BSB # 4321 432
> </p>
> <p>Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
> Page 1/1</p>
> <p />
> <div class="annotation"><a
> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
> </a></div>
> <div class="annotation"><a
> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
> </a></div>
> <div class="annotation"><a
> href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo
> </a></div>
> <div class="annotation"><a
> href="mailto:admin@slicedinvoices.com">mailto:admin@slicedinvoices.com
> </a></div>
> </div>
> </body></html>
>
> On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> >
> > Hi
> >
> > I've used Tika to parse this invoice PDF:
> >
> > https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
> >
> > (AutoDetectParser, ToTextContentHandler), see below what is returned.
> > The numbers like (1), (2) are added by myself, this is the preferred
> order (approximately).
> >
> > Is it possible to hint somehow to Tika how to report the content ?
> >
> > Thanks Sergey
> >
> > PDF Invoice Example
> > Invoice
> >
> > (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
> >
> > Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
> >
> > Page 1/1
> >
> > (2)From:
> >
> > DEMO - Sliced Invoices
> >
> > Suite 5A-1204
> >
> > 123 Somewhere Street
> >
> > Your City AZ 12345
> >
> > admin@slicedinvoices.com
> >
> > (1)Invoice Number INV-3337
> >
> > Order Number 12345
> >
> > Invoice Date January 25, 2016
> >
> > Due Date January 31, 2016
> >
> > Total Due $93.50
> >
> > (3)To:
> >
> > Test Business
> >
> > 123 Somewhere St
> >
> > Melbourne, VIC 3000
> >
> > test@test.com
> >
> > (4) Hrs/Qty Service Rate/Price Adjust Sub Total
> >
> > 1.00
> > Web Design
> > This is a sample description...
> >
> > $85.00 0.00% $85.00
> >
> > Sub Total $85.00
> >
> > Tax $8.50
> >
> > Total $93.50
> >
> > (5) ANZ Bank
> >
> > ACC # 1234 1234
> >
> > BSB # 4321 432 Pa
> > id
>

Re: How to parse PDF more effectively

Posted by Tim Allison <ta...@apache.org>.
Might not need to break out the neural nets just yet...try turning on
sortByPosition via the PDFParserConfig and/or tika_config.xml.

This is what you get:



<title>PDF Invoice Example</title>
</head>
<body><div class="page"><p />
<p>Invoice
</p>
<p>From: Invoice Number INV-3337
</p>
<p>DEMO - Sliced Invoices Order Number 12345
Suite 5A-1204 Invoice Date January 25, 2016
123 Somewhere Street Due Date January 31, 2016
Your City AZ 12345
admin@slicedinvoices.com Total Due $93.50
</p>
<p>To:
Test Business
123 Somewhere St
Melbourne, VIC 3000
test@test.com
</p>
<p>Hrs/Qty Service Rate/Price Adjust Sub Total
</p>
<p>1.00 Web DesignThis is a sample description... $85.00 0.00% $85.00
</p>
<p>Pa
idSub Total $85.00
</p>
<p>Tax $8.50
Total $93.50
</p>
<p>ANZ Bank
ACC # 1234 1234
BSB # 4321 432
</p>
<p>Payment is due within 30 days from date of invoice. Late payment is
subject to fees of 5% per month.
Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
Page 1/1</p>
<p />
<div class="annotation"><a
href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div>
<div class="annotation"><a
href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div>
<div class="annotation"><a
href="http://slicedinvoices.com/demo">http://slicedinvoices.com/demo</a></div>
<div class="annotation"><a
href="mailto:admin@slicedinvoices.com">mailto:admin@slicedinvoices.com</a></div>
</div>
</body></html>

On Thu, Jul 11, 2019 at 1:25 PM Sergey Beryozkin <sb...@gmail.com> wrote:
>
> Hi
>
> I've used Tika to parse this invoice PDF:
>
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>
> (AutoDetectParser, ToTextContentHandler), see below what is returned.
> The numbers like (1), (2) are added by myself, this is the preferred order (approximately).
>
> Is it possible to hint somehow to Tika how to report the content ?
>
> Thanks Sergey
>
> PDF Invoice Example
> Invoice
>
> (5)Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.
>
> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>
> Page 1/1
>
> (2)From:
>
> DEMO - Sliced Invoices
>
> Suite 5A-1204
>
> 123 Somewhere Street
>
> Your City AZ 12345
>
> admin@slicedinvoices.com
>
> (1)Invoice Number INV-3337
>
> Order Number 12345
>
> Invoice Date January 25, 2016
>
> Due Date January 31, 2016
>
> Total Due $93.50
>
> (3)To:
>
> Test Business
>
> 123 Somewhere St
>
> Melbourne, VIC 3000
>
> test@test.com
>
> (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>
> 1.00
> Web Design
> This is a sample description...
>
> $85.00 0.00% $85.00
>
> Sub Total $85.00
>
> Tax $8.50
>
> Total $93.50
>
> (5) ANZ Bank
>
> ACC # 1234 1234
>
> BSB # 4321 432 Pa
> id

Re: [EXTERNAL] How to parse PDF more effectively

Posted by Ralph Soika <ra...@imixs.com>.
Hi,

I am also been looking since some time for a solution to interpret the
text content of an pdf-invoice. But I don't think there's an easy
solution for now. Deep learning and neural networks are too complex to
quickly categorize the contents of an invoice.  Cloud solutions such as
Rossum <https://rossum.ai/> do this quite well. But all data is sent to
AWS first, which is quite questionable for business data....


===
Ralph

On 11.07.19 19:26, Chris Mattmann wrote:
>
> Tabula PDF is something I have been looking at for this as well as doing
> like Deep Neural Nets…
>
>  
>
>  
>
>  
>
> *From: *Sergey Beryozkin <sb...@gmail.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Thursday, July 11, 2019 at 10:25 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *[EXTERNAL] How to parse PDF more effectively
>
>  
>
> Hi
>
>  
>
> I've used Tika to parse this invoice PDF:
>
>  
>
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>
>  
>
> (AutoDetectParser, ToTextContentHandler), see below what is returned.
>
> The numbers like (1), (2) are added by myself, this is the preferred
> order (approximately).
>
>  
>
> Is it possible to hint somehow to Tika how to report the content ?
>
>  
>
> Thanks Sergey
>
>  
>
> PDF Invoice Example
> Invoice
>
> (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
>
> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
> <ma...@slicedinvoices.com>
>
> Page 1/1
>
> (2)From:
>
> DEMO - Sliced Invoices
>
> Suite 5A-1204
>
> 123 Somewhere Street
>
> Your City AZ 12345
>
> admin@slicedinvoices.com <ma...@slicedinvoices.com>
>
> (1)Invoice Number INV-3337
>
> Order Number 12345
>
> Invoice Date January 25, 2016
>
> Due Date January 31, 2016
>
> Total Due $93.50
>
> (3)To:
>
> Test Business
>
> 123 Somewhere St
>
> Melbourne, VIC 3000
>
> test@test.com <ma...@test.com>
>
> (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>
> 1.00
> Web Design
> This is a sample description...
>
> $85.00 0.00% $85.00
>
> Sub Total $85.00
>
> Tax $8.50
>
> Total $93.50
>
> (5) ANZ Bank
>
> ACC # 1234 1234
>
> BSB # 4321 432 Pa
> id
>
-- 

Re: [EXTERNAL] How to parse PDF more effectively

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris

Interesting, I was wondering if it would make sense to add
TikaTabularPDFParser wrapper, example, it would accumulate a given table
headers, report them as a single ContentHandler line, etc...

Sergey

On Thu, Jul 11, 2019 at 6:26 PM Chris Mattmann <ma...@apache.org> wrote:

> Tabula PDF is something I have been looking at for this as well as doing
> like Deep Neural Nets…
>
>
>
>
>
>
>
> *From: *Sergey Beryozkin <sb...@gmail.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Thursday, July 11, 2019 at 10:25 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *[EXTERNAL] How to parse PDF more effectively
>
>
>
> Hi
>
>
>
> I've used Tika to parse this invoice PDF:
>
>
>
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>
>
>
> (AutoDetectParser, ToTextContentHandler), see below what is returned.
>
> The numbers like (1), (2) are added by myself, this is the preferred order
> (approximately).
>
>
>
> Is it possible to hint somehow to Tika how to report the content ?
>
>
>
> Thanks Sergey
>
>
>
> PDF Invoice Example
> Invoice
>
> (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
>
> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>
> Page 1/1
>
> (2)From:
>
> DEMO - Sliced Invoices
>
> Suite 5A-1204
>
> 123 Somewhere Street
>
> Your City AZ 12345
>
> admin@slicedinvoices.com
>
> (1)Invoice Number INV-3337
>
> Order Number 12345
>
> Invoice Date January 25, 2016
>
> Due Date January 31, 2016
>
> Total Due $93.50
>
> (3)To:
>
> Test Business
>
> 123 Somewhere St
>
> Melbourne, VIC 3000
>
> test@test.com
>
> (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>
> 1.00
> Web Design
> This is a sample description...
>
> $85.00 0.00% $85.00
>
> Sub Total $85.00
>
> Tax $8.50
>
> Total $93.50
>
> (5) ANZ Bank
>
> ACC # 1234 1234
>
> BSB # 4321 432 Pa
> id
>

Re: [EXTERNAL] How to parse PDF more effectively

Posted by Chris Mattmann <ma...@apache.org>.
Tabula PDF is something I have been looking at for this as well as doing
like Deep Neural Nets…

 

 

 

From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, July 11, 2019 at 10:25 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: [EXTERNAL] How to parse PDF more effectively

 

Hi

 

I've used Tika to parse this invoice PDF:

 

https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

 

(AutoDetectParser, ToTextContentHandler), see below what is returned.

The numbers like (1), (2) are added by myself, this is the preferred order (approximately).

 

Is it possible to hint somehow to Tika how to report the content ?

 

Thanks Sergey

 

PDF Invoice Example
Invoice

(5)Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.

Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com

Page 1/1

(2)From:

DEMO - Sliced Invoices

Suite 5A-1204

123 Somewhere Street

Your City AZ 12345

admin@slicedinvoices.com

(1)Invoice Number INV-3337

Order Number 12345

Invoice Date January 25, 2016

Due Date January 31, 2016

Total Due $93.50

(3)To:

Test Business

123 Somewhere St

Melbourne, VIC 3000

test@test.com

(4) Hrs/Qty Service Rate/Price Adjust Sub Total

1.00
Web Design
This is a sample description...

$85.00 0.00% $85.00

Sub Total $85.00

Tax $8.50

Total $93.50

(5) ANZ Bank

ACC # 1234 1234

BSB # 4321 432 Pa
id