You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2019/07/11 17:26:42 UTC

Re: [EXTERNAL] How to parse PDF more effectively

Tabula PDF is something I have been looking at for this as well as doing
like Deep Neural Nets…

 

 

 

From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, July 11, 2019 at 10:25 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: [EXTERNAL] How to parse PDF more effectively

 

Hi

 

I've used Tika to parse this invoice PDF:

 

https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

 

(AutoDetectParser, ToTextContentHandler), see below what is returned.

The numbers like (1), (2) are added by myself, this is the preferred order (approximately).

 

Is it possible to hint somehow to Tika how to report the content ?

 

Thanks Sergey

 

PDF Invoice Example
Invoice

(5)Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.

Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com

Page 1/1

(2)From:

DEMO - Sliced Invoices

Suite 5A-1204

123 Somewhere Street

Your City AZ 12345

admin@slicedinvoices.com

(1)Invoice Number INV-3337

Order Number 12345

Invoice Date January 25, 2016

Due Date January 31, 2016

Total Due $93.50

(3)To:

Test Business

123 Somewhere St

Melbourne, VIC 3000

test@test.com

(4) Hrs/Qty Service Rate/Price Adjust Sub Total

1.00
Web Design
This is a sample description...

$85.00 0.00% $85.00

Sub Total $85.00

Tax $8.50

Total $93.50

(5) ANZ Bank

ACC # 1234 1234

BSB # 4321 432 Pa
id


Re: [EXTERNAL] How to parse PDF more effectively

Posted by Ralph Soika <ra...@imixs.com>.
Hi,

I am also been looking since some time for a solution to interpret the
text content of an pdf-invoice. But I don't think there's an easy
solution for now. Deep learning and neural networks are too complex to
quickly categorize the contents of an invoice.  Cloud solutions such as
Rossum <https://rossum.ai/> do this quite well. But all data is sent to
AWS first, which is quite questionable for business data....


===
Ralph

On 11.07.19 19:26, Chris Mattmann wrote:
>
> Tabula PDF is something I have been looking at for this as well as doing
> like Deep Neural Nets…
>
>  
>
>  
>
>  
>
> *From: *Sergey Beryozkin <sb...@gmail.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Thursday, July 11, 2019 at 10:25 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *[EXTERNAL] How to parse PDF more effectively
>
>  
>
> Hi
>
>  
>
> I've used Tika to parse this invoice PDF:
>
>  
>
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>
>  
>
> (AutoDetectParser, ToTextContentHandler), see below what is returned.
>
> The numbers like (1), (2) are added by myself, this is the preferred
> order (approximately).
>
>  
>
> Is it possible to hint somehow to Tika how to report the content ?
>
>  
>
> Thanks Sergey
>
>  
>
> PDF Invoice Example
> Invoice
>
> (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
>
> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
> <ma...@slicedinvoices.com>
>
> Page 1/1
>
> (2)From:
>
> DEMO - Sliced Invoices
>
> Suite 5A-1204
>
> 123 Somewhere Street
>
> Your City AZ 12345
>
> admin@slicedinvoices.com <ma...@slicedinvoices.com>
>
> (1)Invoice Number INV-3337
>
> Order Number 12345
>
> Invoice Date January 25, 2016
>
> Due Date January 31, 2016
>
> Total Due $93.50
>
> (3)To:
>
> Test Business
>
> 123 Somewhere St
>
> Melbourne, VIC 3000
>
> test@test.com <ma...@test.com>
>
> (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>
> 1.00
> Web Design
> This is a sample description...
>
> $85.00 0.00% $85.00
>
> Sub Total $85.00
>
> Tax $8.50
>
> Total $93.50
>
> (5) ANZ Bank
>
> ACC # 1234 1234
>
> BSB # 4321 432 Pa
> id
>
-- 

Re: [EXTERNAL] How to parse PDF more effectively

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris

Interesting, I was wondering if it would make sense to add
TikaTabularPDFParser wrapper, example, it would accumulate a given table
headers, report them as a single ContentHandler line, etc...

Sergey

On Thu, Jul 11, 2019 at 6:26 PM Chris Mattmann <ma...@apache.org> wrote:

> Tabula PDF is something I have been looking at for this as well as doing
> like Deep Neural Nets…
>
>
>
>
>
>
>
> *From: *Sergey Beryozkin <sb...@gmail.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Thursday, July 11, 2019 at 10:25 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *[EXTERNAL] How to parse PDF more effectively
>
>
>
> Hi
>
>
>
> I've used Tika to parse this invoice PDF:
>
>
>
> https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
>
>
>
> (AutoDetectParser, ToTextContentHandler), see below what is returned.
>
> The numbers like (1), (2) are added by myself, this is the preferred order
> (approximately).
>
>
>
> Is it possible to hint somehow to Tika how to report the content ?
>
>
>
> Thanks Sergey
>
>
>
> PDF Invoice Example
> Invoice
>
> (5)Payment is due within 30 days from date of invoice. Late payment is
> subject to fees of 5% per month.
>
> Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
>
> Page 1/1
>
> (2)From:
>
> DEMO - Sliced Invoices
>
> Suite 5A-1204
>
> 123 Somewhere Street
>
> Your City AZ 12345
>
> admin@slicedinvoices.com
>
> (1)Invoice Number INV-3337
>
> Order Number 12345
>
> Invoice Date January 25, 2016
>
> Due Date January 31, 2016
>
> Total Due $93.50
>
> (3)To:
>
> Test Business
>
> 123 Somewhere St
>
> Melbourne, VIC 3000
>
> test@test.com
>
> (4) Hrs/Qty Service Rate/Price Adjust Sub Total
>
> 1.00
> Web Design
> This is a sample description...
>
> $85.00 0.00% $85.00
>
> Sub Total $85.00
>
> Tax $8.50
>
> Total $93.50
>
> (5) ANZ Bank
>
> ACC # 1234 1234
>
> BSB # 4321 432 Pa
> id
>