You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@whimsical.apache.org by sebb <se...@gmail.com> on 2019/11/26 15:16:06 UTC

How to plug PDF scraping into Secretary workbench

I have committed some code to extract the form data from ICLAs.

For example:

https://whimsy.apache.org/secretary/icla-parse/yyyymm/hash/icla.pdf

It would be useful if this could somehow be plugged into the workbench.
For example when a PDF is classified as an ICLA.

However I cannot work out how to do this.

S.

Re: How to plug PDF scraping into Secretary workbench

Posted by sebb <se...@gmail.com>.
On Tue, 26 Nov 2019 at 15:21, Dave Fisher <wa...@comcast.net> wrote:

> Have you looked at Apache Tika?
>
>
[This is tangential to my query.
The Whimsy host does not currently include a JRE, so I did not look at Java
solutions.
The code now exists, and works well enough.]

I would still have the same issue with Tika: how to wire it up in the
Secretary workbench?


Sent from my iPhone
>
> > On Nov 26, 2019, at 9:16 AM, sebb <se...@gmail.com> wrote:
> >
> > I have committed some code to extract the form data from ICLAs.
> >
> > For example:
> >
> > https://whimsy.apache.org/secretary/icla-parse/yyyymm/hash/icla.pdf
> >
> > It would be useful if this could somehow be plugged into the workbench.
> > For example when a PDF is classified as an ICLA.
> >
> > However I cannot work out how to do this.
> >
> > S.
>
>

Re: How to plug PDF scraping into Secretary workbench

Posted by Dave Fisher <wa...@comcast.net>.
Have you looked at Apache Tika?

Sent from my iPhone

> On Nov 26, 2019, at 9:16 AM, sebb <se...@gmail.com> wrote:
> 
> I have committed some code to extract the form data from ICLAs.
> 
> For example:
> 
> https://whimsy.apache.org/secretary/icla-parse/yyyymm/hash/icla.pdf
> 
> It would be useful if this could somehow be plugged into the workbench.
> For example when a PDF is classified as an ICLA.
> 
> However I cannot work out how to do this.
> 
> S.


Re: How to plug PDF scraping into Secretary workbench

Posted by sebb <se...@gmail.com>.
Thanks a lot!

On Tue, 26 Nov 2019 at 17:07, Sam Ruby <ru...@intertwingly.net> wrote:

> On Tue, Nov 26, 2019 at 10:16 AM sebb <se...@gmail.com> wrote:
> >
> > I have committed some code to extract the form data from ICLAs.
> >
> > For example:
> >
> > https://whimsy.apache.org/secretary/icla-parse/yyyymm/hash/icla.pdf
> >
> > It would be useful if this could somehow be plugged into the workbench.
> > For example when a PDF is classified as an ICLA.
> >
> > However I cannot work out how to do this.
>
> First thing to understand is that inside the workbench, the host is
> not involved except when explicitly asked.  For example, when you
> click the icla button in the Categorize tab, there is no server
> action.  Everything happens on the client.
>
> Now look at whimsy/www/secretary/workbench/views/forms/icla.js.rb.  In
> there is a method called "mounted" that is invoked whenever this form
> is displayed.  It currently sets a few form fields.
>
> At the bottom of this method you will want to add code that does a
> POST request to the server.  These days it is safe to assume that the
> browser implements the fetch function, but if you like you can use the
> HTTP.post method that is defined (or even jQuery.ajax).  One of the
> arguments you will need to pass to the server will need to be the
> message (window.parent.location.pathname), and attachment
> (@@selected).  You can see these values are added as hidden fields to
> the existing form.
>
> The implementation of the parsing of the PDF will be done in a new
> file in the whimsy/www/secretary/workbench/views/actions.  If your
> post request is to /actions/name, then views/actions/name will be
> invoked.  If you need something different (and you likely don't), the
> routing of requests is done in server.rb.
>
> Use the arguments to get the attachment, parse the PDF and construct a
> JSON object to be returned.  You can look at other files in this
> directory to see how you do this (e.g., rotate_attachment).   It is as
> easy as message=...; selected=...; File.read(selected.path).  The last
> line of the file defines the object to be returned.
>
> Back on the client side, use the JSON object as you like: setting
> fields will cause them to be updated in the form.  For best results,
> disable the input fields that you expect to be setting when you issue
> the POST request and re-enable them when you get the response - that
> will ensure that the secretary hasn't begun typing and has their work
> overwritten.  You could use the existing @filed variable for this, but
> it would be clearer if you defined a new @disabled variable and change
> the fields that may be overwritten from specifying "disabled: @filed"
> to "disabled: @filed or @disabled".
>
> - Sam Ruby
>

Re: How to plug PDF scraping into Secretary workbench

Posted by Sam Ruby <ru...@intertwingly.net>.
On Tue, Nov 26, 2019 at 10:16 AM sebb <se...@gmail.com> wrote:
>
> I have committed some code to extract the form data from ICLAs.
>
> For example:
>
> https://whimsy.apache.org/secretary/icla-parse/yyyymm/hash/icla.pdf
>
> It would be useful if this could somehow be plugged into the workbench.
> For example when a PDF is classified as an ICLA.
>
> However I cannot work out how to do this.

First thing to understand is that inside the workbench, the host is
not involved except when explicitly asked.  For example, when you
click the icla button in the Categorize tab, there is no server
action.  Everything happens on the client.

Now look at whimsy/www/secretary/workbench/views/forms/icla.js.rb.  In
there is a method called "mounted" that is invoked whenever this form
is displayed.  It currently sets a few form fields.

At the bottom of this method you will want to add code that does a
POST request to the server.  These days it is safe to assume that the
browser implements the fetch function, but if you like you can use the
HTTP.post method that is defined (or even jQuery.ajax).  One of the
arguments you will need to pass to the server will need to be the
message (window.parent.location.pathname), and attachment
(@@selected).  You can see these values are added as hidden fields to
the existing form.

The implementation of the parsing of the PDF will be done in a new
file in the whimsy/www/secretary/workbench/views/actions.  If your
post request is to /actions/name, then views/actions/name will be
invoked.  If you need something different (and you likely don't), the
routing of requests is done in server.rb.

Use the arguments to get the attachment, parse the PDF and construct a
JSON object to be returned.  You can look at other files in this
directory to see how you do this (e.g., rotate_attachment).   It is as
easy as message=...; selected=...; File.read(selected.path).  The last
line of the file defines the object to be returned.

Back on the client side, use the JSON object as you like: setting
fields will cause them to be updated in the form.  For best results,
disable the input fields that you expect to be setting when you issue
the POST request and re-enable them when you get the response - that
will ensure that the secretary hasn't begun typing and has their work
overwritten.  You could use the existing @filed variable for this, but
it would be clearer if you defined a new @disabled variable and change
the fields that may be overwritten from specifying "disabled: @filed"
to "disabled: @filed or @disabled".

- Sam Ruby