You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Dhabolt <my...@yahoo.com> on 2013/02/13 14:53:50 UTC

How do I pass a password to Tika from Nutch for encrypted PDFs?

Hi,

We have PDFs we need to crawl that have a password associated. I don't see a way to pass this password to Tika. Apparently prior to Tika 1.1 the password would have been passed in Tika metadata. In Tika 1.1 and greater, they've added a new ParseContext object, PasswordProvider, which adds a getPassword method. Are either of these methods available to Nutch 1.6 through a property setting?

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by John Dhabolt <my...@yahoo.com>.
Just to touch on our use case, the search is for a public financial website with PDFs "locked" (password protected) so that users cannot change the financial information in the PDF. So they're not password protected to limit access, simply to limit the ability to modify with tools like Acrobat. Since these are created by the financial company, they also have a single password, so this is a fairly simple use case.

Thanks for the discussion!

John


________________________________
 From: Jorge Luis Betancourt Gonzalez <jl...@uci.cu>
To: user@nutch.apache.org 
Sent: Wednesday, February 13, 2013 6:13 PM
Subject: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
 
That's precisely my point I think that the modification should support regular expressions to specify passwords, I think this would be a good addition to nutch.

----- Mensaje original -----
De: "Tejas Patil" <te...@gmail.com>
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 16:54:58
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> ----- Mensaje original -----
> De: "Tejas Patil" <te...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetancourt@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > ----- Mensaje original -----
> > De: "Julien Nioche" <li...@gmail.com>
> > Para: user@nutch.apache.org, "John Dhabolt" <my...@yahoo.com>
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.
That's precisely my point I think that the modification should support regular expressions to specify passwords, I think this would be a good addition to nutch.

----- Mensaje original -----
De: "Tejas Patil" <te...@gmail.com>
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 16:54:58
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> ----- Mensaje original -----
> De: "Tejas Patil" <te...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetancourt@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > ----- Mensaje original -----
> > De: "Julien Nioche" <li...@gmail.com>
> > Para: user@nutch.apache.org, "John Dhabolt" <my...@yahoo.com>
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by Tejas Patil <te...@gmail.com>.
Absolutely. Normally crawlers are expected to gather pages which are
publically accessible. In internet or intranet, if a pdf file is protected,
then it is expected that its only for a small subset of users who know the
password and so it should not pop up in search results. From information
security perspective, its fair if the crawler doesn't parse these files.

Also, the % of such files present over the normal pages is less. The
scenario of people crawling wherein a majority of pdf files are protected
is rare. If that happens, it makes sense to assume that they know the files
and their corresponding passwords before hand.  If the password is common,
say "xyx.com/docs/pages/abc/*" has the same password for all pdf files then
a facility to provide a pattern would be convenient instead of listing
every url of that host.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> I got this, but really a tedious work to list passwords for each PDF file
> that will be crawled, don't you think?
>
> ----- Mensaje original -----
> De: "Tejas Patil" <te...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 13 de Febrero 2013 14:03:21
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> There can be pdf files of same name at different hosts so using the url
> would be better as compared to name. All this info can be in a xml file
> which will be read by the pdf plugin.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
> jlbetancourt@uci.cu> wrote:
>
> > Which could be a good way of specifying which password goes with which
> PDF
> > file? by full URI or by filename? other?
> >
> > ----- Mensaje original -----
> > De: "Julien Nioche" <li...@gmail.com>
> > Para: user@nutch.apache.org, "John Dhabolt" <my...@yahoo.com>
> > Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> > Asunto: Re: How do I pass a password to Tika from Nutch for encrypted
> PDFs?
> >
> > Hi John,
> >
> > Currently not but it should be relatively straightforward to modify
> > parse-tika to do so and would be a nice contribution to Nutch
> >
> > Julien
> >
> > On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > We have PDFs we need to crawl that have a password associated. I don't
> > see
> > > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > > password would have been passed in Tika metadata. In Tika 1.1 and
> > greater,
> > > they've added a new ParseContext object, PasswordProvider, which adds a
> > > getPassword method. Are either of these methods available to Nutch 1.6
> > > through a property setting?
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.
I got this, but really a tedious work to list passwords for each PDF file that will be crawled, don't you think?

----- Mensaje original -----
De: "Tejas Patil" <te...@gmail.com>
Para: user@nutch.apache.org
Enviados: Miércoles, 13 de Febrero 2013 14:03:21
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

There can be pdf files of same name at different hosts so using the url
would be better as compared to name. All this info can be in a xml file
which will be read by the pdf plugin.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> Which could be a good way of specifying which password goes with which PDF
> file? by full URI or by filename? other?
>
> ----- Mensaje original -----
> De: "Julien Nioche" <li...@gmail.com>
> Para: user@nutch.apache.org, "John Dhabolt" <my...@yahoo.com>
> Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> Hi John,
>
> Currently not but it should be relatively straightforward to modify
> parse-tika to do so and would be a nice contribution to Nutch
>
> Julien
>
> On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:
>
> > Hi,
> >
> > We have PDFs we need to crawl that have a password associated. I don't
> see
> > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > password would have been passed in Tika metadata. In Tika 1.1 and
> greater,
> > they've added a new ParseContext object, PasswordProvider, which adds a
> > getPassword method. Are either of these methods available to Nutch 1.6
> > through a property setting?
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by Tejas Patil <te...@gmail.com>.
There can be pdf files of same name at different hosts so using the url
would be better as compared to name. All this info can be in a xml file
which will be read by the pdf plugin.

Thanks,
Tejas Patil


On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> Which could be a good way of specifying which password goes with which PDF
> file? by full URI or by filename? other?
>
> ----- Mensaje original -----
> De: "Julien Nioche" <li...@gmail.com>
> Para: user@nutch.apache.org, "John Dhabolt" <my...@yahoo.com>
> Enviados: Miércoles, 13 de Febrero 2013 13:04:27
> Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?
>
> Hi John,
>
> Currently not but it should be relatively straightforward to modify
> parse-tika to do so and would be a nice contribution to Nutch
>
> Julien
>
> On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:
>
> > Hi,
> >
> > We have PDFs we need to crawl that have a password associated. I don't
> see
> > a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> > password would have been passed in Tika metadata. In Tika 1.1 and
> greater,
> > they've added a new ParseContext object, PasswordProvider, which adds a
> > getPassword method. Are either of these methods available to Nutch 1.6
> > through a property setting?
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.
Which could be a good way of specifying which password goes with which PDF file? by full URI or by filename? other?

----- Mensaje original -----
De: "Julien Nioche" <li...@gmail.com>
Para: user@nutch.apache.org, "John Dhabolt" <my...@yahoo.com>
Enviados: Miércoles, 13 de Febrero 2013 13:04:27
Asunto: Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Hi John,

Currently not but it should be relatively straightforward to modify
parse-tika to do so and would be a nice contribution to Nutch

Julien

On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:

> Hi,
>
> We have PDFs we need to crawl that have a password associated. I don't see
> a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> password would have been passed in Tika metadata. In Tika 1.1 and greater,
> they've added a new ParseContext object, PasswordProvider, which adds a
> getPassword method. Are either of these methods available to Nutch 1.6
> through a property setting?
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

Posted by Julien Nioche <li...@gmail.com>.
Hi John,

Currently not but it should be relatively straightforward to modify
parse-tika to do so and would be a nice contribution to Nutch

Julien

On 13 February 2013 13:53, John Dhabolt <my...@yahoo.com> wrote:

> Hi,
>
> We have PDFs we need to crawl that have a password associated. I don't see
> a way to pass this password to Tika. Apparently prior to Tika 1.1 the
> password would have been passed in Tika metadata. In Tika 1.1 and greater,
> they've added a new ParseContext object, PasswordProvider, which adds a
> getPassword method. Are either of these methods available to Nutch 1.6
> through a property setting?
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble