You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Giuseppe Totaro <to...@gmail.com> on 2017/09/28 20:35:22 UTC
[DISCUSS] Enable specific ContentHandler for tika-server
Hi folks,
if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.
I propose two solutions:
1. augment the TikaConfig class so that a specific ContentHandler can be
used in tika-config.xml;
2. determine the ContentHandler to use for parsing through HTTP headers,
for example:
curl -T filename.pdf http://localhost:9998/meta --header
"X-Content-Handler: PhoneExtractingContentHandler"
This should affect also the TikaResource.java class.
I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.
Thanks a lot,
Giuseppe
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 29 Sep 2017, Giuseppe Totaro wrote:
> To sum up, I would like to quickly discuss the following aspects:
>
> - As you all mentioned, the HTTP headers for configuring the
> ContentHandler to be used are better suited for the dynamic cases.
> Specifically, a ContentHadler can be given through an ad-hoc header, e.g.
> -H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used
> run-time within tika-server.
> - Nick, I believe that providing the ability to determine the
> ContentHandler through a command-line option is a great idea. It could be
> better also for users.
To make for shorter headers / options, I'd suggest that you test the value
given for a ".". If it has one, treat as a class name. If it doesn't, try
to prefix with org.apache.tika.sax , so that just short class names can be
used for Tika built-in handlers
Nick
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Giuseppe Totaro <to...@gmail.com>.
Hi folks,
first of all, I want to express my gratitude for your feedback and
insightful suggestions.
To sum up, I would like to quickly discuss the following aspects:
- As you all mentioned, the HTTP headers for configuring the
ContentHandler to be used are better suited for the dynamic cases.
Specifically, a ContentHadler can be given through an ad-hoc header, e.g.
-H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used
run-time within tika-server.
- Nick, I believe that providing the ability to determine the
ContentHandler through a command-line option is a great idea. It could be
better also for users.
Please let me implement both solutions and provide an example in the next
days that we can discuss.
Thanks again for your kind availability,
Giuseppe
On Thu, Sep 28, 2017 at 10:08 PM, Nick Burch <ap...@gagravarr.org> wrote:
> On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
>
>> if I am not wrong, currently you cannot configure a specific
>> ContentHandler
>> while using tika-server. I mean that you can configure your own parser [0]
>> but you cannot control which ContentHandler the parser leverages to
>> extract
>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>> StandardsExtractingContentHandler, etc).
>>
>
> I think the long-term plan was to work out a viable plan for laying
> multiple parsers on top of each other, then change some of these to be
> "enhancing parsers" on top. However, that's still on the "TODO" list for
> Tika 2.0, as we've still yet to come up with a good way to allow it to
> happen within the SAX / ContentHandler structure
>
>
> I propose two solutions:
>>
>> 1. augment the TikaConfig class so that a specific ContentHandler can be
>> used in tika-config.xml;
>>
>
> That feels a bit wrong to me, because in almost all Tika use-cases, the
> value from the Config would be ignored.
>
> Trying to explain to a new user which were the cases where it'd be used,
> and which ones it was ignored, seems hard and confusing too...
>
>
> 2. determine the ContentHandler to use for parsing through HTTP headers,
>> for example:
>>
>
> We do allow setting of parser config via headers, so this would have
> precidence. It would also allow per-request changing
>
> Otherwise, if server-wide is OK (which your config idea would require
> anyway), might it not be better to make it an option when you start the
> server? I see it as being a bit more like picking a port, in terms of
> something specific to how you run that server instance
>
> eg java -jar tika-server.jar --port 1234 --content-handler
> PhoneExtractingContentHandler
> eg java -jar tika-server.jar --port 1234 --content-handler
> com.example.CustomHandler
>
> Nick
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
I think the long-term plan was to work out a viable plan for laying
multiple parsers on top of each other, then change some of these to be
"enhancing parsers" on top. However, that's still on the "TODO" list for
Tika 2.0, as we've still yet to come up with a good way to allow it to
happen within the SAX / ContentHandler structure
> I propose two solutions:
>
> 1. augment the TikaConfig class so that a specific ContentHandler can be
> used in tika-config.xml;
That feels a bit wrong to me, because in almost all Tika use-cases, the
value from the Config would be ignored.
Trying to explain to a new user which were the cases where it'd be used,
and which ones it was ignored, seems hard and confusing too...
> 2. determine the ContentHandler to use for parsing through HTTP headers,
> for example:
We do allow setting of parser config via headers, so this would have
precidence. It would also allow per-request changing
Otherwise, if server-wide is OK (which your config idea would require
anyway), might it not be better to make it an option when you start the
server? I see it as being a bit more like picking a port, in terms of
something specific to how you run that server instance
eg java -jar tika-server.jar --port 1234 --content-handler PhoneExtractingContentHandler
eg java -jar tika-server.jar --port 1234 --content-handler com.example.CustomHandler
Nick
RE: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Not sure what the best way forward on this is, but this could help with, say, SOLR-7632.
A user could specify a handler that sends the document to Solr on endDocument().
This would cut out one leg of the transmission.
client sends bytes -> tika-server -> client -> Solr
client sends bytes -> tika-server -> Solr
-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: Tuesday, October 24, 2017 9:50 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] Enable specific ContentHandler for tika-server
This makes sense to me, +1 Giuseppe!
On 10/24/17, 6:12 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
Hi folks,
I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:
1. To create and use the given ContentHandler, should I modify each
method within the TikaResource class (as well as the other classes
within org.apache.tika.server.resource) where the parse method is
performed by wrapping the ContentHandler currently used? Alternatively, I
could create a new method (therefore a new REST API) specifically focused
on creating a ContentHandler from the list provided by the user. Of course,
I am totally open to other solutions.
2. As ContentHandlers often provide different types of constructors, we
would need a mechanism to determine via reflection the constructor and the
parameters to be used. I think we could get the ContentHandler by using the
static method Class.forName(String className) [0] with the
fully-qualified name of the given class and then using the method
getConstructor(Class<?>...
parameterTypes) [1] to determine the constructor to be used and
instantiates the ContentHandler.
3. If you agree with the above, I think that we can allow users to
provide the parameters according to RCFC822 [3] so that they can give the
name of the ContentHandler to be used and the parameter as a
semicolon-separated list of entries:
<header> = X-Content-Handler: <entry> *[, <entry>]
<entry> = <content handler> *[; <param>]
<param> = <java type> = <value>
Consistently, I would enable the same syntax when using the command-line
option:
java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]
I look forward to having your feedback.
Thanks a lot,
Giuseppe
[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/
On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sb...@gmail.com>
wrote:
> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
>>> wrote:
>>>
>>> Hi folks,
>>>>
>>>> if I am not wrong, currently you cannot configure a specific
>>>> ContentHandler
>>>> while using tika-server. I mean that you can configure your own parser
>>>> [0]
>>>> but you cannot control which ContentHandler the parser leverages to
>>>> extract
>>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>>> StandardsExtractingContentHandler, etc).
>>>> If it is correct, it would be nice to enable the use of specific
>>>> ContentHandlers within tika-server and I would like to discuss how to
>>>> solve
>>>> this issue generally.
>>>>
>>>> I propose two solutions:
>>>>
>>>> 1. augment the TikaConfig class so that a specific ContentHandler
>>>> can be
>>>> used in tika-config.xml;
>>>> 2. determine the ContentHandler to use for parsing through HTTP
>>>> headers,
>>>> for example:
>>>> curl -T filename.pdf http://localhost:9998/meta --header
>>>> "X-Content-Handler: PhoneExtractingContentHandler"
>>>> This should affect also the TikaResource.java class.
>>>>
>>>> I look forward to having your feedback. I strongly believe that every
>>>> user
>>>> who wants to use Tika as a service through tika-server and needs to
>>>> extract
>>>> content and metadata like phone numbers, standard references, etc would
>>>> be
>>>> very happy.
>>>>
>>>> Thanks a lot,
>>>> Giuseppe
>>>>
>>>>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Chris Mattmann <ma...@apache.org>.
This makes sense to me, +1 Giuseppe!
On 10/24/17, 6:12 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
Hi folks,
I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:
1. To create and use the given ContentHandler, should I modify each
method within the TikaResource class (as well as the other classes
within org.apache.tika.server.resource) where the parse method is
performed by wrapping the ContentHandler currently used? Alternatively, I
could create a new method (therefore a new REST API) specifically focused
on creating a ContentHandler from the list provided by the user. Of course,
I am totally open to other solutions.
2. As ContentHandlers often provide different types of constructors, we
would need a mechanism to determine via reflection the constructor and the
parameters to be used. I think we could get the ContentHandler by using the
static method Class.forName(String className) [0] with the
fully-qualified name of the given class and then using the method
getConstructor(Class<?>...
parameterTypes) [1] to determine the constructor to be used and
instantiates the ContentHandler.
3. If you agree with the above, I think that we can allow users to
provide the parameters according to RCFC822 [3] so that they can give the
name of the ContentHandler to be used and the parameter as a
semicolon-separated list of entries:
<header> = X-Content-Handler: <entry> *[, <entry>]
<entry> = <content handler> *[; <param>]
<param> = <java type> = <value>
Consistently, I would enable the same syntax when using the command-line
option:
java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]
I look forward to having your feedback.
Thanks a lot,
Giuseppe
[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/
On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sb...@gmail.com>
wrote:
> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
>>> wrote:
>>>
>>> Hi folks,
>>>>
>>>> if I am not wrong, currently you cannot configure a specific
>>>> ContentHandler
>>>> while using tika-server. I mean that you can configure your own parser
>>>> [0]
>>>> but you cannot control which ContentHandler the parser leverages to
>>>> extract
>>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>>> StandardsExtractingContentHandler, etc).
>>>> If it is correct, it would be nice to enable the use of specific
>>>> ContentHandlers within tika-server and I would like to discuss how to
>>>> solve
>>>> this issue generally.
>>>>
>>>> I propose two solutions:
>>>>
>>>> 1. augment the TikaConfig class so that a specific ContentHandler
>>>> can be
>>>> used in tika-config.xml;
>>>> 2. determine the ContentHandler to use for parsing through HTTP
>>>> headers,
>>>> for example:
>>>> curl -T filename.pdf http://localhost:9998/meta --header
>>>> "X-Content-Handler: PhoneExtractingContentHandler"
>>>> This should affect also the TikaResource.java class.
>>>>
>>>> I look forward to having your feedback. I strongly believe that every
>>>> user
>>>> who wants to use Tika as a service through tika-server and needs to
>>>> extract
>>>> content and metadata like phone numbers, standard references, etc would
>>>> be
>>>> very happy.
>>>>
>>>> Thanks a lot,
>>>> Giuseppe
>>>>
>>>>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Giuseppe Totaro <to...@gmail.com>.
Hi folks,
I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:
1. To create and use the given ContentHandler, should I modify each
method within the TikaResource class (as well as the other classes
within org.apache.tika.server.resource) where the parse method is
performed by wrapping the ContentHandler currently used? Alternatively, I
could create a new method (therefore a new REST API) specifically focused
on creating a ContentHandler from the list provided by the user. Of course,
I am totally open to other solutions.
2. As ContentHandlers often provide different types of constructors, we
would need a mechanism to determine via reflection the constructor and the
parameters to be used. I think we could get the ContentHandler by using the
static method Class.forName(String className) [0] with the
fully-qualified name of the given class and then using the method
getConstructor(Class<?>...
parameterTypes) [1] to determine the constructor to be used and
instantiates the ContentHandler.
3. If you agree with the above, I think that we can allow users to
provide the parameters according to RCFC822 [3] so that they can give the
name of the ContentHandler to be used and the parameter as a
semicolon-separated list of entries:
<header> = X-Content-Handler: <entry> *[, <entry>]
<entry> = <content handler> *[; <param>]
<param> = <java type> = <value>
Consistently, I would enable the same syntax when using the command-line
option:
java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]
I look forward to having your feedback.
Thanks a lot,
Giuseppe
[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/
On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sb...@gmail.com>
wrote:
> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
>>> wrote:
>>>
>>> Hi folks,
>>>>
>>>> if I am not wrong, currently you cannot configure a specific
>>>> ContentHandler
>>>> while using tika-server. I mean that you can configure your own parser
>>>> [0]
>>>> but you cannot control which ContentHandler the parser leverages to
>>>> extract
>>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>>> StandardsExtractingContentHandler, etc).
>>>> If it is correct, it would be nice to enable the use of specific
>>>> ContentHandlers within tika-server and I would like to discuss how to
>>>> solve
>>>> this issue generally.
>>>>
>>>> I propose two solutions:
>>>>
>>>> 1. augment the TikaConfig class so that a specific ContentHandler
>>>> can be
>>>> used in tika-config.xml;
>>>> 2. determine the ContentHandler to use for parsing through HTTP
>>>> headers,
>>>> for example:
>>>> curl -T filename.pdf http://localhost:9998/meta --header
>>>> "X-Content-Handler: PhoneExtractingContentHandler"
>>>> This should affect also the TikaResource.java class.
>>>>
>>>> I look forward to having your feedback. I strongly believe that every
>>>> user
>>>> who wants to use Tika as a service through tika-server and needs to
>>>> extract
>>>> content and metadata like phone numbers, standard references, etc would
>>>> be
>>>> very happy.
>>>>
>>>> Thanks a lot,
>>>> Giuseppe
>>>>
>>>>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Sergey Beryozkin <sb...@gmail.com>.
Konstantin, by the way, if you are interested in having a good
discussion to do with using the serialized lambdas then you will be
welcome to comment on the relevant text in the Tika Concerns Beam
thread, though may be Beam knows how to take care of the issues you
raised...
Thanks, Sergey
On 06/10/17 18:27, Sergey Beryozkin wrote:
> On 06/10/17 18:08, Konstantin Gribov wrote:
>> My +1 to this idea.
>>
>> IMHO, second option is more flexible. I also like Nick's suggestion about
>> using default package for handlers and interpret dot-separated string as
>> fqcn. Solr does similar thing and it's very convenient to use (but
>> they use
>> prefix `solr.` for their classes in predefined package and any other is
>> interpreted as fqcn).
>>
>> I'll add that you could allow user to pass several comma-separated
>> handlers
>> to allow build content-handler stack if user wants to.
>>
>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>> - it's useful only for java-clients;
>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>> it's very controversial from security PoV.
> Sure. I was not actually suggesting to use them in Tika natively, I only
> referred to it as the alternative mentioned in the context of the Beam
> integration work
>
> Sergey
>>
>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> if I am not wrong, currently you cannot configure a specific
>>> ContentHandler
>>> while using tika-server. I mean that you can configure your own
>>> parser [0]
>>> but you cannot control which ContentHandler the parser leverages to
>>> extract
>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>> StandardsExtractingContentHandler, etc).
>>> If it is correct, it would be nice to enable the use of specific
>>> ContentHandlers within tika-server and I would like to discuss how to
>>> solve
>>> this issue generally.
>>>
>>> I propose two solutions:
>>>
>>> 1. augment the TikaConfig class so that a specific ContentHandler
>>> can be
>>> used in tika-config.xml;
>>> 2. determine the ContentHandler to use for parsing through HTTP
>>> headers,
>>> for example:
>>> curl -T filename.pdf http://localhost:9998/meta --header
>>> "X-Content-Handler: PhoneExtractingContentHandler"
>>> This should affect also the TikaResource.java class.
>>>
>>> I look forward to having your feedback. I strongly believe that every
>>> user
>>> who wants to use Tika as a service through tika-server and needs to
>>> extract
>>> content and metadata like phone numbers, standard references, etc
>>> would be
>>> very happy.
>>>
>>> Thanks a lot,
>>> Giuseppe
>>>
--
Sergey Beryozkin
Talend Community Coders
http://coders.talend.com/
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Sergey Beryozkin <sb...@gmail.com>.
On 06/10/17 18:08, Konstantin Gribov wrote:
> My +1 to this idea.
>
> IMHO, second option is more flexible. I also like Nick's suggestion about
> using default package for handlers and interpret dot-separated string as
> fqcn. Solr does similar thing and it's very convenient to use (but they use
> prefix `solr.` for their classes in predefined package and any other is
> interpreted as fqcn).
>
> I'll add that you could allow user to pass several comma-separated handlers
> to allow build content-handler stack if user wants to.
>
> I would disagree with Sergey about serialized lambdas for 2 reasons:
> - it's useful only for java-clients;
> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
> it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only
referred to it as the alternative mentioned in the context of the Beam
integration work
Sergey
>
> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
> wrote:
>
>> Hi folks,
>>
>> if I am not wrong, currently you cannot configure a specific ContentHandler
>> while using tika-server. I mean that you can configure your own parser [0]
>> but you cannot control which ContentHandler the parser leverages to extract
>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>> StandardsExtractingContentHandler, etc).
>> If it is correct, it would be nice to enable the use of specific
>> ContentHandlers within tika-server and I would like to discuss how to solve
>> this issue generally.
>>
>> I propose two solutions:
>>
>> 1. augment the TikaConfig class so that a specific ContentHandler can be
>> used in tika-config.xml;
>> 2. determine the ContentHandler to use for parsing through HTTP headers,
>> for example:
>> curl -T filename.pdf http://localhost:9998/meta --header
>> "X-Content-Handler: PhoneExtractingContentHandler"
>> This should affect also the TikaResource.java class.
>>
>> I look forward to having your feedback. I strongly believe that every user
>> who wants to use Tika as a service through tika-server and needs to extract
>> content and metadata like phone numbers, standard references, etc would be
>> very happy.
>>
>> Thanks a lot,
>> Giuseppe
>>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Konstantin Gribov <gr...@gmail.com>.
My +1 to this idea.
IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).
I'll add that you could allow user to pass several comma-separated handlers
to allow build content-handler stack if user wants to.
I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.
On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
wrote:
> Hi folks,
>
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
> If it is correct, it would be nice to enable the use of specific
> ContentHandlers within tika-server and I would like to discuss how to solve
> this issue generally.
>
> I propose two solutions:
>
> 1. augment the TikaConfig class so that a specific ContentHandler can be
> used in tika-config.xml;
> 2. determine the ContentHandler to use for parsing through HTTP headers,
> for example:
> curl -T filename.pdf http://localhost:9998/meta --header
> "X-Content-Handler: PhoneExtractingContentHandler"
> This should affect also the TikaResource.java class.
>
> I look forward to having your feedback. I strongly believe that every user
> who wants to use Tika as a service through tika-server and needs to extract
> content and metadata like phone numbers, standard references, etc would be
> very happy.
>
> Thanks a lot,
> Giuseppe
>
--
Best regards,
Konstantin Gribov
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris
Another option (for Beam) was passing a custom content handler via the
serialized lambda expression - which sounds like a black magic to ne at
the moment but I'm curious :-)
I thought, assuming TikaConfig is only used once to bootstrap, then
passing a ContentHandler class name might work. You are right, the
headers are better suited for the dynamic cases...
Cheers, Sergey
On 28/09/17 22:35, Chris Mattmann wrote:
> Hmm, cool.
>
> Can we support both? If I don’t have to modify/ship a Tika config (which is a runtime
> configuration) and I can, on a per call invocation, change the ContentHandler, it would
> be MUCH easier in downstream libraries like Tika Python that rely on the REST server.
> These are documented here:
>
> https://wiki.apache.org/tika/API%20Bindings%20for%20Tika
>
> Cheers,
> Chris
>
>
>
>
> On 9/28/17, 2:26 PM, "Sergey Beryozkin" <sb...@gmail.com> wrote:
>
> Hi
>
> Option #1 is also good - a question how to pass a ContentHandler to a
> Beam function was open, and given that passing TikaConfig is needed
> anyway, having a way to specify a handler there can be handy too...
>
> Cheers, Sergey
> On 28/09/17 22:17, Chris Mattmann wrote:
> > I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
> > remain back compat with tika-python which is of strong interest to me.
> >
> > Cheers,
> > Chris
> >
> >
> >
> >
> > On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
> >
> > Hi folks,
> >
> > if I am not wrong, currently you cannot configure a specific ContentHandler
> > while using tika-server. I mean that you can configure your own parser [0]
> > but you cannot control which ContentHandler the parser leverages to extract
> > text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> > StandardsExtractingContentHandler, etc).
> > If it is correct, it would be nice to enable the use of specific
> > ContentHandlers within tika-server and I would like to discuss how to solve
> > this issue generally.
> >
> > I propose two solutions:
> >
> > 1. augment the TikaConfig class so that a specific ContentHandler can be
> > used in tika-config.xml;
> > 2. determine the ContentHandler to use for parsing through HTTP headers,
> > for example:
> > curl -T filename.pdf http://localhost:9998/meta --header
> > "X-Content-Handler: PhoneExtractingContentHandler"
> > This should affect also the TikaResource.java class.
> >
> > I look forward to having your feedback. I strongly believe that every user
> > who wants to use Tika as a service through tika-server and needs to extract
> > content and metadata like phone numbers, standard references, etc would be
> > very happy.
> >
> > Thanks a lot,
> > Giuseppe
> >
> >
> >
>
>
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Chris Mattmann <ma...@apache.org>.
Hmm, cool.
Can we support both? If I don’t have to modify/ship a Tika config (which is a runtime
configuration) and I can, on a per call invocation, change the ContentHandler, it would
be MUCH easier in downstream libraries like Tika Python that rely on the REST server.
These are documented here:
https://wiki.apache.org/tika/API%20Bindings%20for%20Tika
Cheers,
Chris
On 9/28/17, 2:26 PM, "Sergey Beryozkin" <sb...@gmail.com> wrote:
Hi
Option #1 is also good - a question how to pass a ContentHandler to a
Beam function was open, and given that passing TikaConfig is needed
anyway, having a way to specify a handler there can be handy too...
Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:
> I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
> remain back compat with tika-python which is of strong interest to me.
>
> Cheers,
> Chris
>
>
>
>
> On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
>
> Hi folks,
>
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
> If it is correct, it would be nice to enable the use of specific
> ContentHandlers within tika-server and I would like to discuss how to solve
> this issue generally.
>
> I propose two solutions:
>
> 1. augment the TikaConfig class so that a specific ContentHandler can be
> used in tika-config.xml;
> 2. determine the ContentHandler to use for parsing through HTTP headers,
> for example:
> curl -T filename.pdf http://localhost:9998/meta --header
> "X-Content-Handler: PhoneExtractingContentHandler"
> This should affect also the TikaResource.java class.
>
> I look forward to having your feedback. I strongly believe that every user
> who wants to use Tika as a service through tika-server and needs to extract
> content and metadata like phone numbers, standard references, etc would be
> very happy.
>
> Thanks a lot,
> Giuseppe
>
>
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi
Option #1 is also good - a question how to pass a ContentHandler to a
Beam function was open, and given that passing TikaConfig is needed
anyway, having a way to specify a handler there can be handy too...
Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:
> I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
> remain back compat with tika-python which is of strong interest to me.
>
> Cheers,
> Chris
>
>
>
>
> On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
>
> Hi folks,
>
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
> If it is correct, it would be nice to enable the use of specific
> ContentHandlers within tika-server and I would like to discuss how to solve
> this issue generally.
>
> I propose two solutions:
>
> 1. augment the TikaConfig class so that a specific ContentHandler can be
> used in tika-config.xml;
> 2. determine the ContentHandler to use for parsing through HTTP headers,
> for example:
> curl -T filename.pdf http://localhost:9998/meta --header
> "X-Content-Handler: PhoneExtractingContentHandler"
> This should affect also the TikaResource.java class.
>
> I look forward to having your feedback. I strongly believe that every user
> who wants to use Tika as a service through tika-server and needs to extract
> content and metadata like phone numbers, standard references, etc would be
> very happy.
>
> Thanks a lot,
> Giuseppe
>
>
>
Re: [DISCUSS] Enable specific ContentHandler for tika-server
Posted by Chris Mattmann <ma...@apache.org>.
I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
remain back compat with tika-python which is of strong interest to me.
Cheers,
Chris
On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
Hi folks,
if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.
I propose two solutions:
1. augment the TikaConfig class so that a specific ContentHandler can be
used in tika-config.xml;
2. determine the ContentHandler to use for parsing through HTTP headers,
for example:
curl -T filename.pdf http://localhost:9998/meta --header
"X-Content-Handler: PhoneExtractingContentHandler"
This should affect also the TikaResource.java class.
I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.
Thanks a lot,
Giuseppe