You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Giuseppe Totaro <to...@gmail.com> on 2017/09/28 20:35:22 UTC

[DISCUSS] Enable specific ContentHandler for tika-server

Hi folks,

if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.

I propose two solutions:

   1. augment the TikaConfig class so that a specific ContentHandler can be
   used in tika-config.xml;
   2. determine the ContentHandler to use for parsing through HTTP headers,
   for example:
   curl -T filename.pdf http://localhost:9998/meta --header
   "X-Content-Handler: PhoneExtractingContentHandler"
   This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.

Thanks a lot,
Giuseppe

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 29 Sep 2017, Giuseppe Totaro wrote:
> To sum up, I would like to quickly discuss the following aspects:
>
>   - As you all mentioned, the HTTP headers for configuring the
>   ContentHandler to be used are better suited for the dynamic cases.
>   Specifically, a ContentHadler can be given through an ad-hoc header, e.g.
>   -H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used
>   run-time within tika-server.
>   - Nick, I believe that providing the ability to determine the
>   ContentHandler through a command-line option is a great idea. It could be
>   better also for users.

To make for shorter headers / options, I'd suggest that you test the value 
given for a ".". If it has one, treat as a class name. If it doesn't, try 
to prefix with org.apache.tika.sax , so that just short class names can be 
used for Tika built-in handlers

Nick

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Giuseppe Totaro <to...@gmail.com>.
Hi folks,

first of all, I want to express my gratitude for your feedback and
insightful suggestions.

To sum up, I would like to quickly discuss the following aspects:

   - As you all mentioned, the HTTP headers for configuring the
   ContentHandler to be used are better suited for the dynamic cases.
   Specifically, a ContentHadler can be given through an ad-hoc header, e.g.
   -H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used
   run-time within tika-server.
   - Nick, I believe that providing the ability to determine the
   ContentHandler through a command-line option is a great idea. It could be
   better also for users.

Please let me implement both solutions and provide an example in the next
days that we can discuss.

Thanks again for your kind availability,
Giuseppe


On Thu, Sep 28, 2017 at 10:08 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
>
>> if I am not wrong, currently you cannot configure a specific
>> ContentHandler
>> while using tika-server. I mean that you can configure your own parser [0]
>> but you cannot control which ContentHandler the parser leverages to
>> extract
>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>> StandardsExtractingContentHandler, etc).
>>
>
> I think the long-term plan was to work out a viable plan for laying
> multiple parsers on top of each other, then change some of these to be
> "enhancing parsers" on top. However, that's still on the "TODO" list for
> Tika 2.0, as we've still yet to come up with a good way to allow it to
> happen within the SAX / ContentHandler structure
>
>
> I propose two solutions:
>>
>>   1. augment the TikaConfig class so that a specific ContentHandler can be
>>   used in tika-config.xml;
>>
>
> That feels a bit wrong to me, because in almost all Tika use-cases, the
> value from the Config would be ignored.
>
> Trying to explain to a new user which were the cases where it'd be used,
> and which ones it was ignored, seems hard and confusing too...
>
>
>   2. determine the ContentHandler to use for parsing through HTTP headers,
>>   for example:
>>
>
> We do allow setting of parser config via headers, so this would have
> precidence. It would also allow per-request changing
>
> Otherwise, if server-wide is OK (which your config idea would require
> anyway), might it not be better to make it an option when you start the
> server? I see it as being a bit more like picking a port, in terms of
> something specific to how you run that server instance
>
> eg java -jar tika-server.jar --port 1234 --content-handler
> PhoneExtractingContentHandler
> eg java -jar tika-server.jar --port 1234 --content-handler
> com.example.CustomHandler
>
> Nick
>

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).

I think the long-term plan was to work out a viable plan for laying 
multiple parsers on top of each other, then change some of these to be 
"enhancing parsers" on top. However, that's still on the "TODO" list for 
Tika 2.0, as we've still yet to come up with a good way to allow it to 
happen within the SAX / ContentHandler structure


> I propose two solutions:
>
>   1. augment the TikaConfig class so that a specific ContentHandler can be
>   used in tika-config.xml;

That feels a bit wrong to me, because in almost all Tika use-cases, the 
value from the Config would be ignored.

Trying to explain to a new user which were the cases where it'd be used, 
and which ones it was ignored, seems hard and confusing too...


>   2. determine the ContentHandler to use for parsing through HTTP headers,
>   for example:

We do allow setting of parser config via headers, so this would have 
precidence. It would also allow per-request changing

Otherwise, if server-wide is OK (which your config idea would require 
anyway), might it not be better to make it an option when you start the 
server? I see it as being a bit more like picking a port, in terms of 
something specific to how you run that server instance

eg java -jar tika-server.jar --port 1234 --content-handler PhoneExtractingContentHandler
eg java -jar tika-server.jar --port 1234 --content-handler com.example.CustomHandler

Nick

RE: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Not sure what the best way forward on this is, but this could help with, say, SOLR-7632.

A user could specify a handler that sends the document to Solr on endDocument().

This would cut out one leg of the transmission.

client sends bytes -> tika-server -> client -> Solr

client sends bytes -> tika-server -> Solr

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: Tuesday, October 24, 2017 9:50 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] Enable specific ContentHandler for tika-server

This makes sense to me, +1 Giuseppe!



On 10/24/17, 6:12 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:

    Hi folks,
    
    I am developing the proposed solutions within tika-server for enabling
    specific ContentHandlers. Basically, I am working to provide the ability of
    giving the name of the ContentHandler to be used by either command-line or
    HTTP header.
    In order to complete my work, I would like to get your feedback about the
    following aspects:
    
       1. To create and use the given ContentHandler, should I modify each
       method within the TikaResource class (as well as the other classes
       within org.apache.tika.server.resource) where the parse method is
       performed by wrapping the ContentHandler currently used? Alternatively, I
       could create a new method (therefore a new REST API) specifically focused
       on creating a ContentHandler from the list provided by the user. Of course,
       I am totally open to other solutions.
    
       2. As ContentHandlers often provide different types of constructors, we
       would need a mechanism to determine via reflection the constructor and the
       parameters to be used. I think we could get the ContentHandler by using the
       static method Class.forName(String className) [0] with the
       fully-qualified name of the given class and then using the method
    getConstructor(Class<?>...
       parameterTypes) [1] to determine the constructor to be used and
       instantiates the ContentHandler.
    
       3. If you agree with the above, I think that we can allow users to
       provide the parameters according to RCFC822 [3] so that they can give the
       name of the ContentHandler to be used and the parameter as a
       semicolon-separated list of entries:
    
       <header> = X-Content-Handler: <entry> *[, <entry>]
       <entry> = <content handler> *[; <param>]
       <param> = <java type> = <value>
    
       Consistently, I would enable the same syntax when using the command-line
       option:
    
       java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]
    
    I look forward to having your feedback.
    
    Thanks a lot,
    Giuseppe
    
    [0]
    https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
    [1]
    https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
    [3] https://www.w3.org/Protocols/rfc822/
    
    On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sb...@gmail.com>
    wrote:
    
    > Konstantin, by the way, if you are interested in having a good discussion
    > to do with using the serialized lambdas then you will be welcome to comment
    > on the relevant text in the Tika Concerns Beam thread, though may be Beam
    > knows how to take care of the issues you raised...
    >
    > Thanks, Sergey
    >
    > On 06/10/17 18:27, Sergey Beryozkin wrote:
    >
    >> On 06/10/17 18:08, Konstantin Gribov wrote:
    >>
    >>> My +1 to this idea.
    >>>
    >>> IMHO, second option is more flexible. I also like Nick's suggestion about
    >>> using default package for handlers and interpret dot-separated string as
    >>> fqcn. Solr does similar thing and it's very convenient to use (but they
    >>> use
    >>> prefix `solr.` for their classes in predefined package and any other is
    >>> interpreted as fqcn).
    >>>
    >>> I'll add that you could allow user to pass several comma-separated
    >>> handlers
    >>> to allow build content-handler stack if user wants to.
    >>>
    >>> I would disagree with Sergey about serialized lambdas for 2 reasons:
    >>> - it's useful only for java-clients;
    >>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
    >>> it's very controversial from security PoV.
    >>>
    >> Sure. I was not actually suggesting to use them in Tika natively, I only
    >> referred to it as the alternative mentioned in the context of the Beam
    >> integration work
    >>
    >> Sergey
    >>
    >>>
    >>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
    >>> wrote:
    >>>
    >>> Hi folks,
    >>>>
    >>>> if I am not wrong, currently you cannot configure a specific
    >>>> ContentHandler
    >>>> while using tika-server. I mean that you can configure your own parser
    >>>> [0]
    >>>> but you cannot control which ContentHandler the parser leverages to
    >>>> extract
    >>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
    >>>> StandardsExtractingContentHandler, etc).
    >>>> If it is correct, it would be nice to enable the use of specific
    >>>> ContentHandlers within tika-server and I would like to discuss how to
    >>>> solve
    >>>> this issue generally.
    >>>>
    >>>> I propose two solutions:
    >>>>
    >>>>     1. augment the TikaConfig class so that a specific ContentHandler
    >>>> can be
    >>>>     used in tika-config.xml;
    >>>>     2. determine the ContentHandler to use for parsing through HTTP
    >>>> headers,
    >>>>     for example:
    >>>>     curl -T filename.pdf http://localhost:9998/meta --header
    >>>>     "X-Content-Handler: PhoneExtractingContentHandler"
    >>>>     This should affect also the TikaResource.java class.
    >>>>
    >>>> I look forward to having your feedback. I strongly believe that every
    >>>> user
    >>>> who wants to use Tika as a service through tika-server and needs to
    >>>> extract
    >>>> content and metadata like phone numbers, standard references, etc would
    >>>> be
    >>>> very happy.
    >>>>
    >>>> Thanks a lot,
    >>>> Giuseppe
    >>>>
    >>>>
    >
    > --
    > Sergey Beryozkin
    >
    > Talend Community Coders
    > http://coders.talend.com/
    >
    



Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Chris Mattmann <ma...@apache.org>.
This makes sense to me, +1 Giuseppe!



On 10/24/17, 6:12 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:

    Hi folks,
    
    I am developing the proposed solutions within tika-server for enabling
    specific ContentHandlers. Basically, I am working to provide the ability of
    giving the name of the ContentHandler to be used by either command-line or
    HTTP header.
    In order to complete my work, I would like to get your feedback about the
    following aspects:
    
       1. To create and use the given ContentHandler, should I modify each
       method within the TikaResource class (as well as the other classes
       within org.apache.tika.server.resource) where the parse method is
       performed by wrapping the ContentHandler currently used? Alternatively, I
       could create a new method (therefore a new REST API) specifically focused
       on creating a ContentHandler from the list provided by the user. Of course,
       I am totally open to other solutions.
    
       2. As ContentHandlers often provide different types of constructors, we
       would need a mechanism to determine via reflection the constructor and the
       parameters to be used. I think we could get the ContentHandler by using the
       static method Class.forName(String className) [0] with the
       fully-qualified name of the given class and then using the method
    getConstructor(Class<?>...
       parameterTypes) [1] to determine the constructor to be used and
       instantiates the ContentHandler.
    
       3. If you agree with the above, I think that we can allow users to
       provide the parameters according to RCFC822 [3] so that they can give the
       name of the ContentHandler to be used and the parameter as a
       semicolon-separated list of entries:
    
       <header> = X-Content-Handler: <entry> *[, <entry>]
       <entry> = <content handler> *[; <param>]
       <param> = <java type> = <value>
    
       Consistently, I would enable the same syntax when using the command-line
       option:
    
       java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]
    
    I look forward to having your feedback.
    
    Thanks a lot,
    Giuseppe
    
    [0]
    https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
    [1]
    https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
    [3] https://www.w3.org/Protocols/rfc822/
    
    On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sb...@gmail.com>
    wrote:
    
    > Konstantin, by the way, if you are interested in having a good discussion
    > to do with using the serialized lambdas then you will be welcome to comment
    > on the relevant text in the Tika Concerns Beam thread, though may be Beam
    > knows how to take care of the issues you raised...
    >
    > Thanks, Sergey
    >
    > On 06/10/17 18:27, Sergey Beryozkin wrote:
    >
    >> On 06/10/17 18:08, Konstantin Gribov wrote:
    >>
    >>> My +1 to this idea.
    >>>
    >>> IMHO, second option is more flexible. I also like Nick's suggestion about
    >>> using default package for handlers and interpret dot-separated string as
    >>> fqcn. Solr does similar thing and it's very convenient to use (but they
    >>> use
    >>> prefix `solr.` for their classes in predefined package and any other is
    >>> interpreted as fqcn).
    >>>
    >>> I'll add that you could allow user to pass several comma-separated
    >>> handlers
    >>> to allow build content-handler stack if user wants to.
    >>>
    >>> I would disagree with Sergey about serialized lambdas for 2 reasons:
    >>> - it's useful only for java-clients;
    >>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
    >>> it's very controversial from security PoV.
    >>>
    >> Sure. I was not actually suggesting to use them in Tika natively, I only
    >> referred to it as the alternative mentioned in the context of the Beam
    >> integration work
    >>
    >> Sergey
    >>
    >>>
    >>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
    >>> wrote:
    >>>
    >>> Hi folks,
    >>>>
    >>>> if I am not wrong, currently you cannot configure a specific
    >>>> ContentHandler
    >>>> while using tika-server. I mean that you can configure your own parser
    >>>> [0]
    >>>> but you cannot control which ContentHandler the parser leverages to
    >>>> extract
    >>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
    >>>> StandardsExtractingContentHandler, etc).
    >>>> If it is correct, it would be nice to enable the use of specific
    >>>> ContentHandlers within tika-server and I would like to discuss how to
    >>>> solve
    >>>> this issue generally.
    >>>>
    >>>> I propose two solutions:
    >>>>
    >>>>     1. augment the TikaConfig class so that a specific ContentHandler
    >>>> can be
    >>>>     used in tika-config.xml;
    >>>>     2. determine the ContentHandler to use for parsing through HTTP
    >>>> headers,
    >>>>     for example:
    >>>>     curl -T filename.pdf http://localhost:9998/meta --header
    >>>>     "X-Content-Handler: PhoneExtractingContentHandler"
    >>>>     This should affect also the TikaResource.java class.
    >>>>
    >>>> I look forward to having your feedback. I strongly believe that every
    >>>> user
    >>>> who wants to use Tika as a service through tika-server and needs to
    >>>> extract
    >>>> content and metadata like phone numbers, standard references, etc would
    >>>> be
    >>>> very happy.
    >>>>
    >>>> Thanks a lot,
    >>>> Giuseppe
    >>>>
    >>>>
    >
    > --
    > Sergey Beryozkin
    >
    > Talend Community Coders
    > http://coders.talend.com/
    >
    



Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Giuseppe Totaro <to...@gmail.com>.
Hi folks,

I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:

   1. To create and use the given ContentHandler, should I modify each
   method within the TikaResource class (as well as the other classes
   within org.apache.tika.server.resource) where the parse method is
   performed by wrapping the ContentHandler currently used? Alternatively, I
   could create a new method (therefore a new REST API) specifically focused
   on creating a ContentHandler from the list provided by the user. Of course,
   I am totally open to other solutions.

   2. As ContentHandlers often provide different types of constructors, we
   would need a mechanism to determine via reflection the constructor and the
   parameters to be used. I think we could get the ContentHandler by using the
   static method Class.forName(String className) [0] with the
   fully-qualified name of the given class and then using the method
getConstructor(Class<?>...
   parameterTypes) [1] to determine the constructor to be used and
   instantiates the ContentHandler.

   3. If you agree with the above, I think that we can allow users to
   provide the parameters according to RCFC822 [3] so that they can give the
   name of the ContentHandler to be used and the parameter as a
   semicolon-separated list of entries:

   <header> = X-Content-Handler: <entry> *[, <entry>]
   <entry> = <content handler> *[; <param>]
   <param> = <java type> = <value>

   Consistently, I would enable the same syntax when using the command-line
   option:

   java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]

I look forward to having your feedback.

Thanks a lot,
Giuseppe

[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/

On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <sb...@gmail.com>
wrote:

> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
>>> wrote:
>>>
>>> Hi folks,
>>>>
>>>> if I am not wrong, currently you cannot configure a specific
>>>> ContentHandler
>>>> while using tika-server. I mean that you can configure your own parser
>>>> [0]
>>>> but you cannot control which ContentHandler the parser leverages to
>>>> extract
>>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>>> StandardsExtractingContentHandler, etc).
>>>> If it is correct, it would be nice to enable the use of specific
>>>> ContentHandlers within tika-server and I would like to discuss how to
>>>> solve
>>>> this issue generally.
>>>>
>>>> I propose two solutions:
>>>>
>>>>     1. augment the TikaConfig class so that a specific ContentHandler
>>>> can be
>>>>     used in tika-config.xml;
>>>>     2. determine the ContentHandler to use for parsing through HTTP
>>>> headers,
>>>>     for example:
>>>>     curl -T filename.pdf http://localhost:9998/meta --header
>>>>     "X-Content-Handler: PhoneExtractingContentHandler"
>>>>     This should affect also the TikaResource.java class.
>>>>
>>>> I look forward to having your feedback. I strongly believe that every
>>>> user
>>>> who wants to use Tika as a service through tika-server and needs to
>>>> extract
>>>> content and metadata like phone numbers, standard references, etc would
>>>> be
>>>> very happy.
>>>>
>>>> Thanks a lot,
>>>> Giuseppe
>>>>
>>>>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Sergey Beryozkin <sb...@gmail.com>.
Konstantin, by the way, if you are interested in having a good 
discussion to do with using the serialized lambdas then you will be 
welcome to comment on the relevant text in the Tika Concerns Beam 
thread, though may be Beam knows how to take care of the issues you 
raised...

Thanks, Sergey
On 06/10/17 18:27, Sergey Beryozkin wrote:
> On 06/10/17 18:08, Konstantin Gribov wrote:
>> My +1 to this idea.
>>
>> IMHO, second option is more flexible. I also like Nick's suggestion about
>> using default package for handlers and interpret dot-separated string as
>> fqcn. Solr does similar thing and it's very convenient to use (but 
>> they use
>> prefix `solr.` for their classes in predefined package and any other is
>> interpreted as fqcn).
>>
>> I'll add that you could allow user to pass several comma-separated 
>> handlers
>> to allow build content-handler stack if user wants to.
>>
>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>> - it's useful only for java-clients;
>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>> it's very controversial from security PoV.
> Sure. I was not actually suggesting to use them in Tika natively, I only 
> referred to it as the alternative mentioned in the context of the Beam 
> integration work
> 
> Sergey
>>
>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> if I am not wrong, currently you cannot configure a specific 
>>> ContentHandler
>>> while using tika-server. I mean that you can configure your own 
>>> parser [0]
>>> but you cannot control which ContentHandler the parser leverages to 
>>> extract
>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>> StandardsExtractingContentHandler, etc).
>>> If it is correct, it would be nice to enable the use of specific
>>> ContentHandlers within tika-server and I would like to discuss how to 
>>> solve
>>> this issue generally.
>>>
>>> I propose two solutions:
>>>
>>>     1. augment the TikaConfig class so that a specific ContentHandler 
>>> can be
>>>     used in tika-config.xml;
>>>     2. determine the ContentHandler to use for parsing through HTTP 
>>> headers,
>>>     for example:
>>>     curl -T filename.pdf http://localhost:9998/meta --header
>>>     "X-Content-Handler: PhoneExtractingContentHandler"
>>>     This should affect also the TikaResource.java class.
>>>
>>> I look forward to having your feedback. I strongly believe that every 
>>> user
>>> who wants to use Tika as a service through tika-server and needs to 
>>> extract
>>> content and metadata like phone numbers, standard references, etc 
>>> would be
>>> very happy.
>>>
>>> Thanks a lot,
>>> Giuseppe
>>>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Sergey Beryozkin <sb...@gmail.com>.
On 06/10/17 18:08, Konstantin Gribov wrote:
> My +1 to this idea.
> 
> IMHO, second option is more flexible. I also like Nick's suggestion about
> using default package for handlers and interpret dot-separated string as
> fqcn. Solr does similar thing and it's very convenient to use (but they use
> prefix `solr.` for their classes in predefined package and any other is
> interpreted as fqcn).
> 
> I'll add that you could allow user to pass several comma-separated handlers
> to allow build content-handler stack if user wants to.
> 
> I would disagree with Sergey about serialized lambdas for 2 reasons:
> - it's useful only for java-clients;
> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
> it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only 
referred to it as the alternative mentioned in the context of the Beam 
integration work

Sergey
> 
> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
> wrote:
> 
>> Hi folks,
>>
>> if I am not wrong, currently you cannot configure a specific ContentHandler
>> while using tika-server. I mean that you can configure your own parser [0]
>> but you cannot control which ContentHandler the parser leverages to extract
>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>> StandardsExtractingContentHandler, etc).
>> If it is correct, it would be nice to enable the use of specific
>> ContentHandlers within tika-server and I would like to discuss how to solve
>> this issue generally.
>>
>> I propose two solutions:
>>
>>     1. augment the TikaConfig class so that a specific ContentHandler can be
>>     used in tika-config.xml;
>>     2. determine the ContentHandler to use for parsing through HTTP headers,
>>     for example:
>>     curl -T filename.pdf http://localhost:9998/meta --header
>>     "X-Content-Handler: PhoneExtractingContentHandler"
>>     This should affect also the TikaResource.java class.
>>
>> I look forward to having your feedback. I strongly believe that every user
>> who wants to use Tika as a service through tika-server and needs to extract
>> content and metadata like phone numbers, standard references, etc would be
>> very happy.
>>
>> Thanks a lot,
>> Giuseppe
>>

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Konstantin Gribov <gr...@gmail.com>.
My +1 to this idea.

IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).

I'll add that you could allow user to pass several comma-separated handlers
to allow build content-handler stack if user wants to.

I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.

On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <to...@gmail.com>
wrote:

> Hi folks,
>
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
> If it is correct, it would be nice to enable the use of specific
> ContentHandlers within tika-server and I would like to discuss how to solve
> this issue generally.
>
> I propose two solutions:
>
>    1. augment the TikaConfig class so that a specific ContentHandler can be
>    used in tika-config.xml;
>    2. determine the ContentHandler to use for parsing through HTTP headers,
>    for example:
>    curl -T filename.pdf http://localhost:9998/meta --header
>    "X-Content-Handler: PhoneExtractingContentHandler"
>    This should affect also the TikaResource.java class.
>
> I look forward to having your feedback. I strongly believe that every user
> who wants to use Tika as a service through tika-server and needs to extract
> content and metadata like phone numbers, standard references, etc would be
> very happy.
>
> Thanks a lot,
> Giuseppe
>
-- 

Best regards,
Konstantin Gribov

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris

Another option (for Beam) was passing a custom content handler via the 
serialized lambda expression - which sounds like a black magic to ne at 
the moment but I'm curious :-)

I thought, assuming TikaConfig is only used once to bootstrap, then 
passing a ContentHandler class name might work. You are right, the 
headers are better suited for the dynamic cases...

Cheers, Sergey
On 28/09/17 22:35, Chris Mattmann wrote:
> Hmm, cool.
> 
> Can we support both? If I don’t have to modify/ship a Tika config (which is a runtime
> configuration) and I can, on a per call invocation, change the ContentHandler, it would
> be MUCH easier in downstream libraries like Tika Python that rely on the REST server.
> These are documented here:
> 
> https://wiki.apache.org/tika/API%20Bindings%20for%20Tika
> 
> Cheers,
> Chris
> 
> 
> 
> 
> On 9/28/17, 2:26 PM, "Sergey Beryozkin" <sb...@gmail.com> wrote:
> 
>      Hi
>      
>      Option #1 is also good - a question how to pass a ContentHandler to a
>      Beam function was open, and given that passing TikaConfig is needed
>      anyway, having a way to specify a handler there can be handy too...
>      
>      Cheers, Sergey
>      On 28/09/17 22:17, Chris Mattmann wrote:
>      > I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
>      > remain back compat with tika-python which is of strong interest to me.
>      >
>      > Cheers,
>      > Chris
>      >
>      >
>      >
>      >
>      > On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
>      >
>      >      Hi folks,
>      >
>      >      if I am not wrong, currently you cannot configure a specific ContentHandler
>      >      while using tika-server. I mean that you can configure your own parser [0]
>      >      but you cannot control which ContentHandler the parser leverages to extract
>      >      text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>      >      StandardsExtractingContentHandler, etc).
>      >      If it is correct, it would be nice to enable the use of specific
>      >      ContentHandlers within tika-server and I would like to discuss how to solve
>      >      this issue generally.
>      >
>      >      I propose two solutions:
>      >
>      >         1. augment the TikaConfig class so that a specific ContentHandler can be
>      >         used in tika-config.xml;
>      >         2. determine the ContentHandler to use for parsing through HTTP headers,
>      >         for example:
>      >         curl -T filename.pdf http://localhost:9998/meta --header
>      >         "X-Content-Handler: PhoneExtractingContentHandler"
>      >         This should affect also the TikaResource.java class.
>      >
>      >      I look forward to having your feedback. I strongly believe that every user
>      >      who wants to use Tika as a service through tika-server and needs to extract
>      >      content and metadata like phone numbers, standard references, etc would be
>      >      very happy.
>      >
>      >      Thanks a lot,
>      >      Giuseppe
>      >
>      >
>      >
>      
> 
> 

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Chris Mattmann <ma...@apache.org>.
Hmm, cool.

Can we support both? If I don’t have to modify/ship a Tika config (which is a runtime
configuration) and I can, on a per call invocation, change the ContentHandler, it would
be MUCH easier in downstream libraries like Tika Python that rely on the REST server.
These are documented here:

https://wiki.apache.org/tika/API%20Bindings%20for%20Tika

Cheers,
Chris




On 9/28/17, 2:26 PM, "Sergey Beryozkin" <sb...@gmail.com> wrote:

    Hi
    
    Option #1 is also good - a question how to pass a ContentHandler to a 
    Beam function was open, and given that passing TikaConfig is needed 
    anyway, having a way to specify a handler there can be handy too...
    
    Cheers, Sergey
    On 28/09/17 22:17, Chris Mattmann wrote:
    > I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
    > remain back compat with tika-python which is of strong interest to me.
    > 
    > Cheers,
    > Chris
    > 
    > 
    > 
    > 
    > On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
    > 
    >      Hi folks,
    >      
    >      if I am not wrong, currently you cannot configure a specific ContentHandler
    >      while using tika-server. I mean that you can configure your own parser [0]
    >      but you cannot control which ContentHandler the parser leverages to extract
    >      text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
    >      StandardsExtractingContentHandler, etc).
    >      If it is correct, it would be nice to enable the use of specific
    >      ContentHandlers within tika-server and I would like to discuss how to solve
    >      this issue generally.
    >      
    >      I propose two solutions:
    >      
    >         1. augment the TikaConfig class so that a specific ContentHandler can be
    >         used in tika-config.xml;
    >         2. determine the ContentHandler to use for parsing through HTTP headers,
    >         for example:
    >         curl -T filename.pdf http://localhost:9998/meta --header
    >         "X-Content-Handler: PhoneExtractingContentHandler"
    >         This should affect also the TikaResource.java class.
    >      
    >      I look forward to having your feedback. I strongly believe that every user
    >      who wants to use Tika as a service through tika-server and needs to extract
    >      content and metadata like phone numbers, standard references, etc would be
    >      very happy.
    >      
    >      Thanks a lot,
    >      Giuseppe
    >      
    > 
    > 
    



Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

Option #1 is also good - a question how to pass a ContentHandler to a 
Beam function was open, and given that passing TikaConfig is needed 
anyway, having a way to specify a handler there can be handy too...

Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:
> I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
> remain back compat with tika-python which is of strong interest to me.
> 
> Cheers,
> Chris
> 
> 
> 
> 
> On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:
> 
>      Hi folks,
>      
>      if I am not wrong, currently you cannot configure a specific ContentHandler
>      while using tika-server. I mean that you can configure your own parser [0]
>      but you cannot control which ContentHandler the parser leverages to extract
>      text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>      StandardsExtractingContentHandler, etc).
>      If it is correct, it would be nice to enable the use of specific
>      ContentHandlers within tika-server and I would like to discuss how to solve
>      this issue generally.
>      
>      I propose two solutions:
>      
>         1. augment the TikaConfig class so that a specific ContentHandler can be
>         used in tika-config.xml;
>         2. determine the ContentHandler to use for parsing through HTTP headers,
>         for example:
>         curl -T filename.pdf http://localhost:9998/meta --header
>         "X-Content-Handler: PhoneExtractingContentHandler"
>         This should affect also the TikaResource.java class.
>      
>      I look forward to having your feedback. I strongly believe that every user
>      who wants to use Tika as a service through tika-server and needs to extract
>      content and metadata like phone numbers, standard references, etc would be
>      very happy.
>      
>      Thanks a lot,
>      Giuseppe
>      
> 
> 

Re: [DISCUSS] Enable specific ContentHandler for tika-server

Posted by Chris Mattmann <ma...@apache.org>.
I am +1 for this. Option #2 sounds like a slick way to handle this for me that would
remain back compat with tika-python which is of strong interest to me.

Cheers,
Chris




On 9/28/17, 1:35 PM, "Giuseppe Totaro" <to...@gmail.com> wrote:

    Hi folks,
    
    if I am not wrong, currently you cannot configure a specific ContentHandler
    while using tika-server. I mean that you can configure your own parser [0]
    but you cannot control which ContentHandler the parser leverages to extract
    text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
    StandardsExtractingContentHandler, etc).
    If it is correct, it would be nice to enable the use of specific
    ContentHandlers within tika-server and I would like to discuss how to solve
    this issue generally.
    
    I propose two solutions:
    
       1. augment the TikaConfig class so that a specific ContentHandler can be
       used in tika-config.xml;
       2. determine the ContentHandler to use for parsing through HTTP headers,
       for example:
       curl -T filename.pdf http://localhost:9998/meta --header
       "X-Content-Handler: PhoneExtractingContentHandler"
       This should affect also the TikaResource.java class.
    
    I look forward to having your feedback. I strongly believe that every user
    who wants to use Tika as a service through tika-server and needs to extract
    content and metadata like phone numbers, standard references, etc would be
    very happy.
    
    Thanks a lot,
    Giuseppe