You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Aliaksandr Autayeu <al...@autayeu.com> on 2011/12/01 18:32:30 UTC

Re: proposal to move CLI tools into separate jar

>
> The primary reason is to have proper modularity, out of which follow other
>> things. The CLI does not belong to the library, and this should be
>> reflected in the structure, to avoid, for example that CLI classes are by
>> chance used somewhere in the library. For example, while working in this,
>> I
>> found the duplicated procedure. Proper code modularization makes one aware
>> of such things.
>>
>> James, thank you for input!
>>
>
> Do you have a sample here? As far as I know we almost don't have
> dependencies
> on the CLI package from other classes, expect for the formats package.
>
Yes. POSTaggerCrossValidator.java

import opennlp.tools.cmdline.CmdLineUtil;
import opennlp.tools.cmdline.TerminateToolException;

Why it does that? To throw a TerminateToolException it does not need to
throw anyway. If it is easy to make a dependency, they will creep in.



> I usually go to opennlp-tools, do a "mvn clean install" after a code change
> there and then type "bin/opennlp ..." to run the cli tools.
> With an additional module I need to build it as well, or I have the risk
> that
> things are out of sync. This out of sync happens once in a while with
> maxent.

OK, for you this is not comfortable. However, you are a developer, while
this refactoring will mostly benefit users, in terms of smaller
dependencies.


> >  and cause more issues;
>>>
>> Please, can you elaborate on this too? May be there is something I don't
>> know. Which issues this can cause?
>>
>
> Having three separate jar files (maxent, tools, cli) can cause issues when
> the
> versions are incompatible (maybe someone forget to update maxent),
> you need to put it three times on your class path, you can get issues
> with inter-module code changes, an additional step in the build which can
> go wrong, etc.
>
> Additionally I once in a while use the cli stuff to do testing in my
> UIMA-AS system.
> There it is just handy that the cli stuff is on my server by default.
>
OK, so that's one more inconvenience. By the way, is it a good practice to
do this way? I would install a proper binary installation which is provided
and use a script. Own dog food kind of thing.

I do not think that deploying commad-line utilities in a web application
lib folder is a good idea. It's like having an executable script in a
webapp, and hoping nobody will never discover and use it. That might seems
difficult to exploit, because you don't know how, but if you don't have
such tools on the classpath, there will not be even a potential exploit.

Well, we are still missing a few commands I wish to have on my server, e.g.
> show me the version of a model.

I just go inside and look into the description, but I use Far and its
transparent handling of archives makes this easy to do, but using a console
that might be less convenient. Again, differences in tools and development
practices. By the way, why not creating a JIRA for this idea? :)



> We don't have any examples, but if we would have some, they should be
> distributed as source files. Because the whole point of having them is
> that people can read and copy them.

Yes, I understand this. And in addition it could be helpful to have a
script, which executes them and does a demo.

Aliaksandr

Re: proposal to move CLI tools into separate jar

Posted by Jörn Kottmann <ko...@gmail.com>.

On 12/3/11 5:11 AM, Jason Baldridge wrote:
> Another way to look at this would be to have adaptors and command line
> support for running OpenNLP on the data formats and such that those other
> packages expect.

This would mean just to add support for these to our formats package.

+1 for that

Jörn

Re: proposal to move CLI tools into separate jar

Posted by James Kosin <ja...@gmail.com>.

I have batch files for some of them, that I use to re-train and parse
the data formats...

James

On 12/2/2011 11:11 PM, Jason Baldridge wrote:
> Yeah, I would not propose to go full-GATE style with this. I just meant it
> would be easy to write little wrappers that could produce data for those
> other packages that would make it easy to run them -- they wouldn't even
> necessarily be included with the OpenNLP distro (for the obvious reason of
> not being Apache-clean).
>
> Another way to look at this would be to have adaptors and command line
> support for running OpenNLP on the data formats and such that those other
> packages expect.
>
> On Fri, Dec 2, 2011 at 6:21 AM, Aliaksandr Autayeu
> <al...@autayeu.com>wrote:
>
>> Nice idea! And I agree with Jörn that a good plugin strategy will be key to
>> handling this with easy. An interesting task, actually.
>>
>> Another point is that this seems to move us towards a GATE-like way :) Just
>> a little bit. And reminds me of a need to focus.
>>
>> Aliaksandr
>>
>> On Fri, Dec 2, 2011 at 10:37 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>>
>>> On 12/2/11 6:20 AM, Jason Baldridge wrote:
>>>
>>>> I've only very quicky read through this, and don't have a strong opinion
>>>> on
>>>> whether to keep CLI as is or move it out. However, one interesting
>>>> possibility with moving it out would be that we could provide interfaces
>>>> to
>>>> other packages, e.g. Stanford tools, Lingpipe, Mallet, etc, for
>>>> performance
>>>> comparisons. That would be useful for our purposes, and would be a
>> useful
>>>> capability for others (especially in research), and it is something we
>>>> wouldn't want to have as part of the core library (obviously).
>>>>
>>> If you do that you get dependencies on all these tools. Some of the
>>> mentioned
>>> tools are not compatible with Apache license handling which makes it
>>> impossible
>>> for us to distribute them.
>>>
>>> Anyway it is a nice idea and I believe if you do that you should make one
>>> small sub-project per project. If necessary you can place them at an
>>> external
>>> code hosting site.
>>>
>>> For this we need a good plugin strategy for our CLI tools, so that people
>>> can
>>> easily add their own tools.
>>>
>>> Jörn
>>>
>>>
>>>
>
>

Re: proposal to move CLI tools into separate jar

Posted by Jason Baldridge <ja...@gmail.com>.

Yeah, I would not propose to go full-GATE style with this. I just meant it
would be easy to write little wrappers that could produce data for those
other packages that would make it easy to run them -- they wouldn't even
necessarily be included with the OpenNLP distro (for the obvious reason of
not being Apache-clean).

Another way to look at this would be to have adaptors and command line
support for running OpenNLP on the data formats and such that those other
packages expect.

On Fri, Dec 2, 2011 at 6:21 AM, Aliaksandr Autayeu
<al...@autayeu.com>wrote:

> Nice idea! And I agree with Jörn that a good plugin strategy will be key to
> handling this with easy. An interesting task, actually.
>
> Another point is that this seems to move us towards a GATE-like way :) Just
> a little bit. And reminds me of a need to focus.
>
> Aliaksandr
>
> On Fri, Dec 2, 2011 at 10:37 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> > On 12/2/11 6:20 AM, Jason Baldridge wrote:
> >
> >> I've only very quicky read through this, and don't have a strong opinion
> >> on
> >> whether to keep CLI as is or move it out. However, one interesting
> >> possibility with moving it out would be that we could provide interfaces
> >> to
> >> other packages, e.g. Stanford tools, Lingpipe, Mallet, etc, for
> >> performance
> >> comparisons. That would be useful for our purposes, and would be a
> useful
> >> capability for others (especially in research), and it is something we
> >> wouldn't want to have as part of the core library (obviously).
> >>
> >
> > If you do that you get dependencies on all these tools. Some of the
> > mentioned
> > tools are not compatible with Apache license handling which makes it
> > impossible
> > for us to distribute them.
> >
> > Anyway it is a nice idea and I believe if you do that you should make one
> > small sub-project per project. If necessary you can place them at an
> > external
> > code hosting site.
> >
> > For this we need a good plugin strategy for our CLI tools, so that people
> > can
> > easily add their own tools.
> >
> > Jörn
> >
> >
> >
>



-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal to move CLI tools into separate jar

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

Nice idea! And I agree with Jörn that a good plugin strategy will be key to
handling this with easy. An interesting task, actually.

Another point is that this seems to move us towards a GATE-like way :) Just
a little bit. And reminds me of a need to focus.

Aliaksandr

On Fri, Dec 2, 2011 at 10:37 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 12/2/11 6:20 AM, Jason Baldridge wrote:
>
>> I've only very quicky read through this, and don't have a strong opinion
>> on
>> whether to keep CLI as is or move it out. However, one interesting
>> possibility with moving it out would be that we could provide interfaces
>> to
>> other packages, e.g. Stanford tools, Lingpipe, Mallet, etc, for
>> performance
>> comparisons. That would be useful for our purposes, and would be a useful
>> capability for others (especially in research), and it is something we
>> wouldn't want to have as part of the core library (obviously).
>>
>
> If you do that you get dependencies on all these tools. Some of the
> mentioned
> tools are not compatible with Apache license handling which makes it
> impossible
> for us to distribute them.
>
> Anyway it is a nice idea and I believe if you do that you should make one
> small sub-project per project. If necessary you can place them at an
> external
> code hosting site.
>
> For this we need a good plugin strategy for our CLI tools, so that people
> can
> easily add their own tools.
>
> Jörn
>
>
>

Re: proposal to move CLI tools into separate jar

Posted by Jörn Kottmann <ko...@gmail.com>.

On 12/2/11 6:20 AM, Jason Baldridge wrote:
> I've only very quicky read through this, and don't have a strong opinion on
> whether to keep CLI as is or move it out. However, one interesting
> possibility with moving it out would be that we could provide interfaces to
> other packages, e.g. Stanford tools, Lingpipe, Mallet, etc, for performance
> comparisons. That would be useful for our purposes, and would be a useful
> capability for others (especially in research), and it is something we
> wouldn't want to have as part of the core library (obviously).

If you do that you get dependencies on all these tools. Some of the 
mentioned
tools are not compatible with Apache license handling which makes it 
impossible
for us to distribute them.

Anyway it is a nice idea and I believe if you do that you should make one
small sub-project per project. If necessary you can place them at an 
external
code hosting site.

For this we need a good plugin strategy for our CLI tools, so that 
people can
easily add their own tools.

Jörn

Re: proposal to move CLI tools into separate jar

Posted by Jason Baldridge <ja...@gmail.com>.

I've only very quicky read through this, and don't have a strong opinion on
whether to keep CLI as is or move it out. However, one interesting
possibility with moving it out would be that we could provide interfaces to
other packages, e.g. Stanford tools, Lingpipe, Mallet, etc, for performance
comparisons. That would be useful for our purposes, and would be a useful
capability for others (especially in research), and it is something we
wouldn't want to have as part of the core library (obviously).

Jason

On Thu, Dec 1, 2011 at 12:03 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 12/1/11 6:32 PM, Aliaksandr Autayeu wrote:
>
>> Do you have a sample here? As far as I know we almost don't have
>>> dependencies
>>> on the CLI package from other classes, expect for the formats package.
>>>
>>>  Yes. POSTaggerCrossValidator.java
>>
>> import opennlp.tools.cmdline.**CmdLineUtil;
>> import opennlp.tools.cmdline.**TerminateToolException;
>>
>> Why it does that? To throw a TerminateToolException it does not need to
>> throw anyway. If it is easy to make a dependency, they will creep in.
>>
>
>
> Nice that you fixed that one :)
>
>
>  Having three separate jar files (maxent, tools, cli) can cause issues when
>>> the
>>> versions are incompatible (maybe someone forget to update maxent),
>>> you need to put it three times on your class path, you can get issues
>>> with inter-module code changes, an additional step in the build which can
>>> go wrong, etc.
>>>
>>> Additionally I once in a while use the cli stuff to do testing in my
>>> UIMA-AS system.
>>> There it is just handy that the cli stuff is on my server by default.
>>>
>>>  OK, so that's one more inconvenience. By the way, is it a good practice
>> to
>> do this way? I would install a proper binary installation which is
>> provided
>> and use a script. Own dog food kind of thing.
>>
>> I do not think that deploying commad-line utilities in a web application
>> lib folder is a good idea. It's like having an executable script in a
>> webapp, and hoping nobody will never discover and use it. That might seems
>> difficult to exploit, because you don't know how, but if you don't have
>> such tools on the classpath, there will not be even a potential exploit.
>>
>
> I don't see the security issue here. If you are able to inject code into a
> JVM,
> you probably do other things than using our command line tools, e.g.
> stealing something which gets processed by OpenNLP, reading/writing files,
> trying to get root rights on that machine, etc.
>
>
>
>> Well, we are still missing a few commands I wish to have on my server,
>> e.g.
>>
>>> show me the version of a model.
>>>
>> I just go inside and look into the description, but I use Far and its
>> transparent handling of archives makes this easy to do, but using a
>> console
>> that might be less convenient. Again, differences in tools and development
>> practices. By the way, why not creating a JIRA for this idea? :)
>>
>>
> +1
>
>
>>  We don't have any examples, but if we would have some, they should be
>>> distributed as source files. Because the whole point of having them is
>>> that people can read and copy them.
>>>
>> Yes, I understand this. And in addition it could be helpful to have a
>> script, which executes them and does a demo.
>>
>
> Yes, but for demo we have the tools which can load a model and read
> text from the console.
>
> Anyway are really missing a web demo, which we can add to our website
> (jira already exist).
>
> Jörn
>



-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: proposal to move CLI tools into separate jar

Posted by Jörn Kottmann <ko...@gmail.com>.

On 12/1/11 6:32 PM, Aliaksandr Autayeu wrote:
>> Do you have a sample here? As far as I know we almost don't have
>> dependencies
>> on the CLI package from other classes, expect for the formats package.
>>
> Yes. POSTaggerCrossValidator.java
>
> import opennlp.tools.cmdline.CmdLineUtil;
> import opennlp.tools.cmdline.TerminateToolException;
>
> Why it does that? To throw a TerminateToolException it does not need to
> throw anyway. If it is easy to make a dependency, they will creep in.


Nice that you fixed that one :)

>> Having three separate jar files (maxent, tools, cli) can cause issues when
>> the
>> versions are incompatible (maybe someone forget to update maxent),
>> you need to put it three times on your class path, you can get issues
>> with inter-module code changes, an additional step in the build which can
>> go wrong, etc.
>>
>> Additionally I once in a while use the cli stuff to do testing in my
>> UIMA-AS system.
>> There it is just handy that the cli stuff is on my server by default.
>>
> OK, so that's one more inconvenience. By the way, is it a good practice to
> do this way? I would install a proper binary installation which is provided
> and use a script. Own dog food kind of thing.
>
> I do not think that deploying commad-line utilities in a web application
> lib folder is a good idea. It's like having an executable script in a
> webapp, and hoping nobody will never discover and use it. That might seems
> difficult to exploit, because you don't know how, but if you don't have
> such tools on the classpath, there will not be even a potential exploit.

I don't see the security issue here. If you are able to inject code into 
a JVM,
you probably do other things than using our command line tools, e.g.
stealing something which gets processed by OpenNLP, reading/writing files,
trying to get root rights on that machine, etc.

>
> Well, we are still missing a few commands I wish to have on my server, e.g.
>> show me the version of a model.
> I just go inside and look into the description, but I use Far and its
> transparent handling of archives makes this easy to do, but using a console
> that might be less convenient. Again, differences in tools and development
> practices. By the way, why not creating a JIRA for this idea? :)
>

+1
>
>> We don't have any examples, but if we would have some, they should be
>> distributed as source files. Because the whole point of having them is
>> that people can read and copy them.
> Yes, I understand this. And in addition it could be helpful to have a
> script, which executes them and does a demo.

Yes, but for demo we have the tools which can load a model and read
text from the console.

Anyway are really missing a web demo, which we can add to our website 
(jira already exist).

Jörn