You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@openoffice.apache.org by "Olivier R." <ol...@gmail.com> on 2011/11/07 12:05:16 UTC

Hunspell dictionaries are not just words lists (+ other matters)

Hello everyone,

I don’t like mailing-lists, so I have subscribed here just to explain 
few things about dictionaries. Then I’ll vanish.

Rob Weir wrote:
> Just make sure that you explain what a spell checking dictionary is.
> Otherwise any legal types will be confused.  This is not a dictionary
> like Webster's, with words and definitions, where the definitions are
> creative content.  A spell checking dictionary is more of a word list.
>   I'm not sure what the creative expression is in a list of all common
> words in a language and how that could be copyrighted.  Of course, I
> am not a lawyer.

Few dictionaries are just words lists, but most of them are lists of 
words tagged with flags described in an affixation file which specify 
what are the rules to generate inflexions. This affixation file can be 
quite simple or very complex. And this can be a difficult matter.
   It looks easy at first, but when you begin to get deeper in this 
matter, there is often a lot of issues to handle. Create a proper 
affixation file can really be a hard work. And even if the difficulty is
not high, this can be a very long job.
   So, no, Hunspell dictionaries are not just words lists.

For example, it took me one year and countless hours of work to rewrite 
the affixation file of the French dictionaries from scratch. Even after 
that, there were still a lot of bugs (not spelling mistakes). For one 
year, I had to patch regularly the affixation file. Even after few 
years, there is still sometimes something to fix. The French 
dictionaries contain approximatively 13000 rules.
   Here an example of one of the most complex flags:
http://www.dicollecte.org/affixes.php?prj=fr&flag=c2

(AFAIK, there is only one dictionary which has a more complex affixation 
file, the Hungarian one.)

I also tagged the affixation file in order to generate 4 different 
dictionaries with a script, to offer to users the mean to write 
according to their preferences towards the optional and controversial 
French spelling reform of 1990.

Besides, 99 % of entries have been manually grammatically tagged.
   Several contributors did a tremendous job by adding lexical tags, 
adding many words, moving entries in different subdictionaries according 
to our policy, handling special cases, reporting mistakes and issues. 
Because, spelling matters are much more complex than you think,
especially if you want to use your dictionary for grammar checking.
   We often have to handle old, new or variant spelling just for one 
word, and there are decisions to take about what to do with special 
cases, which are actually very numerous. Managing dictionaries is not a 
trivial task.
   Here is the "bugtracker" where we work on the French dictionaries.
http://www.dicollecte.org/propositions.php?prj=fr&tab=E [fr]
   (This bugtracker also allows us to commit in the dictionary in the 
database.)
   The changelog:
http://www.dicollecte.org/log.php?prj=fr

This dictionary is used by the both French grammar checkers.

What you said about copyright could be right for a list generated by 
script from a corpus, but that’s not true for dictionaries who are 
conceived by human with their knowledge, their work and their choices.

> But we'll never resolve this on legal grounds.  At Apache we would not
> bundle a dictionary under a legal theory if the compiler of the
> dictionary did not want us to.  I think we should respect the
> dictionary compiler's wishes and intent,
> _even if legally we're not obligated to_.

Wow... That’s really not encouraging for people who may consider to 
change the license of their work... Does IBM think the same way?
   Few years ago, when I began to contribute for FLOSS, I thought the 
less restrictive licenses were the better ones, only because I didn’t 
care and I was ignorant about licensing and political matters.
   As time goes, I think more and more the opposite. And when I read 
you, I’m beginning to think I was still too soft on that topic.

> 3) We could contact the compilers of the dictionary and ask if they
> would make them available under a difference license.   Generally
> people make things available under an OSS license because they want to
> see other projects use them.  If we tell them that a leading
> application like OpenOffice can no longer user their dictionary, this
> might persuade them to change their license.

Here is the situation for the French dictionaries:

1. The Hunspell spelling dictionaries
   Licenses: MPL/LGPL/GPL

   As I am the sole author of the affixation file, as I grammatically 
tagged myself about 90 % of all entries (without copying another lexicon 
with a script), I can say for sure that I do not intend to change the 
licenses for the Apache one.
   When I built Dicollecte, my goal was to encourage people to 
contribute for all and give back the improvements they did. Switching to 
the Apache license would be a contradiction with everything I did.

   By the way, these dictionaries _require_ Hunspell. They won’t work 
properly with Myspell. I saw a lot of people think Hunspell dictionaries 
will work with Myspell. That’s a wrong assumption. Hunspell can use 
Myspell dictionaries, but Hunspell also offers a lot of new features 
which allow to improve the dictionaries structure.
   And Myspell does not recognize double suffixation or double 
prefixation, cannot handle duplicate lemmas, does not handle 
morphological tags, has a limited amount of flags, does not recognize 
Hunspell compound commands, etc. (I am not even sure that Myspell can 
use UTF-8 files.)

   But, good for you, AFAIK, many dictionnaries still have a Myspell 
structure. But not the French ones and some others.

2. The thesaurus
   The initial and main author released it under license LGPL.
   Now he’s dead. AFAIK, there is no way to change the license before 
his work is considered as puplic domain, but there also have been 
several improvements on the initial work.
   At the moment, I am working on it to transform it as a list of 
"synsets" which could be used to generate a better thesaurus. A list of 
synsets would be a far better basis to work on. I don’t know if I will 
succeed. This is a difficult matter and it requires a lot of work.

3. Hyphenation rules
   Licence LGPL.
   This is a dictionary converted from the hyphenation rules for TeX,
modified somehow to handle several issues.
   I did nothing on it. I’m just packaging it in the extensions for
OOo/LibO. You'll have to contact the peoples who created it.

> 4) We could convert another word list or dictionary, one that has a
> better license,  into Hunspell format.

Hmmm...
   You may generate affixation rules for Myspell with a script… but 
then, these dictionaries will probably be such a mess that you’ll be 
very lucky if you find someone with enough abnegation to improve it. The 
main issues of dictionaries are:
   - if you just create a list of words, you may only improve it with 
text parser or other lexicons, but it will be hard and annoying to 
improve it manually, as the list will be very, very long, and it will be 
a memory waste. And each times you will regenerate it with your script, 
you’ll have to fix again manually what you did before.
   - if you create an affixation file with script, your dictionary will 
be a mess, no easy way to improve it, as the dictionary structure will 
not be intuitive for a human being. And again, you cannot really mix 
improvements by scripting and improvements by human being.
   The best way is to get somewhere a good lexicon already tagged with a 
non-restrictive license. Even then, you’ll have to write manually a 
proper affixation file… and then, you may discover it is not the easy 
task you may think it is, unless your language is somehow very logical, 
with neither exceptions, neither weird stuff…

Regards,
Olivier R.

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Nóirín Plunkett <no...@apache.org>.

On Mon, Nov 7, 2011 at 8:27 AM, Dave Fisher <da...@comcast.net> wrote:
>
> On Nov 7, 2011, at 8:14 AM, Olivier R. wrote:
>
>> Maybe just because you are an Apache member and you make a strong statement on an Apache list about FLOSS you are willing to bundle in your software.
>> I’d prefer an official statement about this point, if you don’t mind.
>
> An official opinion is a reasonable request.
>

Note that you won't likely get an official opinion on this list, and
I'd discourage seeking an official opinion except where the outcome
makes a difference to the work of an Apache project. (Our Legal
Affairs team are also volunteers, and they support a huge number of
projects.)

In this case, as Rob has already said, it doesn't matter whether
claims about copyright are valid or not. Some authors clearly have a
strong preference that we not use their work, and we're generally not
in the business of being jerks.

Noirin

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 12:14 PM, Dave Fisher <da...@comcast.net> wrote:
>
> On Nov 7, 2011, at 8:32 AM, Rob Weir wrote:
>
>> On Mon, Nov 7, 2011 at 11:27 AM, Dave Fisher <da...@comcast.net> wrote:
>>> Hi Olivier,
>>>
>>> Thanks for bringing your experienced perspective to the list!
>>>
>>> On Nov 7, 2011, at 8:14 AM, Olivier R. wrote:
>>>
>>>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>>>
>>>>> Why would Apache care about that?
>>>>
>>>> Maybe just because you are an Apache member and you make a strong statement on an Apache list about FLOSS you are willing to bundle in your software.
>>>> I’d prefer an official statement about this point, if you don’t mind.
>>>
>>> Rob is not an Apache Member, neither am I. We are Apache Committers and on the Apache OpenOffice.org (incubating) PPMC.
>>>
>>> An official opinion is a reasonable request.
>>>
>>
>> Whether a question is reasonable depends on the question.
>>
>> Andrea was asking an Apache policy question.  Oliver was asking an
>> abstract legal question.   I think we will receive a real answer to
>> only one of these questions.
>
> True, but these are facets of the same problem, how to incorporate language packs into AOOo in a way that is compatible with each Language pack's license and copyright.
>
> Andrea has been participating on ooo-dev long enough to know how to pose the question and Olivier is experienced in OOo but new on this list.
>
> I don't think it is helpful to throw abstract cabbages into the discussion. It might be silly, but it doesn't help find an answer.
>
> You stated IBM has spelling dictionaries. How does this contribute to the discussion? Is IBM willing to donate these cabbages via SGA to the ASF under the AL2?
>

Consider that to be "Plan B".  I cannot guarantee an outcome, but if
we're unable to find a way to make reasonable use of the Hunspell
dictionaries, I can pursue this internally at IBM.  Keep in mind that
aside from our own dictionaries, we also have our own translation
strings and documentation sets.  Of course, these are for Symphony,
but there is much overlap.

-Rob

> Regards,
> Dave
>
>>
>> -Rob
>>
>>> On the other thread Andrea Pescetti had an interesting point of view that I think is the basis for seeking an opinion from the Apache Legal team (made up of Apache Members)
>>>
>>>
>>> Re: GPL'd dictionaries (was Re: ftp.services.openoffice.org?)
>>>
>>> On Nov 6, 2011, at 11:06 AM, Andrea Pescetti wrote:
>>>
>>>> On 05/11/2011 Gianluca Turconi wrote:
>>>>> 2011/11/5 Pedro Giffuni
>>>>>> I have been looking at the situation of the dictionaries,
>>>>>> and particular the italian dictionary.
>>>>>> You are right that it will not be covered by the SGA.
>>>>
>>>
>>> <big snip>
>>>
>>>>> An AOOo without a native language GUI and linguistic tools would be just
>>>>> useless outside the anglosaxon world and, indeed, a rather disastrous
>>>>> presentation of the new project for people who don't speak English.
>>>>
>>>> Sure, especially considering that the project description says that OpenOffice.org supports 110 languages...
>>>>
>>>> What I would recommend is:
>>>>
>>>> 1) Recheck the Apache policy and find out the rationale behind it; I have nothing to teach to the legal team, but this is a very rare case where the "virality" of GPL does not apply.
>>>>
>>>> 2) See if we can find a way to keep dictionaries as they are; note that no dictionary is developed in the OOo trunk, they are synchronized from time to time, usually before a release; the Italian dictionary SVN trunk, for example, is not in the OOo sources. Even just the possibility to provide an extension that can be included in binary releases would be OK for me.
>>>>
>>>> 3) If there is really no way to include a GPL extension this way, then we should think about downloading the extension at installation time. But we managed to get Sun and the FSF agree to ship dictionaries in the most convenient way (i.e., included in the installer), so we might succeed this time as well.
>>>>
>>>> Regards,
>>>>  Andrea.
>>>
>>> Regards,
>>> Dave
>>>
>>>
>>>>
>>>> Regards,
>>>> Olivier
>>>
>>>
>
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Dave Fisher <da...@comcast.net>.

On Nov 7, 2011, at 8:32 AM, Rob Weir wrote:

> On Mon, Nov 7, 2011 at 11:27 AM, Dave Fisher <da...@comcast.net> wrote:
>> Hi Olivier,
>> 
>> Thanks for bringing your experienced perspective to the list!
>> 
>> On Nov 7, 2011, at 8:14 AM, Olivier R. wrote:
>> 
>>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>> 
>>>> Why would Apache care about that?
>>> 
>>> Maybe just because you are an Apache member and you make a strong statement on an Apache list about FLOSS you are willing to bundle in your software.
>>> I’d prefer an official statement about this point, if you don’t mind.
>> 
>> Rob is not an Apache Member, neither am I. We are Apache Committers and on the Apache OpenOffice.org (incubating) PPMC.
>> 
>> An official opinion is a reasonable request.
>> 
> 
> Whether a question is reasonable depends on the question.
> 
> Andrea was asking an Apache policy question.  Oliver was asking an
> abstract legal question.   I think we will receive a real answer to
> only one of these questions.

True, but these are facets of the same problem, how to incorporate language packs into AOOo in a way that is compatible with each Language pack's license and copyright.

Andrea has been participating on ooo-dev long enough to know how to pose the question and Olivier is experienced in OOo but new on this list.

I don't think it is helpful to throw abstract cabbages into the discussion. It might be silly, but it doesn't help find an answer. 

You stated IBM has spelling dictionaries. How does this contribute to the discussion? Is IBM willing to donate these cabbages via SGA to the ASF under the AL2?

Regards,
Dave

> 
> -Rob
> 
>> On the other thread Andrea Pescetti had an interesting point of view that I think is the basis for seeking an opinion from the Apache Legal team (made up of Apache Members)
>> 
>> 
>> Re: GPL'd dictionaries (was Re: ftp.services.openoffice.org?)
>> 
>> On Nov 6, 2011, at 11:06 AM, Andrea Pescetti wrote:
>> 
>>> On 05/11/2011 Gianluca Turconi wrote:
>>>> 2011/11/5 Pedro Giffuni
>>>>> I have been looking at the situation of the dictionaries,
>>>>> and particular the italian dictionary.
>>>>> You are right that it will not be covered by the SGA.
>>> 
>> 
>> <big snip>
>> 
>>>> An AOOo without a native language GUI and linguistic tools would be just
>>>> useless outside the anglosaxon world and, indeed, a rather disastrous
>>>> presentation of the new project for people who don't speak English.
>>> 
>>> Sure, especially considering that the project description says that OpenOffice.org supports 110 languages...
>>> 
>>> What I would recommend is:
>>> 
>>> 1) Recheck the Apache policy and find out the rationale behind it; I have nothing to teach to the legal team, but this is a very rare case where the "virality" of GPL does not apply.
>>> 
>>> 2) See if we can find a way to keep dictionaries as they are; note that no dictionary is developed in the OOo trunk, they are synchronized from time to time, usually before a release; the Italian dictionary SVN trunk, for example, is not in the OOo sources. Even just the possibility to provide an extension that can be included in binary releases would be OK for me.
>>> 
>>> 3) If there is really no way to include a GPL extension this way, then we should think about downloading the extension at installation time. But we managed to get Sun and the FSF agree to ship dictionaries in the most convenient way (i.e., included in the installer), so we might succeed this time as well.
>>> 
>>> Regards,
>>>  Andrea.
>> 
>> Regards,
>> Dave
>> 
>> 
>>> 
>>> Regards,
>>> Olivier
>> 
>>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 11:27 AM, Dave Fisher <da...@comcast.net> wrote:
> Hi Olivier,
>
> Thanks for bringing your experienced perspective to the list!
>
> On Nov 7, 2011, at 8:14 AM, Olivier R. wrote:
>
>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>
>>> Why would Apache care about that?
>>
>> Maybe just because you are an Apache member and you make a strong statement on an Apache list about FLOSS you are willing to bundle in your software.
>> I’d prefer an official statement about this point, if you don’t mind.
>
> Rob is not an Apache Member, neither am I. We are Apache Committers and on the Apache OpenOffice.org (incubating) PPMC.
>
> An official opinion is a reasonable request.
>

Whether a question is reasonable depends on the question.

Andrea was asking an Apache policy question.  Oliver was asking an
abstract legal question.   I think we will receive a real answer to
only one of these questions.

-Rob

> On the other thread Andrea Pescetti had an interesting point of view that I think is the basis for seeking an opinion from the Apache Legal team (made up of Apache Members)
>
>
> Re: GPL'd dictionaries (was Re: ftp.services.openoffice.org?)
>
> On Nov 6, 2011, at 11:06 AM, Andrea Pescetti wrote:
>
>> On 05/11/2011 Gianluca Turconi wrote:
>>> 2011/11/5 Pedro Giffuni
>>>> I have been looking at the situation of the dictionaries,
>>>> and particular the italian dictionary.
>>>> You are right that it will not be covered by the SGA.
>>
>
> <big snip>
>
>>> An AOOo without a native language GUI and linguistic tools would be just
>>> useless outside the anglosaxon world and, indeed, a rather disastrous
>>> presentation of the new project for people who don't speak English.
>>
>> Sure, especially considering that the project description says that OpenOffice.org supports 110 languages...
>>
>> What I would recommend is:
>>
>> 1) Recheck the Apache policy and find out the rationale behind it; I have nothing to teach to the legal team, but this is a very rare case where the "virality" of GPL does not apply.
>>
>> 2) See if we can find a way to keep dictionaries as they are; note that no dictionary is developed in the OOo trunk, they are synchronized from time to time, usually before a release; the Italian dictionary SVN trunk, for example, is not in the OOo sources. Even just the possibility to provide an extension that can be included in binary releases would be OK for me.
>>
>> 3) If there is really no way to include a GPL extension this way, then we should think about downloading the extension at installation time. But we managed to get Sun and the FSF agree to ship dictionaries in the most convenient way (i.e., included in the installer), so we might succeed this time as well.
>>
>> Regards,
>>  Andrea.
>
> Regards,
> Dave
>
>
>>
>> Regards,
>> Olivier
>
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Dave Fisher <da...@comcast.net>.

On Nov 8, 2011, at 7:44 AM, Shane Curcuru wrote:

> Without commenting on the merits of Hunspell or how to - or how not to - incorporate it or related dictionaries into AOOo builds, I will note that this thread has strayed off topic for the developers here on ooo-dev@.
> 
> If an Apache project needs legal advice to move forward with their project's work, it should form a *specific* question and take it to legal-discuss@.  Otherwise, we should focus on figuring out how to make a great AOOo build and organizing the many migration tasks from the *.oo.o handover here on ooo-dev@ and leave complex and non-specififc legal discussions elsewhere.
> 
> - Shane, having seen far too many discussions between developers about the law distract far too much attention from writing useful code

It seems that the other thread - "GPL'd Dictionaries ..." has returned back to the beginning and is focused on an answer - at least for Installation.

Someone, I don't know who, will need to focus on the extensions and templates servers at OSUOSL which are Drupal and can be hammered at times.

See you in Vancouver later today.

Regards,
Dave

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Olivier R." <ol...@gmail.com>.

You are right. I'm leaving now. I'll just read whatever conclusion is 
coming out of all of this.

Goodbye.

Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Shane Curcuru <as...@shanecurcuru.org>.

Without commenting on the merits of Hunspell or how to - or how not to - 
incorporate it or related dictionaries into AOOo builds, I will note 
that this thread has strayed off topic for the developers here on ooo-dev@.

If an Apache project needs legal advice to move forward with their 
project's work, it should form a *specific* question and take it to 
legal-discuss@.  Otherwise, we should focus on figuring out how to make 
a great AOOo build and organizing the many migration tasks from the 
*.oo.o handover here on ooo-dev@ and leave complex and non-specififc 
legal discussions elsewhere.

- Shane, having seen far too many discussions between developers about 
the law distract far too much attention from writing useful code

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Christian Lohmaier <cl...@openoffice.org>.

Hi Rob, *,

On Tue, Nov 8, 2011 at 1:58 AM, Rob Weir <ro...@apache.org> wrote:
> On Mon, Nov 7, 2011 at 7:29 PM, Christian Lohmaier <cl...@openoffice.org> wrote:
>> On Tue, Nov 8, 2011 at 12:34 AM, Rob Weir <ro...@apache.org> wrote:
>> [...]
>> Why don't you just admit that you have absolutely no clue about how
>> the dictionaries (or for that matter hunspell/affix compression as a
>> whole) works?
>
> Actually, I know quite a lot about spell checking and dictionaries.
> And copyright.  How a work is created is totally irrelevant. A
> painting is not copyrightable because of what paint the artist uses or
> how they hold the brush.

Yes, and  nobody claims the words that a dictionary represents or
whatever fragment of grammar it entails would be copyrightable.

> Spell checking dictionaries are just compilations of facts

As is any other software. Following that logic, you could not put a
copyright on *anything*, as the law of physics or math are the same
for everyone.

> that are
> constrained by the preexisting external facts of the language.  The
> compiler of the dictionary does not create these facts.

No computer dictionary in the world is a perfect representation of the
"facts" that make up a language. There is no dictionary with 100%
accuracy. There is no way to take a dictionary and reverseengineer the
language from it.

Unmunge a hunspell dictionary, especially one with componds enabled,
and you will get gigabyte over gigabyte of "valid" (as by the rules of
the dictionary, not by the rules of the language)  words.
Claiming that a dictionary represents external *facts* of the language
just doesn't make any sense.

> He  merely
> encodes them.

No, this is not true. If it was encoding of the facts, you would
create a perfect dictionary. But what affix transformations are
created depends on the creator of the dictionary, the stems that are
included in the dictionary, what level of accuracy is targeted. The
existing affix rules affect other rules in a complex way. These are
not just "outside facts".

>  The particular dictionary might be copyrightable as a
> specific selection, coordination and arrangement of these facts, but
> fair use would allow me to extract the  same facts from the
> dictionaries, via reverse engineering, and make my own selection,
> coordination and arrangement of these same facts and distribute them
> as my own dictionary.

Of course you are free to create your own dictionary. But once gain
your conclusion is silly to the point where I cannot take you
seriously.

You logic really means that you cannot copyright any kind of software,
because you are still able to write your own copy that does the same
since the fundamental math that makes up software is the same, as
you're just rearranging some keywords, the fundamental facts of the
programming language around.

This is stupid.

While it is true that you can rewrite software to do the same, and no
copyright does hinder you from doing so (as Tor implied there might be
other means like patents or other stuff that have nothing to do with
copyright), you all put copyright statements in sourcecode.

>  In other words, you might be able to protect
> the compilation of facts, but you cannot protect the underlying facts,

Yes, with that I agree. (Everyone does I guess). Except the
"compilation of facts" part. It is not a compilation of facts. It is
"guesswork", closing the gaps to the actual facts.

You might be able to do a "just a compilation of facts" style
dictionary for an artificial language, but not for a language that
people are actually using in real life.

> or prevent people from copying your encoding of these facts and
> distributing a different arrangement of them.

Here I (and others) strongly disagree with you. Copying the encoding
of the facts and just altering them is no different from taking
sourcecode from any software, putting your nametag on it and shuffling
things around.

The important matter is *different arrangement* here. Once again: You
are free to (attempt to) create a dictionary for the same language by
yourself. Language itself of course is not protected. But you will not
end up with the same "encoding" (approximation) of the language since
it is not just a matter of collecting facts. It is a creative process.
And once again I challenge your knowledge about the dictionaries. I
just cannot explain otherwise how you can claim it is just a
collection of facts with no creative effort behind.

Once again: Following your path of thought, you could not put a
copyright on any software, as the rules of math are the same for
anyone, and you're just applying those rules to create the same
result. And this is nonsense.

Copyright is not to prevent others from creating stuff that does or
behaves the same.

Copyright does cover the actual way how it is done, applies to the
concrete solution to the given problem.

> This should not be hard to understand.  Free software advocates argue
> all the time that software cannot be patented because it is "just
> math".  Books have been written about it.  So why is it so hard to
> understand that linguistic facts cannot be copyrighted?

Because you don't understand that a dictionary doesn't represent
linguistic facts. As there is no such thing as linguistic fact.
If there was, you could create a perfect dictionary.
There is an approximation at best. A computer dictionary is not a list
of words, is not a list of hard facts.

> In any case, these are important concepts to understand.  If it is not
> clearer, after reading this response, then try doing a Google query on
> terms such as copyright, compilation and facts.

No, it is pointless to search those facts, when the basic assumption
that a dictionary is a mere compilation of facts is wrong already.

ciao
Christian

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Olivier R." <ol...@gmail.com>.

Hello all,

Le 08/11/2011 01:58, Rob Weir a écrit :

> Spell checking dictionaries are just compilations of facts that are
> constrained by the preexisting external facts of the language.  The
> compiler of the dictionary does not create these facts.  He  merely
> encodes them.  The particular dictionary might be copyrightable as a
> specific selection, coordination and arrangement of these facts, but
> fair use would allow me to extract the  same facts from the
> dictionaries, via reverse engineering, and make my own selection,
> coordination and arrangement of these same facts and distribute them
> as my own dictionary.  In other words, you might be able to protect
> the compilation of facts, but you cannot protect the underlying facts,
> or prevent people from copying your encoding of these facts and
> distributing a different arrangement of them.  Copyright protection on
> a compilation of facts is extremely thin.  It is that simple.

I am no expert on legal matters, and I think you might get different 
legal answers in different countries.

So I’ll try to stay on technical ground.

Let’s assume that someone wants to create an Hunspell dictionary from 
scratch. He finds a huge lexicon of well-organized informations about 
his language, a proper list of words with morphological data, tags, etc. 
Let’s assume this is just a compilation of facts.

(Actually, even saying this lexicon is a mere compilation of facts is 
arguable, because there can also be a lot of specific classification, 
personal tags, interpretation data, etc. Otherwise, we wouldn’t have 
many arguments when we tagged the French dictionary. But let’s )

Does this list would _tell_ him to create an affixation file? No.
Does this list would _help_ him to create an affixation file? No.
Is there just one way to create an affixation file from this list? No.

Actually, even if I had such a lexicon of all facts on the French 
language when I began the work on the affixation file, it would have 
required as much time, as much reflexion, as much personal choices.

Creating an affixation file is on higher level than just collecting 
data. It’s not a way of classifying or tagging or selecting data.

So, what is an affixation file? That’s a description of a compression 
algorithm, a description of a human understandable logic to factorize 
data on a specific language.

The lexicon could have been compressed with zip, rar, 7z or whatever 
algorithm. In the same way, there is many ways to factorize a lexicon 
with a human understandable logic.

When I created the French affixation file, there was already one 
existing, but I was really not satisfied with it, so I rewrote it.
With the previous French dictionary, there was approximatively 600 rules 
in the affixation file, and 92,000 entries in the words list.
After one year on work on the new affixation file, there was 
approximatively 12,000 rules and 60,000 entries, but this new dictionary 
generates more inflexions than the previous one, and also far less 
mistakes (because affixation files can also have a lot of side effects 
and can generate a lot of wrong inflexions).

Even now, the compression method could be really different than it is. 
But the data set would be the same. And, actually, I’m considering of 
modifying it in a way to fit more to the grammar checker which retrieves 
these data from Hunspell.

So, is a very specific compression algorithm description for language 
data can be copyrighted? I don’t know, but I think this a creative matter.

HTH.

Regards,
Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 7:29 PM, Christian Lohmaier <cl...@openoffice.org> wrote:
> Hi Rob,
>
> On Tue, Nov 8, 2011 at 12:34 AM, Rob Weir <ro...@apache.org> wrote:
>>
>> The complexity of the language is irrelevant.  The point is that the
>> complexity is not created or invented by the person who compiles the
>> dictionary.  The complexity is not the creative expression of an
>> author.  The compiler of a spell checking dictionary is just recording
>> facts about the language.
>
> This is complete <censored/>.
>
> Why don't you just admit that you have absolutely no clue about how
> the dictionaries (or for that matter hunspell/affix compression as a
> whole) works?
>

Actually, I know quite a lot about spell checking and dictionaries.
And copyright.  How a work is created is totally irrelevant. A
painting is not copyrightable because of what paint the artist uses or
how they hold the brush.

Spell checking dictionaries are just compilations of facts that are
constrained by the preexisting external facts of the language.  The
compiler of the dictionary does not create these facts.  He  merely
encodes them.  The particular dictionary might be copyrightable as a
specific selection, coordination and arrangement of these facts, but
fair use would allow me to extract the  same facts from the
dictionaries, via reverse engineering, and make my own selection,
coordination and arrangement of these same facts and distribute them
as my own dictionary.  In other words, you might be able to protect
the compilation of facts, but you cannot protect the underlying facts,
or prevent people from copying your encoding of these facts and
distributing a different arrangement of them.  Copyright protection on
a compilation of facts is extremely thin.  It is that simple.

Now Apache might decide to honor the dictionary compilers wishes
despite the above, but that is by Apache policy, not because of any
copyright.

This should not be hard to understand.  Free software advocates argue
all the time that software cannot be patented because it is "just
math".  Books have been written about it.  So why is it so hard to
understand that linguistic facts cannot be copyrighted?  And why
should you be offended by me pointing out that your work as a
dictionary compiler enlarges an intellectual commons? That should not
be something you should be offended by.  Do you think your work is
only valuable if you can draw a box around it and enforce exclusionary
acccess?

In any case, these are important concepts to understand.  If it is not
clearer, after reading this response, then try doing a Google query on
terms such as copyright, compilation and facts.

Regards,

-Rob

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Tor Lillqvist <tm...@iki.fi>.

> Unless you provide the spellchecking dictionaries that IBM uses and
> prove your point,

IBM might well think/realise that the spellchecking dictionaries their
products use are not protected by copyright. That doesn't mean Rob, or
some of IBM's customers, would have any right to redistribute them.
You seem to be thinking copyright is the only form of intellectual
property protection.

> But I bet you can't. As they are copyrighted.

Or because he has entered into a contract that forbids him.

(Obviously, I am not a lawyer, so the above is probably complete crack.)

--tml

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Christian Lohmaier <cl...@openoffice.org>.

Hi Rob,

On Tue, Nov 8, 2011 at 12:34 AM, Rob Weir <ro...@apache.org> wrote:
>
> The complexity of the language is irrelevant.  The point is that the
> complexity is not created or invented by the person who compiles the
> dictionary.  The complexity is not the creative expression of an
> author.  The compiler of a spell checking dictionary is just recording
> facts about the language.

This is complete <censored/>.

Why don't you just admit that you have absolutely no clue about how
the dictionaries (or for that matter hunspell/affix compression as a
whole) works?

Unless you provide the spellchecking dictionaries that IBM uses and
prove your point, you should really shut up and let the topic rest.
And it's irrelevant whether IBM uses a different spellchecking engine.
If the dictionaries are not copyrightable, then show them, provide
them to the project. And we'll see how to hook them up to
OOo/LibreOffice/other projects.

But I bet you can't. As they are copyrighted. Because you're talking
about "wordlists", "phonebooks" and the like, but not about
dictionaries.

So once again: Either stop wasting everybody's time, or provide the
dictionaries that IBM uses in Symphony.

30 messages in this thread already. I cannot believe it. Get
productive instead of playing with words!

ciao
Christian

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 5:24 PM, Pedro Giffuni <pf...@apache.org> wrote:
> Hi Olivier;
>
> --- On Mon, 11/7/11, Olivier R. <ol...@gmail.com> wrote:
>
>>
>> I read that a legal issue would be raised about the GPL
>> dictionaries. Then Rob was wondering how dictionaries could
>> be copyrighted. I thought that if lawyers knew that
>> copyrights on such matter were irrelevant, they would not
>> forget to mention it. So that would concern every
>> dictionaries, whatever rights authors may claim on their
>> work, whatever license they chose. That was my assumption.
>>
>
> There were some doubts on the impact of having a dictionary
> under the GPL and there was a line of reasoning mentioning
> it's not really code so the effect is minimal.
>
> If it's not code I was thinking a documentation license should
> be used instead, but thinking at it better, the grammatical and
> syntax rules make such dictionaries behave more like scripts
> than as mere data so a code license is appropriate.
>
> FWIW, I think Rob is talking about a completely different
> concept: IBM uses a NLP tool so perhaps for them the simpler
> structure of English is appropriately managed by an AI
> software package and a dumb list of words.
>

The complexity of the language is irrelevant.  The point is that the
complexity is not created or invented by the person who compiles the
dictionary.  The complexity is not the creative expression of an
author.  The compiler of a spell checking dictionary is just recording
facts about the language.  This may be difficult work.  This might
require skill. This might take a lot of time.  This might have value.
But that does not necessarily mean that it is copyright-able.  IMHO,
no single person can claim copyright on the language rules that the
French people have collectively and organically developed over 1000
years.

> Sorry on our side too for the confusion.
>
> Cheers,
>
> Pedro.
>
>
>
>> Anyway, I’ll leave soon, as there is probably nothing
>> more to say.
>>
>> Best regards,
>> Olivier
>>
>>
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Pedro Giffuni <pf...@apache.org>.

Hi Olivier;

--- On Mon, 11/7/11, Olivier R. <ol...@gmail.com> wrote:

> 
> I read that a legal issue would be raised about the GPL
> dictionaries. Then Rob was wondering how dictionaries could
> be copyrighted. I thought that if lawyers knew that
> copyrights on such matter were irrelevant, they would not
> forget to mention it. So that would concern every
> dictionaries, whatever rights authors may claim on their
> work, whatever license they chose. That was my assumption.
> 

There were some doubts on the impact of having a dictionary
under the GPL and there was a line of reasoning mentioning
it's not really code so the effect is minimal.

If it's not code I was thinking a documentation license should
be used instead, but thinking at it better, the grammatical and
syntax rules make such dictionaries behave more like scripts
than as mere data so a code license is appropriate.

FWIW, I think Rob is talking about a completely different
concept: IBM uses a NLP tool so perhaps for them the simpler
structure of English is appropriately managed by an AI
software package and a dumb list of words.

Sorry on our side too for the confusion.

Cheers,

Pedro.

> Anyway, I’ll leave soon, as there is probably nothing
> more to say.
> 
> Best regards,
> Olivier
> 
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Olivier R." <ol...@gmail.com>.

Hello Dave,

Le 07/11/2011 17:27, Dave Fisher a écrit :

> Rob is not an Apache Member, neither am I. We are Apache Committers
> and on the Apache OpenOffice.org (incubating) PPMC.

Thanks for the clarification.
To all, please accept my apologies for my confusion.


> An official opinion is a reasonable request.
>
> On the other thread Andrea Pescetti had an interesting point of view
> that I think is the basis for seeking an opinion from the Apache
> Legal team (made up of Apache Members)

I read that a legal issue would be raised about the GPL dictionaries. 
Then Rob was wondering how dictionaries could be copyrighted. I thought 
that if lawyers knew that copyrights on such matter were irrelevant, 
they would not forget to mention it. So that would concern every 
dictionaries, whatever rights authors may claim on their work, whatever 
license they chose. That was my assumption.

Anyway, I’ll leave soon, as there is probably nothing more to say.

Best regards,
Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Shane Curcuru <as...@shanecurcuru.org>.

On 2011-11-07 8:27 AM, Dave Fisher wrote:
> Hi Olivier,
>
> Thanks for bringing your experienced perspective to the list!
>
> On Nov 7, 2011, at 8:14 AM, Olivier R. wrote:
>
>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>
>>> Why would Apache care about that?
>>
>> Maybe just because you are an Apache member and you make a strong statement on an Apache list about FLOSS you are willing to bundle in your software.
>> I’d prefer an official statement about this point, if you don’t mind.
>
> Rob is not an Apache Member, neither am I. We are Apache Committers and on the Apache OpenOffice.org (incubating) PPMC.

The list of Apache Members is here:
   http://www.apache.org/foundation/members.html

A very brief description of roles is here:
   http://www.apache.org/foundation/how-it-works.html#roles

> An official opinion is a reasonable request.

If the PPMC does have specific legal (or other) questions about Apache 
policy, then the PPMC should take those specific questions or proposals 
to legal-internal@ or the appropriate mailing list.  But you need to 
have a specific question for some purpose for the project.  The ASF's 
Foundation-level activities are designed to support it's projects in 
their direct work in building software for the public good. Hypothetical 
questions are unlikely to get official policy answers from VP, Legal or 
other officers, so it's important to be specific.

- Shane

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Dave Fisher <da...@comcast.net>.

Hi Olivier,

Thanks for bringing your experienced perspective to the list!

On Nov 7, 2011, at 8:14 AM, Olivier R. wrote:

> Le 07/11/2011 16:53, Rob Weir a écrit :
> 
>> Why would Apache care about that?
> 
> Maybe just because you are an Apache member and you make a strong statement on an Apache list about FLOSS you are willing to bundle in your software.
> I’d prefer an official statement about this point, if you don’t mind.

Rob is not an Apache Member, neither am I. We are Apache Committers and on the Apache OpenOffice.org (incubating) PPMC.

An official opinion is a reasonable request.

On the other thread Andrea Pescetti had an interesting point of view that I think is the basis for seeking an opinion from the Apache Legal team (made up of Apache Members)


Re: GPL'd dictionaries (was Re: ftp.services.openoffice.org?)

On Nov 6, 2011, at 11:06 AM, Andrea Pescetti wrote:

> On 05/11/2011 Gianluca Turconi wrote:
>> 2011/11/5 Pedro Giffuni
>>> I have been looking at the situation of the dictionaries,
>>> and particular the italian dictionary.
>>> You are right that it will not be covered by the SGA.
> 

<big snip>

>> An AOOo without a native language GUI and linguistic tools would be just
>> useless outside the anglosaxon world and, indeed, a rather disastrous
>> presentation of the new project for people who don't speak English.
> 
> Sure, especially considering that the project description says that OpenOffice.org supports 110 languages...
> 
> What I would recommend is:
> 
> 1) Recheck the Apache policy and find out the rationale behind it; I have nothing to teach to the legal team, but this is a very rare case where the "virality" of GPL does not apply.
> 
> 2) See if we can find a way to keep dictionaries as they are; note that no dictionary is developed in the OOo trunk, they are synchronized from time to time, usually before a release; the Italian dictionary SVN trunk, for example, is not in the OOo sources. Even just the possibility to provide an extension that can be included in binary releases would be OK for me.
> 
> 3) If there is really no way to include a GPL extension this way, then we should think about downloading the extension at installation time. But we managed to get Sun and the FSF agree to ship dictionaries in the most convenient way (i.e., included in the installer), so we might succeed this time as well.
> 
> Regards,
>  Andrea.

Regards,
Dave


> 
> Regards,
> Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Andrew Rist <an...@oracle.com>.


On 11/7/2011 8:57 AM, Andre Schnabel wrote:
> Hi again ...
>
>> Datum: Mon, 7 Nov 2011 11:45:46 -0500
>> Von: "Louis Suárez-Potts"<ls...@gmail.com>
>> An: ooo-dev@incubator.apache.org
>> Betreff: Re: Hunspell dictionaries are not just words lists (+ other matters)
>> Rob, probably I am a cabbage, but kings do make odd laws and so my
>> query was not about logic and your opinion but about what we have
>> actually encountered in the doing of this. OOo has been engaged in
>> this activity for 10 yeas, and André for much of that time has been
>> deeply involved. So my question is: what have we learned in this
>> matter?
>
> Ok - just forgot. I learned at OOo, that Sun reqired a SCA for l10n
> contributions. So either Sun was paranoid or there might be some legal
> reason.
Or they might have been contributed.  ;-)
Andrew
>
> regards,
>
> André

-- 

Andrew Rist | Interoperability Architect
OracleCorporate Architecture Group
Redwood Shores, CA | 650.506.9847

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Andre Schnabel <An...@gmx.net>.

Hi again ...

> Datum: Mon, 7 Nov 2011 11:45:46 -0500
> Von: "Louis Suárez-Potts" <ls...@gmail.com>
> An: ooo-dev@incubator.apache.org
> Betreff: Re: Hunspell dictionaries are not just words lists (+ other matters)

> Rob, probably I am a cabbage, but kings do make odd laws and so my
> query was not about logic and your opinion but about what we have
> actually encountered in the doing of this. OOo has been engaged in
> this activity for 10 yeas, and André for much of that time has been
> deeply involved. So my question is: what have we learned in this
> matter?


Ok - just forgot. I learned at OOo, that Sun reqired a SCA for l10n 
contributions. So either Sun was paranoid or there might be some legal
reason.

regards,

André

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Louis Suárez-Potts <ls...@gmail.com>.

Rob, probably I am a cabbage, but kings do make odd laws and so my
query was not about logic and your opinion but about what we have
actually encountered in the doing of this. OOo has been engaged in
this activity for 10 yeas, and André for much of that time has been
deeply involved. So my question is: what have we learned in this
matter?

Louis

On 7 November 2011 11:43, Rob Weir <ro...@apache.org> wrote:
> On Mon, Nov 7, 2011 at 11:28 AM, Louis Suárez-Potts <lo...@apache.org> wrote:
>> André,
>> Do we have an account of the difficulties encountered by the
>> localizers of OOo language packs and related data? As to Rob's point,
>> I think a relevant issue is that in translations such as those
>> required by localizations, the word chosen to translate the original
>> is an interpretation, and its quality (value) depends on the skill of
>> the localizer.
>>
>
> And creating a catalog of birds observed in a park also requires
> skill.  But that does not make it a creative work.
>
> Compare the following catalogs:
>
> A phone book:
>
> Abel, George W, 212-332-3294
> Abel, Thomas S. 212-433-2322
>
> etc.
>
> A catalog of weather observations:
>
> 1970-12-01, Boston, 42.3, 18.2, 5, NE
> 1970-12-02, Miami, 74.2,  52.6, 10, SW
>
> A list of biographical information:
>
> Napoleon Bonaparte, French, 1769-1821
> Frederick the Great, German, 1712-1786
>
> A tagged list of words in a language:
>
> agricola, agricolae (m) (noun)
> amo, amare, amavi, amatus sum (verb)
>
> These are all just lists of facts. You might try to claim a copyright
> on the particular selection and arrangement of these facts, but the
> underlying facts cannot be protected.  So, for example, some could
> extract (reverse engineer) the underlying facts from the work, and
> arrange them differently or with a different selection, and it would
> be perfectly fine.  Copyright does not protect the underlying facts,
> no matter how hard it was for someone to collect the facts to to type
> them in.
>
> -Rob
>
>
>> Or so I understand.
>>
>> Louis
>>
>> On 7 November 2011 11:20, Rob Weir <ro...@apache.org> wrote:
>>> On Mon, Nov 7, 2011 at 11:14 AM, Olivier R. <ol...@gmail.com> wrote:
>>>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>>>
>>>>> Why would Apache care about that?
>>>>
>>>> Maybe just because you are an Apache member and you make a strong statement
>>>> on an Apache list about FLOSS you are willing to bundle in your software.
>>>> I’d prefer an official statement about this point, if you don’t mind.
>>>>
>>>
>>> I think it would be obvious to a cabbage that no one is going to
>>> recognize copyright claims on things that cannot be validly claimed
>>> under copyright law.  It is also clear that the determination of this
>>> for any specific artifact, like a specific spell checking dictionary
>>> would require detailed analysis.  Since Apache does not hand out free
>>> legal advice, I don't think you will get an official response to your
>>> hypothetical question.
>>>
>>> -Rob
>>>
>>>
>>>> Regards,
>>>> Olivier
>>>>
>>>
>>
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 11:28 AM, Louis Suárez-Potts <lo...@apache.org> wrote:
> André,
> Do we have an account of the difficulties encountered by the
> localizers of OOo language packs and related data? As to Rob's point,
> I think a relevant issue is that in translations such as those
> required by localizations, the word chosen to translate the original
> is an interpretation, and its quality (value) depends on the skill of
> the localizer.
>

And creating a catalog of birds observed in a park also requires
skill.  But that does not make it a creative work.

Compare the following catalogs:

A phone book:

Abel, George W, 212-332-3294
Abel, Thomas S. 212-433-2322

etc.

A catalog of weather observations:

1970-12-01, Boston, 42.3, 18.2, 5, NE
1970-12-02, Miami, 74.2,  52.6, 10, SW

A list of biographical information:

Napoleon Bonaparte, French, 1769-1821
Frederick the Great, German, 1712-1786

A tagged list of words in a language:

agricola, agricolae (m) (noun)
amo, amare, amavi, amatus sum (verb)

These are all just lists of facts. You might try to claim a copyright
on the particular selection and arrangement of these facts, but the
underlying facts cannot be protected.  So, for example, some could
extract (reverse engineer) the underlying facts from the work, and
arrange them differently or with a different selection, and it would
be perfectly fine.  Copyright does not protect the underlying facts,
no matter how hard it was for someone to collect the facts to to type
them in.

-Rob


> Or so I understand.
>
> Louis
>
> On 7 November 2011 11:20, Rob Weir <ro...@apache.org> wrote:
>> On Mon, Nov 7, 2011 at 11:14 AM, Olivier R. <ol...@gmail.com> wrote:
>>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>>
>>>> Why would Apache care about that?
>>>
>>> Maybe just because you are an Apache member and you make a strong statement
>>> on an Apache list about FLOSS you are willing to bundle in your software.
>>> I’d prefer an official statement about this point, if you don’t mind.
>>>
>>
>> I think it would be obvious to a cabbage that no one is going to
>> recognize copyright claims on things that cannot be validly claimed
>> under copyright law.  It is also clear that the determination of this
>> for any specific artifact, like a specific spell checking dictionary
>> would require detailed analysis.  Since Apache does not hand out free
>> legal advice, I don't think you will get an official response to your
>> hypothetical question.
>>
>> -Rob
>>
>>
>>> Regards,
>>> Olivier
>>>
>>
>

RE: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Dennis E. Hamilton" <de...@acm.org>.

I don't want to get into this particular discussion.  I think viable 
approaches are on the table and I have nothing to add.

As an armchair copyright maven, here is something on the specific case of 
translations:

In US Copyright, one of the exclusive rights of a copyright holder is the 
creation of translations of the copyrighted work.  (Translation is included 
under "derivative work" by definition).

I can't tell from the brief statement Gianluca provides whether the Italian 
law is addressing the same point, but it appears to be.

Since a translation is explicitly included as a possible "work made for hire" 
in the US Copyright regime, a licensed (i.e., non-infringing) translation is 
presumably copyrightable subject matter.  Who holds the copyright is probably 
subject to the license agreement.

In the various open source licenses that apply to literary works (including 
software but certainly documentation), the conditions on derivative works 
apply transitively to translations of those licensed works as well.


 - Dennis E. Hamilton
   tools for document interoperability,  <http://nfoWorks.org/>
   dennis.hamilton@acm.org  gsm: +1-206-779-9430  @orcmid

PS:  When a different programming-language implementation of the same function 
achieved in another work fails to be a derivative work of the other work is a 
touchy subject and I'll not go there.  That's what judges are for and the 
prudent approach is to avoid ever having to come under such judgment.


-----Original Message-----
From: Gianluca Turconi [mailto:inbox@letturefantastiche.com]
Sent: Tuesday, November 08, 2011 01:26
To: ooo-dev@incubator.apache.org
Subject: Re: Hunspell dictionaries are not just words lists (+ other matters)

Andre Schnabel ha scritto:
> Yes, but this is very likely some gray area in copyright law. It would
> be a very long and likely not very friendly discussion, to see if
> translation is a creative process (producing an original work).
Just to give some free advice to common law citizens or other people
living in other countries, the Italian copyright law (law 633/41,
Section VI, article 64-bis, first paragraph, point b) *expressly
*includes software translations among the rights protected by law.

Regards,

Gianluca

-- 
Lettura gratuita o acquisto di libri e racconti di fantascienza,
fantasy, horror, noir, narrativa fantastica e tradizionale:
http://www.letturefantastiche.com/

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Gianluca Turconi <in...@letturefantastiche.com>.

Andre Schnabel ha scritto:
> Yes, but this is very likely some gray area in copyright law. It would
> be a very long and likely not very friendly discussion, to see if
> translation is a creative process (producing an original work).
Just to give some free advice to common law citizens or other people 
living in other countries, the Italian copyright law (law 633/41, 
Section VI, article 64-bis, first paragraph, point b) *expressly 
*includes software translations among the rights protected by law.

Regards,

Gianluca

-- 
Lettura gratuita o acquisto di libri e racconti di fantascienza,
fantasy, horror, noir, narrativa fantastica e tradizionale:
http://www.letturefantastiche.com/

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Andre Schnabel <An...@gmx.net>.

Hi Louis,

> Datum: Mon, 7 Nov 2011 11:28:43 -0500
> Von: "Louis Suárez-Potts" <lo...@apache.org>
> An: ooo-dev@incubator.apache.org
> Betreff: Re: Hunspell dictionaries are not just words lists (+ other matters)

> André,
> Do we have an account of the difficulties encountered by the
> localizers of OOo language packs and related data? As to Rob's point,
> I think a relevant issue is that in translations such as those
> required by localizations, the word chosen to translate the original
> is an interpretation, and its quality (value) depends on the skill of
> the localizer.

Yes, but this is very likely some gray area in copyright law. It would
be a very long and likely not very friendly discussion, to see if
translation is a creative process (producing an original work).

I would simply expect from communites as we are, that we all respect
the the will of the people who did the work (so if translators want to
have a copyleft license on their work we should respect that ... if they don't want .. fine as well). Not making any legal claims here.

But copyright imho applies to terminologies / glossaries. Many companies
(including Microsoft) claim copyright on their glossaries, and have
different licenses for their usage. But (unfortunately) this is not
relevant for OOo, as the glossaries are lost (afaik).

In case Rob has a copy of any glossary around, you may ask him to
publish it.

regards,

André

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Louis Suárez-Potts <lo...@apache.org>.

André,
Do we have an account of the difficulties encountered by the
localizers of OOo language packs and related data? As to Rob's point,
I think a relevant issue is that in translations such as those
required by localizations, the word chosen to translate the original
is an interpretation, and its quality (value) depends on the skill of
the localizer.

Or so I understand.

Louis

On 7 November 2011 11:20, Rob Weir <ro...@apache.org> wrote:
> On Mon, Nov 7, 2011 at 11:14 AM, Olivier R. <ol...@gmail.com> wrote:
>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>
>>> Why would Apache care about that?
>>
>> Maybe just because you are an Apache member and you make a strong statement
>> on an Apache list about FLOSS you are willing to bundle in your software.
>> I’d prefer an official statement about this point, if you don’t mind.
>>
>
> I think it would be obvious to a cabbage that no one is going to
> recognize copyright claims on things that cannot be validly claimed
> under copyright law.  It is also clear that the determination of this
> for any specific artifact, like a specific spell checking dictionary
> would require detailed analysis.  Since Apache does not hand out free
> legal advice, I don't think you will get an official response to your
> hypothetical question.
>
> -Rob
>
>
>> Regards,
>> Olivier
>>
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 11:14 AM, Olivier R. <ol...@gmail.com> wrote:
> Le 07/11/2011 16:53, Rob Weir a écrit :
>
>> Why would Apache care about that?
>
> Maybe just because you are an Apache member and you make a strong statement
> on an Apache list about FLOSS you are willing to bundle in your software.
> I’d prefer an official statement about this point, if you don’t mind.
>

I think it would be obvious to a cabbage that no one is going to
recognize copyright claims on things that cannot be validly claimed
under copyright law.  It is also clear that the determination of this
for any specific artifact, like a specific spell checking dictionary
would require detailed analysis.  Since Apache does not hand out free
legal advice, I don't think you will get an official response to your
hypothetical question.

-Rob

> Regards,
> Olivier
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 12:23 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Olivier R. wrote on Mon, Nov 07, 2011 at 17:14:03 +0100:
>> Le 07/11/2011 16:53, Rob Weir a écrit :
>>
>> >Why would Apache care about that?
>>
>> Maybe just because you are an Apache member
>
> Rob Weir is not a Member.
>

Correct, I'm an AOOo committer and PPMC member.  I am not an "Apache
Member".  That is part of the secret code that Apache uses to confuse
outsiders.  It means I do not vote for the Apache Board.  But I'd be
extremely surprised if being anointed as an Apache Member made you
more informed about the status of sui generis intellectual property
protection for databases.

-Rob

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Olivier R. wrote on Mon, Nov 07, 2011 at 17:14:03 +0100:
> Le 07/11/2011 16:53, Rob Weir a écrit :
> 
> >Why would Apache care about that?
> 
> Maybe just because you are an Apache member

Rob Weir is not a Member.

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Olivier R." <ol...@gmail.com>.

Le 07/11/2011 16:53, Rob Weir a écrit :

> Why would Apache care about that?

Maybe just because you are an Apache member and you make a strong 
statement on an Apache list about FLOSS you are willing to bundle in 
your software.
I’d prefer an official statement about this point, if you don’t mind.

Regards,
Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 9:33 AM, Olivier R. <ol...@gmail.com> wrote:
> Le 07/11/2011 14:13, Rob Weir a écrit :
>
>> I said something slightly different.  I said we should respect the
>> author's intentions/wishes, regardless of the copyright.  In other
>> words, even if they made something available under an Apache 2.0
>> license, if the author came to us and said, "I really don't want you
>> to use my library", then we should respect that.  Of course, this is
>> within reason.
>
> Thanks for clarifying.
>
>
> Now, I am eager to know if Apache Foundation also states that dictionaries
> authors cannot claim to have any copyright on their work.
>

Why would Apache care about that?  You can claim a copyright on the
alphabet if you want.  Enforcing your claim is another thing.

-Rob

>
> Regards,
> Olivier
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Olivier R." <ol...@gmail.com>.

Le 07/11/2011 14:13, Rob Weir a écrit :

> I said something slightly different.  I said we should respect the
> author's intentions/wishes, regardless of the copyright.  In other
> words, even if they made something available under an Apache 2.0
> license, if the author came to us and said, "I really don't want you
> to use my library", then we should respect that.  Of course, this is
> within reason.

Thanks for clarifying.


Now, I am eager to know if Apache Foundation also states that 
dictionaries authors cannot claim to have any copyright on their work.


Regards,
Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

2011/11/7 André Schnabel <an...@gmx.net>:
> Hi Rob,
>
> Am 07.11.2011 16:51, schrieb Rob Weir:
>>
>> On Mon, Nov 7, 2011 at 9:58 AM, Andre Schnabel<An...@gmx.net>
>>  wrote:
>>
>> The jurisdiction of the creator only matters in the case of local
>> infringement or in the context of international treaties.  And I don't
>> believe any treaties have recognized sui generis IP rights for
>> collections of facts, i.e., databases.  It has been discussed but
>> there is no agreement.  See the WIPO statement on this:
>>
>> http://www.wipo.int/copyright/en/activities/databases.html
>
> This is not a statement on IP rights for databases - it is a statement on IP
> rights for
> *Non-Original Databases* .
>
> We obviously disagree on this part of the text:
> " The originality requirement that a database must constitute an
> intellectual creation
> by reason of the selection or arrangement of its contents in order to enjoy
> copyright
> protection means that some databases are not protected ..."
>
> So obviously some databases actually are protected. Of course - if you
> think, that a
> dictionary is just a mere collection of words you would obviously come to
> the
> conclusion that this is no intellectual creation.
>

For example, an anthology of poetry, where an editor has selected and
ordered specific works.  The anthology can have a copyright.  Put the
anthology into a database and it can still be protected.  MS Encarta
CD encyclopedia, an example from years ago.

But I don't think a selection of "all French words in alphabetical
order, along with well-known, non-original facts about these words",
would receive that same protection.  And if it did, anyone could
extract the same facts and put them in a different arrangement with a
different selection, and use that.

That is the key difference.  With the anthology, the underlying poems
are still individually copyrighted.  But with a dictionary, I think
the underlying facts are not.  So the practical protection is far
less, since fair use would allow the extraction and reuse of the
underlying facts.

> btw ... if IBM does have dictionaries available, why don't you just publish
> those, if
> there is no copyright protection in place? Doing so would end this
> discussion very
> quickly andwould be a great contribution to the project.
>

This was already discussed in another part of this thread.  Our
dictionaries are not in Hunspell format.  We use a different spell
checker altogether.  So to make use of these dictionaries would
require a format conversion.  I'm not sure how hard that would be.

-Rob

> regards,
>
> André
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by André Schnabel <an...@gmx.net>.

Hi Rob,

Am 07.11.2011 16:51, schrieb Rob Weir:
> On Mon, Nov 7, 2011 at 9:58 AM, Andre Schnabel<An...@gmx.net>  wrote:
>
> The jurisdiction of the creator only matters in the case of local
> infringement or in the context of international treaties.  And I don't
> believe any treaties have recognized sui generis IP rights for
> collections of facts, i.e., databases.  It has been discussed but
> there is no agreement.  See the WIPO statement on this:
>
> http://www.wipo.int/copyright/en/activities/databases.html

This is not a statement on IP rights for databases - it is a statement 
on IP rights for
*Non-Original Databases* .

We obviously disagree on this part of the text:
" The originality requirement that a database must constitute an 
intellectual creation
by reason of the selection or arrangement of its contents in order to 
enjoy copyright
protection means that some databases are not protected ..."

So obviously some databases actually are protected. Of course - if you 
think, that a
dictionary is just a mere collection of words you would obviously come 
to the
conclusion that this is no intellectual creation.

btw ... if IBM does have dictionaries available, why don't you just 
publish those, if
there is no copyright protection in place? Doing so would end this 
discussion very
quickly andwould be a great contribution to the project.

regards,

André

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 9:58 AM, Andre Schnabel <An...@gmx.net> wrote:
> Hi Rob,*
>
> -------- Original-Nachricht --------
>> Datum: Mon, 7 Nov 2011 08:13:40 -0500
>> Von: Rob Weir <ro...@apache.org>
>
>>
>> And I'm sure there is choice and in how one structures a telephone
>> book as well. Alphabetical versus topical.  Fonts choices.  Consistent
>> ways of abbreviating given names. But he courts have held that the
>> facts expressed in telephone directory are not copyrightable.
>
> I tried to not comment - but it is really a shame to see people being such
> ignorant.
>
> US copyright law is not the only law that is relevant here. Relevant is
> the local law of the creator of the dictionary - and actually there *are*
> countries where such work ist protected. German law is at least one
> example for that. (This may sound stupid to you, but it is a simple
> fact.)
>

The jurisdiction of the creator only matters in the case of local
infringement or in the context of international treaties.  And I don't
believe any treaties have recognized sui generis IP rights for
collections of facts, i.e., databases.  It has been discussed but
there is no agreement.  See the WIPO statement on this:

http://www.wipo.int/copyright/en/activities/databases.html

-Rob

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Andre Schnabel <An...@gmx.net>.

Hi Rob,*

-------- Original-Nachricht --------
> Datum: Mon, 7 Nov 2011 08:13:40 -0500
> Von: Rob Weir <ro...@apache.org>

> 
> And I'm sure there is choice and in how one structures a telephone
> book as well. Alphabetical versus topical.  Fonts choices.  Consistent
> ways of abbreviating given names. But he courts have held that the
> facts expressed in telephone directory are not copyrightable.

I tried to not comment - but it is really a shame to see people being such
ignorant.

US copyright law is not the only law that is relevant here. Relevant is
the local law of the creator of the dictionary - and actually there *are*
countries where such work ist protected. German law is at least one
example for that. (This may sound stupid to you, but it is a simple
fact.)

All this discussion is indeed futile, as long as you want to accept your
local views only.

regards,

André

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 7:58 AM, Olivier R. <ol...@gmail.com> wrote:
> Hello Rob,
>
> Le 07/11/2011 12:51, Rob Weir a écrit :
>
>> What you say above does not really make a legal difference.  What
>> makes something copyright-able is creative expression. not hard work.
>> You could spend decades collecting data on bird populations, measuring
>> the positions of stars, recording the names of everyone in your town,
>> etc., and this could all be very hard work.  But in the end what you
>> have is just a set of facts.  It is not a creative work.
>
> OK, I understand this point of view, but there is different ways to do what
> I and others did. Creating a affixation file is not just collecting
> informations. This is a way of structuring data and generating inflexions.
> It could have been done in a very different way.
> Even tagging, classifications, lemmatization and repartition of entries can
> be a matter of choice, otherwise there would not have been people trying to
> convince me to do it differently. It’s not a mathematical matter there.
> Obviously, part of the work is collecting data, but there is also, mostly at
> first, a creation of the mind (even if I won’t pretend it’s a piece of art).

And I'm sure there is choice and in how one structures a telephone
book as well. Alphabetical versus topical.  Fonts choices.  Consistent
ways of abbreviating given names. But he courts have held that the
facts expressed in telephone directory are not copyrightable.

However, the actual page of the phone book -- the layout, fonts, etc.,
could be copyrighted.  But it would also be fair use to extract the
non creative "facts" contained on that page and then to include them
in another work.  Reverse engineering to extract the facts is fine.
That's my understanding, at least.

> But I won’t bother you for long on that topic, we are not at court trying to
> defend our respective point of view. I’m sure you get my point, even if you
> may not agree.
>

I think we understand each other.

> What I don’t understand is why you are saying you should respect the
> copyright of something that is not copyright-able according to you. It seems
> to me that’s defending a point of view and the opposite one. (I wonder if we
> should blame you or thank you.)
>

I said something slightly different.  I said we should respect the
author's intentions/wishes, regardless of the copyright.  In other
words, even if they made something available under an Apache 2.0
license, if the author came to us and said, "I really don't want you
to use my library", then we should respect that.  Of course, this is
within reason.  But if someone claims to have invented the numbers
1-10 and asks us not to use them, then we can ignore them and
recommend professional counseling.

-Rob

> Regards,
> Olivier
>

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by "Olivier R." <ol...@gmail.com>.

Hello Rob,

Le 07/11/2011 12:51, Rob Weir a écrit :

> What you say above does not really make a legal difference.  What
> makes something copyright-able is creative expression. not hard work.
> You could spend decades collecting data on bird populations, measuring
> the positions of stars, recording the names of everyone in your town,
> etc., and this could all be very hard work.  But in the end what you
> have is just a set of facts.  It is not a creative work.

OK, I understand this point of view, but there is different ways to do 
what I and others did. Creating a affixation file is not just collecting 
informations. This is a way of structuring data and generating 
inflexions. It could have been done in a very different way.
Even tagging, classifications, lemmatization and repartition of entries 
can be a matter of choice, otherwise there would not have been people 
trying to convince me to do it differently. It’s not a mathematical 
matter there.
Obviously, part of the work is collecting data, but there is also, 
mostly at first, a creation of the mind (even if I won’t pretend it’s a 
piece of art).

But I won’t bother you for long on that topic, we are not at court 
trying to defend our respective point of view. I’m sure you get my 
point, even if you may not agree.

What I don’t understand is why you are saying you should respect the 
copyright of something that is not copyright-able according to you. It 
seems to me that’s defending a point of view and the opposite one. (I 
wonder if we should blame you or thank you.)

Regards,
Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Rob Weir <ro...@apache.org>.

On Mon, Nov 7, 2011 at 6:05 AM, Olivier R. <ol...@gmail.com> wrote:
> Hello everyone,
>
> I don’t like mailing-lists, so I have subscribed here just to explain few
> things about dictionaries. Then I’ll vanish.
>
> Rob Weir wrote:
>>
>> Just make sure that you explain what a spell checking dictionary is.
>> Otherwise any legal types will be confused.  This is not a dictionary
>> like Webster's, with words and definitions, where the definitions are
>> creative content.  A spell checking dictionary is more of a word list.
>>  I'm not sure what the creative expression is in a list of all common
>> words in a language and how that could be copyrighted.  Of course, I
>> am not a lawyer.
>
> Few dictionaries are just words lists, but most of them are lists of words
> tagged with flags described in an affixation file which specify what are the
> rules to generate inflexions. This affixation file can be quite simple or
> very complex. And this can be a difficult matter.
>  It looks easy at first, but when you begin to get deeper in this matter,
> there is often a lot of issues to handle. Create a proper affixation file
> can really be a hard work. And even if the difficulty is
> not high, this can be a very long job.
>  So, no, Hunspell dictionaries are not just words lists.
>

What you say above does not really make a legal difference.  What
makes something copyright-able is creative expression. not hard work.
You could spend decades collecting data on bird populations, measuring
the positions of stars, recording the names of everyone in your town,
etc., and this could all be very hard work.  But in the end what you
have is just a set of facts.  It is not a creative work.

if you read the Wikipedia link I sent before, you can see how the
courts have rejected the "sweat of the brow" theory for copyright.

> For example, it took me one year and countless hours of work to rewrite the
> affixation file of the French dictionaries from scratch. Even after that,
> there were still a lot of bugs (not spelling mistakes). For one year, I had
> to patch regularly the affixation file. Even after few years, there is still
> sometimes something to fix. The French dictionaries contain approximatively
> 13000 rules.

But if you are just expressing the facts of the language, encoding the
well-known affix rules that already exist, then may not be creative
expression.

I don't mean to diminish the effort.  Here is an analogy.  A
mathematician can work his entire life to prove a new theorem, but
cannot get a patent for that proof, because it is not a patentable
subject matter.  But someone else could come up with a trivial idea
and get a patent for it.  Effort and difficulty are not the primary
criteria for what defines intellectual property.  Sorry.

>  Here an example of one of the most complex flags:
> http://www.dicollecte.org/affixes.php?prj=fr&flag=c2
>
> (AFAIK, there is only one dictionary which has a more complex affixation
> file, the Hungarian one.)
>
> I also tagged the affixation file in order to generate 4 different
> dictionaries with a script, to offer to users the mean to write according to
> their preferences towards the optional and controversial French spelling
> reform of 1990.
>
> Besides, 99 % of entries have been manually grammatically tagged.
>  Several contributors did a tremendous job by adding lexical tags, adding
> many words, moving entries in different subdictionaries according to our
> policy, handling special cases, reporting mistakes and issues. Because,
> spelling matters are much more complex than you think,
> especially if you want to use your dictionary for grammar checking.
>  We often have to handle old, new or variant spelling just for one word, and
> there are decisions to take about what to do with special cases, which are
> actually very numerous. Managing dictionaries is not a trivial task.

No one is saying the effort was trivial.  It would also not be trivial
to catalog the positions of all the visible stars.  But that does not
make it a creative effort that can be protected by copyright.

>  Here is the "bugtracker" where we work on the French dictionaries.
> http://www.dicollecte.org/propositions.php?prj=fr&tab=E [fr]
>  (This bugtracker also allows us to commit in the dictionary in the
> database.)
>  The changelog:
> http://www.dicollecte.org/log.php?prj=fr
>
> This dictionary is used by the both French grammar checkers.
>
> What you said about copyright could be right for a list generated by script
> from a corpus, but that’s not true for dictionaries who are conceived by
> human with their knowledge, their work and their choices.
>

It depends on whether there is creative expression or not.  If it is
just fact collection and encoding with a quality checking process
behind it, then I'm not so sure.

>
>> But we'll never resolve this on legal grounds.  At Apache we would not
>> bundle a dictionary under a legal theory if the compiler of the
>> dictionary did not want us to.  I think we should respect the
>> dictionary compiler's wishes and intent,
>> _even if legally we're not obligated to_.
>
> Wow... That’s really not encouraging for people who may consider to change
> the license of their work... Does IBM think the same way?

Maybe it was unclear what I said here.  I said that even if we thought
a work was not copyright-able, we would still not distribute it if the
dictionary compiler did not want us to.

As for IBM, we have our own spell checking dictionaries, so this is
not an issue.

>  Few years ago, when I began to contribute for FLOSS, I thought the less
> restrictive licenses were the better ones, only because I didn’t care and I
> was ignorant about licensing and political matters.
>  As time goes, I think more and more the opposite. And when I read you, I’m
> beginning to think I was still too soft on that topic.
>
>
>> 3) We could contact the compilers of the dictionary and ask if they
>> would make them available under a difference license.   Generally
>> people make things available under an OSS license because they want to
>> see other projects use them.  If we tell them that a leading
>> application like OpenOffice can no longer user their dictionary, this
>> might persuade them to change their license.
>
> Here is the situation for the French dictionaries:
>
> 1. The Hunspell spelling dictionaries
>  Licenses: MPL/LGPL/GPL
>

That's fine.  If the French dictionaries are MPL we can use them.  Thanks!

<snip>

> 2. The thesaurus
>  The initial and main author released it under license LGPL.
>  Now he’s dead. AFAIK, there is no way to change the license before his work


I think you can make a better argument for a thesaurus being a creative work.

> is considered as puplic domain, but there also have been several
> improvements on the initial work.

Copyright lasts many years beyond an author's death.

>  At the moment, I am working on it to transform it as a list of "synsets"
> which could be used to generate a better thesaurus. A list of synsets would
> be a far better basis to work on. I don’t know if I will succeed. This is a
> difficult matter and it requires a lot of work.
>
>
> 3. Hyphenation rules
>  Licence LGPL.
>  This is a dictionary converted from the hyphenation rules for TeX,
> modified somehow to handle several issues.
>  I did nothing on it. I’m just packaging it in the extensions for
> OOo/LibO. You'll have to contact the peoples who created it.
>
>
>> 4) We could convert another word list or dictionary, one that has a
>> better license,  into Hunspell format.
>
> Hmmm...
>  You may generate affixation rules for Myspell with a script… but then,
> these dictionaries will probably be such a mess that you’ll be very lucky if
> you find someone with enough abnegation to improve it. The main issues of
> dictionaries are:
>  - if you just create a list of words, you may only improve it with text

<snip>

Yes, it would be difficult.  That is why I put it last on the list.
It is not our first preference.

-Rob

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Gianluca Turconi <in...@letturefantastiche.com>.

Olivier R. ha scritto:
> (I am not even sure that Myspell can use UTF-8 files.) 
AFAIK, Myspell doesn't support UTF-8. At least this is what Kevin 
Hendricks said several years ago:

http://markmail.org/message/wya3mihuqmmqjxle#query:+page:1+mid:jxvevm2ugemrjwpq+state:results

-- 
Lettura gratuita o acquisto di libri e racconti di fantascienza,
fantasy, horror, noir, narrativa fantastica e tradizionale:
http://www.letturefantastiche.com/

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Armin Le Grand <Ar...@me.com>.

	Hi Olivier,

On 07.11.2011 12:05, Olivier R. wrote:
> Hello everyone,
>
> I don’t like mailing-lists, so I have subscribed here just to explain
> few things about dictionaries. Then I’ll vanish.

I know what you mean :-) You may use/read (nearly) any public mailing 
list by using gmane.org and a newsreader (e.g. TB) without having to 
subscribe. No need to filter your inbox, etc. For following newsgroups 
this is sometimes preferable over subscription (IMHO)

[..]

Sincerely,
	Armin
--
ALG

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Mathias Bauer <Ma...@gmx.net>.

On 07.11.2011 12:37, Pedro Giffuni wrote:
> For the record,
>
> I respect that this type of work takes a *lot* of time and
> hard work, and that people do have the right to make their
> work copyleft.
>
> There is however, for practical purposes, a huge difference
> for us between MPL/LGPL (the french case) and GPL-only (the
> italian case).

More precisely (as the useles discussion started in this thread
distracted from the real topics):

Apache OOo could include even Hunspell dictionaries under (L)GPL from a
legal perspective, as according to the FSF packaging dictionaries into
an application does not make this a derivative work and so the
application that packs the dictionary does not need to follow the same
license as the dictionary. This allowed us to use GPLed dictionaries in
the past in our LGPLed office application.

But from the Apache perspective we can only package dictionaries
released under compatible licenses, including MPL. And in the latter
case we can't provide the sources for them in our svn repository. This
is not enforced by the copyright of the dictionaries, but by the Apache
rules, as far as I understood. But at the end that doesn't make a big
difference in practice.

Regards,
Mathias

(Who thinks that it doesn't matter if the copyright of a dictionary can
be enforced in court or just applies because we respect the will of the
author that he as expressed by choosing a particular license)

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Gianluca Turconi <in...@letturefantastiche.com>.

Pedro Giffuni ha scritto:
> There is however, for practical purposes, a huge difference
> for us between MPL/LGPL (the french case) and GPL-only (the
> italian case).
Well, the Italian dictionary in his long history has been distributed 
under SSIL, LGPL+GPL, LGPL only, GPL only, and as you know, recently 
under Apache license v. 2.0 too.

I think there are two different issues here:

a) for the authors: changing licenses or find another legal way to 
include the linguistic tools or create everything from scratch again;

b) for the users: having open source office suite *product *with 
functional linguistic tools.

Under a), I have done everything I know to facilitate the inclusion of 
what I've created and, be sure, I won't start another dictionary from 
scratch *again*. ;-)

Under b), since I'm a power OOo user too, I have to say that a Office 
Suite without functional linguistic tools in my native language, would 
be simply like crap for me.

I perfectly know that Apache project has its own rules and produces code 
according to those rules without looking at functionalities. That's OK, 
for me.

Nevertheless, OOo users have to be productive in their daily jobs and 
look *a lot* at functionalities.

So, if the choice is between a full-Apache-compliant OOo product and 
productivity, I'll always chose productivity.

I think this issue is one where comprise has to be pursed as first solution.

Regards,

Gianluca

-- 
Lettura gratuita o acquisto di libri e racconti di fantascienza,
fantasy, horror, noir, narrativa fantastica e tradizionale:
http://www.letturefantastiche.com/

Re: Hunspell dictionaries are not just words lists (+ other matters)

Posted by Pedro Giffuni <pf...@apache.org>.

For the record,

I respect that this type of work takes a *lot* of time and
hard work, and that people do have the right to make their
work copyleft.

There is however, for practical purposes, a huge difference
for us between MPL/LGPL (the french case) and GPL-only (the
italian case).

Pedro.

--- On Mon, 11/7/11, Olivier R. <ol...@gmail.com> wrote:


> Hello everyone,
> 
> I don’t like mailing-lists, so I have subscribed here
> just to explain few things about dictionaries. Then I’ll
> vanish.
> 
> Rob Weir wrote:
> > Just make sure that you explain what a spell checking
> dictionary is.
> > Otherwise any legal types will be confused.  This
> is not a dictionary
> > like Webster's, with words and definitions, where the
> definitions are
> > creative content.  A spell checking dictionary is
> more of a word list.
> >   I'm not sure what the creative
> expression is in a list of all common
> > words in a language and how that could be
> copyrighted.  Of course, I
> > am not a lawyer.
> 
> Few dictionaries are just words lists, but most of them are
> lists of words tagged with flags described in an affixation
> file which specify what are the rules to generate
> inflexions. This affixation file can be quite simple or very
> complex. And this can be a difficult matter.
>   It looks easy at first, but when you begin to get
> deeper in this matter, there is often a lot of issues to
> handle. Create a proper affixation file can really be a hard
> work. And even if the difficulty is
> not high, this can be a very long job.
>   So, no, Hunspell dictionaries are not just words
> lists.
> 
> For example, it took me one year and countless hours of
> work to rewrite the affixation file of the French
> dictionaries from scratch. Even after that, there were still
> a lot of bugs (not spelling mistakes). For one year, I had
> to patch regularly the affixation file. Even after few
> years, there is still sometimes something to fix. The French
> dictionaries contain approximatively 13000 rules.
>   Here an example of one of the most complex flags:
> http://www.dicollecte.org/affixes.php?prj=fr&flag=c2
> 
> (AFAIK, there is only one dictionary which has a more
> complex affixation file, the Hungarian one.)
> 
> I also tagged the affixation file in order to generate 4
> different dictionaries with a script, to offer to users the
> mean to write according to their preferences towards the
> optional and controversial French spelling reform of 1990.
> 
> Besides, 99 % of entries have been manually grammatically
> tagged.
>   Several contributors did a tremendous job by adding
> lexical tags, adding many words, moving entries in different
> subdictionaries according to our policy, handling special
> cases, reporting mistakes and issues. Because, spelling
> matters are much more complex than you think,
> especially if you want to use your dictionary for grammar
> checking.
>   We often have to handle old, new or variant spelling
> just for one word, and there are decisions to take about
> what to do with special cases, which are actually very
> numerous. Managing dictionaries is not a trivial task.
>   Here is the "bugtracker" where we work on the French
> dictionaries.
> http://www.dicollecte.org/propositions.php?prj=fr&tab=E
> [fr]
>   (This bugtracker also allows us to commit in the
> dictionary in the database.)
>   The changelog:
> http://www.dicollecte.org/log.php?prj=fr
> 
> This dictionary is used by the both French grammar
> checkers.
> 
> What you said about copyright could be right for a list
> generated by script from a corpus, but that’s not true for
> dictionaries who are conceived by human with their
> knowledge, their work and their choices.
> 
> 
> > But we'll never resolve this on legal grounds. 
> At Apache we would not
> > bundle a dictionary under a legal theory if the
> compiler of the
> > dictionary did not want us to.  I think we should
> respect the
> > dictionary compiler's wishes and intent,
> > _even if legally we're not obligated to_.
> 
> Wow... That’s really not encouraging for people who may
> consider to change the license of their work... Does IBM
> think the same way?
>   Few years ago, when I began to contribute for FLOSS,
> I thought the less restrictive licenses were the better
> ones, only because I didn’t care and I was ignorant about
> licensing and political matters.
>   As time goes, I think more and more the opposite.
> And when I read you, I’m beginning to think I was still
> too soft on that topic.
> 
> 
> > 3) We could contact the compilers of the dictionary
> and ask if they
> > would make them available under a difference
> license.   Generally
> > people make things available under an OSS license
> because they want to
> > see other projects use them.  If we tell them
> that a leading
> > application like OpenOffice can no longer user their
> dictionary, this
> > might persuade them to change their license.
> 
> Here is the situation for the French dictionaries:
> 
> 1. The Hunspell spelling dictionaries
>   Licenses: MPL/LGPL/GPL
> 
>   As I am the sole author of the affixation file, as I
> grammatically tagged myself about 90 % of all entries
> (without copying another lexicon with a script), I can say
> for sure that I do not intend to change the licenses for the
> Apache one.
>   When I built Dicollecte, my goal was to encourage
> people to contribute for all and give back the improvements
> they did. Switching to the Apache license would be a
> contradiction with everything I did.
> 
>   By the way, these dictionaries _require_ Hunspell.
> They won’t work properly with Myspell. I saw a lot of
> people think Hunspell dictionaries will work with Myspell.
> That’s a wrong assumption. Hunspell can use Myspell
> dictionaries, but Hunspell also offers a lot of new features
> which allow to improve the dictionaries structure.
>   And Myspell does not recognize double suffixation or
> double prefixation, cannot handle duplicate lemmas, does not
> handle morphological tags, has a limited amount of flags,
> does not recognize Hunspell compound commands, etc. (I am
> not even sure that Myspell can use UTF-8 files.)
> 
>   But, good for you, AFAIK, many dictionnaries still
> have a Myspell structure. But not the French ones and some
> others.
> 
> 
> 2. The thesaurus
>   The initial and main author released it under
> license LGPL.
>   Now he’s dead. AFAIK, there is no way to change
> the license before his work is considered as puplic domain,
> but there also have been several improvements on the initial
> work.
>   At the moment, I am working on it to transform it as
> a list of "synsets" which could be used to generate a better
> thesaurus. A list of synsets would be a far better basis to
> work on. I don’t know if I will succeed. This is a
> difficult matter and it requires a lot of work.
> 
> 
> 3. Hyphenation rules
>   Licence LGPL.
>   This is a dictionary converted from the hyphenation
> rules for TeX,
> modified somehow to handle several issues.
>   I did nothing on it. I’m just packaging it in the
> extensions for
> OOo/LibO. You'll have to contact the peoples who created
> it.
> 
> 
> > 4) We could convert another word list or dictionary,
> one that has a
> > better license,  into Hunspell format.
> 
> Hmmm...
>   You may generate affixation rules for Myspell with a
> script… but then, these dictionaries will probably be such
> a mess that you’ll be very lucky if you find someone with
> enough abnegation to improve it. The main issues of
> dictionaries are:
>   - if you just create a list of words, you may only
> improve it with text parser or other lexicons, but it will
> be hard and annoying to improve it manually, as the list
> will be very, very long, and it will be a memory waste. And
> each times you will regenerate it with your script, you’ll
> have to fix again manually what you did before.
>   - if you create an affixation file with script, your
> dictionary will be a mess, no easy way to improve it, as the
> dictionary structure will not be intuitive for a human
> being. And again, you cannot really mix improvements by
> scripting and improvements by human being.
>   The best way is to get somewhere a good lexicon
> already tagged with a non-restrictive license. Even then,
> you’ll have to write manually a proper affixation file…
> and then, you may discover it is not the easy task you may
> think it is, unless your language is somehow very logical,
> with neither exceptions, neither weird stuff…
> 
> 
> Regards,
> Olivier R.
>