You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by parag dave <ph...@gmail.com> on 2009/10/21 08:05:20 UTC

Parsing Error while indexing in Lucene WordNet package

While using the Lucene WordNet package, we found that the Syns2Index program
indexes the Synsets wrongly. For example, looking up the synsets for the
word "king", we get:

java SynLookup wnindex king
baron
magnate
mogul
power
queen
rex
scrofula
struma
tycoon

Here, "scrofula" and "struma" are extraneous. This happens because, the line
parser code in Syns2Index.java interpretes the two consecutive single quotes
in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
termination
of the string and separates into "king". This entry concerns
synset of words "scrofula" and "struma", and thus they get inserted in the
synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
WordNet
Prolog data-base files, where such use of two consecutive single quotes
appears.

We have resolved this by adding a statement in the line parsing portion of
Syns2Index.java, as follows:

            // parse line
            line = line.substring(2);
           * line = line.replaceAll("\'\'", "`"); // added statement*
            int comma = line.indexOf(',');
            String num = line.substring(0, comma);  ... ... etc.
In short we replace "''" by "`" (a back-quote). Then on recreating the
index, we get:

java SynLookup zwnindex king
baron
magnate
mogul
power
queen
rex
tycoon

*Recently lucene-2.9.0 has been released, but wordnet package included in it
still has the same problem given above.*

-- Parag H. Dave

Re: Parsing Error while indexing in Lucene WordNet package

Posted by Robert Muir <rc...@gmail.com>.
Hi, thanks again for reporting this.

I created an issue here: http://issues.apache.org/jira/browse/LUCENE-2001

On Wed, Oct 21, 2009 at 2:05 AM, parag dave <ph...@gmail.com>wrote:

> While using the Lucene WordNet package, we found that the Syns2Index
> program
> indexes the Synsets wrongly. For example, looking up the synsets for the
> word "king", we get:
>
> java SynLookup wnindex king
> baron
> magnate
> mogul
> power
> queen
> rex
> scrofula
> struma
> tycoon
>
> Here, "scrofula" and "struma" are extraneous. This happens because, the
> line
> parser code in Syns2Index.java interpretes the two consecutive single
> quotes
> in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
> termination
> of the string and separates into "king". This entry concerns
> synset of words "scrofula" and "struma", and thus they get inserted in the
> synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
> WordNet
> Prolog data-base files, where such use of two consecutive single quotes
> appears.
>
> We have resolved this by adding a statement in the line parsing portion of
> Syns2Index.java, as follows:
>
>            // parse line
>            line = line.substring(2);
>           * line = line.replaceAll("\'\'", "`"); // added statement*
>            int comma = line.indexOf(',');
>            String num = line.substring(0, comma);  ... ... etc.
> In short we replace "''" by "`" (a back-quote). Then on recreating the
> index, we get:
>
> java SynLookup zwnindex king
> baron
> magnate
> mogul
> power
> queen
> rex
> tycoon
>
> *Recently lucene-2.9.0 has been released, but wordnet package included in
> it
> still has the same problem given above.*
>
> -- Parag H. Dave
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Parsing Error while indexing in Lucene WordNet package

Posted by Robert Muir <rc...@gmail.com>.
thanks, this sounds like a bug, I'll play with this today.

On Wed, Oct 21, 2009 at 2:05 AM, parag dave <ph...@gmail.com>wrote:

> While using the Lucene WordNet package, we found that the Syns2Index
> program
> indexes the Synsets wrongly. For example, looking up the synsets for the
> word "king", we get:
>
> java SynLookup wnindex king
> baron
> magnate
> mogul
> power
> queen
> rex
> scrofula
> struma
> tycoon
>
> Here, "scrofula" and "struma" are extraneous. This happens because, the
> line
> parser code in Syns2Index.java interpretes the two consecutive single
> quotes
> in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
> termination
> of the string and separates into "king". This entry concerns
> synset of words "scrofula" and "struma", and thus they get inserted in the
> synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
> WordNet
> Prolog data-base files, where such use of two consecutive single quotes
> appears.
>
> We have resolved this by adding a statement in the line parsing portion of
> Syns2Index.java, as follows:
>
>            // parse line
>            line = line.substring(2);
>           * line = line.replaceAll("\'\'", "`"); // added statement*
>            int comma = line.indexOf(',');
>            String num = line.substring(0, comma);  ... ... etc.
> In short we replace "''" by "`" (a back-quote). Then on recreating the
> index, we get:
>
> java SynLookup zwnindex king
> baron
> magnate
> mogul
> power
> queen
> rex
> tycoon
>
> *Recently lucene-2.9.0 has been released, but wordnet package included in
> it
> still has the same problem given above.*
>
> -- Parag H. Dave
>



-- 
Robert Muir
rcmuir@gmail.com