You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vasil Vasilev <va...@gmail.com> on 2011/03/31 18:36:38 UTC

2 bugs in seq2sparse

Hi all,

I was recently experimenting with seq2sparse and I found 2 problems with it

1. the minLLR parameter is not taken into account. The problem is that in
the CollocDriver class
Job job = new Job(conf);

is executed before

conf.setFloat(LLRReducer.MIN_LLR, minLLRValue);

see CollocDriver.computeNGramsPruneByLLR method

2. maxDFPercent is not taken into account. The problem is that in
TFIDFPartialVectorReducer.reduce the check is

if (df / vectorCount > maxDfPercent) {
          if (log.isInfoEnabled()) {
                log.info("ommiting {}", e.index());
              }
        continue;
      }

and should be:

if (df*100 / vectorCount > maxDfPercent) {
          if (log.isInfoEnabled()) {
                log.info("ommiting {}", e.index());
              }
        continue;
      }

Shall I file Jiras for these issues? I can also apply patch

Regards, Vasil

Re: 2 bugs in seq2sparse

Posted by Sean Owen <sr...@gmail.com>.

It's so small, I'll just file the JIRA and resolve it.

On Thu, Mar 31, 2011 at 6:36 PM, Ted Dunning <te...@gmail.com> wrote:

> Please do.
>
> Can you build tests that demonstrate the problem as part of your patches?
>
> On Thu, Mar 31, 2011 at 9:36 AM, Vasil Vasilev <va...@gmail.com>
> wrote:
>
> > Shall I file Jiras for these issues? I can also apply patch
> >
>

Re: 2 bugs in seq2sparse

Posted by Ted Dunning <te...@gmail.com>.

Please do.

Can you build tests that demonstrate the problem as part of your patches?

On Thu, Mar 31, 2011 at 9:36 AM, Vasil Vasilev <va...@gmail.com> wrote:

> Shall I file Jiras for these issues? I can also apply patch
>