You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2012/02/01 08:51:04 UTC

Re: Extending mahout lucene.vector driver

Hi Michael,

Can you compare what you are doing with the code from the
testRun_query() unit test in LuceneIndexToSequenceFilesTest? The unit
test works, so I am curious where there is a difference.

Cheers,

Frank

On Fri, Jan 27, 2012 at 4:41 PM, Michael Kazekin
<Mi...@mediainsight.info> wrote:
> Frank, I tried this code with Solr 3.5 index (and changed all dependencies
> in pom file), but this still doesn't work:
>
> Directory directory = FSDirectory.open(file);
> IndexReader reader = IndexReader.open(directory, true);
> IndexSearcher searcher = new IndexSearcher(reader);
>
> I try to get a Scorer with this TermQuery ("lang" field is indexed
> and stored and all data is available)
>
> TermQuery atomQuery = new TermQuery(new Term("lang", "ru"));
>
> Weight weight = atomQuery.createWeight(searcher);
> Scorer scorer = weight.scorer(reader, true, false);
>
> // scorer == null here
>
>
>
>
> On 01/25/2012 07:04 PM, Frank Scholten wrote:
>>
>> Are you using Lucene 3.4? I had this problem as well and I believe
>> this was because of https://issues.apache.org/jira/browse/LUCENE-3442
>> which is fixed in Lucene 3.5.
>>
>> On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin
>> <Mi...@mediainsight.info>  wrote:
>>>
>>> Frank, I tried to use BooleanQuery, comprising of several TermQueries
>>> (these
>>> represent key:value constraints, where key is the field name, for example
>>> "lang:en"),
>>> but the Scorer, created by Weight in your code, is null. Do you know,
>>> what
>>> could be wrong here?
>>>
>>> Sorry to bother you on dev list with such questions, but I am trying to
>>> make
>>> a CLI util for this code, so I think it would be helpful for everybody.
>>
>> Great! Let me know if you need more help.
>>
>> Cheers,
>>
>> Frank
>>
>>>
>>> On 01/20/2012 02:15 AM, Frank Scholten wrote:
>>>>
>>>> LuceneIndexToSequenceFiles lucene2Seq = new
>>>> LuceneIndexToSequenceFiles();
>>>>
>>>> Configuration configuration = ... ;
>>>> IndexDirectory indexDirectory = ... ;
>>>> Path seqPath = ... ;
>>>> String idField = ... ;
>>>> String field = ... ;
>>>> List<String>    extraFields = asList( ... );
>>>> Query query = ... ;
>>>>
>>>> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
>>>> LuceneIndexToSequenceFilesConfiguration(configuration,
>>>> indexDirectory.getFile(), seqPath, idField, field);
>>>> lucene2SeqConf.setExtraFields(extraFields);
>>>> lucene2SeqConf.setQuery(query);
>>>>
>>>> lucene2Seq.run(lucene2SeqConf);
>>>>
>>>> The seqPath variable can be passed into seq2sparse.
>>>>
>>>> Cheers,
>>>>
>>>> Frank
>>>>
>>>> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
>>>> <Mi...@mediainsight.info>    wrote:
>>>>>
>>>>> Frank, could you please tell me how to use your lucene2seq tool?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>>>>>
>>>>>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>>>>>
>>>>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>>>>>> <Mi...@mediainsight.info>      wrote:
>>>>>>>
>>>>>>> Thank you, Frank! I'll definitely have a look on it.
>>>>>>>
>>>>>>> As far as I can see, the problem with using Lucene in clusterisation
>>>>>>> tasks
>>>>>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>>>>>> results only, while clusterization tasks need to deal with the
>>>>>>> results
>>>>>>> as
>>>>>>> a
>>>>>>> whole.
>>>>>>>
>>>>>>>
>>>>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>>>>>
>>>>>>>> Hi Michael,
>>>>>>>>
>>>>>>>> Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944
>>>>>>>>
>>>>>>>>
>>>>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query
>>>>>>>> and
>>>>>>>> it generates text sequence files.
>>>>>>>>
>>>>>>>>  From there you can use seq2sparse.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Frank
>>>>>>>>
>>>>>>>> Sorry for brevity, sent from phone
>>>>>>>>
>>>>>>>> On Jan 17, 2012, at 17:37, Michael
>>>>>>>> Kazekin<Mi...@mediainsight.info>        wrote:
>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can
>>>>>>>>> be
>>>>>>>>> feeded with arbitrary
>>>>>>>>> key-value constraints on solr schema fields (and generate only a
>>>>>>>>> subset
>>>>>>>>> for
>>>>>>>>> mahout vectors,
>>>>>>>>> which seems to be a regular use case).
>>>>>>>>>
>>>>>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>>>>>> implementation
>>>>>>>>> that would allow
>>>>>>>>> to read the subset.
>>>>>>>>>
>>>>>>>>> The problem is that I don't know the correct way to do this.
>>>>>>>>>
>>>>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem,
>>>>>>>>> but
>>>>>>>>> I
>>>>>>>>> don't know which
>>>>>>>>> methods to override to get a consistent object representation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The driver code includes the following:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>>>>>
>>>>>>>>>    Weight weight;
>>>>>>>>>    if ("tf".equalsIgnoreCase(weightType)) {
>>>>>>>>>      weight = new TF();
>>>>>>>>>    } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>>>>>      weight = new TFIDF();
>>>>>>>>>    } else {
>>>>>>>>>      throw new IllegalArgumentException("Weight type " + weightType
>>>>>>>>> +
>>>>>>>>> "
>>>>>>>>> is
>>>>>>>>> not supported");
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>>    TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>>>>>> maxDFPercent);
>>>>>>>>>    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>>>>>
>>>>>>>>>    LuceneIterable iterable;
>>>>>>>>>
>>>>>>>>>    if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>>>>>    } else {
>>>>>>>>>      iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>>>> norm,
>>>>>>>>> maxPercentErrorDocs);
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It creates a SequenceFile.Writer class then and writes the
>>>>>>>>> "iterable"
>>>>>>>>> variable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>>>>>> way?
>>>>>>>>>
>>>
>>
>

Re: Extending mahout lucene.vector driver

Posted by Michael Kazekin <Mi...@mediainsight.info>.

Hi, Frank!

Sorry for being silent. I'll try to launch unit tests as soon as I have 
free time for this (now we use a workaround for this problem in our 
solution).

I saw that you started the CLI program, are you going to let the user of 
CLI constrain just "columns" (fields), or also the "rows" (values) in 
the index?

On 02/01/2012 11:51 AM, Frank Scholten wrote:
> Hi Michael,
>
> Can you compare what you are doing with the code from the
> testRun_query() unit test in LuceneIndexToSequenceFilesTest? The unit
> test works, so I am curious where there is a difference.
>
> Cheers,
>
> Frank
>
> On Fri, Jan 27, 2012 at 4:41 PM, Michael Kazekin
> <Mi...@mediainsight.info>  wrote:
>> Frank, I tried this code with Solr 3.5 index (and changed all dependencies
>> in pom file), but this still doesn't work:
>>
>> Directory directory = FSDirectory.open(file);
>> IndexReader reader = IndexReader.open(directory, true);
>> IndexSearcher searcher = new IndexSearcher(reader);
>>
>> I try to get a Scorer with this TermQuery ("lang" field is indexed
>> and stored and all data is available)
>>
>> TermQuery atomQuery = new TermQuery(new Term("lang", "ru"));
>>
>> Weight weight = atomQuery.createWeight(searcher);
>> Scorer scorer = weight.scorer(reader, true, false);
>>
>> // scorer == null here
>>
>>
>>
>>
>> On 01/25/2012 07:04 PM, Frank Scholten wrote:
>>> Are you using Lucene 3.4? I had this problem as well and I believe
>>> this was because of https://issues.apache.org/jira/browse/LUCENE-3442
>>> which is fixed in Lucene 3.5.
>>>
>>> On Wed, Jan 25, 2012 at 1:42 PM, Michael Kazekin
>>> <Mi...@mediainsight.info>    wrote:
>>>> Frank, I tried to use BooleanQuery, comprising of several TermQueries
>>>> (these
>>>> represent key:value constraints, where key is the field name, for example
>>>> "lang:en"),
>>>> but the Scorer, created by Weight in your code, is null. Do you know,
>>>> what
>>>> could be wrong here?
>>>>
>>>> Sorry to bother you on dev list with such questions, but I am trying to
>>>> make
>>>> a CLI util for this code, so I think it would be helpful for everybody.
>>> Great! Let me know if you need more help.
>>>
>>> Cheers,
>>>
>>> Frank
>>>
>>>> On 01/20/2012 02:15 AM, Frank Scholten wrote:
>>>>> LuceneIndexToSequenceFiles lucene2Seq = new
>>>>> LuceneIndexToSequenceFiles();
>>>>>
>>>>> Configuration configuration = ... ;
>>>>> IndexDirectory indexDirectory = ... ;
>>>>> Path seqPath = ... ;
>>>>> String idField = ... ;
>>>>> String field = ... ;
>>>>> List<String>      extraFields = asList( ... );
>>>>> Query query = ... ;
>>>>>
>>>>> LuceneIndexToSequenceFilesConfiguration lucene2SeqConf = new
>>>>> LuceneIndexToSequenceFilesConfiguration(configuration,
>>>>> indexDirectory.getFile(), seqPath, idField, field);
>>>>> lucene2SeqConf.setExtraFields(extraFields);
>>>>> lucene2SeqConf.setQuery(query);
>>>>>
>>>>> lucene2Seq.run(lucene2SeqConf);
>>>>>
>>>>> The seqPath variable can be passed into seq2sparse.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Frank
>>>>>
>>>>> On Thu, Jan 19, 2012 at 2:03 PM, Michael Kazekin
>>>>> <Mi...@mediainsight.info>      wrote:
>>>>>> Frank, could you please tell me how to use your lucene2seq tool?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 01/18/2012 04:57 PM, Frank Scholten wrote:
>>>>>>> You can use a MatchAllDocsQuery if you want to fetch all documents.
>>>>>>>
>>>>>>> On Wed, Jan 18, 2012 at 10:36 AM, Michael Kazekin
>>>>>>> <Mi...@mediainsight.info>        wrote:
>>>>>>>> Thank you, Frank! I'll definitely have a look on it.
>>>>>>>>
>>>>>>>> As far as I can see, the problem with using Lucene in clusterisation
>>>>>>>> tasks
>>>>>>>> is that even with queries you get access to the "tip-of-the-iceberg"
>>>>>>>> results only, while clusterization tasks need to deal with the
>>>>>>>> results
>>>>>>>> as
>>>>>>>> a
>>>>>>>> whole.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/17/2012 09:56 PM, Frank Scholten wrote:
>>>>>>>>> Hi Michael,
>>>>>>>>>
>>>>>>>>> Checkouthttps://issues.apache.org/jira/browse/MAHOUT-944
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a lucene2seq tool. You can pass in fields and a lucene query
>>>>>>>>> and
>>>>>>>>> it generates text sequence files.
>>>>>>>>>
>>>>>>>>>   From there you can use seq2sparse.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Frank
>>>>>>>>>
>>>>>>>>> Sorry for brevity, sent from phone
>>>>>>>>>
>>>>>>>>> On Jan 17, 2012, at 17:37, Michael
>>>>>>>>> Kazekin<Mi...@mediainsight.info>          wrote:
>>>>>>>>>
>>>>>>>>>> Hi!
>>>>>>>>>>
>>>>>>>>>> I am trying to extend "mahout lucene.vector" driver, so that it can
>>>>>>>>>> be
>>>>>>>>>> feeded with arbitrary
>>>>>>>>>> key-value constraints on solr schema fields (and generate only a
>>>>>>>>>> subset
>>>>>>>>>> for
>>>>>>>>>> mahout vectors,
>>>>>>>>>> which seems to be a regular use case).
>>>>>>>>>>
>>>>>>>>>> So the best (easiest) way I see, is to create an IndexReader
>>>>>>>>>> implementation
>>>>>>>>>> that would allow
>>>>>>>>>> to read the subset.
>>>>>>>>>>
>>>>>>>>>> The problem is that I don't know the correct way to do this.
>>>>>>>>>>
>>>>>>>>>> Maybe, subclassing the FilterIndexReader would solve the problem,
>>>>>>>>>> but
>>>>>>>>>> I
>>>>>>>>>> don't know which
>>>>>>>>>> methods to override to get a consistent object representation.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The driver code includes the following:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> IndexReader reader = IndexReader.open(dir, true);
>>>>>>>>>>
>>>>>>>>>>     Weight weight;
>>>>>>>>>>     if ("tf".equalsIgnoreCase(weightType)) {
>>>>>>>>>>       weight = new TF();
>>>>>>>>>>     } else if ("tfidf".equalsIgnoreCase(weightType)) {
>>>>>>>>>>       weight = new TFIDF();
>>>>>>>>>>     } else {
>>>>>>>>>>       throw new IllegalArgumentException("Weight type " + weightType
>>>>>>>>>> +
>>>>>>>>>> "
>>>>>>>>>> is
>>>>>>>>>> not supported");
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     TermInfo termInfo = new CachedTermInfo(reader, field, minDf,
>>>>>>>>>> maxDFPercent);
>>>>>>>>>>     VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>>>>>>>>>
>>>>>>>>>>     LuceneIterable iterable;
>>>>>>>>>>
>>>>>>>>>>     if (norm == LuceneIterable.NO_NORMALIZING) {
>>>>>>>>>>       iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>>>>> LuceneIterable.NO_NORMALIZING, maxPercentErrorDocs);
>>>>>>>>>>     } else {
>>>>>>>>>>       iterable = new LuceneIterable(reader, idField, field, mapper,
>>>>>>>>>> norm,
>>>>>>>>>> maxPercentErrorDocs);
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It creates a SequenceFile.Writer class then and writes the
>>>>>>>>>> "iterable"
>>>>>>>>>> variable.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you have any thoughts on how to inject the code in a most simple
>>>>>>>>>> way?
>>>>>>>>>>
>