You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Karl Wettin <ka...@gmail.com> on 2008/06/05 21:09:37 UTC

Re: Clustering Demo

Any more thoughts on this subject? I'll start coding this tuesday.

           karl

24 maj 2008 kl. 18.10 skrev Karl Wettin:
>
> 24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
>> These are interesting. Perhaps you want to commit LUCENE-725?
>
> If I end up using it for this, then I will. Never tried it out and  
> there are no test cases so I have no clue to how well it works. Nor  
> are there any demonstrations of the features in the patch, but I  
> suppose our demo could be used to produce that.
>
> I'll train it with the last few paragraphs on a per-author basis too  
> see how well it works.
>
>
> We might want to wash out stuff like  "24 maj 2008 kl. 13.13 skrev  
> Grant Ingersoll" too. That should not be to hard to figure out using  
> the headers if the data is stored in a way that allows for  
> navigation in the thread.
>
>
> But I'm honestly not sure if this is preemptive overkill solutions.  
> Perhaps algorithms automatically penalise unrelated text when given  
> enough semiotic data. Perhaps attribute selection does the same job  
> in a shorter time.
>
>> I was wondering whether we should consider asking Lucene to put up  
>> an Analyzer only jar (i.e. a separate jar that combiners the  
>> Analyzer/TokenStream definitions with the contrib Analyzers  
>> package.)  Of course, we may have uses for the rest of Lucene as  
>> well, so maybe not.
>
>
> To me that just sounds like more work for both projects.
>
> I'd be great if we managed to put all future text analysis  
> improvements as patches in Lucene rather than Mahout, but in the  
> long run I think we'll be branching quite a bit of the Lucene  
> analysis code to avoid spending time writing backwards compatible  
> code to support Lucene- rather than Mahout users. See LUCENE-889.
>
>
>     karl


Re: Clustering Demo

Posted by Isabel Drost <ap...@isabel-drost.de>.
On Thursday 05 June 2008, Karl Wettin wrote:
> Any more thoughts on this subject? I'll start coding this tuesday.

+1 from me as well.

Isabel

-- 
Most people feel that everyone is entitled to their opinion.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Clustering Demo

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 5, 2008, at 3:09 PM, Karl Wettin wrote:

> Any more thoughts on this subject? I'll start coding this tuesday.

+1.  Much easier to have thoughts on a patch.

>
>
>          karl
>
> 24 maj 2008 kl. 18.10 skrev Karl Wettin:
>>
>> 24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
>>> These are interesting. Perhaps you want to commit LUCENE-725?
>>
>> If I end up using it for this, then I will. Never tried it out and  
>> there are no test cases so I have no clue to how well it works. Nor  
>> are there any demonstrations of the features in the patch, but I  
>> suppose our demo could be used to produce that.
>>
>> I'll train it with the last few paragraphs on a per-author basis  
>> too see how well it works.
>>
>>
>> We might want to wash out stuff like  "24 maj 2008 kl. 13.13 skrev  
>> Grant Ingersoll" too. That should not be to hard to figure out  
>> using the headers if the data is stored in a way that allows for  
>> navigation in the thread.
>>
>>
>> But I'm honestly not sure if this is preemptive overkill solutions.  
>> Perhaps algorithms automatically penalise unrelated text when given  
>> enough semiotic data. Perhaps attribute selection does the same job  
>> in a shorter time.
>>
>>> I was wondering whether we should consider asking Lucene to put up  
>>> an Analyzer only jar (i.e. a separate jar that combiners the  
>>> Analyzer/TokenStream definitions with the contrib Analyzers  
>>> package.)  Of course, we may have uses for the rest of Lucene as  
>>> well, so maybe not.
>>
>>
>> To me that just sounds like more work for both projects.
>>
>> I'd be great if we managed to put all future text analysis  
>> improvements as patches in Lucene rather than Mahout, but in the  
>> long run I think we'll be branching quite a bit of the Lucene  
>> analysis code to avoid spending time writing backwards compatible  
>> code to support Lucene- rather than Mahout users. See LUCENE-889.
>>
>>
>>    karl
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com