You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Karl Wettin <ka...@gmail.com> on 2008/06/05 21:09:37 UTC
Re: Clustering Demo
Any more thoughts on this subject? I'll start coding this tuesday.
karl
24 maj 2008 kl. 18.10 skrev Karl Wettin:
>
> 24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
>> These are interesting. Perhaps you want to commit LUCENE-725?
>
> If I end up using it for this, then I will. Never tried it out and
> there are no test cases so I have no clue to how well it works. Nor
> are there any demonstrations of the features in the patch, but I
> suppose our demo could be used to produce that.
>
> I'll train it with the last few paragraphs on a per-author basis too
> see how well it works.
>
>
> We might want to wash out stuff like "24 maj 2008 kl. 13.13 skrev
> Grant Ingersoll" too. That should not be to hard to figure out using
> the headers if the data is stored in a way that allows for
> navigation in the thread.
>
>
> But I'm honestly not sure if this is preemptive overkill solutions.
> Perhaps algorithms automatically penalise unrelated text when given
> enough semiotic data. Perhaps attribute selection does the same job
> in a shorter time.
>
>> I was wondering whether we should consider asking Lucene to put up
>> an Analyzer only jar (i.e. a separate jar that combiners the
>> Analyzer/TokenStream definitions with the contrib Analyzers
>> package.) Of course, we may have uses for the rest of Lucene as
>> well, so maybe not.
>
>
> To me that just sounds like more work for both projects.
>
> I'd be great if we managed to put all future text analysis
> improvements as patches in Lucene rather than Mahout, but in the
> long run I think we'll be branching quite a bit of the Lucene
> analysis code to avoid spending time writing backwards compatible
> code to support Lucene- rather than Mahout users. See LUCENE-889.
>
>
> karl
Re: Clustering Demo
Posted by Isabel Drost <ap...@isabel-drost.de>.
On Thursday 05 June 2008, Karl Wettin wrote:
> Any more thoughts on this subject? I'll start coding this tuesday.
+1 from me as well.
Isabel
--
Most people feel that everyone is entitled to their opinion.
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>
Re: Clustering Demo
Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 5, 2008, at 3:09 PM, Karl Wettin wrote:
> Any more thoughts on this subject? I'll start coding this tuesday.
+1. Much easier to have thoughts on a patch.
>
>
> karl
>
> 24 maj 2008 kl. 18.10 skrev Karl Wettin:
>>
>> 24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
>>> These are interesting. Perhaps you want to commit LUCENE-725?
>>
>> If I end up using it for this, then I will. Never tried it out and
>> there are no test cases so I have no clue to how well it works. Nor
>> are there any demonstrations of the features in the patch, but I
>> suppose our demo could be used to produce that.
>>
>> I'll train it with the last few paragraphs on a per-author basis
>> too see how well it works.
>>
>>
>> We might want to wash out stuff like "24 maj 2008 kl. 13.13 skrev
>> Grant Ingersoll" too. That should not be to hard to figure out
>> using the headers if the data is stored in a way that allows for
>> navigation in the thread.
>>
>>
>> But I'm honestly not sure if this is preemptive overkill solutions.
>> Perhaps algorithms automatically penalise unrelated text when given
>> enough semiotic data. Perhaps attribute selection does the same job
>> in a shorter time.
>>
>>> I was wondering whether we should consider asking Lucene to put up
>>> an Analyzer only jar (i.e. a separate jar that combiners the
>>> Analyzer/TokenStream definitions with the contrib Analyzers
>>> package.) Of course, we may have uses for the rest of Lucene as
>>> well, so maybe not.
>>
>>
>> To me that just sounds like more work for both projects.
>>
>> I'd be great if we managed to put all future text analysis
>> improvements as patches in Lucene rather than Mahout, but in the
>> long run I think we'll be branching quite a bit of the Lucene
>> analysis code to avoid spending time writing backwards compatible
>> code to support Lucene- rather than Mahout users. See LUCENE-889.
>>
>>
>> karl
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com