You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2008/03/06 14:00:58 UTC

[jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

     [ https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss reassigned MAHOUT-12:
---------------------------------

    Assignee: Dawid Weiss

> Point formatting and parsing improved (StringBuilder, no need for trailing comma).
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-12
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Trivial
>         Attachments: mah-12.patch
>
>
> Added test case to point class, improved parsing (no need to recompile the pattern all over again) and concatenation of points (stringbuilder used internally).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> I concur that we ought to have additional Writable representations to
> make intra-Hadoop transfers more streamlined. This is certainly *not*
> too late to pursue. I would encourage you to propose a record for Point
> (which is in trunk) and these could be added to Vector and Matrix later
> (once we get past the diff-diffing stage).

I am going to go through Mahout issues tomorrow -- let's get the minor things 
out of way first, then we will focus on further refactorings. I think it makes a 
lot of sense to have custom Writables (and a way to pass user subclass for our 
jobs, so that these can be further inherited). The way I imagine this could look 
is something like this (pseudo code of course):

KMeansJob job = new KMeansJob();
job.setInputKeyClass(? extends MahoutDataType);
job.setInputValueClass(? extends OtherMahoutDataType);
...

Note that this way the user can pass arbitrary records that subclass 
Mahout-defined classes and the job can still freely manipulate them. I think 
this would be pretty neat.

D.

RE: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Posted by Jeff Eastman <je...@collab.net>.
Ted noted an easy fix to my Excel use case that I wasn't aware of, so my
point is agreeably moot. 

I concur that we ought to have additional Writable representations to
make intra-Hadoop transfers more streamlined. This is certainly *not*
too late to pursue. I would encourage you to propose a record for Point
(which is in trunk) and these could be added to Vector and Matrix later
(once we get past the diff-diffing stage).

Jeff

-----Original Message-----
From: Dawid Weiss [mailto:dawid.weiss@cs.put.poznan.pl] 
Sent: Friday, March 07, 2008 12:27 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
improved (StringBuilder, no need for trailing comma).


The Excel scenario doesn't really convince me much, Jeff. For one thing,
I don't 
have Excel, but this is a minor issue, for another -- I don't think
anyone will 
actually import stuff that's supposed to be very large (that's why we do
it in 
Hadoop, don't we) into a spreadsheet.

In fact I did have more thoughts about keeping the data as strings in
general... 
I would much more prefer to have records (Hadoop records) or their
subclasses 
instead -- they offer good flexibility and you could pass in your own
subrecords 
if you wished to have some payload attached to the data points... but I
decided 
it's too late for this to persue.

D.

Jeff Eastman wrote:
> The main reason I put in the trailing comma (and also the leading
comma
> after [) is so that it is easy to slurp the resulting data into Excel
> spreadsheets. Without the extra delimiters, the [] characters mix with
> the data values and manual editing is required.
> 
> That said, the whole issue of formatting for Point (to be replaced
with
> Vector soon) and Matrix is a minimalist hack and begs for more
> consideration. I do think the Excel use case is something that ought
to
> be addressed as we move forward.
> 
> Jeff
> 
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:jira@apache.org] 
> Sent: Thursday, March 06, 2008 5:01 AM
> To: mahout-dev@lucene.apache.org
> Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
> improved (StringBuilder, no need for trailing comma).
> 
> 
>      [
>
https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
> 
> Dawid Weiss reassigned MAHOUT-12:
> ---------------------------------
> 
>     Assignee: Dawid Weiss
> 
>> Point formatting and parsing improved (StringBuilder, no need for
> trailing comma).
>
------------------------------------------------------------------------
> ----------
>>                 Key: MAHOUT-12
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Dawid Weiss
>>            Assignee: Dawid Weiss
>>            Priority: Trivial
>>         Attachments: mah-12.patch
>>
>>
>> Added test case to point class, improved parsing (no need to
recompile
> the pattern all over again) and concatenation of points (stringbuilder
> used internally).
> 

Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
The Excel scenario doesn't really convince me much, Jeff. For one thing, I don't 
have Excel, but this is a minor issue, for another -- I don't think anyone will 
actually import stuff that's supposed to be very large (that's why we do it in 
Hadoop, don't we) into a spreadsheet.

In fact I did have more thoughts about keeping the data as strings in general... 
I would much more prefer to have records (Hadoop records) or their subclasses 
instead -- they offer good flexibility and you could pass in your own subrecords 
if you wished to have some payload attached to the data points... but I decided 
it's too late for this to persue.

D.

Jeff Eastman wrote:
> The main reason I put in the trailing comma (and also the leading comma
> after [) is so that it is easy to slurp the resulting data into Excel
> spreadsheets. Without the extra delimiters, the [] characters mix with
> the data values and manual editing is required.
> 
> That said, the whole issue of formatting for Point (to be replaced with
> Vector soon) and Matrix is a minimalist hack and begs for more
> consideration. I do think the Excel use case is something that ought to
> be addressed as we move forward.
> 
> Jeff
> 
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:jira@apache.org] 
> Sent: Thursday, March 06, 2008 5:01 AM
> To: mahout-dev@lucene.apache.org
> Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
> improved (StringBuilder, no need for trailing comma).
> 
> 
>      [
> https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
> 
> Dawid Weiss reassigned MAHOUT-12:
> ---------------------------------
> 
>     Assignee: Dawid Weiss
> 
>> Point formatting and parsing improved (StringBuilder, no need for
> trailing comma).
> ------------------------------------------------------------------------
> ----------
>>                 Key: MAHOUT-12
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Dawid Weiss
>>            Assignee: Dawid Weiss
>>            Priority: Trivial
>>         Attachments: mah-12.patch
>>
>>
>> Added test case to point class, improved parsing (no need to recompile
> the pattern all over again) and concatenation of points (stringbuilder
> used internally).
> 

RE: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Posted by Jeff Eastman <je...@collab.net>.
The main reason I put in the trailing comma (and also the leading comma
after [) is so that it is easy to slurp the resulting data into Excel
spreadsheets. Without the extra delimiters, the [] characters mix with
the data values and manual editing is required.

That said, the whole issue of formatting for Point (to be replaced with
Vector soon) and Matrix is a minimalist hack and begs for more
consideration. I do think the Excel use case is something that ought to
be addressed as we move forward.

Jeff

-----Original Message-----
From: Dawid Weiss (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, March 06, 2008 5:01 AM
To: mahout-dev@lucene.apache.org
Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
improved (StringBuilder, no need for trailing comma).


     [
https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss reassigned MAHOUT-12:
---------------------------------

    Assignee: Dawid Weiss

> Point formatting and parsing improved (StringBuilder, no need for
trailing comma).
>
------------------------------------------------------------------------
----------
>
>                 Key: MAHOUT-12
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Trivial
>         Attachments: mah-12.patch
>
>
> Added test case to point class, improved parsing (no need to recompile
the pattern all over again) and concatenation of points (stringbuilder
used internally).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.