You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2008/03/06 14:00:58 UTC
[jira] Assigned: (MAHOUT-12) Point formatting and parsing improved
(StringBuilder, no need for trailing comma).
[ https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dawid Weiss reassigned MAHOUT-12:
---------------------------------
Assignee: Dawid Weiss
> Point formatting and parsing improved (StringBuilder, no need for trailing comma).
> ----------------------------------------------------------------------------------
>
> Key: MAHOUT-12
> URL: https://issues.apache.org/jira/browse/MAHOUT-12
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Trivial
> Attachments: mah-12.patch
>
>
> Added test case to point class, improved parsing (no need to recompile the pattern all over again) and concatenation of points (stringbuilder used internally).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved
(StringBuilder, no need for trailing comma).
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> I concur that we ought to have additional Writable representations to
> make intra-Hadoop transfers more streamlined. This is certainly *not*
> too late to pursue. I would encourage you to propose a record for Point
> (which is in trunk) and these could be added to Vector and Matrix later
> (once we get past the diff-diffing stage).
I am going to go through Mahout issues tomorrow -- let's get the minor things
out of way first, then we will focus on further refactorings. I think it makes a
lot of sense to have custom Writables (and a way to pass user subclass for our
jobs, so that these can be further inherited). The way I imagine this could look
is something like this (pseudo code of course):
KMeansJob job = new KMeansJob();
job.setInputKeyClass(? extends MahoutDataType);
job.setInputValueClass(? extends OtherMahoutDataType);
...
Note that this way the user can pass arbitrary records that subclass
Mahout-defined classes and the job can still freely manipulate them. I think
this would be pretty neat.
D.
RE: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).
Posted by Jeff Eastman <je...@collab.net>.
Ted noted an easy fix to my Excel use case that I wasn't aware of, so my
point is agreeably moot.
I concur that we ought to have additional Writable representations to
make intra-Hadoop transfers more streamlined. This is certainly *not*
too late to pursue. I would encourage you to propose a record for Point
(which is in trunk) and these could be added to Vector and Matrix later
(once we get past the diff-diffing stage).
Jeff
-----Original Message-----
From: Dawid Weiss [mailto:dawid.weiss@cs.put.poznan.pl]
Sent: Friday, March 07, 2008 12:27 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
improved (StringBuilder, no need for trailing comma).
The Excel scenario doesn't really convince me much, Jeff. For one thing,
I don't
have Excel, but this is a minor issue, for another -- I don't think
anyone will
actually import stuff that's supposed to be very large (that's why we do
it in
Hadoop, don't we) into a spreadsheet.
In fact I did have more thoughts about keeping the data as strings in
general...
I would much more prefer to have records (Hadoop records) or their
subclasses
instead -- they offer good flexibility and you could pass in your own
subrecords
if you wished to have some payload attached to the data points... but I
decided
it's too late for this to persue.
D.
Jeff Eastman wrote:
> The main reason I put in the trailing comma (and also the leading
comma
> after [) is so that it is easy to slurp the resulting data into Excel
> spreadsheets. Without the extra delimiters, the [] characters mix with
> the data values and manual editing is required.
>
> That said, the whole issue of formatting for Point (to be replaced
with
> Vector soon) and Matrix is a minimalist hack and begs for more
> consideration. I do think the Excel use case is something that ought
to
> be addressed as we move forward.
>
> Jeff
>
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, March 06, 2008 5:01 AM
> To: mahout-dev@lucene.apache.org
> Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
> improved (StringBuilder, no need for trailing comma).
>
>
> [
>
https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
>
> Dawid Weiss reassigned MAHOUT-12:
> ---------------------------------
>
> Assignee: Dawid Weiss
>
>> Point formatting and parsing improved (StringBuilder, no need for
> trailing comma).
>
------------------------------------------------------------------------
> ----------
>> Key: MAHOUT-12
>> URL: https://issues.apache.org/jira/browse/MAHOUT-12
>> Project: Mahout
>> Issue Type: Improvement
>> Components: Clustering
>> Affects Versions: 0.1
>> Reporter: Dawid Weiss
>> Assignee: Dawid Weiss
>> Priority: Trivial
>> Attachments: mah-12.patch
>>
>>
>> Added test case to point class, improved parsing (no need to
recompile
> the pattern all over again) and concatenation of points (stringbuilder
> used internally).
>
Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved
(StringBuilder, no need for trailing comma).
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
The Excel scenario doesn't really convince me much, Jeff. For one thing, I don't
have Excel, but this is a minor issue, for another -- I don't think anyone will
actually import stuff that's supposed to be very large (that's why we do it in
Hadoop, don't we) into a spreadsheet.
In fact I did have more thoughts about keeping the data as strings in general...
I would much more prefer to have records (Hadoop records) or their subclasses
instead -- they offer good flexibility and you could pass in your own subrecords
if you wished to have some payload attached to the data points... but I decided
it's too late for this to persue.
D.
Jeff Eastman wrote:
> The main reason I put in the trailing comma (and also the leading comma
> after [) is so that it is easy to slurp the resulting data into Excel
> spreadsheets. Without the extra delimiters, the [] characters mix with
> the data values and manual editing is required.
>
> That said, the whole issue of formatting for Point (to be replaced with
> Vector soon) and Matrix is a minimalist hack and begs for more
> consideration. I do think the Excel use case is something that ought to
> be addressed as we move forward.
>
> Jeff
>
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, March 06, 2008 5:01 AM
> To: mahout-dev@lucene.apache.org
> Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
> improved (StringBuilder, no need for trailing comma).
>
>
> [
> https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
>
> Dawid Weiss reassigned MAHOUT-12:
> ---------------------------------
>
> Assignee: Dawid Weiss
>
>> Point formatting and parsing improved (StringBuilder, no need for
> trailing comma).
> ------------------------------------------------------------------------
> ----------
>> Key: MAHOUT-12
>> URL: https://issues.apache.org/jira/browse/MAHOUT-12
>> Project: Mahout
>> Issue Type: Improvement
>> Components: Clustering
>> Affects Versions: 0.1
>> Reporter: Dawid Weiss
>> Assignee: Dawid Weiss
>> Priority: Trivial
>> Attachments: mah-12.patch
>>
>>
>> Added test case to point class, improved parsing (no need to recompile
> the pattern all over again) and concatenation of points (stringbuilder
> used internally).
>
RE: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).
Posted by Jeff Eastman <je...@collab.net>.
The main reason I put in the trailing comma (and also the leading comma
after [) is so that it is easy to slurp the resulting data into Excel
spreadsheets. Without the extra delimiters, the [] characters mix with
the data values and manual editing is required.
That said, the whole issue of formatting for Point (to be replaced with
Vector soon) and Matrix is a minimalist hack and begs for more
consideration. I do think the Excel use case is something that ought to
be addressed as we move forward.
Jeff
-----Original Message-----
From: Dawid Weiss (JIRA) [mailto:jira@apache.org]
Sent: Thursday, March 06, 2008 5:01 AM
To: mahout-dev@lucene.apache.org
Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
improved (StringBuilder, no need for trailing comma).
[
https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]
Dawid Weiss reassigned MAHOUT-12:
---------------------------------
Assignee: Dawid Weiss
> Point formatting and parsing improved (StringBuilder, no need for
trailing comma).
>
------------------------------------------------------------------------
----------
>
> Key: MAHOUT-12
> URL: https://issues.apache.org/jira/browse/MAHOUT-12
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Trivial
> Attachments: mah-12.patch
>
>
> Added test case to point class, improved parsing (no need to recompile
the pattern all over again) and concatenation of points (stringbuilder
used internally).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.