You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2011/03/09 14:06:59 UTC

[jira] Created: (MAHOUT-621) Support more data import mechanisms

Support more data import mechanisms
-----------------------------------

                 Key: MAHOUT-621
                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
             Project: Mahout
          Issue Type: Improvement
            Reporter: Grant Ingersoll


We should have more ways of getting data in:

1. ARFF (MAHOUT-155)
2. CSV (MAHOUT-548)
3. Databases
4. Behemoth (Tika, Map-Reduce)
5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-621) Support more data import mechanisms

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004651#comment-13004651 ] 

Ted Dunning commented on MAHOUT-621:
------------------------------------

The data sources that I have mostly seen include:

- document like things that have semi-structured fields.  This includes most of our recommendation style inputs if you do a group by user id and collect
the values of the item being rated.  It also includes document inputs where the Lucene document is an excellent example.

- sql queries which ultimately produce something that looks like a document, possibly by denormalizing the final query result.

- time series.  The openTSDB project has the nicest time series schema that I have seen.



> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011,, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013017#comment-13013017 ] 

Ted Dunning commented on MAHOUT-621:
------------------------------------

{quote}
Makes sense?
{quote}

No.

I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies.  Why do you have a lib directory at all?



> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013542#comment-13013542 ] 

Julien Nioche commented on MAHOUT-621:
--------------------------------------

bc. I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. 

you've obviously not understood the explanations above or Han Hui Wen's

bc. Why do you have a lib directory at all?

this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid





> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012960#comment-13012960 ] 

Julien Nioche commented on MAHOUT-621:
--------------------------------------

>From https://issues.apache.org/jira/browse/MAHOUT-368

{quote} > Why not having a bundle artifact where all the Mahout submodules would be put it a single jar? 

 How is this not trivial for you to handle with maven?
 If you are writing your own maven project (recommended), then jar-with-dependencies will do what you want.
 If you are extending Mahout (ok for prototypes), just put your code in the examples job jar and all will be good.
{quote}

I am not extending Mahout and as you've probably seen in the comments above the point is to be able to generate Mahout data structures from Behemoth so putting the code in examples is not an option anyway.

Back to the original problem. I generate a job file for my Mahout module in Behemoth (https://github.com/jnioche/behemoth/tree/master/modules/mahout) and manage the dependencies with Ivy. The main class (SparseVectorsFromBehemoth) is a slightly modified version of SparseVectorsFromSequenceFiles which gets the Tokens from Behemoth documents instead of using Lucene and generates the data structures expected by the classifiers and clusterers.

The job file contains : 
 * the Behemoth classes for the Mahout module
 * the dependencies in /lib including
  ** mahout-math-0.4.jar
  ** mahout-core-0.4.jar

The problem I had was the same as Han Hui Wen (MAHOUT-368) i.e I was getting a class not found exception on org.apache.mahout.math.VectorWritable. My understanding of the problem is that my main class calls DictionaryVectorizer which in my job file was in lib/mahout-core-0.4.jar and this has a dependency on VectorWritable which is in lib/mahout-maths-0.4.jar.  For some reason MapReduce was not able to find VectorWritable, which I assume has to do with the jobs in DictionaryVectorizer calling 'job.setJarByClass(DictionaryVectorizer.class)'.

I could of course use jar-with-dependencies on the Mahout code and generate a single jar then manage the jar locally. However this means that I have very little control over the dependencies used by Mahout (e.g. potentially conflicting versions with other components in my job files) and I'd rather rely on external publicised jars anyway. A better option would be to simply unpack the content of the mahout core and maths jars into the root of my job file. At least the Mahout dependencies would be handled and versioned normally. 

I've tried with Hadoop 0.21.0 and did not get this issue so I suppose that something must have changed in the way the classloader handles dependencies within a job file. 

Makes sense?


> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-621) Support more data import mechanisms

Posted by "Sean Owen (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-621.
------------------------------

    Resolution: Won't Fix

No action on this issue, and hasn't been touched in 7 months. I think it's a bit too high-level, and could come back with more specific sub-issues. Note that we did refactor, package and improve some integration into the integration/ module, so it's kind of been addressed.
                
> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-621) Support more data import mechanisms

Posted by "Ulrich Stärk (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ulrich Stärk updated MAHOUT-621:
--------------------------------

    Labels: gsoc2011 mahout-gsoc-11  (was: gsoc2011, mahout-gsoc-11)

> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012029#comment-13012029 ] 

Julien Nioche commented on MAHOUT-621:
--------------------------------------

Re-Behemoth : I've started working on a Mahout module (https://github.com/jnioche/behemoth/tree/master/modules/mahout) which will help converting the Behemoth sequence files into vectors as done by seq2sparse.

Am searching for a way to get round https://issues.apache.org/jira/browse/MAHOUT-368 but I think this is the last hurdle in the way before the module is fully functional.

> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-621) Support more data import mechanisms

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012099#comment-13012099 ] 

Ted Dunning commented on MAHOUT-621:
------------------------------------

{quote}
Am searching for a way to get round https://issues.apache.org/jira/browse/MAHOUT-368 but I think this is the last hurdle in the way before the module is fully functional.
{quote}
How is this different from having more than one dependency?

Can't you just use jar-with-dependencies (with maven) or the ant-ish equivalent?



> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-621) Support more data import mechanisms

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004735#comment-13004735 ] 

Ted Dunning commented on MAHOUT-621:
------------------------------------

It would be easy to do an hbase query and pass the data to mahout.  It would not be easy for mahout to use the data without the good offices of hbase.


> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011,, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-621) Support more data import mechanisms

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004543#comment-13004543 ] 

Sean Owen commented on MAHOUT-621:
----------------------------------

FWIW I envision this as a series of support in mahout-utils perhaps that make it very easy to import into Vectors. Is that about right? It'd be good to have one theory of what data looks like coming in, and provide means to ingest data from m sources into that format for use in n algorithms, rather than support m*n source/algo combinations.

> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011,, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-621) Support more data import mechanisms

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004732#comment-13004732 ] 

Shannon Quinn commented on MAHOUT-621:
--------------------------------------

On the note of openTSDB, is that something Hadoop supports or could support? I understand it's built on top of HBase, but could Mahout theoretically use this data transparently?

> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011,, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-621) Support more data import mechanisms

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013542#comment-13013542 ] 

Julien Nioche edited comment on MAHOUT-621 at 3/30/11 4:57 PM:
---------------------------------------------------------------

bq. I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. 

you've obviously not understood the explanations above or Han Hui Wen's

bq. Why do you have a lib directory at all?

this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid





      was (Author: jnioche):
    bc. I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. 

you've obviously not understood the explanations above or Han Hui Wen's

bc. Why do you have a lib directory at all?

this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid




  
> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-621) Support more data import mechanisms

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013542#comment-13013542 ] 

Julien Nioche edited comment on MAHOUT-621 at 3/30/11 4:57 PM:
---------------------------------------------------------------

bq. I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. 

you've obviously not understood the explanations above or Han Hui Wen's : we do get the dependencies OK?

bq. Why do you have a lib directory at all?

this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid





      was (Author: jnioche):
    bq. I don't understand why declaring Mahout as an ivy/maven dependency didn't bring in all transitive dependencies. 

you've obviously not understood the explanations above or Han Hui Wen's

bq. Why do you have a lib directory at all?

this is within the job file that I generate and used to store the dependencies, AFAIK this is a patterns used in other Hadoop related projects and is not particularly unusual or stupid




  
> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-621) Support more data import mechanisms

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004978#comment-13004978 ] 

Lance Norskog commented on MAHOUT-621:
--------------------------------------

Can there be export mechanisms too?

> Support more data import mechanisms
> -----------------------------------
>
>                 Key: MAHOUT-621
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-621
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>              Labels: gsoc2011,, mahout-gsoc-11
>
> We should have more ways of getting data in:
> 1. ARFF (MAHOUT-155)
> 2. CSV (MAHOUT-548)
> 3. Databases
> 4. Behemoth (Tika, Map-Reduce)
> 5. Other

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira