You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Neil Chaudhuri <nc...@potomacfusion.com> on 2011/12/19 21:01:15 UTC

Riffle and Mahout

Does anyone have any code to share about how to use Riffle (and Cascading) with Mahout? I have a class wrapping a Mahout operation, but I am getting a NullPointerException when I add this class to my Cascade. I think the key line is this:

11/12/19 14:50:14 INFO flow.Flow: [mahoutVectorizer] atleast one sink does not exist

This is despite having a method annotated as follows:

@DependencyOutgoing
    public Path getOutgoing() {
        return outputFilePath;
    }

Any insight is appreciated.

Thanks.


Re: Riffle and Mahout

Posted by Neil Chaudhuri <nc...@potomacfusion.com>.
The Mahout wrapper class is as follows:

@riffle.process.Process
public class VectorSequenceFileGenerator {
    private Path inputFilePath;
    private Path outputFilePath;
    private Configuration configuration;

    public VectorSequenceFileGenerator(String inputFileName, String
outputFileName, Configuration configuration) {
  this(new Path(inputFileName), new Path(outputFileName), configuration);
    }

    public VectorSequenceFileGenerator(Path inputFilePath, Path
outputFilePath, Configuration configuration) {
  this.inputFilePath = inputFilePath;
  this.outputFilePath = outputFilePath;
  this.configuration = configuration;
    }


    @DependencyOutgoing
    public Path getOutgoing() {
  return outputFilePath;
    }

    @DependencyIncoming
    public Path getIncoming() {
  return inputFilePath;
    }

    @ProcessStart
    public void start() {
        try {
  DictionaryVectorizer.createTermFrequencyVectors(
    inputFilePath,
    outputFilePath,
    configuration,
    1,
    2,
    0.5f,
    100,
    true,
    0,
    4000,
    true,
    true);
} catch (Exception e) {
    throw new RuntimeException(e);
        }
    }

    @ProcessComplete
    public void complete() {
  start();
    }



Meanwhile, this is integrated into a Cascade in Groovy as follows:

def vectorizerFlow = new ProcessFlow("mahoutVectorizer",
binding.variables.get("vectorSequenceFileGenerator"))
def connector = new CascadeConnector()
def cascade = connector.connect( firstFlow, vectorizerFlow )
cascade.complete();





The stack trace goes like this:

java.lang.RuntimeException: java.lang.NullPointerException
	at 
com.myapp.VectorSequenceFileGenerator.start(VectorSequenceFileGenerator.jav
a:66)
	at 
com.myapp.VectorSequenceFileGenerator.complete(VectorSequenceFileGenerator.
java:72)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:3
9)
	at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
l.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at 
riffle.process.scheduler.ProcessWrapper.invokeMethod(ProcessWrapper.java:17
8)
	at 
riffle.process.scheduler.ProcessWrapper.findInvoke(ProcessWrapper.java:166)
	at 
riffle.process.scheduler.ProcessWrapper.complete(ProcessWrapper.java:147)
	at cascading.flow.ProcessFlow.complete(ProcessFlow.java:172)
	at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:705)
	at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:653)


Thanks.





On 12/19/11 3:10 PM, "Ken Krugler" <kk...@transpac.com> wrote:

>I haven't fooled around with Riffle, but I have done some extraction in
>the past of Mahout components to use with Cascading.
>
>And I'm interested in using Cascading 2.0 (APL vs. GPLv3 license) with
>Mahout, so if you can share more details I'd be happy to take a look.
>
>Regards,
>
>-- Ken
>
>On Dec 19, 2011, at 12:01pm, Neil Chaudhuri wrote:
>
>> Does anyone have any code to share about how to use Riffle (and
>>Cascading) with Mahout? I have a class wrapping a Mahout operation, but
>>I am getting a NullPointerException when I add this class to my Cascade.
>>I think the key line is this:
>> 
>> 11/12/19 14:50:14 INFO flow.Flow: [mahoutVectorizer] atleast one sink
>>does not exist
>> 
>> This is despite having a method annotated as follows:
>> 
>> @DependencyOutgoing
>>    public Path getOutgoing() {
>>        return outputFilePath;
>>    }
>> 
>> Any insight is appreciated.
>> 
>> Thanks.
>> 
>
>--------------------------
>Ken Krugler
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Mahout & Solr
>
>
>
>


Re: Riffle and Mahout

Posted by Ken Krugler <kk...@transpac.com>.
I haven't fooled around with Riffle, but I have done some extraction in the past of Mahout components to use with Cascading.

And I'm interested in using Cascading 2.0 (APL vs. GPLv3 license) with Mahout, so if you can share more details I'd be happy to take a look.

Regards,

-- Ken

On Dec 19, 2011, at 12:01pm, Neil Chaudhuri wrote:

> Does anyone have any code to share about how to use Riffle (and Cascading) with Mahout? I have a class wrapping a Mahout operation, but I am getting a NullPointerException when I add this class to my Cascade. I think the key line is this:
> 
> 11/12/19 14:50:14 INFO flow.Flow: [mahoutVectorizer] atleast one sink does not exist
> 
> This is despite having a method annotated as follows:
> 
> @DependencyOutgoing
>    public Path getOutgoing() {
>        return outputFilePath;
>    }
> 
> Any insight is appreciated.
> 
> Thanks.
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr