You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stanley Xu <we...@gmail.com> on 2011/05/06 05:16:09 UTC

Re: how to get input in parallel FPGrowth

The PFP in mahout accept a text input format, you could specify the splitter
to split different columns. For other data source, the easiest way is to
transfer it to a text format and separate the columns by a tab('\t') and put
it into the hdfs as the PFP input.

On Fri, May 6, 2011 at 9:35 AM, hustnn <nz...@gmail.com> wrote:

> I see a topic of you about "the convert data in databases (Flatfiles,
> XMLdumps, MySQL,Cassandra, Different formats on  HDFS, Hbase) into
> intermediate form(say vector)".
>
> I Know the parallel FPGrowth can use the hadoop to distribute compution in
> different tasktrackers easily in map-reduce ways, but I want to know how
> parallel FPGrowth works using other database such as mysql, cassandra and
> hbase. How does it gain input and how does it distribute computions making
> it works parallelly.
>
> Thanks.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-get-input-in-parallel-FPGrowth-tp2906536p2906536.html
> Sent from the Mahout Developer List mailing list archive at Nabble.com.
>

Re: how to get input in parallel FPGrowth

Posted by Stanley Xu <we...@gmail.com>.

1. A job is killed is a normal behavior. Since by default, hadoop will
enable the speculative executions, which means it will create two attempts
for the same mapper and
once one of the attempt is done, it will just kill the one is not finished.

2. There are lots of possibilities that a mapper take much longer than
others. Maybe the input file is much larger, or the data in that mapper
might consume more CPU resource. Or the cluster node to handle the mapper is
in a heavy load. It is hard to say the root cause without the context.
You could try to check the inputs to figure out the reason, or simply re-run
the task to see if it still takes much longer time again.

BTW, post the question in the mahout mail list probably will get more
feedbacks and might be helpful to others has the same problem comparing to
send directly to me. :-)

Best wishes,
Stanley Xu

On Tue, May 24, 2011 at 3:15 PM, nn hust <nz...@gmail.com> wrote:

> Hi, when I use the pfp-growth , I met a question, I find the first map
> spend much more time then others, and there will be a task to be killed, I
> don't find any error info in the log file, do you know the cause?
>
> you can see the picture of the hadoop web tools I send to you.
>
>
> Thanks.
>