You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Xiaobo Gu <gu...@gmail.com> on 2011/07/12 15:19:42 UTC

File format question about Random forest.

Hi,

The Random Forest partial implementation in
https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
use the ARFF file format, is ARFF the only supportted file format when
using the BuildForest and TestForest program, and are BuildForest and
TestForest program are official tools to build Random Forest models
from the command line?

Regards,

Xiaobo Gu

Re: File format question about Random forest.

Posted by deneche abdelhakim <ad...@gmail.com>.

I think Hadoop can read files from the local file system by using "file:///"
before the path

On Fri, Jul 15, 2011 at 3:55 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> Do the -p and -f option of org.apache.mahout.df.tools.Describe have to
> be HDFS URLs, can they be local file system paths?
>
>
> On Fri, Jul 15, 2011 at 9:28 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> > Can we make the file descriptor as following:
> >
> > 1. make a small csv file with the same format as the actual dataset,
> > say a CSV file with header and only one record,
> > 2. Use java weka.core.converters.CSVLoader filename.csv >
> > filename.arff  to convert the small CSV into a ARFF file, see
> > http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> > 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
> >
> >
> > The only consern here is: does the small CSV file with one record
> > sufficient enough to generate the ARFF file header, or do we have to
> > use the whole file to avoid losing information?
> >
> >
> > Xiaobo Gu
> >
> >
> >
> >
> > On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >> But if we use CSV files, how can we generate descriptors for datasets?
> >>
> >> Cheers
> >>
> >> Xiaobo Gu
> >>
> >> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <ad...@gmail.com>
> wrote:
> >>> I guess yes. as long as you don't use quotes or double quotes to embed
> the
> >>> fields.
> >>>
> >>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >>>
> >>>> So for simple datasets, which only have numeric and character
> >>>> lable(without blank) category columns,  can we just use CSV tools to
> >>>> save it as a standard CSV file without header?
> >>>>
> >>>>
> >>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
> adeneche@gmail.com>
> >>>> wrote:
> >>>> > the current implementation doesn't support the ARFF format
> >>>> out-of-the-box,
> >>>> > as described in the Wiki you need to remove the header of the file
> and
> >>>> leave
> >>>> > only the data. Actually, this implementation is fully compatible
> with
> >>>> UCI's
> >>>> > datasets which are comma separated text files. You'll also need to
> call
> >>>> the
> >>>> > dataset description tool (see the wiki) in order to generate a
> proper
> >>>> > description file (contains the nature of each attribute: Numerical
> or
> >>>> > Categorical).
> >>>> >
> >>>> > Yes you can use BuildForest and TestForest to generate and use
> Random
> >>>> forest
> >>>> > models from the command line
> >>>> >
> >>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> >> Hi,
> >>>> >>
> >>>> >> The Random Forest partial implementation in
> >>>> >>
> >>>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >>>> >> use the ARFF file format, is ARFF the only supportted file format
> when
> >>>> >> using the BuildForest and TestForest program, and are BuildForest
> and
> >>>> >> TestForest program are official tools to build Random Forest models
> >>>> >> from the command line?
> >>>> >>
> >>>> >> Regards,
> >>>> >>
> >>>> >> Xiaobo Gu
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>
> >
>

Re: File format question about Random forest.

Posted by Xiaobo Gu <gu...@gmail.com>.

Do the -p and -f option of org.apache.mahout.df.tools.Describe have to
be HDFS URLs, can they be local file system paths?


On Fri, Jul 15, 2011 at 9:28 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> Can we make the file descriptor as following:
>
> 1. make a small csv file with the same format as the actual dataset,
> say a CSV file with header and only one record,
> 2. Use java weka.core.converters.CSVLoader filename.csv >
> filename.arff  to convert the small CSV into a ARFF file, see
> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
>
>
> The only consern here is: does the small CSV file with one record
> sufficient enough to generate the ARFF file header, or do we have to
> use the whole file to avoid losing information?
>
>
> Xiaobo Gu
>
>
>
>
> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>> But if we use CSV files, how can we generate descriptors for datasets?
>>
>> Cheers
>>
>> Xiaobo Gu
>>
>> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <ad...@gmail.com> wrote:
>>> I guess yes. as long as you don't use quotes or double quotes to embed the
>>> fields.
>>>
>>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>>>
>>>> So for simple datasets, which only have numeric and character
>>>> lable(without blank) category columns,  can we just use CSV tools to
>>>> save it as a standard CSV file without header?
>>>>
>>>>
>>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <ad...@gmail.com>
>>>> wrote:
>>>> > the current implementation doesn't support the ARFF format
>>>> out-of-the-box,
>>>> > as described in the Wiki you need to remove the header of the file and
>>>> leave
>>>> > only the data. Actually, this implementation is fully compatible with
>>>> UCI's
>>>> > datasets which are comma separated text files. You'll also need to call
>>>> the
>>>> > dataset description tool (see the wiki) in order to generate a proper
>>>> > description file (contains the nature of each attribute: Numerical or
>>>> > Categorical).
>>>> >
>>>> > Yes you can use BuildForest and TestForest to generate and use Random
>>>> forest
>>>> > models from the command line
>>>> >
>>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> Hi,
>>>> >>
>>>> >> The Random Forest partial implementation in
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>>>> >> use the ARFF file format, is ARFF the only supportted file format when
>>>> >> using the BuildForest and TestForest program, and are BuildForest and
>>>> >> TestForest program are official tools to build Random Forest models
>>>> >> from the command line?
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Xiaobo Gu
>>>> >>
>>>> >
>>>>
>>>
>>
>

Re: File format question about Random forest.

Posted by deneche abdelhakim <ad...@gmail.com>.

using the Describe tool, the partial implementation Wiki page explains how
to use it. And yes the descriptor file must be supplied

On Sat, Jul 16, 2011 at 5:57 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> But if I just use CSV file, how can I generate the descriptor file,
> does descriptor file must be supplied for BuildForest and TestForest?
>
>
> On Sat, Jul 16, 2011 at 5:39 AM, deneche abdelhakim <ad...@gmail.com>
> wrote:
> > you don't need to convert the CSV file to ARFF, you can use it right
> away.
> >
> > you can use a small dataset as long as all values of categorical
> attributes
> > are available in the dataset
> >
> > On Fri, Jul 15, 2011 at 2:28 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >
> >> Can we make the file descriptor as following:
> >>
> >> 1. make a small csv file with the same format as the actual dataset,
> >> say a CSV file with header and only one record,
> >> 2. Use java weka.core.converters.CSVLoader filename.csv >
> >> filename.arff  to convert the small CSV into a ARFF file, see
> >> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> >> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
> >>
> >>
> >> The only consern here is: does the small CSV file with one record
> >> sufficient enough to generate the ARFF file header, or do we have to
> >> use the whole file to avoid losing information?
> >>
> >>
> >> Xiaobo Gu
> >>
> >>
> >>
> >>
> >> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >> > But if we use CSV files, how can we generate descriptors for datasets?
> >> >
> >> > Cheers
> >> >
> >> > Xiaobo Gu
> >> >
> >> > On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <
> adeneche@gmail.com>
> >> wrote:
> >> >> I guess yes. as long as you don't use quotes or double quotes to
> embed
> >> the
> >> >> fields.
> >> >>
> >> >> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com>
> >> wrote:
> >> >>
> >> >>> So for simple datasets, which only have numeric and character
> >> >>> lable(without blank) category columns,  can we just use CSV tools to
> >> >>> save it as a standard CSV file without header?
> >> >>>
> >> >>>
> >> >>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
> >> adeneche@gmail.com>
> >> >>> wrote:
> >> >>> > the current implementation doesn't support the ARFF format
> >> >>> out-of-the-box,
> >> >>> > as described in the Wiki you need to remove the header of the file
> >> and
> >> >>> leave
> >> >>> > only the data. Actually, this implementation is fully compatible
> with
> >> >>> UCI's
> >> >>> > datasets which are comma separated text files. You'll also need to
> >> call
> >> >>> the
> >> >>> > dataset description tool (see the wiki) in order to generate a
> proper
> >> >>> > description file (contains the nature of each attribute: Numerical
> or
> >> >>> > Categorical).
> >> >>> >
> >> >>> > Yes you can use BuildForest and TestForest to generate and use
> Random
> >> >>> forest
> >> >>> > models from the command line
> >> >>> >
> >> >>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <
> guxiaobo1982@gmail.com>
> >> >>> wrote:
> >> >>> >
> >> >>> >> Hi,
> >> >>> >>
> >> >>> >> The Random Forest partial implementation in
> >> >>> >>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >> >>> >> use the ARFF file format, is ARFF the only supportted file format
> >> when
> >> >>> >> using the BuildForest and TestForest program, and are BuildForest
> >> and
> >> >>> >> TestForest program are official tools to build Random Forest
> models
> >> >>> >> from the command line?
> >> >>> >>
> >> >>> >> Regards,
> >> >>> >>
> >> >>> >> Xiaobo Gu
> >> >>> >>
> >> >>> >
> >> >>>
> >> >>
> >> >
> >>
> >
>

Re: File format question about Random forest.

Posted by Xiaobo Gu <gu...@gmail.com>.

But if I just use CSV file, how can I generate the descriptor file,
does descriptor file must be supplied for BuildForest and TestForest?


On Sat, Jul 16, 2011 at 5:39 AM, deneche abdelhakim <ad...@gmail.com> wrote:
> you don't need to convert the CSV file to ARFF, you can use it right away.
>
> you can use a small dataset as long as all values of categorical attributes
> are available in the dataset
>
> On Fri, Jul 15, 2011 at 2:28 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Can we make the file descriptor as following:
>>
>> 1. make a small csv file with the same format as the actual dataset,
>> say a CSV file with header and only one record,
>> 2. Use java weka.core.converters.CSVLoader filename.csv >
>> filename.arff  to convert the small CSV into a ARFF file, see
>> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
>> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
>>
>>
>> The only consern here is: does the small CSV file with one record
>> sufficient enough to generate the ARFF file header, or do we have to
>> use the whole file to avoid losing information?
>>
>>
>> Xiaobo Gu
>>
>>
>>
>>
>> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>> > But if we use CSV files, how can we generate descriptors for datasets?
>> >
>> > Cheers
>> >
>> > Xiaobo Gu
>> >
>> > On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <ad...@gmail.com>
>> wrote:
>> >> I guess yes. as long as you don't use quotes or double quotes to embed
>> the
>> >> fields.
>> >>
>> >> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com>
>> wrote:
>> >>
>> >>> So for simple datasets, which only have numeric and character
>> >>> lable(without blank) category columns,  can we just use CSV tools to
>> >>> save it as a standard CSV file without header?
>> >>>
>> >>>
>> >>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
>> adeneche@gmail.com>
>> >>> wrote:
>> >>> > the current implementation doesn't support the ARFF format
>> >>> out-of-the-box,
>> >>> > as described in the Wiki you need to remove the header of the file
>> and
>> >>> leave
>> >>> > only the data. Actually, this implementation is fully compatible with
>> >>> UCI's
>> >>> > datasets which are comma separated text files. You'll also need to
>> call
>> >>> the
>> >>> > dataset description tool (see the wiki) in order to generate a proper
>> >>> > description file (contains the nature of each attribute: Numerical or
>> >>> > Categorical).
>> >>> >
>> >>> > Yes you can use BuildForest and TestForest to generate and use Random
>> >>> forest
>> >>> > models from the command line
>> >>> >
>> >>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> Hi,
>> >>> >>
>> >>> >> The Random Forest partial implementation in
>> >>> >>
>> >>>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>> >>> >> use the ARFF file format, is ARFF the only supportted file format
>> when
>> >>> >> using the BuildForest and TestForest program, and are BuildForest
>> and
>> >>> >> TestForest program are official tools to build Random Forest models
>> >>> >> from the command line?
>> >>> >>
>> >>> >> Regards,
>> >>> >>
>> >>> >> Xiaobo Gu
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >
>>
>

Re: File format question about Random forest.

Posted by deneche abdelhakim <ad...@gmail.com>.

you don't need to convert the CSV file to ARFF, you can use it right away.

you can use a small dataset as long as all values of categorical attributes
are available in the dataset

On Fri, Jul 15, 2011 at 2:28 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> Can we make the file descriptor as following:
>
> 1. make a small csv file with the same format as the actual dataset,
> say a CSV file with header and only one record,
> 2. Use java weka.core.converters.CSVLoader filename.csv >
> filename.arff  to convert the small CSV into a ARFF file, see
> http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
> 3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor
>
>
> The only consern here is: does the small CSV file with one record
> sufficient enough to generate the ARFF file header, or do we have to
> use the whole file to avoid losing information?
>
>
> Xiaobo Gu
>
>
>
>
> On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> > But if we use CSV files, how can we generate descriptors for datasets?
> >
> > Cheers
> >
> > Xiaobo Gu
> >
> > On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <ad...@gmail.com>
> wrote:
> >> I guess yes. as long as you don't use quotes or double quotes to embed
> the
> >> fields.
> >>
> >> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >>
> >>> So for simple datasets, which only have numeric and character
> >>> lable(without blank) category columns,  can we just use CSV tools to
> >>> save it as a standard CSV file without header?
> >>>
> >>>
> >>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <
> adeneche@gmail.com>
> >>> wrote:
> >>> > the current implementation doesn't support the ARFF format
> >>> out-of-the-box,
> >>> > as described in the Wiki you need to remove the header of the file
> and
> >>> leave
> >>> > only the data. Actually, this implementation is fully compatible with
> >>> UCI's
> >>> > datasets which are comma separated text files. You'll also need to
> call
> >>> the
> >>> > dataset description tool (see the wiki) in order to generate a proper
> >>> > description file (contains the nature of each attribute: Numerical or
> >>> > Categorical).
> >>> >
> >>> > Yes you can use BuildForest and TestForest to generate and use Random
> >>> forest
> >>> > models from the command line
> >>> >
> >>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> Hi,
> >>> >>
> >>> >> The Random Forest partial implementation in
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >>> >> use the ARFF file format, is ARFF the only supportted file format
> when
> >>> >> using the BuildForest and TestForest program, and are BuildForest
> and
> >>> >> TestForest program are official tools to build Random Forest models
> >>> >> from the command line?
> >>> >>
> >>> >> Regards,
> >>> >>
> >>> >> Xiaobo Gu
> >>> >>
> >>> >
> >>>
> >>
> >
>

Re: File format question about Random forest.

Posted by Xiaobo Gu <gu...@gmail.com>.

Can we make the file descriptor as following:

1. make a small csv file with the same format as the actual dataset,
say a CSV file with header and only one record,
2. Use java weka.core.converters.CSVLoader filename.csv >
filename.arff  to convert the small CSV into a ARFF file, see
http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.html
3. Use org.apache.mahout.df.tools.Describe  to generate a descriptor


The only consern here is: does the small CSV file with one record
sufficient enough to generate the ARFF file header, or do we have to
use the whole file to avoid losing information?


Xiaobo Gu




On Fri, Jul 15, 2011 at 9:10 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> But if we use CSV files, how can we generate descriptors for datasets?
>
> Cheers
>
> Xiaobo Gu
>
> On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <ad...@gmail.com> wrote:
>> I guess yes. as long as you don't use quotes or double quotes to embed the
>> fields.
>>
>> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>>
>>> So for simple datasets, which only have numeric and character
>>> lable(without blank) category columns,  can we just use CSV tools to
>>> save it as a standard CSV file without header?
>>>
>>>
>>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <ad...@gmail.com>
>>> wrote:
>>> > the current implementation doesn't support the ARFF format
>>> out-of-the-box,
>>> > as described in the Wiki you need to remove the header of the file and
>>> leave
>>> > only the data. Actually, this implementation is fully compatible with
>>> UCI's
>>> > datasets which are comma separated text files. You'll also need to call
>>> the
>>> > dataset description tool (see the wiki) in order to generate a proper
>>> > description file (contains the nature of each attribute: Numerical or
>>> > Categorical).
>>> >
>>> > Yes you can use BuildForest and TestForest to generate and use Random
>>> forest
>>> > models from the command line
>>> >
>>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
>>> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> The Random Forest partial implementation in
>>> >>
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>>> >> use the ARFF file format, is ARFF the only supportted file format when
>>> >> using the BuildForest and TestForest program, and are BuildForest and
>>> >> TestForest program are official tools to build Random Forest models
>>> >> from the command line?
>>> >>
>>> >> Regards,
>>> >>
>>> >> Xiaobo Gu
>>> >>
>>> >
>>>
>>
>

Re: File format question about Random forest.

Posted by Xiaobo Gu <gu...@gmail.com>.

But if we use CSV files, how can we generate descriptors for datasets?

Cheers

Xiaobo Gu

On Thu, Jul 14, 2011 at 1:27 AM, deneche abdelhakim <ad...@gmail.com> wrote:
> I guess yes. as long as you don't use quotes or double quotes to embed the
> fields.
>
> On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> So for simple datasets, which only have numeric and character
>> lable(without blank) category columns,  can we just use CSV tools to
>> save it as a standard CSV file without header?
>>
>>
>> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <ad...@gmail.com>
>> wrote:
>> > the current implementation doesn't support the ARFF format
>> out-of-the-box,
>> > as described in the Wiki you need to remove the header of the file and
>> leave
>> > only the data. Actually, this implementation is fully compatible with
>> UCI's
>> > datasets which are comma separated text files. You'll also need to call
>> the
>> > dataset description tool (see the wiki) in order to generate a proper
>> > description file (contains the nature of each attribute: Numerical or
>> > Categorical).
>> >
>> > Yes you can use BuildForest and TestForest to generate and use Random
>> forest
>> > models from the command line
>> >
>> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> The Random Forest partial implementation in
>> >>
>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>> >> use the ARFF file format, is ARFF the only supportted file format when
>> >> using the BuildForest and TestForest program, and are BuildForest and
>> >> TestForest program are official tools to build Random Forest models
>> >> from the command line?
>> >>
>> >> Regards,
>> >>
>> >> Xiaobo Gu
>> >>
>> >
>>
>

Re: File format question about Random forest.

Posted by deneche abdelhakim <ad...@gmail.com>.

I guess yes. as long as you don't use quotes or double quotes to embed the
fields.

On Wed, Jul 13, 2011 at 2:58 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> So for simple datasets, which only have numeric and character
> lable(without blank) category columns,  can we just use CSV tools to
> save it as a standard CSV file without header?
>
>
> On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <ad...@gmail.com>
> wrote:
> > the current implementation doesn't support the ARFF format
> out-of-the-box,
> > as described in the Wiki you need to remove the header of the file and
> leave
> > only the data. Actually, this implementation is fully compatible with
> UCI's
> > datasets which are comma separated text files. You'll also need to call
> the
> > dataset description tool (see the wiki) in order to generate a proper
> > description file (contains the nature of each attribute: Numerical or
> > Categorical).
> >
> > Yes you can use BuildForest and TestForest to generate and use Random
> forest
> > models from the command line
> >
> > On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> The Random Forest partial implementation in
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >> use the ARFF file format, is ARFF the only supportted file format when
> >> using the BuildForest and TestForest program, and are BuildForest and
> >> TestForest program are official tools to build Random Forest models
> >> from the command line?
> >>
> >> Regards,
> >>
> >> Xiaobo Gu
> >>
> >
>

Re: File format question about Random forest.

Posted by Xiaobo Gu <gu...@gmail.com>.

So for simple datasets, which only have numeric and character
lable(without blank) category columns,  can we just use CSV tools to
save it as a standard CSV file without header?


On Wed, Jul 13, 2011 at 3:53 AM, deneche abdelhakim <ad...@gmail.com> wrote:
> the current implementation doesn't support the ARFF format out-of-the-box,
> as described in the Wiki you need to remove the header of the file and leave
> only the data. Actually, this implementation is fully compatible with UCI's
> datasets which are comma separated text files. You'll also need to call the
> dataset description tool (see the wiki) in order to generate a proper
> description file (contains the nature of each attribute: Numerical or
> Categorical).
>
> Yes you can use BuildForest and TestForest to generate and use Random forest
> models from the command line
>
> On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi,
>>
>> The Random Forest partial implementation in
>> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>> use the ARFF file format, is ARFF the only supportted file format when
>> using the BuildForest and TestForest program, and are BuildForest and
>> TestForest program are official tools to build Random Forest models
>> from the command line?
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Re: File format question about Random forest.

Posted by deneche abdelhakim <ad...@gmail.com>.

the current implementation doesn't support the ARFF format out-of-the-box,
as described in the Wiki you need to remove the header of the file and leave
only the data. Actually, this implementation is fully compatible with UCI's
datasets which are comma separated text files. You'll also need to call the
dataset description tool (see the wiki) in order to generate a proper
description file (contains the nature of each attribute: Numerical or
Categorical).

Yes you can use BuildForest and TestForest to generate and use Random forest
models from the command line

On Tue, Jul 12, 2011 at 2:19 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> Hi,
>
> The Random Forest partial implementation in
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> use the ARFF file format, is ARFF the only supportted file format when
> using the BuildForest and TestForest program, and are BuildForest and
> TestForest program are official tools to build Random Forest models
> from the command line?
>
> Regards,
>
> Xiaobo Gu
>