You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by Praveen Sripati <pr...@gmail.com> on 2012/03/25 04:01:03 UTC

InputFormats for Hama

Hi,

For Hama there are limited input formats

CombineFileInputFormat, FileInputFormat, NullInputFormat,
SequenceFileInputFormat, TextInputFormat

Does it make sense to have to have more input formats? I was thinking
InputFormats for Graph Databases.

Any feedback for the different input formats is welcome.

I quickly glanced Giraph and Hadoop and they have more InputFormats which
makes it easy to plug them with external systems.

Praveen

Re: InputFormats for Hama

Posted by Thomas Jungblut <th...@googlemail.com>.

Yeah that's great. Someone from Neo4J was here a long time ago, but I have
never heard again from him.

We should also support some normal formats like XML or CSV. We can directly
port them from Hadoop to our package.
Based on XML we can implement a format that parses DMOZ or commoncrawl on
Amzon S3.

Would someone please open JIRAs for that?

Am 25. März 2012 09:41 schrieb Edward J. Yoon <ed...@apache.org>:

> +1
>
> 나의 iPhone에서 보냄
>
> 2012. 3. 25. 오전 11:01 Praveen Sripati <pr...@gmail.com> 작성:
>
> > Hi,
> >
> > For Hama there are limited input formats
> >
> > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > SequenceFileInputFormat, TextInputFormat
> >
> > Does it make sense to have to have more input formats? I was thinking
> > InputFormats for Graph Databases.
> >
> > Any feedback for the different input formats is welcome.
> >
> > I quickly glanced Giraph and Hadoop and they have more InputFormats which
> > makes it easy to plug them with external systems.
> >
> > Praveen
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: InputFormats for Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

+1

나의 iPhone에서 보냄

2012. 3. 25. 오전 11:01 Praveen Sripati <pr...@gmail.com> 작성:

> Hi,
> 
> For Hama there are limited input formats
> 
> CombineFileInputFormat, FileInputFormat, NullInputFormat,
> SequenceFileInputFormat, TextInputFormat
> 
> Does it make sense to have to have more input formats? I was thinking
> InputFormats for Graph Databases.
> 
> Any feedback for the different input formats is welcome.
> 
> I quickly glanced Giraph and Hadoop and they have more InputFormats which
> makes it easy to plug them with external systems.
> 
> Praveen

Re: InputFormats for Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Great Praveen!

On Wed, Mar 28, 2012 at 10:33 AM, Praveen Sripati
<pr...@gmail.com> wrote:
> Ed,
>
> After I have done porting Hadoop formats to Hama, I can work on it.
>
> I have created a sub-task HAMA-544 for HBase InputFormat.
>
> Praveen
>
> On Wed, Mar 28, 2012 at 4:33 AM, Edward J. Yoon <ed...@apache.org>wrote:
>
>> Nice discussion!
>>
>> BTW, Anyone interested in contributing HBase table input/output formatters?
>>
>> On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
>> <th...@googlemail.com> wrote:
>> > Thanks for your time.
>> > I have tweeted about the graph db formats, I know some of my followers
>> are
>> > working with them, so they might be interested.
>> >
>> > Am 25. März 2012 19:25 schrieb Praveen Sripati <praveensripati@gmail.com
>> >:
>> >
>> >> I have created Umbrella JIRA HAMA-536 for creating the
>> >> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
>> >> the tasks to me, let me know if anyone is interested.
>> >>
>> >> Praveen
>> >>
>> >> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
>> >> thomas.jungblut@googlemail.com> wrote:
>> >>
>> >> > >
>> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
>> and
>> >> > the
>> >> > > their priority. Some we can port from Hadoop.
>> >> >
>> >> >
>> >> > Yep, you're right. I guess a single JIRA would be enough for the
>> already
>> >> > implemented formats in Hadoop, for the others we need subclasses.
>> >> > Formats that I really wanted to have would be:
>> >> >
>> >> >   - DBInputFormat[1]
>> >> >   - XMLInputFormat
>> >> >   - NLineInputFormat
>> >> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>> >> >   TextInputFormat)
>> >> >   - JSONInputFormat (for OpenGraph stuff)
>> >> >   - The graph DB formats Neo4J and how the others are called
>> >> >
>> >> > Anything I missed for a "full" coverage?
>> >> >
>> >> > Could you please elaborate on this?
>> >> >
>> >> >
>> >> > Sure, DMOZ is some kind of crawled website database. It is used in
>> some
>> >> > pagerank examples to test it, don't know if it was in Mahout. We could
>> >> also
>> >> > use it since we have pagerank as well.
>> >> > CommonCrawl is a new up-coming DMOZ-like database of many crawled
>> sites,
>> >> it
>> >> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
>> >> be a
>> >> > cool example as well.
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >>
>> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>> >> >
>> >> >
>> >> > Am 25. März 2012 14:56 schrieb Praveen Sripati <
>> praveensripati@gmail.com
>> >> >:
>> >> >
>> >> > > Thomas et al,
>> >> > >
>> >> > > > Would someone please open JIRAs for that?
>> >> > >
>> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
>> and
>> >> > the
>> >> > > their priority. Some we can port from Hadoop.
>> >> > >
>> >> > > > Based on XML we can implement a format that parses DMOZ or
>> >> commoncrawl
>> >> > on
>> >> > > Amzon S3.
>> >> > >
>> >> > > Could you please elaborate on this?
>> >> > >
>> >> > > Praveen
>> >> > >
>> >> > >
>> >> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <
>> clin4j@googlemail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > As I understand, many iterative applications don't require key
>> value
>> >> > > > input/ output and additionally need random access (read/ write) to
>> >> > > > particular file. I/O interface e.g. mpi may increase flexibility
>> >> here.
>> >> > > >
>> >> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
>> >> > > >
>> >> > > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com
>> >
>> >> > > wrote:
>> >> > > > > Hi,
>> >> > > > >
>> >> > > > > For Hama there are limited input formats
>> >> > > > >
>> >> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
>> >> > > > > SequenceFileInputFormat, TextInputFormat
>> >> > > > >
>> >> > > > > Does it make sense to have to have more input formats? I was
>> >> thinking
>> >> > > > > InputFormats for Graph Databases.
>> >> > > > >
>> >> > > > > Any feedback for the different input formats is welcome.
>> >> > > > >
>> >> > > > > I quickly glanced Giraph and Hadoop and they have more
>> InputFormats
>> >> > > which
>> >> > > > > makes it easy to plug them with external systems.
>> >> > > > >
>> >> > > > > Praveen
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thomas Jungblut
>> >> > Berlin <th...@gmail.com>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <th...@gmail.com>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: InputFormats for Hama

Posted by Praveen Sripati <pr...@gmail.com>.

Ed,

After I have done porting Hadoop formats to Hama, I can work on it.

I have created a sub-task HAMA-544 for HBase InputFormat.

Praveen

On Wed, Mar 28, 2012 at 4:33 AM, Edward J. Yoon <ed...@apache.org>wrote:

> Nice discussion!
>
> BTW, Anyone interested in contributing HBase table input/output formatters?
>
> On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
> <th...@googlemail.com> wrote:
> > Thanks for your time.
> > I have tweeted about the graph db formats, I know some of my followers
> are
> > working with them, so they might be interested.
> >
> > Am 25. März 2012 19:25 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
> >
> >> I have created Umbrella JIRA HAMA-536 for creating the
> >> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
> >> the tasks to me, let me know if anyone is interested.
> >>
> >> Praveen
> >>
> >> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
> >> thomas.jungblut@googlemail.com> wrote:
> >>
> >> > >
> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
> and
> >> > the
> >> > > their priority. Some we can port from Hadoop.
> >> >
> >> >
> >> > Yep, you're right. I guess a single JIRA would be enough for the
> already
> >> > implemented formats in Hadoop, for the others we need subclasses.
> >> > Formats that I really wanted to have would be:
> >> >
> >> >   - DBInputFormat[1]
> >> >   - XMLInputFormat
> >> >   - NLineInputFormat
> >> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
> >> >   TextInputFormat)
> >> >   - JSONInputFormat (for OpenGraph stuff)
> >> >   - The graph DB formats Neo4J and how the others are called
> >> >
> >> > Anything I missed for a "full" coverage?
> >> >
> >> > Could you please elaborate on this?
> >> >
> >> >
> >> > Sure, DMOZ is some kind of crawled website database. It is used in
> some
> >> > pagerank examples to test it, don't know if it was in Mahout. We could
> >> also
> >> > use it since we have pagerank as well.
> >> > CommonCrawl is a new up-coming DMOZ-like database of many crawled
> sites,
> >> it
> >> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
> >> be a
> >> > cool example as well.
> >> >
> >> > [1]
> >> >
> >> >
> >>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
> >> >
> >> >
> >> > Am 25. März 2012 14:56 schrieb Praveen Sripati <
> praveensripati@gmail.com
> >> >:
> >> >
> >> > > Thomas et al,
> >> > >
> >> > > > Would someone please open JIRAs for that?
> >> > >
> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
> and
> >> > the
> >> > > their priority. Some we can port from Hadoop.
> >> > >
> >> > > > Based on XML we can implement a format that parses DMOZ or
> >> commoncrawl
> >> > on
> >> > > Amzon S3.
> >> > >
> >> > > Could you please elaborate on this?
> >> > >
> >> > > Praveen
> >> > >
> >> > >
> >> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <
> clin4j@googlemail.com
> >> > > >wrote:
> >> > >
> >> > > > As I understand, many iterative applications don't require key
> value
> >> > > > input/ output and additionally need random access (read/ write) to
> >> > > > particular file. I/O interface e.g. mpi may increase flexibility
> >> here.
> >> > > >
> >> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> >> > > >
> >> > > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com
> >
> >> > > wrote:
> >> > > > > Hi,
> >> > > > >
> >> > > > > For Hama there are limited input formats
> >> > > > >
> >> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> >> > > > > SequenceFileInputFormat, TextInputFormat
> >> > > > >
> >> > > > > Does it make sense to have to have more input formats? I was
> >> thinking
> >> > > > > InputFormats for Graph Databases.
> >> > > > >
> >> > > > > Any feedback for the different input formats is welcome.
> >> > > > >
> >> > > > > I quickly glanced Giraph and Hadoop and they have more
> InputFormats
> >> > > which
> >> > > > > makes it easy to plug them with external systems.
> >> > > > >
> >> > > > > Praveen
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Thomas Jungblut
> >> > Berlin <th...@gmail.com>
> >> >
> >>
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: InputFormats for Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Nice discussion!

BTW, Anyone interested in contributing HBase table input/output formatters?

On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
<th...@googlemail.com> wrote:
> Thanks for your time.
> I have tweeted about the graph db formats, I know some of my followers are
> working with them, so they might be interested.
>
> Am 25. März 2012 19:25 schrieb Praveen Sripati <pr...@gmail.com>:
>
>> I have created Umbrella JIRA HAMA-536 for creating the
>> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
>> the tasks to me, let me know if anyone is interested.
>>
>> Praveen
>>
>> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
>> thomas.jungblut@googlemail.com> wrote:
>>
>> > >
>> > > I can open a JIRA. I need input on what all InputFormat makes sense and
>> > the
>> > > their priority. Some we can port from Hadoop.
>> >
>> >
>> > Yep, you're right. I guess a single JIRA would be enough for the already
>> > implemented formats in Hadoop, for the others we need subclasses.
>> > Formats that I really wanted to have would be:
>> >
>> >   - DBInputFormat[1]
>> >   - XMLInputFormat
>> >   - NLineInputFormat
>> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>> >   TextInputFormat)
>> >   - JSONInputFormat (for OpenGraph stuff)
>> >   - The graph DB formats Neo4J and how the others are called
>> >
>> > Anything I missed for a "full" coverage?
>> >
>> > Could you please elaborate on this?
>> >
>> >
>> > Sure, DMOZ is some kind of crawled website database. It is used in some
>> > pagerank examples to test it, don't know if it was in Mahout. We could
>> also
>> > use it since we have pagerank as well.
>> > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites,
>> it
>> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
>> be a
>> > cool example as well.
>> >
>> > [1]
>> >
>> >
>> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>> >
>> >
>> > Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com
>> >:
>> >
>> > > Thomas et al,
>> > >
>> > > > Would someone please open JIRAs for that?
>> > >
>> > > I can open a JIRA. I need input on what all InputFormat makes sense and
>> > the
>> > > their priority. Some we can port from Hadoop.
>> > >
>> > > > Based on XML we can implement a format that parses DMOZ or
>> commoncrawl
>> > on
>> > > Amzon S3.
>> > >
>> > > Could you please elaborate on this?
>> > >
>> > > Praveen
>> > >
>> > >
>> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
>> > > >wrote:
>> > >
>> > > > As I understand, many iterative applications don't require key value
>> > > > input/ output and additionally need random access (read/ write) to
>> > > > particular file. I/O interface e.g. mpi may increase flexibility
>> here.
>> > > >
>> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
>> > > >
>> > > > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
>> > > wrote:
>> > > > > Hi,
>> > > > >
>> > > > > For Hama there are limited input formats
>> > > > >
>> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
>> > > > > SequenceFileInputFormat, TextInputFormat
>> > > > >
>> > > > > Does it make sense to have to have more input formats? I was
>> thinking
>> > > > > InputFormats for Graph Databases.
>> > > > >
>> > > > > Any feedback for the different input formats is welcome.
>> > > > >
>> > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
>> > > which
>> > > > > makes it easy to plug them with external systems.
>> > > > >
>> > > > > Praveen
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <th...@gmail.com>
>> >
>>
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: InputFormats for Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Nice discussion!

BTW, Anyone interested in contributing HBase table input/output formatters?

On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
<th...@googlemail.com> wrote:
> Thanks for your time.
> I have tweeted about the graph db formats, I know some of my followers are
> working with them, so they might be interested.
>
> Am 25. März 2012 19:25 schrieb Praveen Sripati <pr...@gmail.com>:
>
>> I have created Umbrella JIRA HAMA-536 for creating the
>> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
>> the tasks to me, let me know if anyone is interested.
>>
>> Praveen
>>
>> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
>> thomas.jungblut@googlemail.com> wrote:
>>
>> > >
>> > > I can open a JIRA. I need input on what all InputFormat makes sense and
>> > the
>> > > their priority. Some we can port from Hadoop.
>> >
>> >
>> > Yep, you're right. I guess a single JIRA would be enough for the already
>> > implemented formats in Hadoop, for the others we need subclasses.
>> > Formats that I really wanted to have would be:
>> >
>> >   - DBInputFormat[1]
>> >   - XMLInputFormat
>> >   - NLineInputFormat
>> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>> >   TextInputFormat)
>> >   - JSONInputFormat (for OpenGraph stuff)
>> >   - The graph DB formats Neo4J and how the others are called
>> >
>> > Anything I missed for a "full" coverage?
>> >
>> > Could you please elaborate on this?
>> >
>> >
>> > Sure, DMOZ is some kind of crawled website database. It is used in some
>> > pagerank examples to test it, don't know if it was in Mahout. We could
>> also
>> > use it since we have pagerank as well.
>> > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites,
>> it
>> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
>> be a
>> > cool example as well.
>> >
>> > [1]
>> >
>> >
>> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>> >
>> >
>> > Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com
>> >:
>> >
>> > > Thomas et al,
>> > >
>> > > > Would someone please open JIRAs for that?
>> > >
>> > > I can open a JIRA. I need input on what all InputFormat makes sense and
>> > the
>> > > their priority. Some we can port from Hadoop.
>> > >
>> > > > Based on XML we can implement a format that parses DMOZ or
>> commoncrawl
>> > on
>> > > Amzon S3.
>> > >
>> > > Could you please elaborate on this?
>> > >
>> > > Praveen
>> > >
>> > >
>> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
>> > > >wrote:
>> > >
>> > > > As I understand, many iterative applications don't require key value
>> > > > input/ output and additionally need random access (read/ write) to
>> > > > particular file. I/O interface e.g. mpi may increase flexibility
>> here.
>> > > >
>> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
>> > > >
>> > > > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
>> > > wrote:
>> > > > > Hi,
>> > > > >
>> > > > > For Hama there are limited input formats
>> > > > >
>> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
>> > > > > SequenceFileInputFormat, TextInputFormat
>> > > > >
>> > > > > Does it make sense to have to have more input formats? I was
>> thinking
>> > > > > InputFormats for Graph Databases.
>> > > > >
>> > > > > Any feedback for the different input formats is welcome.
>> > > > >
>> > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
>> > > which
>> > > > > makes it easy to plug them with external systems.
>> > > > >
>> > > > > Praveen
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <th...@gmail.com>
>> >
>>
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: InputFormats for Hama

Posted by Thomas Jungblut <th...@googlemail.com>.

Thanks for your time.
I have tweeted about the graph db formats, I know some of my followers are
working with them, so they might be interested.

Am 25. März 2012 19:25 schrieb Praveen Sripati <pr...@gmail.com>:

> I have created Umbrella JIRA HAMA-536 for creating the
> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
> the tasks to me, let me know if anyone is interested.
>
> Praveen
>
> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
> thomas.jungblut@googlemail.com> wrote:
>
> > >
> > > I can open a JIRA. I need input on what all InputFormat makes sense and
> > the
> > > their priority. Some we can port from Hadoop.
> >
> >
> > Yep, you're right. I guess a single JIRA would be enough for the already
> > implemented formats in Hadoop, for the others we need subclasses.
> > Formats that I really wanted to have would be:
> >
> >   - DBInputFormat[1]
> >   - XMLInputFormat
> >   - NLineInputFormat
> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
> >   TextInputFormat)
> >   - JSONInputFormat (for OpenGraph stuff)
> >   - The graph DB formats Neo4J and how the others are called
> >
> > Anything I missed for a "full" coverage?
> >
> > Could you please elaborate on this?
> >
> >
> > Sure, DMOZ is some kind of crawled website database. It is used in some
> > pagerank examples to test it, don't know if it was in Mahout. We could
> also
> > use it since we have pagerank as well.
> > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites,
> it
> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
> be a
> > cool example as well.
> >
> > [1]
> >
> >
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
> >
> >
> > Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
> >
> > > Thomas et al,
> > >
> > > > Would someone please open JIRAs for that?
> > >
> > > I can open a JIRA. I need input on what all InputFormat makes sense and
> > the
> > > their priority. Some we can port from Hadoop.
> > >
> > > > Based on XML we can implement a format that parses DMOZ or
> commoncrawl
> > on
> > > Amzon S3.
> > >
> > > Could you please elaborate on this?
> > >
> > > Praveen
> > >
> > >
> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> > > >wrote:
> > >
> > > > As I understand, many iterative applications don't require key value
> > > > input/ output and additionally need random access (read/ write) to
> > > > particular file. I/O interface e.g. mpi may increase flexibility
> here.
> > > >
> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> > > >
> > > > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
> > > wrote:
> > > > > Hi,
> > > > >
> > > > > For Hama there are limited input formats
> > > > >
> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > > > SequenceFileInputFormat, TextInputFormat
> > > > >
> > > > > Does it make sense to have to have more input formats? I was
> thinking
> > > > > InputFormats for Graph Databases.
> > > > >
> > > > > Any feedback for the different input formats is welcome.
> > > > >
> > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> > > which
> > > > > makes it easy to plug them with external systems.
> > > > >
> > > > > Praveen
> > > >
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
> >
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: InputFormats for Hama

Posted by Praveen Sripati <pr...@gmail.com>.

I have created Umbrella JIRA HAMA-536 for creating the
InputFormats/OutputFormats with three sub-tasks. For now I have assigned
the tasks to me, let me know if anyone is interested.

Praveen

On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
>
>
> Yep, you're right. I guess a single JIRA would be enough for the already
> implemented formats in Hadoop, for the others we need subclasses.
> Formats that I really wanted to have would be:
>
>   - DBInputFormat[1]
>   - XMLInputFormat
>   - NLineInputFormat
>   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>   TextInputFormat)
>   - JSONInputFormat (for OpenGraph stuff)
>   - The graph DB formats Neo4J and how the others are called
>
> Anything I missed for a "full" coverage?
>
> Could you please elaborate on this?
>
>
> Sure, DMOZ is some kind of crawled website database. It is used in some
> pagerank examples to test it, don't know if it was in Mahout. We could also
> use it since we have pagerank as well.
> CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it
> is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a
> cool example as well.
>
> [1]
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>
>
> Am 25. März 2012 14:56 schrieb Praveen Sripati <pr...@gmail.com>:
>
> > Thomas et al,
> >
> > > Would someone please open JIRAs for that?
> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
> >
> > > Based on XML we can implement a format that parses DMOZ or commoncrawl
> on
> > Amzon S3.
> >
> > Could you please elaborate on this?
> >
> > Praveen
> >
> >
> > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> > >wrote:
> >
> > > As I understand, many iterative applications don't require key value
> > > input/ output and additionally need random access (read/ write) to
> > > particular file. I/O interface e.g. mpi may increase flexibility here.
> > >
> > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> > >
> > > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > For Hama there are limited input formats
> > > >
> > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > > SequenceFileInputFormat, TextInputFormat
> > > >
> > > > Does it make sense to have to have more input formats? I was thinking
> > > > InputFormats for Graph Databases.
> > > >
> > > > Any feedback for the different input formats is welcome.
> > > >
> > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> > which
> > > > makes it easy to plug them with external systems.
> > > >
> > > > Praveen
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>

Re: InputFormats for Hama

Posted by Praveen Sripati <pr...@gmail.com>.

It would be nice to use the same Hadoop core classes instead of copying the
code into Hama. Same applies to InputFormat and other classes also. Hama
would be getting effort free updates.

A generic Input/Output format can be applicable to MR, BSP and other
distributed models also.

Praveen

On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
>
>
> Yep, you're right. I guess a single JIRA would be enough for the already
> implemented formats in Hadoop, for the others we need subclasses.
> Formats that I really wanted to have would be:
>
>   - DBInputFormat[1]
>   - XMLInputFormat
>   - NLineInputFormat
>   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>   TextInputFormat)
>   - JSONInputFormat (for OpenGraph stuff)
>   - The graph DB formats Neo4J and how the others are called
>
> Anything I missed for a "full" coverage?
>
> Could you please elaborate on this?
>
>
> Sure, DMOZ is some kind of crawled website database. It is used in some
> pagerank examples to test it, don't know if it was in Mahout. We could also
> use it since we have pagerank as well.
> CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it
> is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a
> cool example as well.
>
> [1]
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>
>
> Am 25. März 2012 14:56 schrieb Praveen Sripati <pr...@gmail.com>:
>
> > Thomas et al,
> >
> > > Would someone please open JIRAs for that?
> >
> > I can open a JIRA. I need input on what all InputFormat makes sense and
> the
> > their priority. Some we can port from Hadoop.
> >
> > > Based on XML we can implement a format that parses DMOZ or commoncrawl
> on
> > Amzon S3.
> >
> > Could you please elaborate on this?
> >
> > Praveen
> >
> >
> > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> > >wrote:
> >
> > > As I understand, many iterative applications don't require key value
> > > input/ output and additionally need random access (read/ write) to
> > > particular file. I/O interface e.g. mpi may increase flexibility here.
> > >
> > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> > >
> > > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > For Hama there are limited input formats
> > > >
> > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > > SequenceFileInputFormat, TextInputFormat
> > > >
> > > > Does it make sense to have to have more input formats? I was thinking
> > > > InputFormats for Graph Databases.
> > > >
> > > > Any feedback for the different input formats is welcome.
> > > >
> > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> > which
> > > > makes it easy to plug them with external systems.
> > > >
> > > > Praveen
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>

Re: InputFormats for Hama

Posted by Thomas Jungblut <th...@googlemail.com>.

>
> I can open a JIRA. I need input on what all InputFormat makes sense and the
> their priority. Some we can port from Hadoop.


Yep, you're right. I guess a single JIRA would be enough for the already
implemented formats in Hadoop, for the others we need subclasses.
Formats that I really wanted to have would be:

   - DBInputFormat[1]
   - XMLInputFormat
   - NLineInputFormat
   - CSVInputFormat (we could use OpenCSV for that in conjunction with
   TextInputFormat)
   - JSONInputFormat (for OpenGraph stuff)
   - The graph DB formats Neo4J and how the others are called

Anything I missed for a "full" coverage?

Could you please elaborate on this?


Sure, DMOZ is some kind of crawled website database. It is used in some
pagerank examples to test it, don't know if it was in Mahout. We could also
use it since we have pagerank as well.
CommonCrawl is a new up-coming DMOZ-like database of many crawled sites, it
is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could be a
cool example as well.

[1]
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html


Am 25. März 2012 14:56 schrieb Praveen Sripati <pr...@gmail.com>:

> Thomas et al,
>
> > Would someone please open JIRAs for that?
>
> I can open a JIRA. I need input on what all InputFormat makes sense and the
> their priority. Some we can port from Hadoop.
>
> > Based on XML we can implement a format that parses DMOZ or commoncrawl on
> Amzon S3.
>
> Could you please elaborate on this?
>
> Praveen
>
>
> On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
> >wrote:
>
> > As I understand, many iterative applications don't require key value
> > input/ output and additionally need random access (read/ write) to
> > particular file. I/O interface e.g. mpi may increase flexibility here.
> >
> > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> >
> > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
> wrote:
> > > Hi,
> > >
> > > For Hama there are limited input formats
> > >
> > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > > SequenceFileInputFormat, TextInputFormat
> > >
> > > Does it make sense to have to have more input formats? I was thinking
> > > InputFormats for Graph Databases.
> > >
> > > Any feedback for the different input formats is welcome.
> > >
> > > I quickly glanced Giraph and Hadoop and they have more InputFormats
> which
> > > makes it easy to plug them with external systems.
> > >
> > > Praveen
> >
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: InputFormats for Hama

Posted by Praveen Sripati <pr...@gmail.com>.

Thomas et al,

> Would someone please open JIRAs for that?

I can open a JIRA. I need input on what all InputFormat makes sense and the
their priority. Some we can port from Hadoop.

> Based on XML we can implement a format that parses DMOZ or commoncrawl on
Amzon S3.

Could you please elaborate on this?

Praveen


On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <cl...@googlemail.com>wrote:

> As I understand, many iterative applications don't require key value
> input/ output and additionally need random access (read/ write) to
> particular file. I/O interface e.g. mpi may increase flexibility here.
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2911
>
> On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com> wrote:
> > Hi,
> >
> > For Hama there are limited input formats
> >
> > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> > SequenceFileInputFormat, TextInputFormat
> >
> > Does it make sense to have to have more input formats? I was thinking
> > InputFormats for Graph Databases.
> >
> > Any feedback for the different input formats is welcome.
> >
> > I quickly glanced Giraph and Hadoop and they have more InputFormats which
> > makes it easy to plug them with external systems.
> >
> > Praveen
>

Re: InputFormats for Hama

Posted by Chia-Hung Lin <cl...@googlemail.com>.

As I understand, many iterative applications don't require key value
input/ output and additionally need random access (read/ write) to
particular file. I/O interface e.g. mpi may increase flexibility here.

https://issues.apache.org/jira/browse/MAPREDUCE-2911

On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com> wrote:
> Hi,
>
> For Hama there are limited input formats
>
> CombineFileInputFormat, FileInputFormat, NullInputFormat,
> SequenceFileInputFormat, TextInputFormat
>
> Does it make sense to have to have more input formats? I was thinking
> InputFormats for Graph Databases.
>
> Any feedback for the different input formats is welcome.
>
> I quickly glanced Giraph and Hadoop and they have more InputFormats which
> makes it easy to plug them with external systems.
>
> Praveen