You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2012/03/28 01:03:29 UTC

Re: InputFormats for Hama

Nice discussion!

BTW, Anyone interested in contributing HBase table input/output formatters?

On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
<th...@googlemail.com> wrote:
> Thanks for your time.
> I have tweeted about the graph db formats, I know some of my followers are
> working with them, so they might be interested.
>
> Am 25. März 2012 19:25 schrieb Praveen Sripati <pr...@gmail.com>:
>
>> I have created Umbrella JIRA HAMA-536 for creating the
>> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
>> the tasks to me, let me know if anyone is interested.
>>
>> Praveen
>>
>> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
>> thomas.jungblut@googlemail.com> wrote:
>>
>> > >
>> > > I can open a JIRA. I need input on what all InputFormat makes sense and
>> > the
>> > > their priority. Some we can port from Hadoop.
>> >
>> >
>> > Yep, you're right. I guess a single JIRA would be enough for the already
>> > implemented formats in Hadoop, for the others we need subclasses.
>> > Formats that I really wanted to have would be:
>> >
>> >   - DBInputFormat[1]
>> >   - XMLInputFormat
>> >   - NLineInputFormat
>> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>> >   TextInputFormat)
>> >   - JSONInputFormat (for OpenGraph stuff)
>> >   - The graph DB formats Neo4J and how the others are called
>> >
>> > Anything I missed for a "full" coverage?
>> >
>> > Could you please elaborate on this?
>> >
>> >
>> > Sure, DMOZ is some kind of crawled website database. It is used in some
>> > pagerank examples to test it, don't know if it was in Mahout. We could
>> also
>> > use it since we have pagerank as well.
>> > CommonCrawl is a new up-coming DMOZ-like database of many crawled sites,
>> it
>> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
>> be a
>> > cool example as well.
>> >
>> > [1]
>> >
>> >
>> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>> >
>> >
>> > Am 25. März 2012 14:56 schrieb Praveen Sripati <praveensripati@gmail.com
>> >:
>> >
>> > > Thomas et al,
>> > >
>> > > > Would someone please open JIRAs for that?
>> > >
>> > > I can open a JIRA. I need input on what all InputFormat makes sense and
>> > the
>> > > their priority. Some we can port from Hadoop.
>> > >
>> > > > Based on XML we can implement a format that parses DMOZ or
>> commoncrawl
>> > on
>> > > Amzon S3.
>> > >
>> > > Could you please elaborate on this?
>> > >
>> > > Praveen
>> > >
>> > >
>> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <clin4j@googlemail.com
>> > > >wrote:
>> > >
>> > > > As I understand, many iterative applications don't require key value
>> > > > input/ output and additionally need random access (read/ write) to
>> > > > particular file. I/O interface e.g. mpi may increase flexibility
>> here.
>> > > >
>> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
>> > > >
>> > > > On 25 March 2012 10:01, Praveen Sripati <pr...@gmail.com>
>> > > wrote:
>> > > > > Hi,
>> > > > >
>> > > > > For Hama there are limited input formats
>> > > > >
>> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
>> > > > > SequenceFileInputFormat, TextInputFormat
>> > > > >
>> > > > > Does it make sense to have to have more input formats? I was
>> thinking
>> > > > > InputFormats for Graph Databases.
>> > > > >
>> > > > > Any feedback for the different input formats is welcome.
>> > > > >
>> > > > > I quickly glanced Giraph and Hadoop and they have more InputFormats
>> > > which
>> > > > > makes it easy to plug them with external systems.
>> > > > >
>> > > > > Praveen
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <th...@gmail.com>
>> >
>>
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: InputFormats for Hama

Posted by "Edward J. Yoon" <ed...@apache.org>.

Great Praveen!

On Wed, Mar 28, 2012 at 10:33 AM, Praveen Sripati
<pr...@gmail.com> wrote:
> Ed,
>
> After I have done porting Hadoop formats to Hama, I can work on it.
>
> I have created a sub-task HAMA-544 for HBase InputFormat.
>
> Praveen
>
> On Wed, Mar 28, 2012 at 4:33 AM, Edward J. Yoon <ed...@apache.org>wrote:
>
>> Nice discussion!
>>
>> BTW, Anyone interested in contributing HBase table input/output formatters?
>>
>> On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
>> <th...@googlemail.com> wrote:
>> > Thanks for your time.
>> > I have tweeted about the graph db formats, I know some of my followers
>> are
>> > working with them, so they might be interested.
>> >
>> > Am 25. März 2012 19:25 schrieb Praveen Sripati <praveensripati@gmail.com
>> >:
>> >
>> >> I have created Umbrella JIRA HAMA-536 for creating the
>> >> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
>> >> the tasks to me, let me know if anyone is interested.
>> >>
>> >> Praveen
>> >>
>> >> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
>> >> thomas.jungblut@googlemail.com> wrote:
>> >>
>> >> > >
>> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
>> and
>> >> > the
>> >> > > their priority. Some we can port from Hadoop.
>> >> >
>> >> >
>> >> > Yep, you're right. I guess a single JIRA would be enough for the
>> already
>> >> > implemented formats in Hadoop, for the others we need subclasses.
>> >> > Formats that I really wanted to have would be:
>> >> >
>> >> >   - DBInputFormat[1]
>> >> >   - XMLInputFormat
>> >> >   - NLineInputFormat
>> >> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
>> >> >   TextInputFormat)
>> >> >   - JSONInputFormat (for OpenGraph stuff)
>> >> >   - The graph DB formats Neo4J and how the others are called
>> >> >
>> >> > Anything I missed for a "full" coverage?
>> >> >
>> >> > Could you please elaborate on this?
>> >> >
>> >> >
>> >> > Sure, DMOZ is some kind of crawled website database. It is used in
>> some
>> >> > pagerank examples to test it, don't know if it was in Mahout. We could
>> >> also
>> >> > use it since we have pagerank as well.
>> >> > CommonCrawl is a new up-coming DMOZ-like database of many crawled
>> sites,
>> >> it
>> >> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
>> >> be a
>> >> > cool example as well.
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >>
>> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
>> >> >
>> >> >
>> >> > Am 25. März 2012 14:56 schrieb Praveen Sripati <
>> praveensripati@gmail.com
>> >> >:
>> >> >
>> >> > > Thomas et al,
>> >> > >
>> >> > > > Would someone please open JIRAs for that?
>> >> > >
>> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
>> and
>> >> > the
>> >> > > their priority. Some we can port from Hadoop.
>> >> > >
>> >> > > > Based on XML we can implement a format that parses DMOZ or
>> >> commoncrawl
>> >> > on
>> >> > > Amzon S3.
>> >> > >
>> >> > > Could you please elaborate on this?
>> >> > >
>> >> > > Praveen
>> >> > >
>> >> > >
>> >> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <
>> clin4j@googlemail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > As I understand, many iterative applications don't require key
>> value
>> >> > > > input/ output and additionally need random access (read/ write) to
>> >> > > > particular file. I/O interface e.g. mpi may increase flexibility
>> >> here.
>> >> > > >
>> >> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
>> >> > > >
>> >> > > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com
>> >
>> >> > > wrote:
>> >> > > > > Hi,
>> >> > > > >
>> >> > > > > For Hama there are limited input formats
>> >> > > > >
>> >> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
>> >> > > > > SequenceFileInputFormat, TextInputFormat
>> >> > > > >
>> >> > > > > Does it make sense to have to have more input formats? I was
>> >> thinking
>> >> > > > > InputFormats for Graph Databases.
>> >> > > > >
>> >> > > > > Any feedback for the different input formats is welcome.
>> >> > > > >
>> >> > > > > I quickly glanced Giraph and Hadoop and they have more
>> InputFormats
>> >> > > which
>> >> > > > > makes it easy to plug them with external systems.
>> >> > > > >
>> >> > > > > Praveen
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thomas Jungblut
>> >> > Berlin <th...@gmail.com>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <th...@gmail.com>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: InputFormats for Hama

Posted by Praveen Sripati <pr...@gmail.com>.

Ed,

After I have done porting Hadoop formats to Hama, I can work on it.

I have created a sub-task HAMA-544 for HBase InputFormat.

Praveen

On Wed, Mar 28, 2012 at 4:33 AM, Edward J. Yoon <ed...@apache.org>wrote:

> Nice discussion!
>
> BTW, Anyone interested in contributing HBase table input/output formatters?
>
> On Mon, Mar 26, 2012 at 2:27 AM, Thomas Jungblut
> <th...@googlemail.com> wrote:
> > Thanks for your time.
> > I have tweeted about the graph db formats, I know some of my followers
> are
> > working with them, so they might be interested.
> >
> > Am 25. März 2012 19:25 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
> >
> >> I have created Umbrella JIRA HAMA-536 for creating the
> >> InputFormats/OutputFormats with three sub-tasks. For now I have assigned
> >> the tasks to me, let me know if anyone is interested.
> >>
> >> Praveen
> >>
> >> On Sun, Mar 25, 2012 at 6:40 PM, Thomas Jungblut <
> >> thomas.jungblut@googlemail.com> wrote:
> >>
> >> > >
> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
> and
> >> > the
> >> > > their priority. Some we can port from Hadoop.
> >> >
> >> >
> >> > Yep, you're right. I guess a single JIRA would be enough for the
> already
> >> > implemented formats in Hadoop, for the others we need subclasses.
> >> > Formats that I really wanted to have would be:
> >> >
> >> >   - DBInputFormat[1]
> >> >   - XMLInputFormat
> >> >   - NLineInputFormat
> >> >   - CSVInputFormat (we could use OpenCSV for that in conjunction with
> >> >   TextInputFormat)
> >> >   - JSONInputFormat (for OpenGraph stuff)
> >> >   - The graph DB formats Neo4J and how the others are called
> >> >
> >> > Anything I missed for a "full" coverage?
> >> >
> >> > Could you please elaborate on this?
> >> >
> >> >
> >> > Sure, DMOZ is some kind of crawled website database. It is used in
> some
> >> > pagerank examples to test it, don't know if it was in Mahout. We could
> >> also
> >> > use it since we have pagerank as well.
> >> > CommonCrawl is a new up-coming DMOZ-like database of many crawled
> sites,
> >> it
> >> > is hosted on S3 in Amazon Cloud. We run on EC2 via Whirr so this could
> >> be a
> >> > cool example as well.
> >> >
> >> > [1]
> >> >
> >> >
> >>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.html
> >> >
> >> >
> >> > Am 25. März 2012 14:56 schrieb Praveen Sripati <
> praveensripati@gmail.com
> >> >:
> >> >
> >> > > Thomas et al,
> >> > >
> >> > > > Would someone please open JIRAs for that?
> >> > >
> >> > > I can open a JIRA. I need input on what all InputFormat makes sense
> and
> >> > the
> >> > > their priority. Some we can port from Hadoop.
> >> > >
> >> > > > Based on XML we can implement a format that parses DMOZ or
> >> commoncrawl
> >> > on
> >> > > Amzon S3.
> >> > >
> >> > > Could you please elaborate on this?
> >> > >
> >> > > Praveen
> >> > >
> >> > >
> >> > > On Sun, Mar 25, 2012 at 5:14 PM, Chia-Hung Lin <
> clin4j@googlemail.com
> >> > > >wrote:
> >> > >
> >> > > > As I understand, many iterative applications don't require key
> value
> >> > > > input/ output and additionally need random access (read/ write) to
> >> > > > particular file. I/O interface e.g. mpi may increase flexibility
> >> here.
> >> > > >
> >> > > > https://issues.apache.org/jira/browse/MAPREDUCE-2911
> >> > > >
> >> > > > On 25 March 2012 10:01, Praveen Sripati <praveensripati@gmail.com
> >
> >> > > wrote:
> >> > > > > Hi,
> >> > > > >
> >> > > > > For Hama there are limited input formats
> >> > > > >
> >> > > > > CombineFileInputFormat, FileInputFormat, NullInputFormat,
> >> > > > > SequenceFileInputFormat, TextInputFormat
> >> > > > >
> >> > > > > Does it make sense to have to have more input formats? I was
> >> thinking
> >> > > > > InputFormats for Graph Databases.
> >> > > > >
> >> > > > > Any feedback for the different input formats is welcome.
> >> > > > >
> >> > > > > I quickly glanced Giraph and Hadoop and they have more
> InputFormats
> >> > > which
> >> > > > > makes it easy to plug them with external systems.
> >> > > > >
> >> > > > > Praveen
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Thomas Jungblut
> >> > Berlin <th...@gmail.com>
> >> >
> >>
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>