You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Alan Ho <ka...@yahoo.ca> on 2007/12/10 10:12:28 UTC

Re: MapReduce Job on XML input

I've written a xml input splitter based on a Stax parser. Its much better than StreamXMLRecordReader

----- Original Message ----
From: Peter Thygesen <th...@infopaq.dk>
To: hadoop-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:49:52 AM
Subject: MapReduce Job on XML input

I would like to run some mapReduce jobs on some xml files I got (aprox.
100000 compressed files). 
The XML files are not that big about 1 Mb compressed, each containing
about 1000 records. 

Do I have to write my own InputSplitter? Should I use
MultiFileInputFormat or StreamInputFormat? Can I use the
StreamXmlRecordReader, and how? By sub-classing some input class?

The tutorials and examples I've read are all very straight forward
reading simple text files, but I miss a more complex example,
 especially
one that reads xml files ;) 

thx. 
Peter

      Looking for the perfect gift? Give the gift of Flickr! 

http://www.flickr.com/gift/

Re: MapReduce Job on XML input

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

Alan,

On Mon, Dec 10, 2007 at 01:12:28AM -0800, Alan Ho wrote:
>I've written a xml input splitter based on a Stax parser. Its much better than StreamXMLRecordReader
>

We'd definitely like to see something like this in Hadoop, do you mind contributing it?

Details: http://wiki.apache.org/lucene-hadoop/HowToContribute

thanks,
Arun

>----- Original Message ----
>From: Peter Thygesen <th...@infopaq.dk>
>To: hadoop-user@lucene.apache.org
>Sent: Monday, November 26, 2007 8:49:52 AM
>Subject: MapReduce Job on XML input
>
>I would like to run some mapReduce jobs on some xml files I got (aprox.
>100000 compressed files). 
>The XML files are not that big about 1 Mb compressed, each containing
>about 1000 records. 
>
>Do I have to write my own InputSplitter? Should I use
>MultiFileInputFormat or StreamInputFormat? Can I use the
>StreamXmlRecordReader, and how? By sub-classing some input class?
>
>The tutorials and examples I've read are all very straight forward
>reading simple text files, but I miss a more complex example,
> especially
>one that reads xml files ;) 
>
>thx. 
>Peter
>
>
>
>
>
>
>
>      Looking for the perfect gift? Give the gift of Flickr! 
>
>http://www.flickr.com/gift/
>

RE: About relational algebra operators

Posted by edward yoon <we...@udanax.org>.

Ok, Let me sweep the inaccuracy issues.

------------------------------

B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org


> From: webmaster@udanax.org
> To: hadoop-dev@lucene.apache.org
> Subject: RE: About relational algebra operators
> Date: Tue, 11 Dec 2007 03:25:19 +0000
>
>
> I asked Pig community the idea of merging HBase shell with their work, but they suggested another idea - to further develop each other's work now and have discussion later.
>
> I think PIG'll probably lose PIG original identity once they develop their work on top of HBase.
> Also, PIG will be need an administrative tools for Hbase (table create/alter/drop ... ).
>
> So, i was suggested to compose an abstract 2d-table only with certain data filtered from hbase array structure using arbitrary HQL on PIG-6.
> It will be useful option for PIG's extended storage.
>
> Anyway, I got your reply.
>
> Edward.
> ------------------------------
> B. Regards,
>
> Edward yoon @ NHN, corp.
> Home : http://www.udanax.org
>
>
>> From: michael@powerset.com
>> To: hadoop-dev@lucene.apache.org; webmaster@udanax.org
>> Date: Mon, 10 Dec 2007 18:42:40 -0800
>> Subject: Re: About relational algebra operators
>>
>> If you have a large table, then using hbase shell is probably not a good idea because it will take too long to run the job. Instead, you need to start a mapreduce job to do the processing for you.
>>
>> Since pig already exists as a shell that can create mapreduce jobs (like sawzall), I think it is probably best to work on pig until/unless you find something in their approach you fundamentally disagree with.
>>
>> -Michael
>>
>> On 12/10/07 6:22 PM, "edward yoon" wrote:
>>
>>
>>
>> Let's assume some data set of Hbase relations can be undergoing many changes by advance of human culture.
>>
>> We need better relation for these changes.
>> Therefore, I think formal relational algebra operators will be good administrative tool in hbase shell.
>> We can use relational algebra operators like a administrative tool.
>>
>> Also, it is helpful for temporary simulation situation.
>> Let's assume the tables.
>>
>> 1. huge-webTable ( URL , title, content, image, language, metatag, color, ... , etc ).
>> 2. huge-clickLogTable ( URL , userIP, search_Keyword, ... , etc)
>>
>> And we wnat to categorize web document by language and userIP factors.
>>
>> Then we can make the temporary table using relational algebra operators.
>> π language, userIP (webTable ▷◁ row = row clickLogTable)
>>
>> Users can easily develop the numeric analysis application program for categorize document sets using new Relation table.
>> Formal relational algebra operators can be most useful feature in Hbase.
>>
>> What do you think?
>>
>> ------------------------------
>>
>> B. Regards,
>>
>> Edward yoon @ NHN, corp.
>> Home : http://www.udanax.org
>> _________________________________________________________________
>> Put your friends on the big screen with Windows Vista® + Windows Live™.
>> http://www.microsoft.com/windows/shop/specialoffers.mspx?ocid=TXT_TAGLM_CPC_MediaCtr_bigscreen_102007
>>
>>
>
> _________________________________________________________________
> You keep typing, we keep giving. Download Messenger and join the i’m Initiative now.
> http://im.live.com/messenger/im/home/?source=TAGLM

_________________________________________________________________
Connect and share in new ways with Windows Live.
http://www.windowslive.com/connect.html?ocid=TXT_TAGLM_Wave2_newways_112007

RE: About relational algebra operators

Posted by edward yoon <we...@udanax.org>.

I asked Pig community the idea of merging HBase shell with their work, but they suggested another idea - to further develop each other's work now and have discussion later. 

I think PIG'll probably lose PIG original identity once they develop their work on top of HBase.
Also, PIG will be need an administrative tools for Hbase (table create/alter/drop ... ).

So, i was suggested to compose an abstract 2d-table only with certain data filtered from hbase array structure using arbitrary HQL on PIG-6.
It will be useful option for PIG's extended storage.

Anyway, I got your reply.

Edward.
------------------------------
B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org


> From: michael@powerset.com
> To: hadoop-dev@lucene.apache.org; webmaster@udanax.org
> Date: Mon, 10 Dec 2007 18:42:40 -0800
> Subject: Re: About relational algebra operators
>
> If you have a large table, then using hbase shell is probably not a good idea because it will take too long to run the job. Instead, you need to start a mapreduce job to do the processing for you.
>
> Since pig already exists as a shell that can create mapreduce jobs (like sawzall), I think it is probably best to work on pig until/unless you find something in their approach you fundamentally disagree with.
>
> -Michael
>
> On 12/10/07 6:22 PM, "edward yoon"  wrote:
>
>
>
> Let's assume some data set of Hbase relations can be undergoing many changes by advance of human culture.
>
> We need better relation for these changes.
> Therefore, I think formal relational algebra operators will be good administrative tool in hbase shell.
> We can use relational algebra operators like a administrative tool.
>
> Also, it is helpful for temporary simulation situation.
> Let's assume the tables.
>
> 1. huge-webTable ( URL , title, content, image, language, metatag, color, ... , etc ).
> 2. huge-clickLogTable ( URL , userIP, search_Keyword, ... , etc)
>
> And we wnat to categorize web document by language and userIP factors.
>
> Then we can make the temporary table using relational algebra operators.
> π language, userIP (webTable ▷◁ row = row clickLogTable)
>
> Users can easily develop the numeric analysis application program for categorize document sets using new Relation table.
> Formal relational algebra operators can be most useful feature in Hbase.
>
> What do you think?
>
> ------------------------------
>
> B. Regards,
>
> Edward yoon @ NHN, corp.
> Home : http://www.udanax.org
> _________________________________________________________________
> Put your friends on the big screen with Windows Vista® + Windows Live™.
> http://www.microsoft.com/windows/shop/specialoffers.mspx?ocid=TXT_TAGLM_CPC_MediaCtr_bigscreen_102007
>
>

_________________________________________________________________
You keep typing, we keep giving. Download Messenger and join the i’m Initiative now.
http://im.live.com/messenger/im/home/?source=TAGLM

Re: About relational algebra operators

Posted by Michael Bieniosek <mi...@powerset.com>.

If you have a large table, then using hbase shell is probably not a good idea because it will take too long to run the job. Instead, you need to start a mapreduce job to do the processing for you.

Since pig already exists as a shell that can create mapreduce jobs (like sawzall), I think it is probably best to work on pig until/unless you find something in their approach you fundamentally disagree with.

-Michael

On 12/10/07 6:22 PM, "edward yoon" <we...@udanax.org> wrote:

Let's assume some data set of Hbase relations can be undergoing many changes by advance of human culture.

We need better relation for these changes.
Therefore, I think formal relational algebra operators will be good administrative tool in hbase shell.
We can use relational algebra operators like a administrative tool.

Also, it is helpful for temporary simulation situation.
Let's assume the tables.

1. huge-webTable ( URL , title, content, image, language, metatag, color, ... , etc ).
2. huge-clickLogTable ( URL , userIP, search_Keyword, ... , etc)

And we wnat to categorize web document by language and userIP factors.

Then we can make the temporary table using relational algebra operators.
π language, userIP (webTable ▷◁ row = row clickLogTable)

Users can easily develop the numeric analysis application program for categorize document sets using new Relation table.
Formal relational algebra operators can be most useful feature in Hbase.

What do you think?

------------------------------

B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org
_________________________________________________________________
Put your friends on the big screen with Windows Vista® + Windows Live™.
http://www.microsoft.com/windows/shop/specialoffers.mspx?ocid=TXT_TAGLM_CPC_MediaCtr_bigscreen_102007

About relational algebra operators

Posted by edward yoon <we...@udanax.org>.

Let's assume some data set of Hbase relations can be undergoing many changes by advance of human culture.

We need better relation for these changes.
Therefore, I think formal relational algebra operators will be good administrative tool in hbase shell.
We can use relational algebra operators like a administrative tool.

Also, it is helpful for temporary simulation situation.
Let's assume the tables.

1. huge-webTable ( URL  , title, content, image, language, metatag, color, ... , etc ).
2. huge-clickLogTable ( URL , userIP, search_Keyword, ... , etc)

And we wnat to categorize web document by language and userIP factors.

Then we can make the temporary table using relational algebra operators.
π language, userIP (webTable ▷◁ row = row clickLogTable) 

Users can easily develop the numeric analysis application program for categorize document sets using new Relation table.
Formal relational algebra operators can be most useful feature in Hbase.

What do you think?

------------------------------

B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org
_________________________________________________________________
Put your friends on the big screen with Windows Vista® + Windows Live™.
http://www.microsoft.com/windows/shop/specialoffers.mspx?ocid=TXT_TAGLM_CPC_MediaCtr_bigscreen_102007

Re: MapReduce Job on XML input

Posted by Ted Dunning <td...@veoh.com>.

Can you post a Jira and a patch?


On 12/10/07 1:12 AM, "Alan Ho" <ka...@yahoo.ca> wrote:

> I've written a xml input splitter based on a Stax parser. Its much better than
> StreamXMLRecordReader
> 
> ----- Original Message ----
> From: Peter Thygesen <th...@infopaq.dk>
> To: hadoop-user@lucene.apache.org
> Sent: Monday, November 26, 2007 8:49:52 AM
> Subject: MapReduce Job on XML input
> 
> I would like to run some mapReduce jobs on some xml files I got (aprox.
> 100000 compressed files).
> The XML files are not that big about 1 Mb compressed, each containing
> about 1000 records.
> 
> Do I have to write my own InputSplitter? Should I use
> MultiFileInputFormat or StreamInputFormat? Can I use the
> StreamXmlRecordReader, and how? By sub-classing some input class?
> 
> The tutorials and examples I've read are all very straight forward
> reading simple text files, but I miss a more complex example,
>  especially
> one that reads xml files ;)
> 
> thx. 
> Peter
> 
> 
> 
> 
> 
> 
> 
>       Looking for the perfect gift? Give the gift of Flickr!
> 
> http://www.flickr.com/gift/
>