You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Marco Didonna <m....@gmail.com> on 2011/01/28 11:49:56 UTC

Distributed indexing with Hadoop

Hello everyone,
I am building an hadoop "app" to quickly index a corpus of documents.
This app will accept one or more XML file that will contain the corpus.
Each document is made up of several section: title, authors,
body...these section are not static and depend on the collection. Here's
a sample glimpse of how the xml input file looks like:

<document id='1'>
<field name='title'> the divine comedy </field>
<field name='author'>Dante</field>
<field name='body'>halfway along our life's path.......</field>
</document>
<document id='2'>

...

</document>

I would like to discuss some implementation choices:

- which is the best way to "tell" my hadoop app which section to expect
between <document> and </document> tags?

- is it more appropriate to implement a record reader that passes to the
mapper the whole content of the document tag or section by section. I
was wondering which parser to use, a dom-like one or a sax-like
one...any library (efficient) to recommend?

- do you know any library I could use to process text? By text
processing I mean common preprocessing operation like tokenization,
stopword elimination...I was thinking of using lucene's engine...can it
be a bottleneck?

I am looking forward to read your opinion

Thanks,

Marco


Re: Distributed indexing with Hadoop

Posted by James Seigel <ja...@tynt.com>.
Has anyone tried to do the reuters example with both approaches?  I seem to have problems getting them to run.

Cheers
James.


On 2011-01-29, at 9:25 AM, Ted Yu wrote:

> $MAHOUT_HOME/examples/bin/build-reuters.shFYI
> 
> On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna <m....@gmail.com>wrote:
> 
>> On 01/29/2011 05:17 AM, Lance Norskog wrote:
>>> Look at the Reuters example in the Mahout project:
>> http://mahout.apache.org
>> 
>> Ehm could you point me to it ? I cannot find it
>> 
>> Thanks
>> 
>> 
>> 


Re: Distributed indexing with Hadoop

Posted by Ted Yu <yu...@gmail.com>.
$MAHOUT_HOME/examples/bin/build-reuters.shFYI

On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna <m....@gmail.com>wrote:

> On 01/29/2011 05:17 AM, Lance Norskog wrote:
> > Look at the Reuters example in the Mahout project:
> http://mahout.apache.org
>
> Ehm could you point me to it ? I cannot find it
>
> Thanks
>
>
>

Re: Distributed indexing with Hadoop

Posted by Marco Didonna <m....@gmail.com>.
On 01/29/2011 05:17 AM, Lance Norskog wrote:
> Look at the Reuters example in the Mahout project: http://mahout.apache.org

Ehm could you point me to it ? I cannot find it

Thanks



Re: Distributed indexing with Hadoop

Posted by Lance Norskog <go...@gmail.com>.
Look at the Reuters example in the Mahout project: http://mahout.apache.org

On Fri, Jan 28, 2011 at 2:49 AM, Marco Didonna <m....@gmail.com> wrote:
> Hello everyone,
> I am building an hadoop "app" to quickly index a corpus of documents.
> This app will accept one or more XML file that will contain the corpus.
> Each document is made up of several section: title, authors,
> body...these section are not static and depend on the collection. Here's
> a sample glimpse of how the xml input file looks like:
>
> <document id='1'>
> <field name='title'> the divine comedy </field>
> <field name='author'>Dante</field>
> <field name='body'>halfway along our life's path.......</field>
> </document>
> <document id='2'>
>
> ...
>
> </document>
>
> I would like to discuss some implementation choices:
>
> - which is the best way to "tell" my hadoop app which section to expect
> between <document> and </document> tags?
>
> - is it more appropriate to implement a record reader that passes to the
> mapper the whole content of the document tag or section by section. I
> was wondering which parser to use, a dom-like one or a sax-like
> one...any library (efficient) to recommend?
>
> - do you know any library I could use to process text? By text
> processing I mean common preprocessing operation like tokenization,
> stopword elimination...I was thinking of using lucene's engine...can it
> be a bottleneck?
>
> I am looking forward to read your opinion
>
> Thanks,
>
> Marco
>
>



-- 
Lance Norskog
goksron@gmail.com