You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Marco Didonna <m....@gmail.com> on 2011/01/28 11:49:56 UTC
Distributed indexing with Hadoop
Hello everyone,
I am building an hadoop "app" to quickly index a corpus of documents.
This app will accept one or more XML file that will contain the corpus.
Each document is made up of several section: title, authors,
body...these section are not static and depend on the collection. Here's
a sample glimpse of how the xml input file looks like:
<document id='1'>
<field name='title'> the divine comedy </field>
<field name='author'>Dante</field>
<field name='body'>halfway along our life's path.......</field>
</document>
<document id='2'>
...
</document>
I would like to discuss some implementation choices:
- which is the best way to "tell" my hadoop app which section to expect
between <document> and </document> tags?
- is it more appropriate to implement a record reader that passes to the
mapper the whole content of the document tag or section by section. I
was wondering which parser to use, a dom-like one or a sax-like
one...any library (efficient) to recommend?
- do you know any library I could use to process text? By text
processing I mean common preprocessing operation like tokenization,
stopword elimination...I was thinking of using lucene's engine...can it
be a bottleneck?
I am looking forward to read your opinion
Thanks,
Marco
Re: Distributed indexing with Hadoop
Posted by James Seigel <ja...@tynt.com>.
Has anyone tried to do the reuters example with both approaches? I seem to have problems getting them to run.
Cheers
James.
On 2011-01-29, at 9:25 AM, Ted Yu wrote:
> $MAHOUT_HOME/examples/bin/build-reuters.shFYI
>
> On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna <m....@gmail.com>wrote:
>
>> On 01/29/2011 05:17 AM, Lance Norskog wrote:
>>> Look at the Reuters example in the Mahout project:
>> http://mahout.apache.org
>>
>> Ehm could you point me to it ? I cannot find it
>>
>> Thanks
>>
>>
>>
Re: Distributed indexing with Hadoop
Posted by Ted Yu <yu...@gmail.com>.
$MAHOUT_HOME/examples/bin/build-reuters.shFYI
On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna <m....@gmail.com>wrote:
> On 01/29/2011 05:17 AM, Lance Norskog wrote:
> > Look at the Reuters example in the Mahout project:
> http://mahout.apache.org
>
> Ehm could you point me to it ? I cannot find it
>
> Thanks
>
>
>
Re: Distributed indexing with Hadoop
Posted by Marco Didonna <m....@gmail.com>.
On 01/29/2011 05:17 AM, Lance Norskog wrote:
> Look at the Reuters example in the Mahout project: http://mahout.apache.org
Ehm could you point me to it ? I cannot find it
Thanks
Re: Distributed indexing with Hadoop
Posted by Lance Norskog <go...@gmail.com>.
Look at the Reuters example in the Mahout project: http://mahout.apache.org
On Fri, Jan 28, 2011 at 2:49 AM, Marco Didonna <m....@gmail.com> wrote:
> Hello everyone,
> I am building an hadoop "app" to quickly index a corpus of documents.
> This app will accept one or more XML file that will contain the corpus.
> Each document is made up of several section: title, authors,
> body...these section are not static and depend on the collection. Here's
> a sample glimpse of how the xml input file looks like:
>
> <document id='1'>
> <field name='title'> the divine comedy </field>
> <field name='author'>Dante</field>
> <field name='body'>halfway along our life's path.......</field>
> </document>
> <document id='2'>
>
> ...
>
> </document>
>
> I would like to discuss some implementation choices:
>
> - which is the best way to "tell" my hadoop app which section to expect
> between <document> and </document> tags?
>
> - is it more appropriate to implement a record reader that passes to the
> mapper the whole content of the document tag or section by section. I
> was wondering which parser to use, a dom-like one or a sax-like
> one...any library (efficient) to recommend?
>
> - do you know any library I could use to process text? By text
> processing I mean common preprocessing operation like tokenization,
> stopword elimination...I was thinking of using lucene's engine...can it
> be a bottleneck?
>
> I am looking forward to read your opinion
>
> Thanks,
>
> Marco
>
>
--
Lance Norskog
goksron@gmail.com