You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2007/12/12 06:35:05 UTC

Indexing Wikipedia dumps

Hi,

I need to index a Wikipedia dump.  I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarking purposes.  However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index.

Any suggestions where I should start?  That is, can anything in contrib/benchmark already do this, or is there anything there that I should use as a starting point?  As opposed to writing my own Wikipedia XML dump parser+indexer.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Matt Kangas <ka...@gmail.com>.

Otis, if you're willing to use some non-Java code for your task...

1) Wikipedia uses Lucene for their full-text searches, and the module  
is part of Mediawiki. You could use this as follows:
- Install Mediawiki
- Load your Wikipedia dump into MW (and MySQL)
- Build a search index for the Lucene Search extension:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/README.txt?revision=8535&view=markup

2) Alternately, use Mediawiki's native import parser (in PHP) and use  
that to feed Solr, etc. The code is a bit hairy, 'tho.
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/SpecialImport.php?revision=27686&view=markup

--Matt

On Dec 12, 2007, at 12:35 AM, Otis Gospodnetic wrote:

> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that I  
> should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis

--
Matt Kangas / kangas@gmail.com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Chris Lu <ch...@gmail.com>.

For a quick java approach, give yourself 3 minutes and try to use
DBSight to access the database. You can simply use "select * from
mw_searchindex" as a starting point. It'll build the index for you.
However, you may need to pluggin your custom analyzer for media wiki's
format(Or maybe not).

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer (remain anonymous per request) got 2.6 Million Euro funding!


On Dec 11, 2007 9:35 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:
> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarking purposes.  However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in contrib/benchmark already do this, or is there anything there that I should use as a starting point?  As opposed to writing my own Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing Wikipedia dumps

Posted by Steven Parkes <st...@esseff.org>.

Probably want a combination of extractWikipedia.alg and wikipedia.alg?

You want the EnwikiDocMaker from extractWikipedia.alg which reads the
uncompressed xml file but rather than using WriteLineDoc, you want to go
ahead and index as wikipedia.alg does. (Ditch the query part.)

You'll need an acceptable analyzer, which StandardAnalyzer might not be.

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Wednesday, December 12, 2007 2:29 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing Wikipedia dumps

I haven't actually tried it, but I think very likely the current code  
in contrib/benchmark might be able to extract non-English Wikipedia  
dump as well?

Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
if you just change the docs.file to reference your downloaded XML  
file it could just work?

Mike

Otis Gospodnetic wrote:

> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that  
> I should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Grant Ingersoll <gs...@apache.org>.

I put up a patch and would appreciate testing/feedback.  It's not  
perfect, but it handles most things, I think.

-Grant

On Dec 28, 2007, at 12:19 PM, Grant Ingersoll wrote:

> See https://issues.apache.org/jira/browse/LUCENE-1103
>
>
> On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote:
>
>> Hi All:
>> Just to add simple hack, I had posted at my Blog an entry named
>> "Uploading WikiPedia Dumps to Oracle databases":
>> http://marceloochoa.blogspot.com/2007_12_01_archive.html
>> with instructions to upload WikiPedia Dumps to Oracle XMLDB, it
>> means transforming an XML file to an object-relational storage.
>> Finally, I added instructions to index it with Lucene Domain Index.
>> Best regards, Marcelo.
>>
>> On Dec 14, 2007 5:08 AM, Dawid Weiss <da...@cs.put.poznan.pl>  
>> wrote:
>>>
>>> Good pointers, thanks. I asked because I did have a problem like  
>>> this a few
>>> months ago -- none of the existing parsers solved it for me (back  
>>> then).
>>>
>>> D.
>>>
>>>
>>> Petite Abeille wrote:
>>>>
>>>> On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
>>>>
>>>>> Just incidentally -- do you know of something that would parse the
>>>>> wikipedia markup (to plain text, for example)?
>>>>
>>>> If you find out, let us know :)
>>>>
>>>> You may want to check the partial ANTLR grammar for Wikitext:
>>>>
>>>> http://www.mediawiki.org/wiki/User:Stevage/ANTLR
>>>> http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
>>>>
>>>> This also might be of interest:
>>>>
>>>> http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
>>>>
>>>> "the nice people over at woc.fslab.de have created a standalone
>>>> wiki-markup parser which is ready for use"
>>>> http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
>>>> There is also Text::MediawikiFormat:
>>>> http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
>>>>
>>>> Perhaps you will be better off processing the Wikipedia static HTML
>>>> dump, instead of the XML one:
>>>> http://static.wikipedia.org/
>>>> Not a piece of cake one way or another :(
>>>> Cheers,
>>>> PA.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> -- 
>> Marcelo F. Ochoa
>> http://marceloochoa.blogspot.com/
>> http://marcelo.ochoa.googlepages.com/home
>> ______________
>> Do you Know DBPrism? Look @ DB Prism's Web Site
>> http://www.dbprism.com.ar/index.html
>> More info?
>> Chapter 17 of the book "Programming the Oracle Database using Java &
>> Web Services"
>> http://www.amazon.com/gp/product/1555583296/
>> Chapter 21 of the book "Professional XML Databases" - Wrox Press
>> http://www.amazon.com/gp/product/1861003587/
>> Chapter 8 of the book "Oracle & Open Source" - O'Reilly
>> http://www.oreilly.com/catalog/oracleopen/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Grant Ingersoll <gs...@apache.org>.

See https://issues.apache.org/jira/browse/LUCENE-1103


On Dec 18, 2007, at 1:31 PM, Marcelo Ochoa wrote:

> Hi All:
>  Just to add simple hack, I had posted at my Blog an entry named
> "Uploading WikiPedia Dumps to Oracle databases":
> http://marceloochoa.blogspot.com/2007_12_01_archive.html
>  with instructions to upload WikiPedia Dumps to Oracle XMLDB, it
> means transforming an XML file to an object-relational storage.
>  Finally, I added instructions to index it with Lucene Domain Index.
>  Best regards, Marcelo.
>
> On Dec 14, 2007 5:08 AM, Dawid Weiss <da...@cs.put.poznan.pl>  
> wrote:
>>
>> Good pointers, thanks. I asked because I did have a problem like  
>> this a few
>> months ago -- none of the existing parsers solved it for me (back  
>> then).
>>
>> D.
>>
>>
>> Petite Abeille wrote:
>>>
>>> On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
>>>
>>>> Just incidentally -- do you know of something that would parse the
>>>> wikipedia markup (to plain text, for example)?
>>>
>>> If you find out, let us know :)
>>>
>>> You may want to check the partial ANTLR grammar for Wikitext:
>>>
>>> http://www.mediawiki.org/wiki/User:Stevage/ANTLR
>>> http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
>>>
>>> This also might be of interest:
>>>
>>> http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
>>>
>>> "the nice people over at woc.fslab.de have created a standalone
>>> wiki-markup parser which is ready for use"
>>> http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
>>> There is also Text::MediawikiFormat:
>>> http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
>>>
>>> Perhaps you will be better off processing the Wikipedia static HTML
>>> dump, instead of the XML one:
>>> http://static.wikipedia.org/
>>> Not a piece of cake one way or another :(
>>> Cheers,
>>> PA.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> -- 
> Marcelo F. Ochoa
> http://marceloochoa.blogspot.com/
> http://marcelo.ochoa.googlepages.com/home
> ______________
> Do you Know DBPrism? Look @ DB Prism's Web Site
> http://www.dbprism.com.ar/index.html
> More info?
> Chapter 17 of the book "Programming the Oracle Database using Java &
> Web Services"
> http://www.amazon.com/gp/product/1555583296/
> Chapter 21 of the book "Professional XML Databases" - Wrox Press
> http://www.amazon.com/gp/product/1861003587/
> Chapter 8 of the book "Oracle & Open Source" - O'Reilly
> http://www.oreilly.com/catalog/oracleopen/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Marcelo Ochoa <ma...@gmail.com>.

Hi All:
  Just to add simple hack, I had posted at my Blog an entry named
"Uploading WikiPedia Dumps to Oracle databases":
http://marceloochoa.blogspot.com/2007_12_01_archive.html
  with instructions to upload WikiPedia Dumps to Oracle XMLDB, it
means transforming an XML file to an object-relational storage.
  Finally, I added instructions to index it with Lucene Domain Index.
  Best regards, Marcelo.

On Dec 14, 2007 5:08 AM, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
> Good pointers, thanks. I asked because I did have a problem like this a few
> months ago -- none of the existing parsers solved it for me (back then).
>
> D.
>
>
> Petite Abeille wrote:
> >
> > On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
> >
> >> Just incidentally -- do you know of something that would parse the
> >> wikipedia markup (to plain text, for example)?
> >
> > If you find out, let us know :)
> >
> > You may want to check the partial ANTLR grammar for Wikitext:
> >
> > http://www.mediawiki.org/wiki/User:Stevage/ANTLR
> > http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
> >
> > This also might be of interest:
> >
> > http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
> >
> > "the nice people over at woc.fslab.de have created a standalone
> > wiki-markup parser which is ready for use"
> > http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
> > There is also Text::MediawikiFormat:
> > http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
> >
> > Perhaps you will be better off processing the Wikipedia static HTML
> > dump, instead of the XML one:
> > http://static.wikipedia.org/
> > Not a piece of cake one way or another :(
> > Cheers,
> > PA.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Good pointers, thanks. I asked because I did have a problem like this a few 
months ago -- none of the existing parsers solved it for me (back then).

D.

Petite Abeille wrote:
> 
> On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:
> 
>> Just incidentally -- do you know of something that would parse the 
>> wikipedia markup (to plain text, for example)?
> 
> If you find out, let us know :)
> 
> You may want to check the partial ANTLR grammar for Wikitext:
> 
> http://www.mediawiki.org/wiki/User:Stevage/ANTLR
> http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html
> 
> This also might be of interest:
> 
> http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html
> 
> "the nice people over at woc.fslab.de have created a standalone 
> wiki-markup parser which is ready for use"
> http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
> There is also Text::MediawikiFormat:
> http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm 
> 
> Perhaps you will be better off processing the Wikipedia static HTML 
> dump, instead of the XML one:
> http://static.wikipedia.org/
> Not a piece of cake one way or another :(
> Cheers,
> PA.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Petite Abeille <pe...@mac.com>.

On Dec 13, 2007, at 8:39 AM, Dawid Weiss wrote:

> Just incidentally -- do you know of something that would parse the  
> wikipedia markup (to plain text, for example)?

If you find out, let us know :)

You may want to check the partial ANTLR grammar for Wikitext:

http://www.mediawiki.org/wiki/User:Stevage/ANTLR
http://lists.wikimedia.org/pipermail/wikitext-l/2007-December/000117.html

This also might be of interest:

http://www.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html

"the nice people over at woc.fslab.de have created a standalone wiki- 
markup parser which is ready for use"
http://fslab.de/svn/wpofflineclient/trunk/mediawiki_sa
There is also Text::MediawikiFormat:
http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm
Perhaps you will be better off processing the Wikipedia static HTML  
dump, instead of the XML one:
http://static.wikipedia.org/
Not a piece of cake one way or another :(
Cheers,
PA.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> Note that the current code doesn't actually do anything with the wiki 
> syntax, but I would think as long as the other language is in the same 
> format you should be fine.

Just incidentally -- do you know of something that would parse the wikipedia 
markup (to plain text, for example)?

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Grant Ingersoll <gs...@apache.org>.

Note that the current code doesn't actually do anything with the wiki  
syntax, but I would think as long as the other language is in the same  
format you should be fine.

-Grant

On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote:

>
> I haven't actually tried it, but I think very likely the current  
> code in contrib/benchmark might be able to extract non-English  
> Wikipedia dump as well?
>
> Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
> if you just change the docs.file to reference your downloaded XML  
> file it could just work?
>
> Mike
>
> Otis Gospodnetic wrote:
>
>> Hi,
>>
>> I need to index a Wikipedia dump.  I know there is code in contrib/ 
>> benchmark for indexing *English* Wikipedia for benchmarking  
>> purposes.  However, I'd like to index a non-English dump, and I  
>> actually don't need it for benchmarking, I just want to end up with  
>> a Lucene index.
>>
>> Any suggestions where I should start?  That is, can anything in  
>> contrib/benchmark already do this, or is there anything there that  
>> I should use as a starting point?  As opposed to writing my own  
>> Wikipedia XML dump parser+indexer.
>>
>> Thanks,
>> Otis
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Michael McCandless <lu...@mikemccandless.com>.

I haven't actually tried it, but I think very likely the current code  
in contrib/benchmark might be able to extract non-English Wikipedia  
dump as well?

Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
if you just change the docs.file to reference your downloaded XML  
file it could just work?

Mike

Otis Gospodnetic wrote:

> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that  
> I should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Andy Goodell <ag...@discoverymining.com>.

My firm uses a parser based on javax.xml.stream.XMLStreamReader to
break (english and nonenglish) wikipedia xml dumps into lucene-style
"documents and fields."  We use wikipedia to test our
language-specific code, so we've probably indexed 20 wikipedia dumps.

- andy g

On Dec 11, 2007 9:35 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:
> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarking purposes.  However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in contrib/benchmark already do this, or is there anything there that I should use as a starting point?  As opposed to writing my own Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Wikipedia dumps

Posted by Karl Wettin <ka...@gmail.com>.

12 dec 2007 kl. 06.35 skrev Otis Gospodnetic:

> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that I  
> should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.


Here is one more alternative, the way I did it way back.

Get the tarballs containing rendered HTML. Using NekoHTML (or so) find  
the DOM-node that contains the text content. And there you go, plain  
text.


-- 
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org