You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ken Krugler <kk...@transpac.com> on 2013/10/04 03:03:07 UTC

WikipediaTokenizer documentation

Hi all,

Where's the documentation on the WikipediaTokenizer?

Specifically I'm wondering how pieces from the source XML get mapped to field names in the Solr schema.

For example, <revision><timestamp> seems to be going into the "date" field for an example schema I've got.

And <revision><text> goes into "body".

But is there any way to get <revision><contributor><username>, for example?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: WikipediaTokenizer documentation

Posted by Jack Krupansky <ja...@basetechnology.com>.

I have some info and examples for the WikipediaTokenizer in my book, but a 
tokenizer does not direct tokens to a field. Rather, you would use the 
tokenizer in the analyzer for whatever field you wish to store values in. 
You could use the same input for multiple fields and then filter the tokens 
to keep only some token types.

Besides my book, the best reference is going to be... the source code.

-- Jack Krupansky

-----Original Message----- 
From: Ken Krugler
Sent: Thursday, October 03, 2013 9:03 PM
To: solr-user@lucene.apache.org
Subject: WikipediaTokenizer documentation

Hi all,

Where's the documentation on the WikipediaTokenizer?

Specifically I'm wondering how pieces from the source XML get mapped to 
field names in the Solr schema.

For example, <revision><timestamp> seems to be going into the "date" field 
for an example schema I've got.

And <revision><text> goes into "body".

But is there any way to get <revision><contributor><username>, for example?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: WikipediaTokenizer documentation

Posted by Furkan KAMACI <fu...@gmail.com>.

I suggest you to look at here:
http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html


2013/10/4 Ken Krugler <kk...@transpac.com>

> Hi all,
>
> Where's the documentation on the WikipediaTokenizer?
>
> Specifically I'm wondering how pieces from the source XML get mapped to
> field names in the Solr schema.
>
> For example, <revision><timestamp> seems to be going into the "date" field
> for an example schema I've got.
>
> And <revision><text> goes into "body".
>
> But is there any way to get <revision><contributor><username>, for example?
>
> Thanks,
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>