You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sergey Bartunov <sb...@gmail.com> on 2010/10/22 18:07:50 UTC

How to index long words with StandardTokenizerFactory?

I'm trying to force solr to index words which length is more than 255
symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
in schema configuration XML. Specifying the maxTokenLength attribute
won't work.

I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar
and replaced original lucene-core jar in solr /lib. But seems like
that it had bring no effect.

Re: How to index long words with StandardTokenizerFactory?

Posted by Sergey Bartunov <sb...@gmail.com>.

This is exactly what I did. Look:

>> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by
>> my
>> >> lucene-core-2.9.3-dev.jar that I'd just compiled
>> >> 4) than I do "ant compile" and "ant dist" in solr
>> folder
>> >> 5) after that I recompile
>> solr/example/webapps/solr.war

On 23 October 2010 18:53, Ahmet Arslan <io...@yahoo.com> wrote:
> I think you should replace your new lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then create a new solr.war under \apache-solr-1.4.1\dist. And copy this new solr.war to solr/example/webapps/solr.war
>
> --- On Sat, 10/23/10, Sergey Bartunov <sb...@gmail.com> wrote:
>
>> From: Sergey Bartunov <sb...@gmail.com>
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>> To: solr-user@lucene.apache.org
>> Date: Saturday, October 23, 2010, 5:45 PM
>> Yes. I did. Won't help.
>>
>> On 23 October 2010 17:45, Ahmet Arslan <io...@yahoo.com>
>> wrote:
>> > Did you delete the folder
>> Jetty_0_0_0_0_8983_solr.war_** under
>> apache-solr-1.4.1\example\work?
>> >
>> > --- On Sat, 10/23/10, Sergey Bartunov <sb...@gmail.com>
>> wrote:
>> >
>> >> From: Sergey Bartunov <sb...@gmail.com>
>> >> Subject: Re: How to index long words with
>> StandardTokenizerFactory?
>> >> To: solr-user@lucene.apache.org
>> >> Date: Saturday, October 23, 2010, 3:56 PM
>> >> Here are all the files: http://rghost.net/3016862
>> >>
>> >> 1) StandardAnalyzer.java, StandardTokenizer.java -
>> patched
>> >> files from
>> >> lucene-2.9.3
>> >> 2) I patch these files and build lucene by typing
>> "ant"
>> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by
>> my
>> >> lucene-core-2.9.3-dev.jar that I'd just compiled
>> >> 4) than I do "ant compile" and "ant dist" in solr
>> folder
>> >> 5) after that I recompile
>> solr/example/webapps/solr.war
>> >> with my new
>> >> solr and lucene-core jars
>> >> 6) I put my schema.xml in solr/example/solr/conf/
>> >> 7) then I do "java -jar start.jar" in
>> solr/example
>> >> 8) index big_post.xml
>> >> 9) trying to find this document by "curl
>> >> http://localhost:8983/solr/select?q=body:big*"
>> >> (big_post.xml contains
>> >> a long word bigaaaaa...aaaa)
>> >> 10) solr returns nothing
>> >>
>> >> On 23 October 2010 02:43, Steven A Rowe <sa...@syr.edu>
>> >> wrote:
>> >> > Hi Sergey,
>> >> >
>> >> > What does your ~34kb field value look like?
>>  Does
>> >> StandardTokenizer think it's just one token?
>> >> >
>> >> > What doesn't work?  What happens?
>> >> >
>> >> > Steve
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> >> >> Sent: Friday, October 22, 2010 3:18 PM
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Subject: Re: How to index long words
>> with
>> >> StandardTokenizerFactory?
>> >> >>
>> >> >> I'm using Solr 1.4.1. Now I'm successed
>> with
>> >> replacing lucene-core jar
>> >> >> but maxTokenValue seems to be used in
>> very strange
>> >> way. Currenty for
>> >> >> me it's set to 1024*1024, but I couldn't
>> index a
>> >> field with just size
>> >> >> of ~34kb. I understand that it's a little
>> weird to
>> >> index such a big
>> >> >> data, but I just want to know it doesn't
>> work
>> >> >>
>> >> >> On 22 October 2010 20:36, Steven A Rowe
>> <sa...@syr.edu>
>> >> wrote:
>> >> >> > Hi Sergey,
>> >> >> >
>> >> >> > I've opened an issue to add a
>> maxTokenLength
>> >> param to the
>> >> >> StandardTokenizerFactory configuration:
>> >> >> >
>> >> >> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >> >> >
>> >> >> > I'll work on it this weekend.
>> >> >> >
>> >> >> > Are you using Solr 1.4.1?  I ask
>> because of
>> >> your mention of Lucene
>> >> >> 2.9.3.  I'm not sure there will ever be
>> a Solr
>> >> 1.4.2 release.  I plan on
>> >> >> targeting Solr 3.1 and 4.0 for the
>> SOLR-2188 fix.
>> >> >> >
>> >> >> > I'm not sure why you didn't get the
>> results
>> >> you wanted with your Lucene
>> >> >> hack - is it possible you have other
>> Lucene jars
>> >> in your Solr classpath?
>> >> >> >
>> >> >> > Steve
>> >> >> >
>> >> >> >> -----Original Message-----
>> >> >> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> >> >> >> Sent: Friday, October 22, 2010
>> 12:08 PM
>> >> >> >> To: solr-user@lucene.apache.org
>> >> >> >> Subject: How to index long words
>> with
>> >> StandardTokenizerFactory?
>> >> >> >>
>> >> >> >> I'm trying to force solr to
>> index words
>> >> which length is more than 255
>> >> >> >> symbols (this constant is
>> >> DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> >> >> StandardAnalyzer.java) using
>> >> StandardTokenizerFactory as 'filter' tag
>> >> >> >> in schema configuration XML.
>> Specifying
>> >> the maxTokenLength attribute
>> >> >> >> won't work.
>> >> >> >>
>> >> >> >> I'd tried to make the dirty
>> hack: I
>> >> downloaded lucene-core-2.9.3 src
>> >> >> >> and changed the
>> DEFAULT_MAX_TOKEN_LENGTH
>> >> to 1000000, built it to jar
>> >> >> >> and replaced original
>> lucene-core jar in
>> >> solr /lib. But seems like
>> >> >> >> that it had bring no effect.
>> >>
>> >
>> >
>> >
>> >
>>
>
>
>
>

Re: How to index long words with StandardTokenizerFactory?

Posted by Ahmet Arslan <io...@yahoo.com>.

I think you should replace your new lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then create a new solr.war under \apache-solr-1.4.1\dist. And copy this new solr.war to solr/example/webapps/solr.war

--- On Sat, 10/23/10, Sergey Bartunov <sb...@gmail.com> wrote:

> From: Sergey Bartunov <sb...@gmail.com>
> Subject: Re: How to index long words with StandardTokenizerFactory?
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 5:45 PM
> Yes. I did. Won't help.
> 
> On 23 October 2010 17:45, Ahmet Arslan <io...@yahoo.com>
> wrote:
> > Did you delete the folder
> Jetty_0_0_0_0_8983_solr.war_** under
> apache-solr-1.4.1\example\work?
> >
> > --- On Sat, 10/23/10, Sergey Bartunov <sb...@gmail.com>
> wrote:
> >
> >> From: Sergey Bartunov <sb...@gmail.com>
> >> Subject: Re: How to index long words with
> StandardTokenizerFactory?
> >> To: solr-user@lucene.apache.org
> >> Date: Saturday, October 23, 2010, 3:56 PM
> >> Here are all the files: http://rghost.net/3016862
> >>
> >> 1) StandardAnalyzer.java, StandardTokenizer.java -
> patched
> >> files from
> >> lucene-2.9.3
> >> 2) I patch these files and build lucene by typing
> "ant"
> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by
> my
> >> lucene-core-2.9.3-dev.jar that I'd just compiled
> >> 4) than I do "ant compile" and "ant dist" in solr
> folder
> >> 5) after that I recompile
> solr/example/webapps/solr.war
> >> with my new
> >> solr and lucene-core jars
> >> 6) I put my schema.xml in solr/example/solr/conf/
> >> 7) then I do "java -jar start.jar" in
> solr/example
> >> 8) index big_post.xml
> >> 9) trying to find this document by "curl
> >> http://localhost:8983/solr/select?q=body:big*"
> >> (big_post.xml contains
> >> a long word bigaaaaa...aaaa)
> >> 10) solr returns nothing
> >>
> >> On 23 October 2010 02:43, Steven A Rowe <sa...@syr.edu>
> >> wrote:
> >> > Hi Sergey,
> >> >
> >> > What does your ~34kb field value look like?
>  Does
> >> StandardTokenizer think it's just one token?
> >> >
> >> > What doesn't work?  What happens?
> >> >
> >> > Steve
> >> >
> >> >> -----Original Message-----
> >> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> >> >> Sent: Friday, October 22, 2010 3:18 PM
> >> >> To: solr-user@lucene.apache.org
> >> >> Subject: Re: How to index long words
> with
> >> StandardTokenizerFactory?
> >> >>
> >> >> I'm using Solr 1.4.1. Now I'm successed
> with
> >> replacing lucene-core jar
> >> >> but maxTokenValue seems to be used in
> very strange
> >> way. Currenty for
> >> >> me it's set to 1024*1024, but I couldn't
> index a
> >> field with just size
> >> >> of ~34kb. I understand that it's a little
> weird to
> >> index such a big
> >> >> data, but I just want to know it doesn't
> work
> >> >>
> >> >> On 22 October 2010 20:36, Steven A Rowe
> <sa...@syr.edu>
> >> wrote:
> >> >> > Hi Sergey,
> >> >> >
> >> >> > I've opened an issue to add a
> maxTokenLength
> >> param to the
> >> >> StandardTokenizerFactory configuration:
> >> >> >
> >> >> >        https://issues.apache.org/jira/browse/SOLR-2188
> >> >> >
> >> >> > I'll work on it this weekend.
> >> >> >
> >> >> > Are you using Solr 1.4.1?  I ask
> because of
> >> your mention of Lucene
> >> >> 2.9.3.  I'm not sure there will ever be
> a Solr
> >> 1.4.2 release.  I plan on
> >> >> targeting Solr 3.1 and 4.0 for the
> SOLR-2188 fix.
> >> >> >
> >> >> > I'm not sure why you didn't get the
> results
> >> you wanted with your Lucene
> >> >> hack - is it possible you have other
> Lucene jars
> >> in your Solr classpath?
> >> >> >
> >> >> > Steve
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> >> >> >> Sent: Friday, October 22, 2010
> 12:08 PM
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Subject: How to index long words
> with
> >> StandardTokenizerFactory?
> >> >> >>
> >> >> >> I'm trying to force solr to
> index words
> >> which length is more than 255
> >> >> >> symbols (this constant is
> >> DEFAULT_MAX_TOKEN_LENGTH in lucene
> >> >> >> StandardAnalyzer.java) using
> >> StandardTokenizerFactory as 'filter' tag
> >> >> >> in schema configuration XML.
> Specifying
> >> the maxTokenLength attribute
> >> >> >> won't work.
> >> >> >>
> >> >> >> I'd tried to make the dirty
> hack: I
> >> downloaded lucene-core-2.9.3 src
> >> >> >> and changed the
> DEFAULT_MAX_TOKEN_LENGTH
> >> to 1000000, built it to jar
> >> >> >> and replaced original
> lucene-core jar in
> >> solr /lib. But seems like
> >> >> >> that it had bring no effect.
> >>
> >
> >
> >
> >
>

Re: How to index long words with StandardTokenizerFactory?

Posted by Sergey Bartunov <sb...@gmail.com>.

Yes. I did. Won't help.

On 23 October 2010 17:45, Ahmet Arslan <io...@yahoo.com> wrote:
> Did you delete the folder Jetty_0_0_0_0_8983_solr.war_** under apache-solr-1.4.1\example\work?
>
> --- On Sat, 10/23/10, Sergey Bartunov <sb...@gmail.com> wrote:
>
>> From: Sergey Bartunov <sb...@gmail.com>
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>> To: solr-user@lucene.apache.org
>> Date: Saturday, October 23, 2010, 3:56 PM
>> Here are all the files: http://rghost.net/3016862
>>
>> 1) StandardAnalyzer.java, StandardTokenizer.java - patched
>> files from
>> lucene-2.9.3
>> 2) I patch these files and build lucene by typing "ant"
>> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
>> lucene-core-2.9.3-dev.jar that I'd just compiled
>> 4) than I do "ant compile" and "ant dist" in solr folder
>> 5) after that I recompile solr/example/webapps/solr.war
>> with my new
>> solr and lucene-core jars
>> 6) I put my schema.xml in solr/example/solr/conf/
>> 7) then I do "java -jar start.jar" in solr/example
>> 8) index big_post.xml
>> 9) trying to find this document by "curl
>> http://localhost:8983/solr/select?q=body:big*"
>> (big_post.xml contains
>> a long word bigaaaaa...aaaa)
>> 10) solr returns nothing
>>
>> On 23 October 2010 02:43, Steven A Rowe <sa...@syr.edu>
>> wrote:
>> > Hi Sergey,
>> >
>> > What does your ~34kb field value look like?  Does
>> StandardTokenizer think it's just one token?
>> >
>> > What doesn't work?  What happens?
>> >
>> > Steve
>> >
>> >> -----Original Message-----
>> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> >> Sent: Friday, October 22, 2010 3:18 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: How to index long words with
>> StandardTokenizerFactory?
>> >>
>> >> I'm using Solr 1.4.1. Now I'm successed with
>> replacing lucene-core jar
>> >> but maxTokenValue seems to be used in very strange
>> way. Currenty for
>> >> me it's set to 1024*1024, but I couldn't index a
>> field with just size
>> >> of ~34kb. I understand that it's a little weird to
>> index such a big
>> >> data, but I just want to know it doesn't work
>> >>
>> >> On 22 October 2010 20:36, Steven A Rowe <sa...@syr.edu>
>> wrote:
>> >> > Hi Sergey,
>> >> >
>> >> > I've opened an issue to add a maxTokenLength
>> param to the
>> >> StandardTokenizerFactory configuration:
>> >> >
>> >> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >> >
>> >> > I'll work on it this weekend.
>> >> >
>> >> > Are you using Solr 1.4.1?  I ask because of
>> your mention of Lucene
>> >> 2.9.3.  I'm not sure there will ever be a Solr
>> 1.4.2 release.  I plan on
>> >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
>> >> >
>> >> > I'm not sure why you didn't get the results
>> you wanted with your Lucene
>> >> hack - is it possible you have other Lucene jars
>> in your Solr classpath?
>> >> >
>> >> > Steve
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> >> >> Sent: Friday, October 22, 2010 12:08 PM
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Subject: How to index long words with
>> StandardTokenizerFactory?
>> >> >>
>> >> >> I'm trying to force solr to index words
>> which length is more than 255
>> >> >> symbols (this constant is
>> DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> >> StandardAnalyzer.java) using
>> StandardTokenizerFactory as 'filter' tag
>> >> >> in schema configuration XML. Specifying
>> the maxTokenLength attribute
>> >> >> won't work.
>> >> >>
>> >> >> I'd tried to make the dirty hack: I
>> downloaded lucene-core-2.9.3 src
>> >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH
>> to 1000000, built it to jar
>> >> >> and replaced original lucene-core jar in
>> solr /lib. But seems like
>> >> >> that it had bring no effect.
>>
>
>
>
>

Re: How to index long words with StandardTokenizerFactory?

Posted by Ahmet Arslan <io...@yahoo.com>.

Did you delete the folder Jetty_0_0_0_0_8983_solr.war_** under apache-solr-1.4.1\example\work?

--- On Sat, 10/23/10, Sergey Bartunov <sb...@gmail.com> wrote:

> From: Sergey Bartunov <sb...@gmail.com>
> Subject: Re: How to index long words with StandardTokenizerFactory?
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 3:56 PM
> Here are all the files: http://rghost.net/3016862
> 
> 1) StandardAnalyzer.java, StandardTokenizer.java - patched
> files from
> lucene-2.9.3
> 2) I patch these files and build lucene by typing "ant"
> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
> lucene-core-2.9.3-dev.jar that I'd just compiled
> 4) than I do "ant compile" and "ant dist" in solr folder
> 5) after that I recompile solr/example/webapps/solr.war
> with my new
> solr and lucene-core jars
> 6) I put my schema.xml in solr/example/solr/conf/
> 7) then I do "java -jar start.jar" in solr/example
> 8) index big_post.xml
> 9) trying to find this document by "curl
> http://localhost:8983/solr/select?q=body:big*"
> (big_post.xml contains
> a long word bigaaaaa...aaaa)
> 10) solr returns nothing
> 
> On 23 October 2010 02:43, Steven A Rowe <sa...@syr.edu>
> wrote:
> > Hi Sergey,
> >
> > What does your ~34kb field value look like?  Does
> StandardTokenizer think it's just one token?
> >
> > What doesn't work?  What happens?
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> >> Sent: Friday, October 22, 2010 3:18 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How to index long words with
> StandardTokenizerFactory?
> >>
> >> I'm using Solr 1.4.1. Now I'm successed with
> replacing lucene-core jar
> >> but maxTokenValue seems to be used in very strange
> way. Currenty for
> >> me it's set to 1024*1024, but I couldn't index a
> field with just size
> >> of ~34kb. I understand that it's a little weird to
> index such a big
> >> data, but I just want to know it doesn't work
> >>
> >> On 22 October 2010 20:36, Steven A Rowe <sa...@syr.edu>
> wrote:
> >> > Hi Sergey,
> >> >
> >> > I've opened an issue to add a maxTokenLength
> param to the
> >> StandardTokenizerFactory configuration:
> >> >
> >> >        https://issues.apache.org/jira/browse/SOLR-2188
> >> >
> >> > I'll work on it this weekend.
> >> >
> >> > Are you using Solr 1.4.1?  I ask because of
> your mention of Lucene
> >> 2.9.3.  I'm not sure there will ever be a Solr
> 1.4.2 release.  I plan on
> >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
> >> >
> >> > I'm not sure why you didn't get the results
> you wanted with your Lucene
> >> hack - is it possible you have other Lucene jars
> in your Solr classpath?
> >> >
> >> > Steve
> >> >
> >> >> -----Original Message-----
> >> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> >> >> Sent: Friday, October 22, 2010 12:08 PM
> >> >> To: solr-user@lucene.apache.org
> >> >> Subject: How to index long words with
> StandardTokenizerFactory?
> >> >>
> >> >> I'm trying to force solr to index words
> which length is more than 255
> >> >> symbols (this constant is
> DEFAULT_MAX_TOKEN_LENGTH in lucene
> >> >> StandardAnalyzer.java) using
> StandardTokenizerFactory as 'filter' tag
> >> >> in schema configuration XML. Specifying
> the maxTokenLength attribute
> >> >> won't work.
> >> >>
> >> >> I'd tried to make the dirty hack: I
> downloaded lucene-core-2.9.3 src
> >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH
> to 1000000, built it to jar
> >> >> and replaced original lucene-core jar in
> solr /lib. But seems like
> >> >> that it had bring no effect.
>

Re: How to index long words with StandardTokenizerFactory?

Posted by Sergey Bartunov <sb...@gmail.com>.

Here are all the files: http://rghost.net/3016862

1) StandardAnalyzer.java, StandardTokenizer.java - patched files from
lucene-2.9.3
2) I patch these files and build lucene by typing "ant"
3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
lucene-core-2.9.3-dev.jar that I'd just compiled
4) than I do "ant compile" and "ant dist" in solr folder
5) after that I recompile solr/example/webapps/solr.war with my new
solr and lucene-core jars
6) I put my schema.xml in solr/example/solr/conf/
7) then I do "java -jar start.jar" in solr/example
8) index big_post.xml
9) trying to find this document by "curl
http://localhost:8983/solr/select?q=body:big*" (big_post.xml contains
a long word bigaaaaa...aaaa)
10) solr returns nothing

On 23 October 2010 02:43, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Sergey,
>
> What does your ~34kb field value look like?  Does StandardTokenizer think it's just one token?
>
> What doesn't work?  What happens?
>
> Steve
>
>> -----Original Message-----
>> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> Sent: Friday, October 22, 2010 3:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>>
>> I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar
>> but maxTokenValue seems to be used in very strange way. Currenty for
>> me it's set to 1024*1024, but I couldn't index a field with just size
>> of ~34kb. I understand that it's a little weird to index such a big
>> data, but I just want to know it doesn't work
>>
>> On 22 October 2010 20:36, Steven A Rowe <sa...@syr.edu> wrote:
>> > Hi Sergey,
>> >
>> > I've opened an issue to add a maxTokenLength param to the
>> StandardTokenizerFactory configuration:
>> >
>> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >
>> > I'll work on it this weekend.
>> >
>> > Are you using Solr 1.4.1?  I ask because of your mention of Lucene
>> 2.9.3.  I'm not sure there will ever be a Solr 1.4.2 release.  I plan on
>> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
>> >
>> > I'm not sure why you didn't get the results you wanted with your Lucene
>> hack - is it possible you have other Lucene jars in your Solr classpath?
>> >
>> > Steve
>> >
>> >> -----Original Message-----
>> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> >> Sent: Friday, October 22, 2010 12:08 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: How to index long words with StandardTokenizerFactory?
>> >>
>> >> I'm trying to force solr to index words which length is more than 255
>> >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
>> >> in schema configuration XML. Specifying the maxTokenLength attribute
>> >> won't work.
>> >>
>> >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
>> >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar
>> >> and replaced original lucene-core jar in solr /lib. But seems like
>> >> that it had bring no effect.

RE: How to index long words with StandardTokenizerFactory?

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Sergey,

What does your ~34kb field value look like?  Does StandardTokenizer think it's just one token?

What doesn't work?  What happens?

Steve

> -----Original Message-----
> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> Sent: Friday, October 22, 2010 3:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index long words with StandardTokenizerFactory?
> 
> I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar
> but maxTokenValue seems to be used in very strange way. Currenty for
> me it's set to 1024*1024, but I couldn't index a field with just size
> of ~34kb. I understand that it's a little weird to index such a big
> data, but I just want to know it doesn't work
> 
> On 22 October 2010 20:36, Steven A Rowe <sa...@syr.edu> wrote:
> > Hi Sergey,
> >
> > I've opened an issue to add a maxTokenLength param to the
> StandardTokenizerFactory configuration:
> >
> >        https://issues.apache.org/jira/browse/SOLR-2188
> >
> > I'll work on it this weekend.
> >
> > Are you using Solr 1.4.1?  I ask because of your mention of Lucene
> 2.9.3.  I'm not sure there will ever be a Solr 1.4.2 release.  I plan on
> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
> >
> > I'm not sure why you didn't get the results you wanted with your Lucene
> hack - is it possible you have other Lucene jars in your Solr classpath?
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> >> Sent: Friday, October 22, 2010 12:08 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: How to index long words with StandardTokenizerFactory?
> >>
> >> I'm trying to force solr to index words which length is more than 255
> >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
> >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
> >> in schema configuration XML. Specifying the maxTokenLength attribute
> >> won't work.
> >>
> >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
> >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar
> >> and replaced original lucene-core jar in solr /lib. But seems like
> >> that it had bring no effect.
> >

Re: How to index long words with StandardTokenizerFactory?

Posted by Sergey Bartunov <sb...@gmail.com>.

I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar
but maxTokenValue seems to be used in very strange way. Currenty for
me it's set to 1024*1024, but I couldn't index a field with just size
of ~34kb. I understand that it's a little weird to index such a big
data, but I just want to know it doesn't work

On 22 October 2010 20:36, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Sergey,
>
> I've opened an issue to add a maxTokenLength param to the StandardTokenizerFactory configuration:
>
>        https://issues.apache.org/jira/browse/SOLR-2188
>
> I'll work on it this weekend.
>
> Are you using Solr 1.4.1?  I ask because of your mention of Lucene 2.9.3.  I'm not sure there will ever be a Solr 1.4.2 release.  I plan on targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
>
> I'm not sure why you didn't get the results you wanted with your Lucene hack - is it possible you have other Lucene jars in your Solr classpath?
>
> Steve
>
>> -----Original Message-----
>> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
>> Sent: Friday, October 22, 2010 12:08 PM
>> To: solr-user@lucene.apache.org
>> Subject: How to index long words with StandardTokenizerFactory?
>>
>> I'm trying to force solr to index words which length is more than 255
>> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
>> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
>> in schema configuration XML. Specifying the maxTokenLength attribute
>> won't work.
>>
>> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
>> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar
>> and replaced original lucene-core jar in solr /lib. But seems like
>> that it had bring no effect.
>

RE: How to index long words with StandardTokenizerFactory?

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Sergey,

I've opened an issue to add a maxTokenLength param to the StandardTokenizerFactory configuration:

	https://issues.apache.org/jira/browse/SOLR-2188

I'll work on it this weekend.

Are you using Solr 1.4.1?  I ask because of your mention of Lucene 2.9.3.  I'm not sure there will ever be a Solr 1.4.2 release.  I plan on targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.

I'm not sure why you didn't get the results you wanted with your Lucene hack - is it possible you have other Lucene jars in your Solr classpath?

Steve

> -----Original Message-----
> From: Sergey Bartunov [mailto:sbos.net@gmail.com]
> Sent: Friday, October 22, 2010 12:08 PM
> To: solr-user@lucene.apache.org
> Subject: How to index long words with StandardTokenizerFactory?
> 
> I'm trying to force solr to index words which length is more than 255
> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
> in schema configuration XML. Specifying the maxTokenLength attribute
> won't work.
> 
> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar
> and replaced original lucene-core jar in solr /lib. But seems like
> that it had bring no effect.

Re: How to index long words with StandardTokenizerFactory?

Posted by Sergey Bartunov <sb...@gmail.com>.

Look at the scheme.xml that I provided. I use my own "text_block" type
which is derived from "TextField". And I force using
StandardTokenizerFactory using tokenizer tag.

If I use StrField type there are no problems with big data indexing.
The problem is in the tokenizer.

On 23 October 2010 18:55, Yonik Seeley <yo...@lucidimagination.com> wrote:
> On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov <sb...@gmail.com> wrote:
>> I'm trying to force solr to index words which length is more than 255
>
> If the field is not a text field, the Solr's default analyzer is used,
> which currently limits the token to 256 bytes.
> Out of curiosity, what's your usecase that you really need a single 34KB token?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: How to index long words with StandardTokenizerFactory?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov <sb...@gmail.com> wrote:
> I'm trying to force solr to index words which length is more than 255

If the field is not a text field, the Solr's default analyzer is used,
which currently limits the token to 256 bytes.
Out of curiosity, what's your usecase that you really need a single 34KB token?

-Yonik
http://www.lucidimagination.com