You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2013/03/04 21:57:32 UTC

Ability to specify 2 different query analyzers for same indexed field in Solr

Hello,

We would like to be able to specify two different fields that both use the
same indexed field but use different analyzers.   An example use-case for
this might be doing query-time synonym expansion with the synonyms weighted
lower than an exact match.

q=exact_field^10 OR synonyms^1

The normal way to do this in Solr, which is just to set up separate
analyzer chains and use a copyfield, will not work for us because the field
in question is huge.  It is about 7 TB of OCR.

Is there a way to do this currently in Solr?   If not ,

1) should I open a JIRA issue?
2) can someone point me towards the part of the code I might need to modify?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Chris Hostetter <ho...@fucit.org>.

:   I would still like the ability to specify two different query analysis
: chains with one index, rather than having to write a custom parser for each

I'm not sure if this is a good idea, I certainly haven't thought it  
through very hard, but ...

i wonder if you could create a new FieldType subclassing TextField in 
which you would not only specify an analyzer, but also another field name 
(or prefix or something) and that FieldType would use it's analyzer to 
build queries against the other field.

So for example you might configure...

  <fieldType name="no_sym_ft" class="SpoofingTextField" prefix="nosym_">
    <analyzer ....  />
  </fieldType>
  <dynamicField type="no_sym_ft" name="nosym_* indexed="false" />

...and then at query time, any use of a field name like "nosym_foo" would 
cause the "no_sym_ft" field type to use it's analyzer to build queries 
against the "foo" field.

I think the implementation would could be fairly simple, but i'm not 
certian.


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Tom Burton-West <tb...@umich.edu>.

Thanks Jan,

The blog post is very good, I didn't quite realize all those various
pitfalls with synonyms.

  I would still like the ability to specify two different query analysis
chains with one index, rather than having to write a custom parser for each
use case.   For example the Traditional/Simplified Chinese use case in my
previous message could probably be solved with a custom query parser along
the lines of the synonym solution in the blog post but if there were a way
to specify two different query analysis chains for the same indexed field,
I would not have to write a custom query parser.

Tom



On Tue, Mar 5, 2013 at 5:39 PM, Jan Høydahl <ja...@cominvent.com> wrote:

> Hi,
>
> Please have a look at
> http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ and a
> working plugin to Solr to deboost the expanded synonyms. The plugin code
> currently lacks ability to configure different dictionaries for each field,
> but that could be added. Also see SOLR-4381 for eventual inclusion in Solr.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> 5. mars 2013 kl. 17:26 skrev Tom Burton-West <tb...@umich.edu>:
>
> Thanks Erick,
>
> Payloads might work but I'm looking at a more general problem
>
> Here is another use case:
>
> We have a mix of Traditional and Simplified Chinese documents indexed in
> the same OCR field.
>  When a user searches using Traditional Chinese, I would like to also
> search in Simplified Chinese, but rank the results matching Traditional
> Chinese higher.   Similarly, if a user enters a query in Simplified
> Chinese, I want to also search in Traditional Chinese but rank matches of
> the Simplified Chinese query terms higher.
>
> Since it is not always possible to determine whether a short query is in
> Simplified or Traditional Chinese here is what I would like to do.
>
> 1) Convert the query to Traditional Chinese
> 2) Convert the query to Simplified Chinese
> (One of these two steps would not be necessary if I could reliably
> determine the nature of the query)
>
> q1=QueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1.
>
> Again, this could be done with copy fields, but that would increase my
> index size too much.  What I really want to be able to do is to query the
> same index (i.e. document as created ) with the user's query
> processed/analyzed in 3 different ways.
>
> I could do this myself in the app layer, but I would really like to be
> able to use Solr.
>
>
> Tom
>
>
>
> On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Tom:
>>
>> I wonder if you could do something with payloads here. Index all terms
>> with payloads of 10, but synonyms with 1?
>>
>> Random thought off the top of my head.
>>
>> Erick
>>
>>
>>>     <analyzer type=index>
>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> <fieldType name="plain">
>>>     <analyzer type=query>
>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>>
>>> <fieldType name="syn">
>>>     <analyzer type=index>
>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> <fieldType name="plain">
>>>     <analyzer type=query>
>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> <copyField source="plain" dest="syn"/>
>>>
>>> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
>>>
>>>>   Please clarify, and try providing a couple more use cases. I mean,
>>>> the case you provided suggests that the contents of the index will be
>>>> different between the two fields, while you told us that you wanted to
>>>> share the same indexed field. In other words, it sounds like you will have
>>>> two copies of similar data anyway.
>>>>
>>>> Maybe you simply want one copy of the stored value for the field and
>>>> then have one or more copyfields that index the same source data
>>>> differently, but don’t re-store the copied source data.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Tom Burton-West <tb...@umich.edu>
>>>> *Sent:* Monday, March 04, 2013 3:57 PM
>>>> *To:* dev@lucene.apache.org
>>>> *Subject:* Ability to specify 2 different query analyzers for same
>>>> indexed field in Solr
>>>>
>>>> Hello,
>>>>
>>>> We would like to be able to specify two different fields that both use
>>>> the same indexed field but use different analyzers.   An example use-case
>>>> for this might be doing query-time synonym expansion with the synonyms
>>>> weighted lower than an exact match.
>>>>
>>>> q=exact_field^10 OR synonyms^1
>>>>
>>>> The normal way to do this in Solr, which is just to set up separate
>>>> analyzer chains and use a copyfield, will not work for us because the field
>>>> in question is huge.  It is about 7 TB of OCR.
>>>>
>>>> Is there a way to do this currently in Solr?   If not ,
>>>>
>>>> 1) should I open a JIRA issue?
>>>> 2) can someone point me towards the part of the code I might need to
>>>> modify?
>>>>
>>>> Tom
>>>>
>>>>  Tom Burton-West
>>>> Information Retrieval Programmer
>>>> Digital Library Production Service
>>>> University of Michigan Library
>>>> http://www.hathitrust.org/blogs/large-scale-search
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Please have a look at http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ and a working plugin to Solr to deboost the expanded synonyms. The plugin code currently lacks ability to configure different dictionaries for each field, but that could be added. Also see SOLR-4381 for eventual inclusion in Solr.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

5. mars 2013 kl. 17:26 skrev Tom Burton-West <tb...@umich.edu>:

> Thanks Erick,
> 
> Payloads might work but I'm looking at a more general problem
> 
> Here is another use case:
> 
> We have a mix of Traditional and Simplified Chinese documents indexed in the same OCR field.  
>  When a user searches using Traditional Chinese, I would like to also search in Simplified Chinese, but rank the results matching Traditional Chinese higher.   Similarly, if a user enters a query in Simplified Chinese, I want to also search in Traditional Chinese but rank matches of the Simplified Chinese query terms higher.
> 
> Since it is not always possible to determine whether a short query is in Simplified or Traditional Chinese here is what I would like to do.
> 
> 1) Convert the query to Traditional Chinese
> 2) Convert the query to Simplified Chinese
> (One of these two steps would not be necessary if I could reliably determine the nature of the query)
> 
> q1=QueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1.
> 
> Again, this could be done with copy fields, but that would increase my index size too much.  What I really want to be able to do is to query the same index (i.e. document as created ) with the user's query processed/analyzed in 3 different ways.
> 
> I could do this myself in the app layer, but I would really like to be able to use Solr.
> 
> 
> Tom
> 
> 
> 
> On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson <er...@gmail.com> wrote:
> Tom:
> 
> I wonder if you could do something with payloads here. Index all terms with payloads of 10, but synonyms with 1?
> 
> Random thought off the top of my head.
> 
> Erick
> 
> 
>     <analyzer type=index>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <fieldType name="plain">
>     <analyzer type=query>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> 
> <fieldType name="syn">
>     <analyzer type=index>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <fieldType name="plain">
>     <analyzer type=query>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <copyField source="plain" dest="syn"/>
> 
> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Please clarify, and try providing a couple more use cases. I mean, the case you provided suggests that the contents of the index will be different between the two fields, while you told us that you wanted to share the same indexed field. In other words, it sounds like you will have two copies of similar data anyway.
>  
> Maybe you simply want one copy of the stored value for the field and then have one or more copyfields that index the same source data differently, but don’t re-store the copied source data.
> 
> -- Jack Krupansky
>  
> From: Tom Burton-West
> Sent: Monday, March 04, 2013 3:57 PM
> To: dev@lucene.apache.org
> Subject: Ability to specify 2 different query analyzers for same indexed field in Solr
>  
> Hello,
>  
> We would like to be able to specify two different fields that both use the same indexed field but use different analyzers.   An example use-case for this might be doing query-time synonym expansion with the synonyms weighted lower than an exact match.  
>  
> q=exact_field^10 OR synonyms^1
>  
> The normal way to do this in Solr, which is just to set up separate analyzer chains and use a copyfield, will not work for us because the field in question is huge.  It is about 7 TB of OCR.
>  
> Is there a way to do this currently in Solr?   If not ,
>  
> 1) should I open a JIRA issue?
> 2) can someone point me towards the part of the code I might need to modify?
>  
> Tom
>  
> Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Service
> University of Michigan Library
> http://www.hathitrust.org/blogs/large-scale-search
>  
>  
> 
> 
>

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Tom Burton-West <tb...@umich.edu>.

Thanks Erick,

Payloads might work but I'm looking at a more general problem

Here is another use case:

We have a mix of Traditional and Simplified Chinese documents indexed in
the same OCR field.
 When a user searches using Traditional Chinese, I would like to also
search in Simplified Chinese, but rank the results matching Traditional
Chinese higher.   Similarly, if a user enters a query in Simplified
Chinese, I want to also search in Traditional Chinese but rank matches of
the Simplified Chinese query terms higher.

Since it is not always possible to determine whether a short query is in
Simplified or Traditional Chinese here is what I would like to do.

1) Convert the query to Traditional Chinese
2) Convert the query to Simplified Chinese
(One of these two steps would not be necessary if I could reliably
determine the nature of the query)

q1=QueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1.

Again, this could be done with copy fields, but that would increase my
index size too much.  What I really want to be able to do is to query the
same index (i.e. document as created ) with the user's query
processed/analyzed in 3 different ways.

I could do this myself in the app layer, but I would really like to be able
to use Solr.


Tom



On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson <er...@gmail.com>wrote:

> Tom:
>
> I wonder if you could do something with payloads here. Index all terms
> with payloads of 10, but synonyms with 1?
>
> Random thought off the top of my head.
>
> Erick
>
>
>>     <analyzer type=index>
>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> <fieldType name="plain">
>>     <analyzer type=query>
>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>>
>> <fieldType name="syn">
>>     <analyzer type=index>
>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> <fieldType name="plain">
>>     <analyzer type=query>
>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> <copyField source="plain" dest="syn"/>
>>
>> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
>>
>>>   Please clarify, and try providing a couple more use cases. I mean,
>>> the case you provided suggests that the contents of the index will be
>>> different between the two fields, while you told us that you wanted to
>>> share the same indexed field. In other words, it sounds like you will have
>>> two copies of similar data anyway.
>>>
>>> Maybe you simply want one copy of the stored value for the field and
>>> then have one or more copyfields that index the same source data
>>> differently, but don’t re-store the copied source data.
>>>
>>> -- Jack Krupansky
>>>
>>>  *From:* Tom Burton-West <tb...@umich.edu>
>>> *Sent:* Monday, March 04, 2013 3:57 PM
>>> *To:* dev@lucene.apache.org
>>> *Subject:* Ability to specify 2 different query analyzers for same
>>> indexed field in Solr
>>>
>>> Hello,
>>>
>>> We would like to be able to specify two different fields that both use
>>> the same indexed field but use different analyzers.   An example use-case
>>> for this might be doing query-time synonym expansion with the synonyms
>>> weighted lower than an exact match.
>>>
>>> q=exact_field^10 OR synonyms^1
>>>
>>> The normal way to do this in Solr, which is just to set up separate
>>> analyzer chains and use a copyfield, will not work for us because the field
>>> in question is huge.  It is about 7 TB of OCR.
>>>
>>> Is there a way to do this currently in Solr?   If not ,
>>>
>>> 1) should I open a JIRA issue?
>>> 2) can someone point me towards the part of the code I might need to
>>> modify?
>>>
>>> Tom
>>>
>>>  Tom Burton-West
>>> Information Retrieval Programmer
>>> Digital Library Production Service
>>> University of Michigan Library
>>> http://www.hathitrust.org/blogs/large-scale-search
>>>
>>>
>>>
>>
>>
>

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Erick Erickson <er...@gmail.com>.

Tom:

I wonder if you could do something with payloads here. Index all terms with
payloads of 10, but synonyms with 1?

Random thought off the top of my head.

Erick


On Mon, Mar 4, 2013 at 6:25 PM, Tom Burton-West <tb...@umich.edu> wrote:

> Hi Jack,
>
> Sorry the example is not clear.  Below is the normal way to accomplish
> what I am trying to do using a copyField and two separate fieldTypes with
> the index analyzer the same but the query time analyzer different.
>
> So the query would be something like q=plain:foobar^10 OR syn:foobar^1  to
> get synonyms but scored much lower than an exact match.
>
> The problem with this is since the analysis chain used for indexing is the
> same in both cases, I would rather not have to actually index the exact
> same content in the exact same way twice.
>
> Does that make it any clearer or do I need a more compelling use case?
>
> Tom
>
> <fieldType name="plain">
>     <analyzer type=index>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <fieldType name="plain">
>     <analyzer type=query>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
>
> <fieldType name="syn">
>     <analyzer type=index>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <fieldType name="plain">
>     <analyzer type=query>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <copyField source="plain" dest="syn"/>
>
> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
>
>>   Please clarify, and try providing a couple more use cases. I mean, the
>> case you provided suggests that the contents of the index will be different
>> between the two fields, while you told us that you wanted to share the same
>> indexed field. In other words, it sounds like you will have two copies of
>> similar data anyway.
>>
>> Maybe you simply want one copy of the stored value for the field and then
>> have one or more copyfields that index the same source data differently,
>> but don’t re-store the copied source data.
>>
>> -- Jack Krupansky
>>
>>  *From:* Tom Burton-West <tb...@umich.edu>
>> *Sent:* Monday, March 04, 2013 3:57 PM
>> *To:* dev@lucene.apache.org
>> *Subject:* Ability to specify 2 different query analyzers for same
>> indexed field in Solr
>>
>> Hello,
>>
>> We would like to be able to specify two different fields that both use
>> the same indexed field but use different analyzers.   An example use-case
>> for this might be doing query-time synonym expansion with the synonyms
>> weighted lower than an exact match.
>>
>> q=exact_field^10 OR synonyms^1
>>
>> The normal way to do this in Solr, which is just to set up separate
>> analyzer chains and use a copyfield, will not work for us because the field
>> in question is huge.  It is about 7 TB of OCR.
>>
>> Is there a way to do this currently in Solr?   If not ,
>>
>> 1) should I open a JIRA issue?
>> 2) can someone point me towards the part of the code I might need to
>> modify?
>>
>> Tom
>>
>>  Tom Burton-West
>> Information Retrieval Programmer
>> Digital Library Production Service
>> University of Michigan Library
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>>
>>
>
>

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Jack,

Sorry the example is not clear.  Below is the normal way to accomplish what
I am trying to do using a copyField and two separate fieldTypes with the
index analyzer the same but the query time analyzer different.

So the query would be something like q=plain:foobar^10 OR syn:foobar^1  to
get synonyms but scored much lower than an exact match.

The problem with this is since the analysis chain used for indexing is the
same in both cases, I would rather not have to actually index the exact
same content in the exact same way twice.

Does that make it any clearer or do I need a more compelling use case?

Tom

<fieldType name="plain">
    <analyzer type=index>
   <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<fieldType name="plain">
    <analyzer type=query>
   <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>

<fieldType name="syn">
    <analyzer type=index>
   <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<fieldType name="plain">
    <analyzer type=query>
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<copyField source="plain" dest="syn"/>

On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

>   Please clarify, and try providing a couple more use cases. I mean, the
> case you provided suggests that the contents of the index will be different
> between the two fields, while you told us that you wanted to share the same
> indexed field. In other words, it sounds like you will have two copies of
> similar data anyway.
>
> Maybe you simply want one copy of the stored value for the field and then
> have one or more copyfields that index the same source data differently,
> but don’t re-store the copied source data.
>
> -- Jack Krupansky
>
>  *From:* Tom Burton-West <tb...@umich.edu>
> *Sent:* Monday, March 04, 2013 3:57 PM
> *To:* dev@lucene.apache.org
> *Subject:* Ability to specify 2 different query analyzers for same
> indexed field in Solr
>
> Hello,
>
> We would like to be able to specify two different fields that both use the
> same indexed field but use different analyzers.   An example use-case for
> this might be doing query-time synonym expansion with the synonyms weighted
> lower than an exact match.
>
> q=exact_field^10 OR synonyms^1
>
> The normal way to do this in Solr, which is just to set up separate
> analyzer chains and use a copyfield, will not work for us because the field
> in question is huge.  It is about 7 TB of OCR.
>
> Is there a way to do this currently in Solr?   If not ,
>
> 1) should I open a JIRA issue?
> 2) can someone point me towards the part of the code I might need to
> modify?
>
> Tom
>
>  Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Service
> University of Michigan Library
> http://www.hathitrust.org/blogs/large-scale-search
>
>
>

Re: Ability to specify 2 different query analyzers for same indexed field in Solr

Posted by Jack Krupansky <ja...@basetechnology.com>.

Please clarify, and try providing a couple more use cases. I mean, the case you provided suggests that the contents of the index will be different between the two fields, while you told us that you wanted to share the same indexed field. In other words, it sounds like you will have two copies of similar data anyway.

Maybe you simply want one copy of the stored value for the field and then have one or more copyfields that index the same source data differently, but don’t re-store the copied source data.

-- Jack Krupansky

From: Tom Burton-West 
Sent: Monday, March 04, 2013 3:57 PM
To: dev@lucene.apache.org 
Subject: Ability to specify 2 different query analyzers for same indexed field in Solr

Hello, 

We would like to be able to specify two different fields that both use the same indexed field but use different analyzers.   An example use-case for this might be doing query-time synonym expansion with the synonyms weighted lower than an exact match.   

q=exact_field^10 OR synonyms^1

The normal way to do this in Solr, which is just to set up separate analyzer chains and use a copyfield, will not work for us because the field in question is huge.  It is about 7 TB of OCR.

Is there a way to do this currently in Solr?   If not ,

1) should I open a JIRA issue?
2) can someone point me towards the part of the code I might need to modify?

Tom 

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search