You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Kevin Lopez <ke...@gmail.com> on 2015/12/28 15:45:15 UTC

Solr - facet fields that contain other facet fields

*What I am trying to accomplish: *
Generate a facet based on the documents uploaded and a text file containing
terms from a domain/ontology such that a facet is shown if a term is in the
text file and in a document (key phrase extraction).

*The problem:*
When I select the facet for the term "*not necessarily*" (we see there is a
space) and I get the results for the term "*not*". The field is tokenized
and multivalued. This leads me to believe that I can not use a tokenized
field as a facet field. I tried to copy the values of the field to a text
field with a keywordtokenizer. I am told when checking the schema browser:
"Sorry, no Term Info available :(" This is after I delete the old index and
upload the documents again. The facet is coming from a field that is
already copied from another field, so I cannot copy this field to a text
field with a keywordtokenizer or strfield. What can I do to fix this? Is
there an alternate way to accomplish this?

*Here is my configuration:*

<copyField source="ColonCancerField" dest="cytokineField"/>

<field name="cytokineField" indexed="true" stored="true"
multiValued="true" type="Cytokine_Pass"/>
<fieldType name="Cytokine_Pass" class="solr.TextField">
    <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory" />
    </analyzer>
</fieldType>

  <field name="ColonCancerField" type="ColonCancer" indexed="true"
stored="true" multiValued="true"
   termPositions="true"
   termVectors="true"
   termOffsets="true"/>
<fieldType name="ColonCancer" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<filter class="solr.ShingleFilterFactory"
            minShingleSize="2" maxShingleSize="5"
            outputUnigramsIfNoShingles="true"
    />
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
tokenizerFactory="solr.KeywordTokenizerFactory"/>
    <filter class="solr.KeepWordFilterFactory"
            words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
  </analyzer>
</fieldType>
<copyField source="content" dest="ColonCancerField"/>

Regards,

Kevin

Re: Solr - facet fields that contain other facet fields

Posted by Kevin Lopez <ke...@gmail.com>.

Hi Erick,

I believe I have found a solution and I am putting plenty of detail for
future reference. I have taken your previous advice and decided to add a
field (cancerTerms) and add in the terms there. But I am not doing this
outside of Solr. I am using the analysis chain and passing it through a
ScriptUpdateProcessor. Here I can take the results of the analysis chain
and store the results to the document (as as strField). Then I facet on
this field (cancerTerms). This actually gives me the correct results, it
does not give me issues with the not and not necessarily or any other
similar issue. Also I am not storing the the analysis chain field (I was
previously). It makes no sense to store this because it was a copy field
(apparently copy fields only copy the source text then pipe it to the
analyzer, and cannot be chained). I am only storing the results of the
chain (which is useful for faceting).

Here is a simplified view as to what I am doing:

*Content* [is copied to] -> *ColonCancerField* (analysis chain [not stored,
and will produce tokenized strings]) ->*Passed to update-script* (processed
each token as string) [added to] -> *CancerTerms* (strField)

Here is an example Document:
id:2040ee23-c5dc-459c-969f-2ebf6c728184title:Immune profile modulation of
blood and mucosal eosinophils in nasal polyposis with concomitant asthma.
content:BACKGROUND: Chronic rhinosinusitis with nasal polyps (CRSwNP) is
frequently associated with asthma. Mucosal eosinophil (EO) infiltrate has
been found to correlate with asthma and disease severity but not
necessarily in ......SNIP...... and could explain the low benefit of
anti-IL-5 therapy for some patients with asthma and nasal polyposis.
cytokineTerms:t cell replacing factortype ii
interferonc7chemokineinterleukin 17 precursorleukocyte
mediatorinterleukinst cell replacing factort cell replacing factoril9
proteininterferon alpha-5cytokinesil9 proteincancerTerms:butnotnot
necessarilyalthough_version_:1522116540216901632score:1.0
Here is some of the code (please forgive the mess. I have included changes
for Solr ver. 5):

/***************************UpdateScript*********************************/
> function getAnalyzerResult(analyzer, fieldName, fieldValue) {
>   var result = [];
>   var token_stream = analyzer.tokenStream(fieldName, new
> java.io.StringReader(fieldValue));//null value?
>   var term_att =
> token_stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute.class);
>   token_stream.reset();
>   while (token_stream.incrementToken()) {
>     result.push(term_att.toString());
>   }
>   token_stream.end();
>   token_stream.close();
>   return result;
> }
> function processAdd(cmd) {
>   doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
>   id = doc.getFieldValue("id");
>   logger.warn("update-script#processAdd: id=" + id);
>
>   var content = doc.getFieldValue("content"); // Comes from /update/extract
>   //facetList contains the actual facet terms
>   //facetAnalyzerName contains the Analyzer name for the term vector list
> names. (i.e the field type)
>   var facetList = ["cytokineTerms", "cancerTerms"];
>   var facetAnalyzerName = ["key_phrases", "ColonCancer"];
>   /*
>     Loop through all of the facets, and get the analyzer and the name for
> the field
> Then add the terms to the document
>   */
>   for(var i = 0; i < facetList.length; i++){
> var analyzer =
> req.getCore().getLatestSchema().getFieldTypeByName(facetAnalyzerName[i]).getIndexAnalyzer();
> var terms = getAnalyzerResult(analyzer, null, content);
>     for(var index = 0; index < terms.length; index++){
>  doc.addField(facetList[i], terms[index]);
>     }
>   }
> }
> // The functions below must be defined, but there's rarely a need to
> implement
> // anything in these.
> function processDelete(cmd) {
>   // no-op
> }
> function processMergeIndexes(cmd) {
>   // no-op
> }
> function processCommit(cmd) {
>   // no-op
> }
> function processRollback(cmd) {
>   // no-op
> }
> function finish() {
>   // no-op
> }
> /***************************UpdateScript*********************************/
> /****************updateRequestProcessorChain ***********************/
>     <updateRequestProcessorChain name="script" default="true">
>       <processor class="solr.StatelessScriptUpdateProcessorFactory">
>         <str name="script">update-script.js</str>
>         <lst name="params">
>           <str name="config_param">example config parameter</str>
>         </lst>
>       </processor>
>  <processor class="solr.LogUpdateProcessorFactory"/>
>       <processor class="solr.RunUpdateProcessorFactory" />
>     </updateRequestProcessorChain>
> /****************updateRequestProcessorChain ***********************/



>  java -Durl=http://localhost:8983/solr/Cytokine/update -Dauto
> -Dparams=update.chain=script -jar bin/post.jar
> C:/Users/Kevin/Downloads/pubmed_result.json


Sources:

   1.
   http://lucidworks.com/blog/2013/06/27/poor-mans-entity-extraction-with-solr/
   2. https://www.youtube.com/watch?v=AXSK2RvVJsk
   3. https://wiki.apache.org/solr/ScriptUpdateProcessor
   4.
   https://lucene.apache.org/solr/5_0_0/changes/Changes.html#v5.0.0.upgrading_from_solr_4.x
   5. https://gist.github.com/erikhatcher/50e653c1c09abb68e068

 One issue I see is that I would like to highlight the selected terms in
the document. Currently I am using the positions of the term vectors, and
overlaying it onto the content. Is there a way to highlight the term
without getting the term vectors?

Thank you for all of your help!

Regards,

Kevin

On Tue, Dec 29, 2015 at 2:14 PM, Kevin Lopez <ke...@gmail.com>
wrote:

> Erick,
>
> I am not sure when you say "the only available terms are "not" and
> "necessarily"" is totally correct. I go into the schema browser and I can
> see that there are two terms "not" and "not necessarily" with the correct
> count. Unless these are not the terms you are talking about. Can you
> explain to me what these are exactly.
>
> http://imgur.com/m82CH2f
>
> I see what you are saying, it may be best for me to do the entity
> extraction separately, and put the terms into a special field, although I
> would like the terms to be highlighted (or have some type of position so I
> can highlight it).
>
> Regards,
>
> Kevin
>
> On Mon, Dec 28, 2015 at 12:49 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> bq:  so I cannot copy this field to a text field with a
>> keywordtokenizer or strfield
>>
>> 1> There is no restriction on whether a field is analyzed or not as far as
>> faceting is concerned. You can freely facet on an analyzed field
>> or String field or KeywordTokenized field. As Binoy says, though,
>> faceting on large analyzed text fields is dangerous.
>>
>> 2> copyField directives are not chained. As soon as the
>> field is received, before _anything_ is done the raw contents are
>> pushed to the copyField destinations. So in your case the source
>> for both copyField directives should be "content". Otherwise you
>> get into interesting behavior if you, say,  copyField from A to B and
>> have another copyField from B to A. I _suspect_ this is
>> why you have no term info available, but check....
>>
>> 3> This is not going to work as you're trying to implement it. If you
>> tokenize, the only available terms are "not" and "necessarily". There
>> is no "not necessarily" _token_ to facet on. If you use a String
>> or KeywordAnalylzed field, likewise there is no "not necessarily"
>> token, there will be a _single_ token that's the entire content of the
>> field
>> (I'm leaving aside, for instance, WordDelimiterFilterFactory
>> modifications...).
>>
>> One way to approach this would be to recognize and index synthetic
>> tokens representing the concepts. You'd pre-analyze the text, do your
>> entity recognition and add those entities to a special "entity" field or
>> some such. This would be an unanalyzed field that you facet on. Let's
>> say your entity was "colon cancer". Whenever you recognized that in
>> the text during indexing, you'd index "colon_cancer", or "disease_234"
>> in your special field.
>>
>> Of course your app would then have to present this pleasingly, and
>> rather than the app needing access to your dictionary the "colon_cancer"
>> form would be easier to unpack.
>>
>> The fragility here is that changing your text file of entities would
>> require
>> you to re-index to re-inject them into documents.
>>
>> You could also, assuming you know all the entities that should match
>> a given query form facet _queries_ on the phrases. This could get to be
>> quite a large query, but has the advantage of not requiring re-indexing.
>> So you'd have something like
>> facet.query=field:"not necessarily"&facet.query=field:certainly
>> etc.
>>
>> Best,
>> Erick
>>
>>
>> On Mon, Dec 28, 2015 at 9:13 AM, Binoy Dalal <bi...@gmail.com>
>> wrote:
>> > 1) When faceting use field of type string. That'll rid you of your
>> > tokenization problems.
>> > Alternatively do not use any tokenizers.
>> > Also turn doc values on for the field. It'll improve performance.
>> > 2) If however you do need to use a tokenized field for faceting, make
>> sure
>> > that they're pretty short in terms of number of tokens or else your app
>> > will die real soon.
>> >
>> > On Mon, 28 Dec 2015, 22:24 Kevin Lopez <ke...@gmail.com>
>> wrote:
>> >
>> >> I am not sure I am following correctly. The field I upload the
>> document to
>> >> would be "content" the analyzed field is "ColonCancerField". The
>> "content"
>> >> field contains the entire text of the document, in my case a pubmed
>> >> abstract. This is a tokenized field. I made this field untokenized and
>> I
>> >> still received the same results [the results for not instead of not
>> >> necessarily (in my current example I have 2 docs with not and 1 doc
>> with
>> >> not necessarily {not is of course in the document that contains not
>> >> necessarily})]:
>> >>
>> >> http://imgur.com/a/1bfXT
>> >>
>> >> I also tried this:
>> >>
>> >> http://localhost:8983/solr/Cytokine/select?&q=ColonCancerField
>> >> :"not+necessarily"
>> >>
>> >> I still receive the two documents, which is the same as doing
>> >> ColonCancerField:"not"
>> >>
>> >> Just to clarify the structure looks like this: *content (untokenized,
>> >> unanalyzed)* [copied to]==> *ColonCancerField *(tokenized, analyzed)
>> then I
>> >> browse the ColonCancerField and the facets state that there is 1
>> document
>> >> for not necessarily, but when selecting it, solr returns 2 results.
>> >>
>> >> -Kevin
>> >>
>> >> On Mon, Dec 28, 2015 at 10:22 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>
>> >> > Can you do the opposite?  Index into an unanalyzed field and copy
>> into
>> >> the
>> >> > analyzed?
>> >> >
>> >> > If I remember correctly facets are based off of indexed values so if
>> you
>> >> > tokenize the field then the facets will be as you are seeing now.
>> >> > On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com>
>> wrote:
>> >> >
>> >> > > *What I am trying to accomplish: *
>> >> > > Generate a facet based on the documents uploaded and a text file
>> >> > containing
>> >> > > terms from a domain/ontology such that a facet is shown if a term
>> is in
>> >> > the
>> >> > > text file and in a document (key phrase extraction).
>> >> > >
>> >> > > *The problem:*
>> >> > > When I select the facet for the term "*not necessarily*" (we see
>> there
>> >> > is a
>> >> > > space) and I get the results for the term "*not*". The field is
>> >> tokenized
>> >> > > and multivalued. This leads me to believe that I can not use a
>> >> tokenized
>> >> > > field as a facet field. I tried to copy the values of the field to
>> a
>> >> text
>> >> > > field with a keywordtokenizer. I am told when checking the schema
>> >> > browser:
>> >> > > "Sorry, no Term Info available :(" This is after I delete the old
>> index
>> >> > and
>> >> > > upload the documents again. The facet is coming from a field that
>> is
>> >> > > already copied from another field, so I cannot copy this field to a
>> >> text
>> >> > > field with a keywordtokenizer or strfield. What can I do to fix
>> this?
>> >> Is
>> >> > > there an alternate way to accomplish this?
>> >> > >
>> >> > > *Here is my configuration:*
>> >> > >
>> >> > > <copyField source="ColonCancerField" dest="cytokineField"/>
>> >> > >
>> >> > > <field name="cytokineField" indexed="true" stored="true"
>> >> > > multiValued="true" type="Cytokine_Pass"/>
>> >> > > <fieldType name="Cytokine_Pass" class="solr.TextField">
>> >> > >     <analyzer>
>> >> > >     <tokenizer class="solr.KeywordTokenizerFactory" />
>> >> > >     </analyzer>
>> >> > > </fieldType>
>> >> > >
>> >> > >   <field name="ColonCancerField" type="ColonCancer" indexed="true"
>> >> > > stored="true" multiValued="true"
>> >> > >    termPositions="true"
>> >> > >    termVectors="true"
>> >> > >    termOffsets="true"/>
>> >> > > <fieldType name="ColonCancer" class="solr.TextField"
>> >> > > sortMissingLast="true" omitNorms="true">
>> >> > > <analyzer>
>> >> > > <filter class="solr.ShingleFilterFactory"
>> >> > >             minShingleSize="2" maxShingleSize="5"
>> >> > >             outputUnigramsIfNoShingles="true"
>> >> > >     />
>> >> > >   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >> > >       <filter class="solr.LowerCaseFilterFactory"/>
>> >> > >     <filter class="solr.SynonymFilterFactory"
>> >> > > synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
>> >> > > tokenizerFactory="solr.KeywordTokenizerFactory"/>
>> >> > >     <filter class="solr.KeepWordFilterFactory"
>> >> > >             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
>> >> > >   </analyzer>
>> >> > > </fieldType>
>> >> > > <copyField source="content" dest="ColonCancerField"/>
>> >> > >
>> >> > > Regards,
>> >> > >
>> >> > > Kevin
>> >> > >
>> >> >
>> >>
>> > --
>> > Regards,
>> > Binoy Dalal
>>
>
>

Re: Solr - facet fields that contain other facet fields

Posted by Erick Erickson <er...@gmail.com>.

Sorry, I overlooked the ShingleFilterFactory.
You're getting that from, presumably, your
ShingleFilterFactory. Note that the minShingleSize=2
does not mean that only 2-shingles are output, there's
yet another parameter "outputUnigrams" that controls
that in combination with outputUnigramsIfNoShingles.

I suspect that the shingle factory is making things
not quite meet your expectations. It's actually unclear to me
why the search for "not necessarily" with quotes is matching
the doc with "not". Can we see the output with

debug=true&debug.explain.structured=true

?

In particular I've been assuming that your fq clause is a _phrase_
search as (with quotes) fq:"not necessarily". Look in the parsed-query
of the above (ignore the scoring) to see if the fq clause is a phrase
clause. If it's not, with a default operator of OR then your results
are understandable.

BTW, just to be paranoid I'd start with some two-word phrase
that doesn't contain "not" as that can be an operator.... It
shouldn't be in this case since it's lower case, but just to be safe...

Best,
Erick



On Tue, Dec 29, 2015 at 11:14 AM, Kevin Lopez <ke...@gmail.com> wrote:
> Erick,
>
> I am not sure when you say "the only available terms are "not" and
> "necessarily"" is totally correct. I go into the schema browser and I can
> see that there are two terms "not" and "not necessarily" with the correct
> count. Unless these are not the terms you are talking about. Can you
> explain to me what these are exactly.
>
> http://imgur.com/m82CH2f
>
> I see what you are saying, it may be best for me to do the entity
> extraction separately, and put the terms into a special field, although I
> would like the terms to be highlighted (or have some type of position so I
> can highlight it).
>
> Regards,
>
> Kevin
>
> On Mon, Dec 28, 2015 at 12:49 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> bq:  so I cannot copy this field to a text field with a
>> keywordtokenizer or strfield
>>
>> 1> There is no restriction on whether a field is analyzed or not as far as
>> faceting is concerned. You can freely facet on an analyzed field
>> or String field or KeywordTokenized field. As Binoy says, though,
>> faceting on large analyzed text fields is dangerous.
>>
>> 2> copyField directives are not chained. As soon as the
>> field is received, before _anything_ is done the raw contents are
>> pushed to the copyField destinations. So in your case the source
>> for both copyField directives should be "content". Otherwise you
>> get into interesting behavior if you, say,  copyField from A to B and
>> have another copyField from B to A. I _suspect_ this is
>> why you have no term info available, but check....
>>
>> 3> This is not going to work as you're trying to implement it. If you
>> tokenize, the only available terms are "not" and "necessarily". There
>> is no "not necessarily" _token_ to facet on. If you use a String
>> or KeywordAnalylzed field, likewise there is no "not necessarily"
>> token, there will be a _single_ token that's the entire content of the
>> field
>> (I'm leaving aside, for instance, WordDelimiterFilterFactory
>> modifications...).
>>
>> One way to approach this would be to recognize and index synthetic
>> tokens representing the concepts. You'd pre-analyze the text, do your
>> entity recognition and add those entities to a special "entity" field or
>> some such. This would be an unanalyzed field that you facet on. Let's
>> say your entity was "colon cancer". Whenever you recognized that in
>> the text during indexing, you'd index "colon_cancer", or "disease_234"
>> in your special field.
>>
>> Of course your app would then have to present this pleasingly, and
>> rather than the app needing access to your dictionary the "colon_cancer"
>> form would be easier to unpack.
>>
>> The fragility here is that changing your text file of entities would
>> require
>> you to re-index to re-inject them into documents.
>>
>> You could also, assuming you know all the entities that should match
>> a given query form facet _queries_ on the phrases. This could get to be
>> quite a large query, but has the advantage of not requiring re-indexing.
>> So you'd have something like
>> facet.query=field:"not necessarily"&facet.query=field:certainly
>> etc.
>>
>> Best,
>> Erick
>>
>>
>> On Mon, Dec 28, 2015 at 9:13 AM, Binoy Dalal <bi...@gmail.com>
>> wrote:
>> > 1) When faceting use field of type string. That'll rid you of your
>> > tokenization problems.
>> > Alternatively do not use any tokenizers.
>> > Also turn doc values on for the field. It'll improve performance.
>> > 2) If however you do need to use a tokenized field for faceting, make
>> sure
>> > that they're pretty short in terms of number of tokens or else your app
>> > will die real soon.
>> >
>> > On Mon, 28 Dec 2015, 22:24 Kevin Lopez <ke...@gmail.com> wrote:
>> >
>> >> I am not sure I am following correctly. The field I upload the document
>> to
>> >> would be "content" the analyzed field is "ColonCancerField". The
>> "content"
>> >> field contains the entire text of the document, in my case a pubmed
>> >> abstract. This is a tokenized field. I made this field untokenized and I
>> >> still received the same results [the results for not instead of not
>> >> necessarily (in my current example I have 2 docs with not and 1 doc with
>> >> not necessarily {not is of course in the document that contains not
>> >> necessarily})]:
>> >>
>> >> http://imgur.com/a/1bfXT
>> >>
>> >> I also tried this:
>> >>
>> >> http://localhost:8983/solr/Cytokine/select?&q=ColonCancerField
>> >> :"not+necessarily"
>> >>
>> >> I still receive the two documents, which is the same as doing
>> >> ColonCancerField:"not"
>> >>
>> >> Just to clarify the structure looks like this: *content (untokenized,
>> >> unanalyzed)* [copied to]==> *ColonCancerField *(tokenized, analyzed)
>> then I
>> >> browse the ColonCancerField and the facets state that there is 1
>> document
>> >> for not necessarily, but when selecting it, solr returns 2 results.
>> >>
>> >> -Kevin
>> >>
>> >> On Mon, Dec 28, 2015 at 10:22 AM, Jamie Johnson <je...@gmail.com>
>> wrote:
>> >>
>> >> > Can you do the opposite?  Index into an unanalyzed field and copy into
>> >> the
>> >> > analyzed?
>> >> >
>> >> > If I remember correctly facets are based off of indexed values so if
>> you
>> >> > tokenize the field then the facets will be as you are seeing now.
>> >> > On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com>
>> wrote:
>> >> >
>> >> > > *What I am trying to accomplish: *
>> >> > > Generate a facet based on the documents uploaded and a text file
>> >> > containing
>> >> > > terms from a domain/ontology such that a facet is shown if a term
>> is in
>> >> > the
>> >> > > text file and in a document (key phrase extraction).
>> >> > >
>> >> > > *The problem:*
>> >> > > When I select the facet for the term "*not necessarily*" (we see
>> there
>> >> > is a
>> >> > > space) and I get the results for the term "*not*". The field is
>> >> tokenized
>> >> > > and multivalued. This leads me to believe that I can not use a
>> >> tokenized
>> >> > > field as a facet field. I tried to copy the values of the field to a
>> >> text
>> >> > > field with a keywordtokenizer. I am told when checking the schema
>> >> > browser:
>> >> > > "Sorry, no Term Info available :(" This is after I delete the old
>> index
>> >> > and
>> >> > > upload the documents again. The facet is coming from a field that is
>> >> > > already copied from another field, so I cannot copy this field to a
>> >> text
>> >> > > field with a keywordtokenizer or strfield. What can I do to fix
>> this?
>> >> Is
>> >> > > there an alternate way to accomplish this?
>> >> > >
>> >> > > *Here is my configuration:*
>> >> > >
>> >> > > <copyField source="ColonCancerField" dest="cytokineField"/>
>> >> > >
>> >> > > <field name="cytokineField" indexed="true" stored="true"
>> >> > > multiValued="true" type="Cytokine_Pass"/>
>> >> > > <fieldType name="Cytokine_Pass" class="solr.TextField">
>> >> > >     <analyzer>
>> >> > >     <tokenizer class="solr.KeywordTokenizerFactory" />
>> >> > >     </analyzer>
>> >> > > </fieldType>
>> >> > >
>> >> > >   <field name="ColonCancerField" type="ColonCancer" indexed="true"
>> >> > > stored="true" multiValued="true"
>> >> > >    termPositions="true"
>> >> > >    termVectors="true"
>> >> > >    termOffsets="true"/>
>> >> > > <fieldType name="ColonCancer" class="solr.TextField"
>> >> > > sortMissingLast="true" omitNorms="true">
>> >> > > <analyzer>
>> >> > > <filter class="solr.ShingleFilterFactory"
>> >> > >             minShingleSize="2" maxShingleSize="5"
>> >> > >             outputUnigramsIfNoShingles="true"
>> >> > >     />
>> >> > >   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >> > >       <filter class="solr.LowerCaseFilterFactory"/>
>> >> > >     <filter class="solr.SynonymFilterFactory"
>> >> > > synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
>> >> > > tokenizerFactory="solr.KeywordTokenizerFactory"/>
>> >> > >     <filter class="solr.KeepWordFilterFactory"
>> >> > >             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
>> >> > >   </analyzer>
>> >> > > </fieldType>
>> >> > > <copyField source="content" dest="ColonCancerField"/>
>> >> > >
>> >> > > Regards,
>> >> > >
>> >> > > Kevin
>> >> > >
>> >> >
>> >>
>> > --
>> > Regards,
>> > Binoy Dalal
>>

Re: Solr - facet fields that contain other facet fields

Posted by Kevin Lopez <ke...@gmail.com>.

Erick,

I am not sure when you say "the only available terms are "not" and
"necessarily"" is totally correct. I go into the schema browser and I can
see that there are two terms "not" and "not necessarily" with the correct
count. Unless these are not the terms you are talking about. Can you
explain to me what these are exactly.

http://imgur.com/m82CH2f

I see what you are saying, it may be best for me to do the entity
extraction separately, and put the terms into a special field, although I
would like the terms to be highlighted (or have some type of position so I
can highlight it).

Regards,

Kevin

On Mon, Dec 28, 2015 at 12:49 PM, Erick Erickson <er...@gmail.com>
wrote:

> bq:  so I cannot copy this field to a text field with a
> keywordtokenizer or strfield
>
> 1> There is no restriction on whether a field is analyzed or not as far as
> faceting is concerned. You can freely facet on an analyzed field
> or String field or KeywordTokenized field. As Binoy says, though,
> faceting on large analyzed text fields is dangerous.
>
> 2> copyField directives are not chained. As soon as the
> field is received, before _anything_ is done the raw contents are
> pushed to the copyField destinations. So in your case the source
> for both copyField directives should be "content". Otherwise you
> get into interesting behavior if you, say,  copyField from A to B and
> have another copyField from B to A. I _suspect_ this is
> why you have no term info available, but check....
>
> 3> This is not going to work as you're trying to implement it. If you
> tokenize, the only available terms are "not" and "necessarily". There
> is no "not necessarily" _token_ to facet on. If you use a String
> or KeywordAnalylzed field, likewise there is no "not necessarily"
> token, there will be a _single_ token that's the entire content of the
> field
> (I'm leaving aside, for instance, WordDelimiterFilterFactory
> modifications...).
>
> One way to approach this would be to recognize and index synthetic
> tokens representing the concepts. You'd pre-analyze the text, do your
> entity recognition and add those entities to a special "entity" field or
> some such. This would be an unanalyzed field that you facet on. Let's
> say your entity was "colon cancer". Whenever you recognized that in
> the text during indexing, you'd index "colon_cancer", or "disease_234"
> in your special field.
>
> Of course your app would then have to present this pleasingly, and
> rather than the app needing access to your dictionary the "colon_cancer"
> form would be easier to unpack.
>
> The fragility here is that changing your text file of entities would
> require
> you to re-index to re-inject them into documents.
>
> You could also, assuming you know all the entities that should match
> a given query form facet _queries_ on the phrases. This could get to be
> quite a large query, but has the advantage of not requiring re-indexing.
> So you'd have something like
> facet.query=field:"not necessarily"&facet.query=field:certainly
> etc.
>
> Best,
> Erick
>
>
> On Mon, Dec 28, 2015 at 9:13 AM, Binoy Dalal <bi...@gmail.com>
> wrote:
> > 1) When faceting use field of type string. That'll rid you of your
> > tokenization problems.
> > Alternatively do not use any tokenizers.
> > Also turn doc values on for the field. It'll improve performance.
> > 2) If however you do need to use a tokenized field for faceting, make
> sure
> > that they're pretty short in terms of number of tokens or else your app
> > will die real soon.
> >
> > On Mon, 28 Dec 2015, 22:24 Kevin Lopez <ke...@gmail.com> wrote:
> >
> >> I am not sure I am following correctly. The field I upload the document
> to
> >> would be "content" the analyzed field is "ColonCancerField". The
> "content"
> >> field contains the entire text of the document, in my case a pubmed
> >> abstract. This is a tokenized field. I made this field untokenized and I
> >> still received the same results [the results for not instead of not
> >> necessarily (in my current example I have 2 docs with not and 1 doc with
> >> not necessarily {not is of course in the document that contains not
> >> necessarily})]:
> >>
> >> http://imgur.com/a/1bfXT
> >>
> >> I also tried this:
> >>
> >> http://localhost:8983/solr/Cytokine/select?&q=ColonCancerField
> >> :"not+necessarily"
> >>
> >> I still receive the two documents, which is the same as doing
> >> ColonCancerField:"not"
> >>
> >> Just to clarify the structure looks like this: *content (untokenized,
> >> unanalyzed)* [copied to]==> *ColonCancerField *(tokenized, analyzed)
> then I
> >> browse the ColonCancerField and the facets state that there is 1
> document
> >> for not necessarily, but when selecting it, solr returns 2 results.
> >>
> >> -Kevin
> >>
> >> On Mon, Dec 28, 2015 at 10:22 AM, Jamie Johnson <je...@gmail.com>
> wrote:
> >>
> >> > Can you do the opposite?  Index into an unanalyzed field and copy into
> >> the
> >> > analyzed?
> >> >
> >> > If I remember correctly facets are based off of indexed values so if
> you
> >> > tokenize the field then the facets will be as you are seeing now.
> >> > On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com>
> wrote:
> >> >
> >> > > *What I am trying to accomplish: *
> >> > > Generate a facet based on the documents uploaded and a text file
> >> > containing
> >> > > terms from a domain/ontology such that a facet is shown if a term
> is in
> >> > the
> >> > > text file and in a document (key phrase extraction).
> >> > >
> >> > > *The problem:*
> >> > > When I select the facet for the term "*not necessarily*" (we see
> there
> >> > is a
> >> > > space) and I get the results for the term "*not*". The field is
> >> tokenized
> >> > > and multivalued. This leads me to believe that I can not use a
> >> tokenized
> >> > > field as a facet field. I tried to copy the values of the field to a
> >> text
> >> > > field with a keywordtokenizer. I am told when checking the schema
> >> > browser:
> >> > > "Sorry, no Term Info available :(" This is after I delete the old
> index
> >> > and
> >> > > upload the documents again. The facet is coming from a field that is
> >> > > already copied from another field, so I cannot copy this field to a
> >> text
> >> > > field with a keywordtokenizer or strfield. What can I do to fix
> this?
> >> Is
> >> > > there an alternate way to accomplish this?
> >> > >
> >> > > *Here is my configuration:*
> >> > >
> >> > > <copyField source="ColonCancerField" dest="cytokineField"/>
> >> > >
> >> > > <field name="cytokineField" indexed="true" stored="true"
> >> > > multiValued="true" type="Cytokine_Pass"/>
> >> > > <fieldType name="Cytokine_Pass" class="solr.TextField">
> >> > >     <analyzer>
> >> > >     <tokenizer class="solr.KeywordTokenizerFactory" />
> >> > >     </analyzer>
> >> > > </fieldType>
> >> > >
> >> > >   <field name="ColonCancerField" type="ColonCancer" indexed="true"
> >> > > stored="true" multiValued="true"
> >> > >    termPositions="true"
> >> > >    termVectors="true"
> >> > >    termOffsets="true"/>
> >> > > <fieldType name="ColonCancer" class="solr.TextField"
> >> > > sortMissingLast="true" omitNorms="true">
> >> > > <analyzer>
> >> > > <filter class="solr.ShingleFilterFactory"
> >> > >             minShingleSize="2" maxShingleSize="5"
> >> > >             outputUnigramsIfNoShingles="true"
> >> > >     />
> >> > >   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> > >       <filter class="solr.LowerCaseFilterFactory"/>
> >> > >     <filter class="solr.SynonymFilterFactory"
> >> > > synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
> >> > > tokenizerFactory="solr.KeywordTokenizerFactory"/>
> >> > >     <filter class="solr.KeepWordFilterFactory"
> >> > >             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
> >> > >   </analyzer>
> >> > > </fieldType>
> >> > > <copyField source="content" dest="ColonCancerField"/>
> >> > >
> >> > > Regards,
> >> > >
> >> > > Kevin
> >> > >
> >> >
> >>
> > --
> > Regards,
> > Binoy Dalal
>

Re: Solr - facet fields that contain other facet fields

Posted by Erick Erickson <er...@gmail.com>.

bq:  so I cannot copy this field to a text field with a
keywordtokenizer or strfield

1> There is no restriction on whether a field is analyzed or not as far as
faceting is concerned. You can freely facet on an analyzed field
or String field or KeywordTokenized field. As Binoy says, though,
faceting on large analyzed text fields is dangerous.

2> copyField directives are not chained. As soon as the
field is received, before _anything_ is done the raw contents are
pushed to the copyField destinations. So in your case the source
for both copyField directives should be "content". Otherwise you
get into interesting behavior if you, say,  copyField from A to B and
have another copyField from B to A. I _suspect_ this is
why you have no term info available, but check....

3> This is not going to work as you're trying to implement it. If you
tokenize, the only available terms are "not" and "necessarily". There
is no "not necessarily" _token_ to facet on. If you use a String
or KeywordAnalylzed field, likewise there is no "not necessarily"
token, there will be a _single_ token that's the entire content of the field
(I'm leaving aside, for instance, WordDelimiterFilterFactory
modifications...).

One way to approach this would be to recognize and index synthetic
tokens representing the concepts. You'd pre-analyze the text, do your
entity recognition and add those entities to a special "entity" field or
some such. This would be an unanalyzed field that you facet on. Let's
say your entity was "colon cancer". Whenever you recognized that in
the text during indexing, you'd index "colon_cancer", or "disease_234"
in your special field.

Of course your app would then have to present this pleasingly, and
rather than the app needing access to your dictionary the "colon_cancer"
form would be easier to unpack.

The fragility here is that changing your text file of entities would require
you to re-index to re-inject them into documents.

You could also, assuming you know all the entities that should match
a given query form facet _queries_ on the phrases. This could get to be
quite a large query, but has the advantage of not requiring re-indexing.
So you'd have something like
facet.query=field:"not necessarily"&facet.query=field:certainly
etc.

Best,
Erick

On Mon, Dec 28, 2015 at 9:13 AM, Binoy Dalal <bi...@gmail.com> wrote:
> 1) When faceting use field of type string. That'll rid you of your
> tokenization problems.
> Alternatively do not use any tokenizers.
> Also turn doc values on for the field. It'll improve performance.
> 2) If however you do need to use a tokenized field for faceting, make sure
> that they're pretty short in terms of number of tokens or else your app
> will die real soon.
>
> On Mon, 28 Dec 2015, 22:24 Kevin Lopez <ke...@gmail.com> wrote:
>
>> I am not sure I am following correctly. The field I upload the document to
>> would be "content" the analyzed field is "ColonCancerField". The "content"
>> field contains the entire text of the document, in my case a pubmed
>> abstract. This is a tokenized field. I made this field untokenized and I
>> still received the same results [the results for not instead of not
>> necessarily (in my current example I have 2 docs with not and 1 doc with
>> not necessarily {not is of course in the document that contains not
>> necessarily})]:
>>
>> http://imgur.com/a/1bfXT
>>
>> I also tried this:
>>
>> http://localhost:8983/solr/Cytokine/select?&q=ColonCancerField
>> :"not+necessarily"
>>
>> I still receive the two documents, which is the same as doing
>> ColonCancerField:"not"
>>
>> Just to clarify the structure looks like this: *content (untokenized,
>> unanalyzed)* [copied to]==> *ColonCancerField *(tokenized, analyzed) then I
>> browse the ColonCancerField and the facets state that there is 1 document
>> for not necessarily, but when selecting it, solr returns 2 results.
>>
>> -Kevin
>>
>> On Mon, Dec 28, 2015 at 10:22 AM, Jamie Johnson <je...@gmail.com> wrote:
>>
>> > Can you do the opposite?  Index into an unanalyzed field and copy into
>> the
>> > analyzed?
>> >
>> > If I remember correctly facets are based off of indexed values so if you
>> > tokenize the field then the facets will be as you are seeing now.
>> > On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com> wrote:
>> >
>> > > *What I am trying to accomplish: *
>> > > Generate a facet based on the documents uploaded and a text file
>> > containing
>> > > terms from a domain/ontology such that a facet is shown if a term is in
>> > the
>> > > text file and in a document (key phrase extraction).
>> > >
>> > > *The problem:*
>> > > When I select the facet for the term "*not necessarily*" (we see there
>> > is a
>> > > space) and I get the results for the term "*not*". The field is
>> tokenized
>> > > and multivalued. This leads me to believe that I can not use a
>> tokenized
>> > > field as a facet field. I tried to copy the values of the field to a
>> text
>> > > field with a keywordtokenizer. I am told when checking the schema
>> > browser:
>> > > "Sorry, no Term Info available :(" This is after I delete the old index
>> > and
>> > > upload the documents again. The facet is coming from a field that is
>> > > already copied from another field, so I cannot copy this field to a
>> text
>> > > field with a keywordtokenizer or strfield. What can I do to fix this?
>> Is
>> > > there an alternate way to accomplish this?
>> > >
>> > > *Here is my configuration:*
>> > >
>> > > <copyField source="ColonCancerField" dest="cytokineField"/>
>> > >
>> > > <field name="cytokineField" indexed="true" stored="true"
>> > > multiValued="true" type="Cytokine_Pass"/>
>> > > <fieldType name="Cytokine_Pass" class="solr.TextField">
>> > >     <analyzer>
>> > >     <tokenizer class="solr.KeywordTokenizerFactory" />
>> > >     </analyzer>
>> > > </fieldType>
>> > >
>> > >   <field name="ColonCancerField" type="ColonCancer" indexed="true"
>> > > stored="true" multiValued="true"
>> > >    termPositions="true"
>> > >    termVectors="true"
>> > >    termOffsets="true"/>
>> > > <fieldType name="ColonCancer" class="solr.TextField"
>> > > sortMissingLast="true" omitNorms="true">
>> > > <analyzer>
>> > > <filter class="solr.ShingleFilterFactory"
>> > >             minShingleSize="2" maxShingleSize="5"
>> > >             outputUnigramsIfNoShingles="true"
>> > >     />
>> > >   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> > >       <filter class="solr.LowerCaseFilterFactory"/>
>> > >     <filter class="solr.SynonymFilterFactory"
>> > > synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
>> > > tokenizerFactory="solr.KeywordTokenizerFactory"/>
>> > >     <filter class="solr.KeepWordFilterFactory"
>> > >             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
>> > >   </analyzer>
>> > > </fieldType>
>> > > <copyField source="content" dest="ColonCancerField"/>
>> > >
>> > > Regards,
>> > >
>> > > Kevin
>> > >
>> >
>>
> --
> Regards,
> Binoy Dalal

Re: Solr - facet fields that contain other facet fields

Posted by Binoy Dalal <bi...@gmail.com>.

1) When faceting use field of type string. That'll rid you of your
tokenization problems.
Alternatively do not use any tokenizers.
Also turn doc values on for the field. It'll improve performance.
2) If however you do need to use a tokenized field for faceting, make sure
that they're pretty short in terms of number of tokens or else your app
will die real soon.

On Mon, 28 Dec 2015, 22:24 Kevin Lopez <ke...@gmail.com> wrote:

> I am not sure I am following correctly. The field I upload the document to
> would be "content" the analyzed field is "ColonCancerField". The "content"
> field contains the entire text of the document, in my case a pubmed
> abstract. This is a tokenized field. I made this field untokenized and I
> still received the same results [the results for not instead of not
> necessarily (in my current example I have 2 docs with not and 1 doc with
> not necessarily {not is of course in the document that contains not
> necessarily})]:
>
> http://imgur.com/a/1bfXT
>
> I also tried this:
>
> http://localhost:8983/solr/Cytokine/select?&q=ColonCancerField
> :"not+necessarily"
>
> I still receive the two documents, which is the same as doing
> ColonCancerField:"not"
>
> Just to clarify the structure looks like this: *content (untokenized,
> unanalyzed)* [copied to]==> *ColonCancerField *(tokenized, analyzed) then I
> browse the ColonCancerField and the facets state that there is 1 document
> for not necessarily, but when selecting it, solr returns 2 results.
>
> -Kevin
>
> On Mon, Dec 28, 2015 at 10:22 AM, Jamie Johnson <je...@gmail.com> wrote:
>
> > Can you do the opposite?  Index into an unanalyzed field and copy into
> the
> > analyzed?
> >
> > If I remember correctly facets are based off of indexed values so if you
> > tokenize the field then the facets will be as you are seeing now.
> > On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com> wrote:
> >
> > > *What I am trying to accomplish: *
> > > Generate a facet based on the documents uploaded and a text file
> > containing
> > > terms from a domain/ontology such that a facet is shown if a term is in
> > the
> > > text file and in a document (key phrase extraction).
> > >
> > > *The problem:*
> > > When I select the facet for the term "*not necessarily*" (we see there
> > is a
> > > space) and I get the results for the term "*not*". The field is
> tokenized
> > > and multivalued. This leads me to believe that I can not use a
> tokenized
> > > field as a facet field. I tried to copy the values of the field to a
> text
> > > field with a keywordtokenizer. I am told when checking the schema
> > browser:
> > > "Sorry, no Term Info available :(" This is after I delete the old index
> > and
> > > upload the documents again. The facet is coming from a field that is
> > > already copied from another field, so I cannot copy this field to a
> text
> > > field with a keywordtokenizer or strfield. What can I do to fix this?
> Is
> > > there an alternate way to accomplish this?
> > >
> > > *Here is my configuration:*
> > >
> > > <copyField source="ColonCancerField" dest="cytokineField"/>
> > >
> > > <field name="cytokineField" indexed="true" stored="true"
> > > multiValued="true" type="Cytokine_Pass"/>
> > > <fieldType name="Cytokine_Pass" class="solr.TextField">
> > >     <analyzer>
> > >     <tokenizer class="solr.KeywordTokenizerFactory" />
> > >     </analyzer>
> > > </fieldType>
> > >
> > >   <field name="ColonCancerField" type="ColonCancer" indexed="true"
> > > stored="true" multiValued="true"
> > >    termPositions="true"
> > >    termVectors="true"
> > >    termOffsets="true"/>
> > > <fieldType name="ColonCancer" class="solr.TextField"
> > > sortMissingLast="true" omitNorms="true">
> > > <analyzer>
> > > <filter class="solr.ShingleFilterFactory"
> > >             minShingleSize="2" maxShingleSize="5"
> > >             outputUnigramsIfNoShingles="true"
> > >     />
> > >   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >       <filter class="solr.LowerCaseFilterFactory"/>
> > >     <filter class="solr.SynonymFilterFactory"
> > > synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
> > > tokenizerFactory="solr.KeywordTokenizerFactory"/>
> > >     <filter class="solr.KeepWordFilterFactory"
> > >             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
> > >   </analyzer>
> > > </fieldType>
> > > <copyField source="content" dest="ColonCancerField"/>
> > >
> > > Regards,
> > >
> > > Kevin
> > >
> >
>
-- 
Regards,
Binoy Dalal

Re: Solr - facet fields that contain other facet fields

Posted by Kevin Lopez <ke...@gmail.com>.

I am not sure I am following correctly. The field I upload the document to
would be "content" the analyzed field is "ColonCancerField". The "content"
field contains the entire text of the document, in my case a pubmed
abstract. This is a tokenized field. I made this field untokenized and I
still received the same results [the results for not instead of not
necessarily (in my current example I have 2 docs with not and 1 doc with
not necessarily {not is of course in the document that contains not
necessarily})]:

http://imgur.com/a/1bfXT

I also tried this:

http://localhost:8983/solr/Cytokine/select?&q=ColonCancerField
:"not+necessarily"

I still receive the two documents, which is the same as doing
ColonCancerField:"not"

Just to clarify the structure looks like this: *content (untokenized,
unanalyzed)* [copied to]==> *ColonCancerField *(tokenized, analyzed) then I
browse the ColonCancerField and the facets state that there is 1 document
for not necessarily, but when selecting it, solr returns 2 results.

-Kevin

On Mon, Dec 28, 2015 at 10:22 AM, Jamie Johnson <je...@gmail.com> wrote:

> Can you do the opposite?  Index into an unanalyzed field and copy into the
> analyzed?
>
> If I remember correctly facets are based off of indexed values so if you
> tokenize the field then the facets will be as you are seeing now.
> On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com> wrote:
>
> > *What I am trying to accomplish: *
> > Generate a facet based on the documents uploaded and a text file
> containing
> > terms from a domain/ontology such that a facet is shown if a term is in
> the
> > text file and in a document (key phrase extraction).
> >
> > *The problem:*
> > When I select the facet for the term "*not necessarily*" (we see there
> is a
> > space) and I get the results for the term "*not*". The field is tokenized
> > and multivalued. This leads me to believe that I can not use a tokenized
> > field as a facet field. I tried to copy the values of the field to a text
> > field with a keywordtokenizer. I am told when checking the schema
> browser:
> > "Sorry, no Term Info available :(" This is after I delete the old index
> and
> > upload the documents again. The facet is coming from a field that is
> > already copied from another field, so I cannot copy this field to a text
> > field with a keywordtokenizer or strfield. What can I do to fix this? Is
> > there an alternate way to accomplish this?
> >
> > *Here is my configuration:*
> >
> > <copyField source="ColonCancerField" dest="cytokineField"/>
> >
> > <field name="cytokineField" indexed="true" stored="true"
> > multiValued="true" type="Cytokine_Pass"/>
> > <fieldType name="Cytokine_Pass" class="solr.TextField">
> >     <analyzer>
> >     <tokenizer class="solr.KeywordTokenizerFactory" />
> >     </analyzer>
> > </fieldType>
> >
> >   <field name="ColonCancerField" type="ColonCancer" indexed="true"
> > stored="true" multiValued="true"
> >    termPositions="true"
> >    termVectors="true"
> >    termOffsets="true"/>
> > <fieldType name="ColonCancer" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true">
> > <analyzer>
> > <filter class="solr.ShingleFilterFactory"
> >             minShingleSize="2" maxShingleSize="5"
> >             outputUnigramsIfNoShingles="true"
> >     />
> >   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >     <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
> > tokenizerFactory="solr.KeywordTokenizerFactory"/>
> >     <filter class="solr.KeepWordFilterFactory"
> >             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
> >   </analyzer>
> > </fieldType>
> > <copyField source="content" dest="ColonCancerField"/>
> >
> > Regards,
> >
> > Kevin
> >
>

Re: Solr - facet fields that contain other facet fields

Posted by Jamie Johnson <je...@gmail.com>.

Can you do the opposite?  Index into an unanalyzed field and copy into the
analyzed?

If I remember correctly facets are based off of indexed values so if you
tokenize the field then the facets will be as you are seeing now.
On Dec 28, 2015 9:45 AM, "Kevin Lopez" <ke...@gmail.com> wrote:

> *What I am trying to accomplish: *
> Generate a facet based on the documents uploaded and a text file containing
> terms from a domain/ontology such that a facet is shown if a term is in the
> text file and in a document (key phrase extraction).
>
> *The problem:*
> When I select the facet for the term "*not necessarily*" (we see there is a
> space) and I get the results for the term "*not*". The field is tokenized
> and multivalued. This leads me to believe that I can not use a tokenized
> field as a facet field. I tried to copy the values of the field to a text
> field with a keywordtokenizer. I am told when checking the schema browser:
> "Sorry, no Term Info available :(" This is after I delete the old index and
> upload the documents again. The facet is coming from a field that is
> already copied from another field, so I cannot copy this field to a text
> field with a keywordtokenizer or strfield. What can I do to fix this? Is
> there an alternate way to accomplish this?
>
> *Here is my configuration:*
>
> <copyField source="ColonCancerField" dest="cytokineField"/>
>
> <field name="cytokineField" indexed="true" stored="true"
> multiValued="true" type="Cytokine_Pass"/>
> <fieldType name="Cytokine_Pass" class="solr.TextField">
>     <analyzer>
>     <tokenizer class="solr.KeywordTokenizerFactory" />
>     </analyzer>
> </fieldType>
>
>   <field name="ColonCancerField" type="ColonCancer" indexed="true"
> stored="true" multiValued="true"
>    termPositions="true"
>    termVectors="true"
>    termOffsets="true"/>
> <fieldType name="ColonCancer" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> <analyzer>
> <filter class="solr.ShingleFilterFactory"
>             minShingleSize="2" maxShingleSize="5"
>             outputUnigramsIfNoShingles="true"
>     />
>   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms_ColonCancer.txt" ignoreCase="true" expand="true"
> tokenizerFactory="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.KeepWordFilterFactory"
>             words="prefLabels_ColonCancer.txt" ignoreCase="true"/>
>   </analyzer>
> </fieldType>
> <copyField source="content" dest="ColonCancerField"/>
>
> Regards,
>
> Kevin
>