You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "G, Rajesh" <rg...@cebglobal.com> on 2016/04/29 10:55:53 UTC

Facet ignoring repeated word

Hi,

I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

I have indexed the text :
It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.

The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?

Also please suggest If there is a better way to implement word cloud in Solr other than using facet?

    "facet_fields":{
      "comments":[
        "absorbed",1,
        "am",1,
        "believe",1,
        "bonus",1,
        "company",1,
        "compensation",1,
        "confident",1,
        "could",1,
        "current",1,
        "don't",1,
        "evaluation",1,
        "get",1,
        "gets",1,
        "harder",1,
        "hire",1,
        "i",1,
        "i'm",1,
        "left",1,
        "makes",1,
        "me",1,
        "more",1,
        "my",1,
        "normal",1,
        "peers",1,
        "performers",1,
        "potential",1,
        "process",1,
        "realizes",1,
        "recognized",1,
        "replace",1,
        "reward",1,
        "rewards",1,
        "same",1,
        "seems",1,
        "someone",1,
        "strong",1,
        "structure",1,
        "take",1,
        "talent",1,
        "than",1,
        "think",1,
        "underwhelmed",1,
        "very",1,
        "want",1,
        "which",1,
        "work",1,
        "working",1,
        "workload",1]
    }




CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.



Re: Facet ignoring repeated word

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,


I understand the word cloud part. 
It looks like you want to use within-resultList term frequency information.In your first mail, I thought you want within-document term frequency.

TermsComponent reports within-collection term frequency.

I am not sure how to retrieve within-resultList term frequency.
Traversing the result list and collecting term vector data seems plausible.

Ahmet

 



On Monday, May 9, 2016 11:55 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Ahmet,

Please let me know if I am not clear

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Friday, May 6, 2016 1:08 PM
To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org
Subject: RE: Facet ignoring repeated word

Hi Ahmet,



Sorry it is Word Cloud  https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_webhp-3Fsourceid-3Dchrome-2Dinstant-26ion-3D1-26espv-3D2-26ie-3DUTF-2D8-23newwindow-3D1-26q-3Dword-2Bcloud&d=CwIGaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=k-w03YA11ltRmGgXa55Yx2gs1Jk1QowoFIE32lm9QMU&s=X_BPC_BR1vgdcijmmd50zYBOnIP97BfPfS2H7MxC9V4&e=




We have comments from survey. We want to build word cloud using the filed comments



e.g For question 1 the comments are



    Comment 1.Projects, technology, features, performance

    Comment 2.Too many projects and technology, not enough people to run projects



I want to run a query for question 1 that will produce the below result



projects: 3

technology:2

features:1

performance:1

Too:1

Many:1

Enough:1

People:1

Run:1

....



Facet produces the result but ignores repeated words in a document[projects count will be 2 instead of 3].



projects: 2

technology:2

features:1

performance:1

Too:1

Many:1

Enough:1

People:1

Run:1



TeamVectorComponent produces the result as expected but they are not grouped by words, instead they are grouped by id.



<lst name="1">

<str name="uniqueKey">1</str>

        <lst name="comments">

                <lst name="projects">

                        <int name="tf">1</int>

                </lst>

        </lst>

</lst>



<lst name="2">

<str name="uniqueKey">2</str>

        <lst name="comments">

                <lst name="projects">

                        <int name="tf">2</int>

                </lst>

        </lst>

</lst>



I wanted to know if it is possible to produce a result that is grouped by word and also does not ignore repeated words in a document. If it is not possible then I have to write some script that will take the above result from solr group words and sum the count



Thanks

Rajesh









CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.



-----Original Message-----

From: Ahmet Arslan [mailto:iorixxx@yahoo.com]

Sent: Friday, May 6, 2016 12:39 PM

To: G, Rajesh <rg...@cebglobal.com>; solr-user@lucene.apache.org

Subject: Re: Facet ignoring repeated word



Hi Rajesh,



Can you please explain what do you mean by "tag cloud"?

How it is related to a query?

Please explain your requirements.



Ahmet







On Friday, May 6, 2016 8:44 AM, "G," <rg...@cebglobal.com> wrote:

Hi,



Can you please help? If there is a solution then It will be easy, else I have to create a script in python that can process the results from TermVectorComponent and group the result by words in different documents to find the word count. The Python script will accept the exported Solr result as input



Thanks

Rajesh







CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.





-----Original Message-----

From: G, Rajesh [mailto:rg@cebglobal.com]

Sent: Thursday, May 5, 2016 4:29 PM

To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org; erickerickson@gmail.com

Subject: RE: Facet ignoring repeated word



Hi,



TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result



<lst name="1675">

<str name="uniqueKey">1675</str>

        <lst name="comments">

                <lst name="my">

                        <int name="tf">4</int>

                </lst>

        </lst>

</lst>



<lst name="1781">

<str name="uniqueKey">1675</str>

        <lst name="comments">

                <lst name="my">

                        <int name="tf">2</int>

                </lst>

        </lst>

</lst>



https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_tvrh-3Fq-3D-2A-3A-2A-26tv-3Dtrue-26tv.fl-3Dcomments-26tv.tf-3Dtrue-26fl-3Dcomments-26rows-3D1000&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=W1Ti2_egOYFBVpBB11wxKQZqf8RGf5FkM22HrMI6eiY&e=





Hi Erick,

I need the count of repeated words to build word cloud



Thanks

Rajesh







CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.



-----Original Message-----

From: Ahmet Arslan [mailto:iorixxx@yahoo.com]

Sent: Tuesday, May 3, 2016 6:19 AM

To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>

Subject: Re: Facet ignoring repeated word



Hi,



StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.



Instead consider using TermVectors or MLT's interesting terms.





https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerm-2BVector-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=96tOS2bK5hyC4pncDqAVvO4eUQ3uDFk_WE9xuOFqWck&e=

https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_MoreLikeThis&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Agd0JeOWCUWrCU2PxyFWTbwVxAP7mzVVVd7-105NJtM&e=



Ahmet





On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:

Hi Erick/ Ahmet,



Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all



https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_terms-3Fterms.fl-3Dcomments-26terms-3Dtrue-26terms.limit-3D1000-26q-3Dquestionid-3D123&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ya0KmfIVVtTMgcIYpXe0pN_VwdEwXqJkF9iDhF2xOOU&e=



StatsComponent is not supporting text fields



Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported



  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.StandardTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>

          <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.StandardTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

  </fieldType>



Thanks

Rajesh







CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.





-----Original Message-----

From: Erick Erickson [mailto:erickerickson@gmail.com]

Sent: Friday, April 29, 2016 9:16 PM

To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>

Subject: Re: Facet ignoring repeated word



That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.



For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.



Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.



Best,

Erick



On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:

> Hi,

>

> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.

>

>

> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerms-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=wumoMAx5ahS9S8tDmQAAOqTZCPa3t_VpgDtj7awpUfI&e=

> https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_LukeRequestHandler&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ca7XObSJb3GieteQwRbLQSmBThqpW3eovVMEkK4NnU4&e=

> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BStats-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=NgH0cqmhy8GcSfG4VDoxd5Y9tCAsoZEmwqE8_4UKISo&e=

> Ahmet

>

>

>

> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:

> Hi,

>

> I am trying to implement word cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Cjao8wJV-9kqmiNXxqmEkdzC746qLdQdiCbjlRAjaA0&e= >  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

>

> I have indexed the text :

> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.

>

> The indexed content has word my and the count the is 3 but when I run the query https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_select-3Ffacet-3Dtrue-26facet.field-3Dcomments-26rows-3D0-26indent-3Don-26q-3Dquestionid-3A3956-26wt-3Djson&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=eAPRQ47qzgCQed7F0hYces46xDxPvqeBxQG4JCM7RpE&e=  the count of word my  is 1 and not 3. Can you please help?

>

> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?

>

>     "facet_fields":{

>       "comments":[

>         "absorbed",1,

>         "am",1,

>         "believe",1,

>         "bonus",1,

>         "company",1,

>         "compensation",1,

>         "confident",1,

>         "could",1,

>         "current",1,

>         "don't",1,

>         "evaluation",1,

>         "get",1,

>         "gets",1,

>         "harder",1,

>         "hire",1,

>         "i",1,

>         "i'm",1,

>         "left",1,

>         "makes",1,

>         "me",1,

>         "more",1,

>         "my",1,

>         "normal",1,

>         "peers",1,

>         "performers",1,

>         "potential",1,

>         "process",1,

>         "realizes",1,

>         "recognized",1,

>         "replace",1,

>         "reward",1,

>         "rewards",1,

>         "same",1,

>         "seems",1,

>         "someone",1,

>         "strong",1,

>         "structure",1,

>         "take",1,

>         "talent",1,

>         "than",1,

>         "think",1,

>         "underwhelmed",1,

>         "very",1,

>         "want",1,

>         "which",1,

>         "work",1,

>         "working",1,

>         "workload",1]

>     }

>

>

>

>

> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

>

>

>

> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi Ahmet,

Please let me know if I am not clear

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Friday, May 6, 2016 1:08 PM
To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org
Subject: RE: Facet ignoring repeated word

Hi Ahmet,



Sorry it is Word Cloud  https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_webhp-3Fsourceid-3Dchrome-2Dinstant-26ion-3D1-26espv-3D2-26ie-3DUTF-2D8-23newwindow-3D1-26q-3Dword-2Bcloud&d=CwIGaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=k-w03YA11ltRmGgXa55Yx2gs1Jk1QowoFIE32lm9QMU&s=X_BPC_BR1vgdcijmmd50zYBOnIP97BfPfS2H7MxC9V4&e=



We have comments from survey. We want to build word cloud using the filed comments



e.g For question 1 the comments are



    Comment 1.Projects, technology, features, performance

    Comment 2.Too many projects and technology, not enough people to run projects



I want to run a query for question 1 that will produce the below result



projects: 3

technology:2

features:1

performance:1

Too:1

Many:1

Enough:1

People:1

Run:1

....



Facet produces the result but ignores repeated words in a document[projects count will be 2 instead of 3].



projects: 2

technology:2

features:1

performance:1

Too:1

Many:1

Enough:1

People:1

Run:1



TeamVectorComponent produces the result as expected but they are not grouped by words, instead they are grouped by id.



<lst name="1">

<str name="uniqueKey">1</str>

        <lst name="comments">

                <lst name="projects">

                        <int name="tf">1</int>

                </lst>

        </lst>

</lst>



<lst name="2">

<str name="uniqueKey">2</str>

        <lst name="comments">

                <lst name="projects">

                        <int name="tf">2</int>

                </lst>

        </lst>

</lst>



I wanted to know if it is possible to produce a result that is grouped by word and also does not ignore repeated words in a document. If it is not possible then I have to write some script that will take the above result from solr group words and sum the count



Thanks

Rajesh









CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.



-----Original Message-----

From: Ahmet Arslan [mailto:iorixxx@yahoo.com]

Sent: Friday, May 6, 2016 12:39 PM

To: G, Rajesh <rg...@cebglobal.com>; solr-user@lucene.apache.org

Subject: Re: Facet ignoring repeated word



Hi Rajesh,



Can you please explain what do you mean by "tag cloud"?

How it is related to a query?

Please explain your requirements.



Ahmet







On Friday, May 6, 2016 8:44 AM, "G," <rg...@cebglobal.com> wrote:

Hi,



Can you please help? If there is a solution then It will be easy, else I have to create a script in python that can process the results from TermVectorComponent and group the result by words in different documents to find the word count. The Python script will accept the exported Solr result as input



Thanks

Rajesh







CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.





-----Original Message-----

From: G, Rajesh [mailto:rg@cebglobal.com]

Sent: Thursday, May 5, 2016 4:29 PM

To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org; erickerickson@gmail.com

Subject: RE: Facet ignoring repeated word



Hi,



TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result



<lst name="1675">

<str name="uniqueKey">1675</str>

        <lst name="comments">

                <lst name="my">

                        <int name="tf">4</int>

                </lst>

        </lst>

</lst>



<lst name="1781">

<str name="uniqueKey">1675</str>

        <lst name="comments">

                <lst name="my">

                        <int name="tf">2</int>

                </lst>

        </lst>

</lst>



https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_tvrh-3Fq-3D-2A-3A-2A-26tv-3Dtrue-26tv.fl-3Dcomments-26tv.tf-3Dtrue-26fl-3Dcomments-26rows-3D1000&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=W1Ti2_egOYFBVpBB11wxKQZqf8RGf5FkM22HrMI6eiY&e=





Hi Erick,

I need the count of repeated words to build word cloud



Thanks

Rajesh







CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.



-----Original Message-----

From: Ahmet Arslan [mailto:iorixxx@yahoo.com]

Sent: Tuesday, May 3, 2016 6:19 AM

To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>

Subject: Re: Facet ignoring repeated word



Hi,



StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.



Instead consider using TermVectors or MLT's interesting terms.





https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerm-2BVector-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=96tOS2bK5hyC4pncDqAVvO4eUQ3uDFk_WE9xuOFqWck&e=

https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_MoreLikeThis&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Agd0JeOWCUWrCU2PxyFWTbwVxAP7mzVVVd7-105NJtM&e=



Ahmet





On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:

Hi Erick/ Ahmet,



Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all



https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_terms-3Fterms.fl-3Dcomments-26terms-3Dtrue-26terms.limit-3D1000-26q-3Dquestionid-3D123&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ya0KmfIVVtTMgcIYpXe0pN_VwdEwXqJkF9iDhF2xOOU&e=



StatsComponent is not supporting text fields



Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported



  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.StandardTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>

          <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.StandardTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

  </fieldType>



Thanks

Rajesh







CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.





-----Original Message-----

From: Erick Erickson [mailto:erickerickson@gmail.com]

Sent: Friday, April 29, 2016 9:16 PM

To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>

Subject: Re: Facet ignoring repeated word



That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.



For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.



Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.



Best,

Erick



On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:

> Hi,

>

> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.

>

>

> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerms-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=wumoMAx5ahS9S8tDmQAAOqTZCPa3t_VpgDtj7awpUfI&e=

> https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_LukeRequestHandler&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ca7XObSJb3GieteQwRbLQSmBThqpW3eovVMEkK4NnU4&e=

> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BStats-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=NgH0cqmhy8GcSfG4VDoxd5Y9tCAsoZEmwqE8_4UKISo&e=

> Ahmet

>

>

>

> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:

> Hi,

>

> I am trying to implement word cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Cjao8wJV-9kqmiNXxqmEkdzC746qLdQdiCbjlRAjaA0&e= >  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

>

> I have indexed the text :

> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.

>

> The indexed content has word my and the count the is 3 but when I run the query https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_select-3Ffacet-3Dtrue-26facet.field-3Dcomments-26rows-3D0-26indent-3Don-26q-3Dquestionid-3A3956-26wt-3Djson&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=eAPRQ47qzgCQed7F0hYces46xDxPvqeBxQG4JCM7RpE&e=  the count of word my  is 1 and not 3. Can you please help?

>

> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?

>

>     "facet_fields":{

>       "comments":[

>         "absorbed",1,

>         "am",1,

>         "believe",1,

>         "bonus",1,

>         "company",1,

>         "compensation",1,

>         "confident",1,

>         "could",1,

>         "current",1,

>         "don't",1,

>         "evaluation",1,

>         "get",1,

>         "gets",1,

>         "harder",1,

>         "hire",1,

>         "i",1,

>         "i'm",1,

>         "left",1,

>         "makes",1,

>         "me",1,

>         "more",1,

>         "my",1,

>         "normal",1,

>         "peers",1,

>         "performers",1,

>         "potential",1,

>         "process",1,

>         "realizes",1,

>         "recognized",1,

>         "replace",1,

>         "reward",1,

>         "rewards",1,

>         "same",1,

>         "seems",1,

>         "someone",1,

>         "strong",1,

>         "structure",1,

>         "take",1,

>         "talent",1,

>         "than",1,

>         "think",1,

>         "underwhelmed",1,

>         "very",1,

>         "want",1,

>         "which",1,

>         "work",1,

>         "working",1,

>         "workload",1]

>     }

>

>

>

>

> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

>

>

>

> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi Ahmet,

Sorry it is Word Cloud  https://www.google.co.uk/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#newwindow=1&q=word+cloud

We have comments from survey. We want to build word cloud using the filed comments

e.g For question 1 the comments are

    Comment 1.Projects, technology, features, performance
    Comment 2.Too many projects and technology, not enough people to run projects

I want to run a query for question 1 that will produce the below result

projects: 3
technology:2
features:1
performance:1
Too:1
Many:1
Enough:1
People:1
Run:1
....

Facet produces the result but ignores repeated words in a document[projects count will be 2 instead of 3].

projects: 2
technology:2
features:1
performance:1
Too:1
Many:1
Enough:1
People:1
Run:1

TeamVectorComponent produces the result as expected but they are not grouped by words, instead they are grouped by id.

<lst name="1">
<str name="uniqueKey">1</str>
        <lst name="comments">
                <lst name="projects">
                        <int name="tf">1</int>
                </lst>
        </lst>
</lst>

<lst name="2">
<str name="uniqueKey">2</str>
        <lst name="comments">
                <lst name="projects">
                        <int name="tf">2</int>
                </lst>
        </lst>
</lst>

I wanted to know if it is possible to produce a result that is grouped by word and also does not ignore repeated words in a document. If it is not possible then I have to write some script that will take the above result from solr group words and sum the count

Thanks
Rajesh




CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Friday, May 6, 2016 12:39 PM
To: G, Rajesh <rg...@cebglobal.com>; solr-user@lucene.apache.org
Subject: Re: Facet ignoring repeated word

Hi Rajesh,

Can you please explain what do you mean by "tag cloud"?
How it is related to a query?
Please explain your requirements.

Ahmet



On Friday, May 6, 2016 8:44 AM, "G," <rg...@cebglobal.com> wrote:
Hi,

Can you please help? If there is a solution then It will be easy, else I have to create a script in python that can process the results from TermVectorComponent and group the result by words in different documents to find the word count. The Python script will accept the exported Solr result as input

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Thursday, May 5, 2016 4:29 PM
To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org; erickerickson@gmail.com
Subject: RE: Facet ignoring repeated word

Hi,

TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result

<lst name="1675">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">4</int>
                </lst>
        </lst>
</lst>

<lst name="1781">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">2</int>
                </lst>
        </lst>
</lst>

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_tvrh-3Fq-3D-2A-3A-2A-26tv-3Dtrue-26tv.fl-3Dcomments-26tv.tf-3Dtrue-26fl-3Dcomments-26rows-3D1000&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=W1Ti2_egOYFBVpBB11wxKQZqf8RGf5FkM22HrMI6eiY&e=


Hi Erick,
I need the count of repeated words to build word cloud

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerm-2BVector-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=96tOS2bK5hyC4pncDqAVvO4eUQ3uDFk_WE9xuOFqWck&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_MoreLikeThis&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Agd0JeOWCUWrCU2PxyFWTbwVxAP7mzVVVd7-105NJtM&e=

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_terms-3Fterms.fl-3Dcomments-26terms-3Dtrue-26terms.limit-3D1000-26q-3Dquestionid-3D123&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ya0KmfIVVtTMgcIYpXe0pN_VwdEwXqJkF9iDhF2xOOU&e=

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerms-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=wumoMAx5ahS9S8tDmQAAOqTZCPa3t_VpgDtj7awpUfI&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_LukeRequestHandler&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ca7XObSJb3GieteQwRbLQSmBThqpW3eovVMEkK4NnU4&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BStats-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=NgH0cqmhy8GcSfG4VDoxd5Y9tCAsoZEmwqE8_4UKISo&e=
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Cjao8wJV-9kqmiNXxqmEkdzC746qLdQdiCbjlRAjaA0&e= >  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_select-3Ffacet-3Dtrue-26facet.field-3Dcomments-26rows-3D0-26indent-3Don-26q-3Dquestionid-3A3956-26wt-3Djson&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=eAPRQ47qzgCQed7F0hYces46xDxPvqeBxQG4JCM7RpE&e=  the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

Re: Facet ignoring repeated word

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Rajesh,

Can you please explain what do you mean by "tag cloud"?
How it is related to a query?
Please explain your requirements.

Ahmet



On Friday, May 6, 2016 8:44 AM, "G," <rg...@cebglobal.com> wrote:
Hi,

Can you please help? If there is a solution then It will be easy, else I have to create a script in python that can process the results from TermVectorComponent and group the result by words in different documents to find the word count. The Python script will accept the exported Solr result as input

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Thursday, May 5, 2016 4:29 PM
To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org; erickerickson@gmail.com
Subject: RE: Facet ignoring repeated word

Hi,

TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result

<lst name="1675">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">4</int>
                </lst>
        </lst>
</lst>

<lst name="1781">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">2</int>
                </lst>
        </lst>
</lst>

http://localhost:8182/solr/dev/tvrh?q=*:*&tv=true&tv.fl=comments&tv.tf=true&fl=comments&rows=1000


Hi Erick,
I need the count of repeated words to build word cloud

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi,

Can you please help? If there is a solution then It will be easy, else I have to create a script in python that can process the results from TermVectorComponent and group the result by words in different documents to find the word count. The Python script will accept the exported Solr result as input

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Thursday, May 5, 2016 4:29 PM
To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org; erickerickson@gmail.com
Subject: RE: Facet ignoring repeated word

Hi,

TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result

<lst name="1675">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">4</int>
                </lst>
        </lst>
</lst>

<lst name="1781">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">2</int>
                </lst>
        </lst>
</lst>

http://localhost:8182/solr/dev/tvrh?q=*:*&tv=true&tv.fl=comments&tv.tf=true&fl=comments&rows=1000


Hi Erick,
I need the count of repeated words to build word cloud

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi,

TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result

<lst name="1675">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">4</int>
                </lst>
        </lst>
</lst>

<lst name="1781">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">2</int>
                </lst>
        </lst>
</lst>

http://localhost:8182/solr/dev/tvrh?q=*:*&tv=true&tv.fl=comments&tv.tf=true&fl=comments&rows=1000


Hi Erick,
I need the count of repeated words to build word cloud

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi,

Please ignore my previous email.



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Thursday, May 5, 2016 2:29 PM
To: Ahmet Arslan <io...@yahoo.com>; solr-user@lucene.apache.org
Subject: RE: Facet ignoring repeated word

Hi,

TearmVector component is also not considering query parameter. The below query shows result for all question id instead of question id 3426

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=3426

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi,

TearmVector component is also not considering query parameter. The below query shows result for all question id instead of question id 3426

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=3426

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg...@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

Re: Facet ignoring repeated word

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <so...@lucene.apache.org>; Ahmet Arslan <io...@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

Re: Facet ignoring repeated word

Posted by Erick Erickson <er...@gmail.com>.
That's the way faceting is designed to work. It counts the _documents_
that a term appears in that satisfy your query, if a word appears
multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet
count of 500, then click on it and discover that the number of docs in
the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words
multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

Re: Facet ignoring repeated word

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can also be used.


https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
https://wiki.apache.org/solr/LukeRequestHandler
https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
Ahmet



On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg...@cebglobal.com> wrote:
Hi,

I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

I have indexed the text :
It seems that the harder I work, the more work I get for the same compensation and reward. The more work I take on gets absorbed into my "normal" workload and I'm not recognized for working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed by the evaluation process and bonus structure. I don't believe the current structure rewards strong performers. I am confident that the company could not hire someone with my talent to replace me if I left, but I don't think the company realizes that.

The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json the count of word my  is 1 and not 3. Can you please help?

Also please suggest If there is a better way to implement word cloud in Solr other than using facet?

    "facet_fields":{
      "comments":[
        "absorbed",1,
        "am",1,
        "believe",1,
        "bonus",1,
        "company",1,
        "compensation",1,
        "confident",1,
        "could",1,
        "current",1,
        "don't",1,
        "evaluation",1,
        "get",1,
        "gets",1,
        "harder",1,
        "hire",1,
        "i",1,
        "i'm",1,
        "left",1,
        "makes",1,
        "me",1,
        "more",1,
        "my",1,
        "normal",1,
        "peers",1,
        "performers",1,
        "potential",1,
        "process",1,
        "realizes",1,
        "recognized",1,
        "replace",1,
        "reward",1,
        "rewards",1,
        "same",1,
        "seems",1,
        "someone",1,
        "strong",1,
        "structure",1,
        "take",1,
        "talent",1,
        "than",1,
        "think",1,
        "underwhelmed",1,
        "very",1,
        "want",1,
        "which",1,
        "work",1,
        "working",1,
        "workload",1]
    }




CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

Re: Facet ignoring repeated word

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
+1 to Toke's facet and stats combo!



On Tuesday, May 10, 2016 11:21 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
On Fri, 2016-04-29 at 08:55 +0000, G, Rajesh wrote:

> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with
a termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

http://localhost:8983/solr/techproducts/select
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
    "numFound": 3,
...
"stats": {
    "stats_fields": {
      "termfreq('name', 'ddr')": {
        "sum": 6
      },
      "termfreq('name', '1GB')": {
        "sum": 3
      }
    }
  }


- Toke Eskildsen, State and University Library, Denmark

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr in termfreq(%27name%27,%20%27ddr%27). I have to find count of all words and their sum. I might have 1000+ comments and each might have different words



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Tuesday, May 10, 2016 6:22 PM
To: solr-user@lucene.apache.org; te@statsbiblioteket.dk
Subject: RE: Facet ignoring repeated word

Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr in termfreq(%27name%27,%20%27ddr%27). I have to find count of all words and their sum



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
Sent: Tuesday, May 10, 2016 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Facet ignoring repeated word

On Fri, 2016-04-29 at 08:55 +0000, G, Rajesh wrote:
> I am trying to implement word cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=ZdiuXWIvnemQkwtzfuD8daMQYonM62VtPXW6Nojd__o&s=fEZWmciBUrd2RCDeqkQcv4wZx4tZlQIt_u01gB6D0VU&e= >  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with a termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_techproducts_select&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=ZdiuXWIvnemQkwtzfuD8daMQYonM62VtPXW6Nojd__o&s=UWysIbdd4V1fnKkuLiek_J_zQ66MM2YNLLVI7f--ICI&e=
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
    "numFound": 3,
...
"stats": {
    "stats_fields": {
      "termfreq('name', 'ddr')": {
        "sum": 6
      },
      "termfreq('name', '1GB')": {
        "sum": 3
      }
    }
  }


- Toke Eskildsen, State and University Library, Denmark



Re: Facet ignoring repeated word

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
G, Rajesh <rg...@cebglobal.com> wrote:
> Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr
> in termfreq(%27name%27,%20%27ddr%27). I have to find count of all words
> and their sum

Is that really the case? As your field is a comment field, your word cloud could easily contain tens or hundreds of thousands of words. That is pretty hard to display. Normally a word cloud consists of a small amount of words, just as seen in the example you link to. The point of using facet + stats is that facets gives you a rough list and stats gives you the real count.

If a usable word cloud consists of 50 words, you could use something like facet.limit=200 and feed those to your stats-request, then only use the top-50 from there. I know that it does not guarantee that the words are the correct ones, but you can experiment with the facet.limit until you get a proper speed/accurracy trade-off.

- Toke Eskildsen

RE: Facet ignoring repeated word

Posted by "G, Rajesh" <rg...@cebglobal.com>.
Thanks Toke. The issue I have is I cannot look for a specific word e.g. ddr in termfreq(%27name%27,%20%27ddr%27). I have to find count of all words and their sum



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
Sent: Tuesday, May 10, 2016 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Facet ignoring repeated word

On Fri, 2016-04-29 at 08:55 +0000, G, Rajesh wrote:
> I am trying to implement word cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=ZdiuXWIvnemQkwtzfuD8daMQYonM62VtPXW6Nojd__o&s=fEZWmciBUrd2RCDeqkQcv4wZx4tZlQIt_u01gB6D0VU&e= >  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with a termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_techproducts_select&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=ZdiuXWIvnemQkwtzfuD8daMQYonM62VtPXW6Nojd__o&s=UWysIbdd4V1fnKkuLiek_J_zQ66MM2YNLLVI7f--ICI&e=
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
    "numFound": 3,
...
"stats": {
    "stats_fields": {
      "termfreq('name', 'ddr')": {
        "sum": 6
      },
      "termfreq('name', '1GB')": {
        "sum": 3
      }
    }
  }


- Toke Eskildsen, State and University Library, Denmark



Re: Facet ignoring repeated word

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Fri, 2016-04-29 at 08:55 +0000, G, Rajesh wrote:
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>  using Solr.  The problem I have is Solr facet query ignores repeated words in a document eg.

Use a combination of faceting and stats:

1) Resolve candidate words with faceting, just as you have already done.

2) Create a stats-request with the same q as you used for faceting, with
a termfreq-function for each term in your facet result.


Working example from the techproducts-demo that comes with Solr:

http://localhost:8983/solr/techproducts/select
?q=name%3Addr%0A
&fl=name&wt=json&indent=true
&stats=true
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%27ddr%27)
&stats.field={!sum=true%20func}termfreq(%27name%27,%20%271GB%27)

where 'name' is the field ('comments' in your setup) and 'ddr' and '1GB'
are two terms ('absorbed', 'am', 'believe' etc. in your setup).


The result will be something like

"response": {
    "numFound": 3,
...
"stats": {
    "stats_fields": {
      "termfreq('name', 'ddr')": {
        "sum": 6
      },
      "termfreq('name', '1GB')": {
        "sum": 3
      }
    }
  }


- Toke Eskildsen, State and University Library, Denmark