You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Geoffrey Willis <gw...@yahoo.com.INVALID> on 2019/05/01 19:14:27 UTC

Term Freq Vector with SOLR cell?

I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema:

<field name="doc_content" type="text_ws" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true”/>

And use a get request similar to :

   http://localhost:8983/solr/myCore/tvrh?q=doc_content&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload
  s=true&tv.fl=includes

When I look in the browser network tab, I see that the query went in as expected with tv=true, tv.positions= true etc. But no Term Positions/Offsets in the results. I’ve done similar using the Data Import Handler with java, but looking for a web solution. Before I “Roll my own” Term Vector, thought I’d see if it’s available from Solr Cell. 

Re: Term Freq Vector with SOLR cell?

Posted by Geoffrey Willis <gw...@yahoo.com.INVALID>.
Thanks for the response. The tvrh I got off a google search and the doc_content was meant as a filter. The actual query I’m using is:
http://localhost:8983/solr/myCore/select?q=999&tf=true&tf=true&tv.positions=true&tv.offsets=true <http://localhost:8983/solr/myCore/select?q=999&tf=true&tf=true&tv.positions=true&tv.offsets=true>

A screen grab of the response headers :




So it appears that term vector	and term positions are set, but not included in the response. My doc_content field was modified as shown earlier to enable storing these attributes when indexing. I get the doc_contents data (Text extracted by Tika), just not the TermFreq data I need such as Offset, and Positions. Thanks for any help. 
Geoff



> On May 1, 2019, at 4:52 PM, Erik Hatcher <er...@gmail.com> wrote:
> 
> q=doc_content?    Try q=id:"<some known id that you've indexed>"
> 
> Solr Cell and DIH are comparable (in that they are about getting content into Solr) but "unrelated" to TVRH.   TVRH is about inspecting indexed content, regardless of how it got in.
> 
> 	Erik
> 
> 
>> On May 1, 2019, at 3:14 PM, Geoffrey Willis <gw...@yahoo.com.INVALID> wrote:
>> 
>> I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema:
>> 
>> <field name="doc_content" type="text_ws" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true”/>
>> 
>> And use a get request similar to :
>> 
>>  http://localhost:8983/solr/myCore/tvrh?q=doc_content&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload
>> s=true&tv.fl=includes
>> 
>> When I look in the browser network tab, I see that the query went in as expected with tv=true, tv.positions= true etc. But no Term Positions/Offsets in the results. I’ve done similar using the Data Import Handler with java, but looking for a web solution. Before I “Roll my own” Term Vector, thought I’d see if it’s available from Solr Cell.
> 


Re: Term Freq Vector with SOLR cell?

Posted by Erik Hatcher <er...@gmail.com>.
q=doc_content?    Try q=id:"<some known id that you've indexed>"

Solr Cell and DIH are comparable (in that they are about getting content into Solr) but "unrelated" to TVRH.   TVRH is about inspecting indexed content, regardless of how it got in.

	Erik


> On May 1, 2019, at 3:14 PM, Geoffrey Willis <gw...@yahoo.com.INVALID> wrote:
> 
> I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema:
> 
> <field name="doc_content" type="text_ws" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true”/>
> 
> And use a get request similar to :
> 
>   http://localhost:8983/solr/myCore/tvrh?q=doc_content&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload
>  s=true&tv.fl=includes
> 
> When I look in the browser network tab, I see that the query went in as expected with tv=true, tv.positions= true etc. But no Term Positions/Offsets in the results. I’ve done similar using the Data Import Handler with java, but looking for a web solution. Before I “Roll my own” Term Vector, thought I’d see if it’s available from Solr Cell.