You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ZiYuan <zi...@gmail.com> on 2017/06/17 22:04:24 UTC

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Hi,

I am new to Solr and I need to implement a full-text search of some PDF
files. The indexing part works out of the box by using bin/post. I can see
search results in the admin UI given some queries, though without the
matched texts and the context.

Now I am reading this post
<http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
for the highlighting part. It is for an older version of Solr when managed
schema was not available. Before fully understand what it is doing I have
some questions:

1. He defined two fields:

<field name="content" type="text_general" indexed="false" stored="true"
multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true"/>

But why are there two fields needed? Can I define a field

<field name="content" type="text_general" indexed="true" stored="true"
multiValued="true"/>

to capture the full text?

2. How are the fields filled? I don't see relevant information in
TikaEntityProcessor's documentation
<https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/apache/solr/handler/dataimport/TikaEntityProcessor.html#fields.inherited.from.class.org.apache.solr.handler.dataimport.EntityProcessorBase>.
The current text extractor should already be Tika (I can see

"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]

in the returned JSON of some query). But even I define the fields as he
said I cannot see them in the search results as keys in JSON.

3. The _text_ field seems a concatenation of other fields, does it contain
the full text? Though it does not seem to be accessible by default.

To be brief, using The Elements of Statistical Learning
<http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf>
as an example, how to highlight the relevant texts for the query "SVM"? And
if changing the file name into "The Elements of Statistical Learning -
Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
query "id:Trevor Hastie"?

Thank you.

Best regards,
Ziyuan

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Yeah, Chris knows a thing or two about Tika.  :)

-----Original Message-----
From: ZiYuan [mailto:ziyuang@gmail.com] 
Sent: Tuesday, June 20, 2017 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

No intention of spamming but I also want to mention tika-python <https://github.com/chrismattmann/tika-python> in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan <zi...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr) 
> because Python is my main programming language. I have an impression 
> that 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ 
> talk to the server via HTTP or some other more native ways? Is the 
> main benefit of SolrJ over other clients the official shipment with Solr? Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan" <zi...@gmail.com> wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just 
>> trying to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
>> <er...@gmail.com>
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a 
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of 
>>> problems there are when parsing all the different formats so I'd 
>>> _really_ follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all 
>>> these different formats, implemented by different vendors with 
>>> different versions that more or less follow a spec which really 
>>> isn't a spec in many cases just recommendations using packages that 
>>> may or may not be actively maintained. And by the way, we'll try to 
>>> handle that 1G document that someone sends us, but don't blame us if 
>>> we hit an OOM.....". When Tika is run on the same box as Solr any 
>>> problems in that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU 
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for 
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <zi...@gmail.com> wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content 
>>> > to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
>>> > <er...@gmail.com>
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >>         Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for 
>>> >> > question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned 
>>> >> > JSON,
>>> with
>>> >> the
>>> >> > the same definitions as in the post 
>>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika
>>> >> />
>>> >> > :
>>> >> >
>>> >> > <field name="content" type="text_general" indexed="false"
>>> stored="true"/>
>>> >> > <field name="text" type="text_general" multiValued="true"
>>> indexed="true"
>>> >> > stored="false"/>
>>> >> > <copyField source="content" dest="text"/>
>>> >> >
>>> >> > Is it so that Tika does not fill these two fields automatically 
>>> >> > and
>>> I
>>> >> have
>>> >> > to write some client code to fill them?
>>> >> >
>>> >> > Best regards,
>>> >> > Ziyuan
>>> >> >
>>> >> >
>>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>>> erickerickson@gmail.com
>>> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> 1> Yes, you can use your single definition. The author 
>>> >> >> 1> identifies
>>> the
>>> >> >> "text" field as a catch-all. Somewhere in the schema there'll 
>>> >> >> be a copyField directive copying (perhaps) many different 
>>> >> >> fields to the "text" field. That permits simple searches 
>>> >> >> against a single field rather than, say, using edismax to 
>>> >> >> search across multiple separate fields.
>>> >> >>
>>> >> >> 2> The link you referenced is for Data Import Handler, which 
>>> >> >> 2> is
>>> much
>>> >> >> different than just posting files to Solr. See
>>> >> >> ExtractingRequestHandler:
>>> >> >> https://cwiki.apache.org/confluence/display/solr/
>>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>>> >> >> There are ways to map meta-data fields from the doc into 
>>> >> >> specific fields matching your schema. Be a little careful 
>>> >> >> here. There is no standard across different types of docs as 
>>> >> >> to what meta-data field
>>> is
>>> >> >> included. PDF might have a "last_edited" field. Word might 
>>> >> >> have a "last_modified" field where the two mean the same 
>>> >> >> thing. Here's a
>>> link
>>> >> >> to a SolrJ program that'll dump all the fields:
>>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You 
>>> >> >> can
>>> easily
>>> >> >> hack out the DB bits.
>>> >> >>
>>> >> >> BTW, once you get more familiar with processing, I strongly
>>> recommend
>>> >> >> you do the document processing on the client, the reasons are
>>> outlined
>>> >> >> in that article.
>>> >> >>
>>> >> >> bq: even I define the fields as he said I cannot see them in 
>>> >> >> the search results as keys in JSON are the fields set as 
>>> >> >> stored="true"? They must be to be returned in requests 
>>> >> >> (skipping the docValues discussion here).
>>> >> >>
>>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>>> >> >> Because it has stored=false, you can only search it, you 
>>> >> >> cannot highlight or view. Fields you highlight must have stored=true BTW.
>>> >> >>
>>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a 
>>> >> >> lot
>>> of
>>> >> >> things, most particularly whether that text is ever actually 
>>> >> >> in a field in your index. Just because there's no guarantee 
>>> >> >> that the
>>> name
>>> >> >> of the file is indexed in a searchable/highlightable way.
>>> >> >>
>>> >> >> And the query q=id:Trevor Hastie won't do what you think. 
>>> >> >> It'll be
>>> >> parsed
>>> >> >> as
>>> >> >> id:Trevor _text_:Hastie
>>> >> >> _text_ is the default field, look for a "df" parameter in your
>>> request
>>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>>> >> >>
>>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> I am new to Solr and I need to implement a full-text search 
>>> >> >>> of
>>> some PDF
>>> >> >>> files. The indexing part works out of the box by using bin/post.
>>> I can
>>> >> >> see
>>> >> >>> search results in the admin UI given some queries, though 
>>> >> >>> without
>>> the
>>> >> >>> matched texts and the context.
>>> >> >>>
>>> >> >>> Now I am reading this post
>>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus
>>> -tika/>
>>> >> >>> for the highlighting part. It is for an older version of Solr 
>>> >> >>> when
>>> >> >> managed
>>> >> >>> schema was not available. Before fully understand what it is
>>> doing I
>>> >> have
>>> >> >>> some questions:
>>> >> >>>
>>> >> >>> 1. He defined two fields:
>>> >> >>>
>>> >> >>> <field name="content" type="text_general" indexed="false"
>>> stored="true"
>>> >> >>> multiValued="false"/>
>>> >> >>> <field name="text" type="text_general" indexed="true"
>>> stored="false"
>>> >> >>> multiValued="true"/>
>>> >> >>>
>>> >> >>> But why are there two fields needed? Can I define a field
>>> >> >>>
>>> >> >>> <field name="content" type="text_general" indexed="true"
>>> stored="true"
>>> >> >>> multiValued="true"/>
>>> >> >>>
>>> >> >>> to capture the full text?
>>> >> >>>
>>> >> >>> 2. How are the fields filled? I don't see relevant 
>>> >> >>> information in TikaEntityProcessor's documentation
>>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>>> >> dataimporthandler-extras/org/
>>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>>> >> >> dataimport.EntityProcessorBase>.
>>> >> >>> The current text extractor should already be Tika (I can see
>>> >> >>>
>>> >> >>> "x_parsed_by":
>>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>>> >> >> tika.parser.pdf.PDFParser"]
>>> >> >>>
>>> >> >>> in the returned JSON of some query). But even I define the 
>>> >> >>> fields
>>> as he
>>> >> >>> said I cannot see them in the search results as keys in JSON.
>>> >> >>>
>>> >> >>> 3. The _text_ field seems a concatenation of other fields, 
>>> >> >>> does it
>>> >> >> contain
>>> >> >>> the full text? Though it does not seem to be accessible by
>>> default.
>>> >> >>>
>>> >> >>> To be brief, using The Elements of Statistical Learning 
>>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>>> >> >> ESLII_print10.pdf>
>>> >> >>> as an example, how to highlight the relevant texts for the 
>>> >> >>> query
>>> "SVM"?
>>> >> >> And
>>> >> >>> if changing the file name into "The Elements of Statistical
>>> Learning -
>>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>>> for
>>> >> the
>>> >> >>> query "id:Trevor Hastie"?
>>> >> >>>
>>> >> >>> Thank you.
>>> >> >>>
>>> >> >>> Best regards,
>>> >> >>> Ziyuan
>>> >> >>
>>> >>
>>> >>
>>>
>>
>>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by ZiYuan <zi...@gmail.com>.

No intention of spamming but I also want to mention tika-python
<https://github.com/chrismattmann/tika-python> in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan <zi...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr)
> because Python is my main programming language. I have an impression that
> 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ talk to
> the server via HTTP or some other more native ways? Is the main benefit of
> SolrJ over other clients the official shipment with Solr? Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan" <zi...@gmail.com> wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just trying
>> to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of problems
>>> there are when parsing all the different formats so I'd _really_
>>> follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all these
>>> different formats, implemented by different vendors with different
>>> versions that more or less follow a spec which really isn't a spec in
>>> many cases just recommendations using packages that may or may not be
>>> actively maintained. And by the way, we'll try to handle that 1G
>>> document that someone sends us, but don't blame us if we hit an
>>> OOM.....". When Tika is run on the same box as Solr any problems in
>>> that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <zi...@gmail.com> wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <er...@gmail.com>
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >>         Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned JSON,
>>> with
>>> >> the
>>> >> > the same definitions as in the post
>>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> >> > :
>>> >> >
>>> >> > <field name="content" type="text_general" indexed="false"
>>> stored="true"/>
>>> >> > <field name="text" type="text_general" multiValued="true"
>>> indexed="true"
>>> >> > stored="false"/>
>>> >> > <copyField source="content" dest="text"/>
>>> >> >
>>> >> > Is it so that Tika does not fill these two fields automatically and
>>> I
>>> >> have
>>> >> > to write some client code to fill them?
>>> >> >
>>> >> > Best regards,
>>> >> > Ziyuan
>>> >> >
>>> >> >
>>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>>> erickerickson@gmail.com
>>> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> 1> Yes, you can use your single definition. The author identifies
>>> the
>>> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
>>> >> >> copyField directive copying (perhaps) many different fields to the
>>> >> >> "text" field. That permits simple searches against a single field
>>> >> >> rather than, say, using edismax to search across multiple separate
>>> >> >> fields.
>>> >> >>
>>> >> >> 2> The link you referenced is for Data Import Handler, which is
>>> much
>>> >> >> different than just posting files to Solr. See
>>> >> >> ExtractingRequestHandler:
>>> >> >> https://cwiki.apache.org/confluence/display/solr/
>>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>>> >> >> There are ways to map meta-data fields from the doc into specific
>>> >> >> fields matching your schema. Be a little careful here. There is no
>>> >> >> standard across different types of docs as to what meta-data field
>>> is
>>> >> >> included. PDF might have a "last_edited" field. Word might have a
>>> >> >> "last_modified" field where the two mean the same thing. Here's a
>>> link
>>> >> >> to a SolrJ program that'll dump all the fields:
>>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
>>> easily
>>> >> >> hack out the DB bits.
>>> >> >>
>>> >> >> BTW, once you get more familiar with processing, I strongly
>>> recommend
>>> >> >> you do the document processing on the client, the reasons are
>>> outlined
>>> >> >> in that article.
>>> >> >>
>>> >> >> bq: even I define the fields as he said I cannot see them in the
>>> >> >> search results as keys in JSON
>>> >> >> are the fields set as stored="true"? They must be to be returned in
>>> >> >> requests (skipping the docValues discussion here).
>>> >> >>
>>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>>> >> >> Because it has stored=false, you can only search it, you cannot
>>> >> >> highlight or view. Fields you highlight must have stored=true BTW.
>>> >> >>
>>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot
>>> of
>>> >> >> things, most particularly whether that text is ever actually in a
>>> >> >> field in your index. Just because there's no guarantee that the
>>> name
>>> >> >> of the file is indexed in a searchable/highlightable way.
>>> >> >>
>>> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
>>> >> parsed
>>> >> >> as
>>> >> >> id:Trevor _text_:Hastie
>>> >> >> _text_ is the default field, look for a "df" parameter in your
>>> request
>>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>>> >> >>
>>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> I am new to Solr and I need to implement a full-text search of
>>> some PDF
>>> >> >>> files. The indexing part works out of the box by using bin/post.
>>> I can
>>> >> >> see
>>> >> >>> search results in the admin UI given some queries, though without
>>> the
>>> >> >>> matched texts and the context.
>>> >> >>>
>>> >> >>> Now I am reading this post
>>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus
>>> -tika/>
>>> >> >>> for the highlighting part. It is for an older version of Solr when
>>> >> >> managed
>>> >> >>> schema was not available. Before fully understand what it is
>>> doing I
>>> >> have
>>> >> >>> some questions:
>>> >> >>>
>>> >> >>> 1. He defined two fields:
>>> >> >>>
>>> >> >>> <field name="content" type="text_general" indexed="false"
>>> stored="true"
>>> >> >>> multiValued="false"/>
>>> >> >>> <field name="text" type="text_general" indexed="true"
>>> stored="false"
>>> >> >>> multiValued="true"/>
>>> >> >>>
>>> >> >>> But why are there two fields needed? Can I define a field
>>> >> >>>
>>> >> >>> <field name="content" type="text_general" indexed="true"
>>> stored="true"
>>> >> >>> multiValued="true"/>
>>> >> >>>
>>> >> >>> to capture the full text?
>>> >> >>>
>>> >> >>> 2. How are the fields filled? I don't see relevant information in
>>> >> >>> TikaEntityProcessor's documentation
>>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>>> >> dataimporthandler-extras/org/
>>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>>> >> >> dataimport.EntityProcessorBase>.
>>> >> >>> The current text extractor should already be Tika (I can see
>>> >> >>>
>>> >> >>> "x_parsed_by":
>>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>>> >> >> tika.parser.pdf.PDFParser"]
>>> >> >>>
>>> >> >>> in the returned JSON of some query). But even I define the fields
>>> as he
>>> >> >>> said I cannot see them in the search results as keys in JSON.
>>> >> >>>
>>> >> >>> 3. The _text_ field seems a concatenation of other fields, does it
>>> >> >> contain
>>> >> >>> the full text? Though it does not seem to be accessible by
>>> default.
>>> >> >>>
>>> >> >>> To be brief, using The Elements of Statistical Learning
>>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>>> >> >> ESLII_print10.pdf>
>>> >> >>> as an example, how to highlight the relevant texts for the query
>>> "SVM"?
>>> >> >> And
>>> >> >>> if changing the file name into "The Elements of Statistical
>>> Learning -
>>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>>> for
>>> >> the
>>> >> >>> query "id:Trevor Hastie"?
>>> >> >>>
>>> >> >>> Thank you.
>>> >> >>>
>>> >> >>> Best regards,
>>> >> >>> Ziyuan
>>> >> >>
>>> >>
>>> >>
>>>
>>
>>

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>http -  however, the big advantage of doing your indexing on different machine is that the heavy lifting that tika does in extracting text from documents, finding metadata etc is not happening on the server. If the indexer crashes, it doesn’t affect Solr either.

+1 

for what can go wrong: http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf 

https://www.youtube.com/watch?v=vRPTPMwI53k&t=13s&index=43&list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp

Really, we try our best on Tika, but sometimes bad things happen.  Let us know when they do, and we'll try to fix them.

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by Phil Scadden <P....@gns.cri.nz>.

http -  however, the big advantage of doing your indexing on different machine is that the heavy lifting that tika does in extracting text from documents, finding metadata etc is not happening on the server. If the indexer crashes, it doesn’t affect Solr either.

-----Original Message-----
From: ZiYuan [mailto:ziyuang@gmail.com]
Sent: Tuesday, 20 June 2017 11:29 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr) because Python is my main programming language. I have an impression that 1. they send HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to the server via HTTP or some other more native ways? Is the main benefit of SolrJ over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan" <zi...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just
> trying to figure out what is going on by indexing one or two PDF files
> first. Thank you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson
> <er...@gmail.com>
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.....". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <zi...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher
>> > <er...@gmail.com>
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >>         Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned
>> >> > JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/
>> >> >
>> >> > :
>> >> >
>> >> > <field name="content" type="text_general" indexed="false"
>> stored="true"/>
>> >> > <field name="text" type="text_general" multiValued="true"
>> indexed="true"
>> >> > stored="false"/>
>> >> > <copyField source="content" dest="text"/>
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically
>> >> > and I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author
>> >> >> 1> identifies
>> the
>> >> >> "text" field as a catch-all. Somewhere in the schema there'll
>> >> >> be a copyField directive copying (perhaps) many different
>> >> >> fields to the "text" field. That permits simple searches
>> >> >> against a single field rather than, say, using edismax to
>> >> >> search across multiple separate fields.
>> >> >>
>> >> >> 2> The link you referenced is for Data Import Handler, which is
>> >> >> 2> much
>> >> >> different than just posting files to Solr. See
>> >> >> ExtractingRequestHandler:
>> >> >> https://cwiki.apache.org/confluence/display/solr/
>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> >> There are ways to map meta-data fields from the doc into
>> >> >> specific fields matching your schema. Be a little careful here.
>> >> >> There is no standard across different types of docs as to what
>> >> >> meta-data field
>> is
>> >> >> included. PDF might have a "last_edited" field. Word might have
>> >> >> a "last_modified" field where the two mean the same thing.
>> >> >> Here's a
>> link
>> >> >> to a SolrJ program that'll dump all the fields:
>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
>> easily
>> >> >> hack out the DB bits.
>> >> >>
>> >> >> BTW, once you get more familiar with processing, I strongly
>> recommend
>> >> >> you do the document processing on the client, the reasons are
>> outlined
>> >> >> in that article.
>> >> >>
>> >> >> bq: even I define the fields as he said I cannot see them in
>> >> >> the search results as keys in JSON are the fields set as
>> >> >> stored="true"? They must be to be returned in requests
>> >> >> (skipping the docValues discussion here).
>> >> >>
>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>> >> >> Because it has stored=false, you can only search it, you cannot
>> >> >> highlight or view. Fields you highlight must have stored=true BTW.
>> >> >>
>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a
>> >> >> lot of things, most particularly whether that text is ever
>> >> >> actually in a field in your index. Just because there's no
>> >> >> guarantee that the name of the file is indexed in a searchable/highlightable way.
>> >> >>
>> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll
>> >> >> be
>> >> parsed
>> >> >> as
>> >> >> id:Trevor _text_:Hastie
>> >> >> _text_ is the default field, look for a "df" parameter in your
>> request
>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>> >> >>
>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>> >> >>> Hi,
>> >> >>>
>> >> >>> I am new to Solr and I need to implement a full-text search of
>> some PDF
>> >> >>> files. The indexing part works out of the box by using
>> >> >>> bin/post. I
>> can
>> >> >> see
>> >> >>> search results in the admin UI given some queries, though
>> >> >>> without
>> the
>> >> >>> matched texts and the context.
>> >> >>>
>> >> >>> Now I am reading this post
>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-ti
>> >> >> ka/>
>> >> >>> for the highlighting part. It is for an older version of Solr
>> >> >>> when
>> >> >> managed
>> >> >>> schema was not available. Before fully understand what it is
>> >> >>> doing
>> I
>> >> have
>> >> >>> some questions:
>> >> >>>
>> >> >>> 1. He defined two fields:
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="false"
>> stored="true"
>> >> >>> multiValued="false"/>
>> >> >>> <field name="text" type="text_general" indexed="true"
>> stored="false"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> But why are there two fields needed? Can I define a field
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="true"
>> stored="true"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> to capture the full text?
>> >> >>>
>> >> >>> 2. How are the fields filled? I don't see relevant information
>> >> >>> in TikaEntityProcessor's documentation
>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>> >> dataimporthandler-extras/org/
>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>> >> >> dataimport.EntityProcessorBase>.
>> >> >>> The current text extractor should already be Tika (I can see
>> >> >>>
>> >> >>> "x_parsed_by":
>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> >> >> tika.parser.pdf.PDFParser"]
>> >> >>>
>> >> >>> in the returned JSON of some query). But even I define the
>> >> >>> fields
>> as he
>> >> >>> said I cannot see them in the search results as keys in JSON.
>> >> >>>
>> >> >>> 3. The _text_ field seems a concatenation of other fields,
>> >> >>> does it
>> >> >> contain
>> >> >>> the full text? Though it does not seem to be accessible by default.
>> >> >>>
>> >> >>> To be brief, using The Elements of Statistical Learning
>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> >> >> ESLII_print10.pdf>
>> >> >>> as an example, how to highlight the relevant texts for the
>> >> >>> query
>> "SVM"?
>> >> >> And
>> >> >>> if changing the file name into "The Elements of Statistical
>> Learning -
>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>> for
>> >> the
>> >> >>> query "id:Trevor Hastie"?
>> >> >>>
>> >> >>> Thank you.
>> >> >>>
>> >> >>> Best regards,
>> >> >>> Ziyuan
>> >> >>
>> >>
>> >>
>>
>
>
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by ZiYuan <zi...@gmail.com>.

Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr)
because Python is my main programming language. I have an impression that
1. they send HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to
the server via HTTP or some other more native ways? Is the main benefit of
SolrJ over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan" <zi...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just trying to
> figure out what is going on by indexing one or two PDF files first. Thank
> you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.....". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <zi...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <er...@gmail.com>
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >>         Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >> > :
>> >> >
>> >> > <field name="content" type="text_general" indexed="false"
>> stored="true"/>
>> >> > <field name="text" type="text_general" multiValued="true"
>> indexed="true"
>> >> > stored="false"/>
>> >> > <copyField source="content" dest="text"/>
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically and I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author identifies
>> the
>> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
>> >> >> copyField directive copying (perhaps) many different fields to the
>> >> >> "text" field. That permits simple searches against a single field
>> >> >> rather than, say, using edismax to search across multiple separate
>> >> >> fields.
>> >> >>
>> >> >> 2> The link you referenced is for Data Import Handler, which is much
>> >> >> different than just posting files to Solr. See
>> >> >> ExtractingRequestHandler:
>> >> >> https://cwiki.apache.org/confluence/display/solr/
>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> >> There are ways to map meta-data fields from the doc into specific
>> >> >> fields matching your schema. Be a little careful here. There is no
>> >> >> standard across different types of docs as to what meta-data field
>> is
>> >> >> included. PDF might have a "last_edited" field. Word might have a
>> >> >> "last_modified" field where the two mean the same thing. Here's a
>> link
>> >> >> to a SolrJ program that'll dump all the fields:
>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
>> easily
>> >> >> hack out the DB bits.
>> >> >>
>> >> >> BTW, once you get more familiar with processing, I strongly
>> recommend
>> >> >> you do the document processing on the client, the reasons are
>> outlined
>> >> >> in that article.
>> >> >>
>> >> >> bq: even I define the fields as he said I cannot see them in the
>> >> >> search results as keys in JSON
>> >> >> are the fields set as stored="true"? They must be to be returned in
>> >> >> requests (skipping the docValues discussion here).
>> >> >>
>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>> >> >> Because it has stored=false, you can only search it, you cannot
>> >> >> highlight or view. Fields you highlight must have stored=true BTW.
>> >> >>
>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
>> >> >> things, most particularly whether that text is ever actually in a
>> >> >> field in your index. Just because there's no guarantee that the name
>> >> >> of the file is indexed in a searchable/highlightable way.
>> >> >>
>> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
>> >> parsed
>> >> >> as
>> >> >> id:Trevor _text_:Hastie
>> >> >> _text_ is the default field, look for a "df" parameter in your
>> request
>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>> >> >>
>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>> >> >>> Hi,
>> >> >>>
>> >> >>> I am new to Solr and I need to implement a full-text search of
>> some PDF
>> >> >>> files. The indexing part works out of the box by using bin/post. I
>> can
>> >> >> see
>> >> >>> search results in the admin UI given some queries, though without
>> the
>> >> >>> matched texts and the context.
>> >> >>>
>> >> >>> Now I am reading this post
>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >> >>> for the highlighting part. It is for an older version of Solr when
>> >> >> managed
>> >> >>> schema was not available. Before fully understand what it is doing
>> I
>> >> have
>> >> >>> some questions:
>> >> >>>
>> >> >>> 1. He defined two fields:
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="false"
>> stored="true"
>> >> >>> multiValued="false"/>
>> >> >>> <field name="text" type="text_general" indexed="true"
>> stored="false"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> But why are there two fields needed? Can I define a field
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="true"
>> stored="true"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> to capture the full text?
>> >> >>>
>> >> >>> 2. How are the fields filled? I don't see relevant information in
>> >> >>> TikaEntityProcessor's documentation
>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>> >> dataimporthandler-extras/org/
>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>> >> >> dataimport.EntityProcessorBase>.
>> >> >>> The current text extractor should already be Tika (I can see
>> >> >>>
>> >> >>> "x_parsed_by":
>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> >> >> tika.parser.pdf.PDFParser"]
>> >> >>>
>> >> >>> in the returned JSON of some query). But even I define the fields
>> as he
>> >> >>> said I cannot see them in the search results as keys in JSON.
>> >> >>>
>> >> >>> 3. The _text_ field seems a concatenation of other fields, does it
>> >> >> contain
>> >> >>> the full text? Though it does not seem to be accessible by default.
>> >> >>>
>> >> >>> To be brief, using The Elements of Statistical Learning
>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> >> >> ESLII_print10.pdf>
>> >> >>> as an example, how to highlight the relevant texts for the query
>> "SVM"?
>> >> >> And
>> >> >>> if changing the file name into "The Elements of Statistical
>> Learning -
>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>> for
>> >> the
>> >> >>> query "id:Trevor Hastie"?
>> >> >>>
>> >> >>> Thank you.
>> >> >>>
>> >> >>> Best regards,
>> >> >>> Ziyuan
>> >> >>
>> >>
>> >>
>>
>
>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by ZiYuan <zi...@gmail.com>.

Dear Erick and Timothy,

yes I will parse from the client for all the benefits. I am just trying to
figure out what is going on by indexing one or two PDF files first. Thank
you both.

Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <er...@gmail.com>
wrote:

> bq: Hope that there is no side effect of not mapping the PDF
>
> Well, yes it will have that side effect. You can cure that with a
> copyField directive from content to _text_.
>
> But do really consider running this as a SolrJ program on the client.
> Tim knows in far more painful detail than I do what kinds of problems
> there are when parsing all the different formats so I'd _really_
> follow his advice.
>
> Tika pretty much has an impossible job. "Here, try to parse all these
> different formats, implemented by different vendors with different
> versions that more or less follow a spec which really isn't a spec in
> many cases just recommendations using packages that may or may not be
> actively maintained. And by the way, we'll try to handle that 1G
> document that someone sends us, but don't blame us if we hit an
> OOM.....". When Tika is run on the same box as Solr any problems in
> that entire chain can adversely affect your search.
>
> Not to mention that Tika has to do some heavy lifting, using CPU
> cycles that are unavailable for Solr.
>
> Extracting Request Handler is a fine way to get started, but for
> production seriously consider a separate client.
>
> Best,
> Erick
>
> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <zi...@gmail.com> wrote:
> > Hi Erick,
> >
> > Now it is clear. I have to update the request handler of /update/extract/
> > from
> > "defaults":{"fmap.content":"_text_"}
> > to
> > "defaults":{"fmap.content":"content"}
> > to fill the field.
> >
> > Hope that there is no side effect of not mapping the PDF content to
> _text_.
> > Thank you for the hint.
> >
> > Best regards,
> > Ziyuan
> >
> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <er...@gmail.com>
> > wrote:
> >
> >> Ziyuan -
> >>
> >> You may be interested in the example/files that ships with Solr too.
> It’s
> >> got schema and config and even UI for file indexing and searching.
>  Check
> >> it out README.txt under example/files in your Solr install.
> >>
> >>         Erik
> >>
> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
> >> >
> >> > Hi Erick,
> >> >
> >> > thanks very much for the explanations! Clarification for question 2:
> more
> >> > specifically I cannot see the field content in the returned JSON, with
> >> the
> >> > the same definitions as in the post
> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >> > :
> >> >
> >> > <field name="content" type="text_general" indexed="false"
> stored="true"/>
> >> > <field name="text" type="text_general" multiValued="true"
> indexed="true"
> >> > stored="false"/>
> >> > <copyField source="content" dest="text"/>
> >> >
> >> > Is it so that Tika does not fill these two fields automatically and I
> >> have
> >> > to write some client code to fill them?
> >> >
> >> > Best regards,
> >> > Ziyuan
> >> >
> >> >
> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> 1> Yes, you can use your single definition. The author identifies the
> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> >> copyField directive copying (perhaps) many different fields to the
> >> >> "text" field. That permits simple searches against a single field
> >> >> rather than, say, using edismax to search across multiple separate
> >> >> fields.
> >> >>
> >> >> 2> The link you referenced is for Data Import Handler, which is much
> >> >> different than just posting files to Solr. See
> >> >> ExtractingRequestHandler:
> >> >> https://cwiki.apache.org/confluence/display/solr/
> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> >> There are ways to map meta-data fields from the doc into specific
> >> >> fields matching your schema. Be a little careful here. There is no
> >> >> standard across different types of docs as to what meta-data field is
> >> >> included. PDF might have a "last_edited" field. Word might have a
> >> >> "last_modified" field where the two mean the same thing. Here's a
> link
> >> >> to a SolrJ program that'll dump all the fields:
> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
> easily
> >> >> hack out the DB bits.
> >> >>
> >> >> BTW, once you get more familiar with processing, I strongly recommend
> >> >> you do the document processing on the client, the reasons are
> outlined
> >> >> in that article.
> >> >>
> >> >> bq: even I define the fields as he said I cannot see them in the
> >> >> search results as keys in JSON
> >> >> are the fields set as stored="true"? They must be to be returned in
> >> >> requests (skipping the docValues discussion here).
> >> >>
> >> >> 3> Yes, the text field is a concatenation of all the other ones.
> >> >> Because it has stored=false, you can only search it, you cannot
> >> >> highlight or view. Fields you highlight must have stored=true BTW.
> >> >>
> >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> >> >> things, most particularly whether that text is ever actually in a
> >> >> field in your index. Just because there's no guarantee that the name
> >> >> of the file is indexed in a searchable/highlightable way.
> >> >>
> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
> >> parsed
> >> >> as
> >> >> id:Trevor _text_:Hastie
> >> >> _text_ is the default field, look for a "df" parameter in your
> request
> >> >> handler in solrconfig.xml (usually "/select" or "/query").
> >> >>
> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I am new to Solr and I need to implement a full-text search of some
> PDF
> >> >>> files. The indexing part works out of the box by using bin/post. I
> can
> >> >> see
> >> >>> search results in the admin UI given some queries, though without
> the
> >> >>> matched texts and the context.
> >> >>>
> >> >>> Now I am reading this post
> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >> >>> for the highlighting part. It is for an older version of Solr when
> >> >> managed
> >> >>> schema was not available. Before fully understand what it is doing I
> >> have
> >> >>> some questions:
> >> >>>
> >> >>> 1. He defined two fields:
> >> >>>
> >> >>> <field name="content" type="text_general" indexed="false"
> stored="true"
> >> >>> multiValued="false"/>
> >> >>> <field name="text" type="text_general" indexed="true" stored="false"
> >> >>> multiValued="true"/>
> >> >>>
> >> >>> But why are there two fields needed? Can I define a field
> >> >>>
> >> >>> <field name="content" type="text_general" indexed="true"
> stored="true"
> >> >>> multiValued="true"/>
> >> >>>
> >> >>> to capture the full text?
> >> >>>
> >> >>> 2. How are the fields filled? I don't see relevant information in
> >> >>> TikaEntityProcessor's documentation
> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
> >> dataimporthandler-extras/org/
> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
> >> >> fields.inherited.from.class.org.apache.solr.handler.
> >> >> dataimport.EntityProcessorBase>.
> >> >>> The current text extractor should already be Tika (I can see
> >> >>>
> >> >>> "x_parsed_by":
> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
> >> >> tika.parser.pdf.PDFParser"]
> >> >>>
> >> >>> in the returned JSON of some query). But even I define the fields
> as he
> >> >>> said I cannot see them in the search results as keys in JSON.
> >> >>>
> >> >>> 3. The _text_ field seems a concatenation of other fields, does it
> >> >> contain
> >> >>> the full text? Though it does not seem to be accessible by default.
> >> >>>
> >> >>> To be brief, using The Elements of Statistical Learning
> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
> >> >> ESLII_print10.pdf>
> >> >>> as an example, how to highlight the relevant texts for the query
> "SVM"?
> >> >> And
> >> >>> if changing the file name into "The Elements of Statistical
> Learning -
> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for
> >> the
> >> >>> query "id:Trevor Hastie"?
> >> >>>
> >> >>> Thank you.
> >> >>>
> >> >>> Best regards,
> >> >>> Ziyuan
> >> >>
> >>
> >>
>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by Erick Erickson <er...@gmail.com>.

bq: Hope that there is no side effect of not mapping the PDF

Well, yes it will have that side effect. You can cure that with a
copyField directive from content to _text_.

But do really consider running this as a SolrJ program on the client.
Tim knows in far more painful detail than I do what kinds of problems
there are when parsing all the different formats so I'd _really_
follow his advice.

Tika pretty much has an impossible job. "Here, try to parse all these
different formats, implemented by different vendors with different
versions that more or less follow a spec which really isn't a spec in
many cases just recommendations using packages that may or may not be
actively maintained. And by the way, we'll try to handle that 1G
document that someone sends us, but don't blame us if we hit an
OOM.....". When Tika is run on the same box as Solr any problems in
that entire chain can adversely affect your search.

Not to mention that Tika has to do some heavy lifting, using CPU
cycles that are unavailable for Solr.

Extracting Request Handler is a fine way to get started, but for
production seriously consider a separate client.

Best,
Erick

On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <zi...@gmail.com> wrote:
> Hi Erick,
>
> Now it is clear. I have to update the request handler of /update/extract/
> from
> "defaults":{"fmap.content":"_text_"}
> to
> "defaults":{"fmap.content":"content"}
> to fill the field.
>
> Hope that there is no side effect of not mapping the PDF content to _text_.
> Thank you for the hint.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <er...@gmail.com>
> wrote:
>
>> Ziyuan -
>>
>> You may be interested in the example/files that ships with Solr too.  It’s
>> got schema and config and even UI for file indexing and searching.   Check
>> it out README.txt under example/files in your Solr install.
>>
>>         Erik
>>
>> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
>> >
>> > Hi Erick,
>> >
>> > thanks very much for the explanations! Clarification for question 2: more
>> > specifically I cannot see the field content in the returned JSON, with
>> the
>> > the same definitions as in the post
>> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> > :
>> >
>> > <field name="content" type="text_general" indexed="false" stored="true"/>
>> > <field name="text" type="text_general" multiValued="true" indexed="true"
>> > stored="false"/>
>> > <copyField source="content" dest="text"/>
>> >
>> > Is it so that Tika does not fill these two fields automatically and I
>> have
>> > to write some client code to fill them?
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> >
>> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> 1> Yes, you can use your single definition. The author identifies the
>> >> "text" field as a catch-all. Somewhere in the schema there'll be a
>> >> copyField directive copying (perhaps) many different fields to the
>> >> "text" field. That permits simple searches against a single field
>> >> rather than, say, using edismax to search across multiple separate
>> >> fields.
>> >>
>> >> 2> The link you referenced is for Data Import Handler, which is much
>> >> different than just posting files to Solr. See
>> >> ExtractingRequestHandler:
>> >> https://cwiki.apache.org/confluence/display/solr/
>> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> There are ways to map meta-data fields from the doc into specific
>> >> fields matching your schema. Be a little careful here. There is no
>> >> standard across different types of docs as to what meta-data field is
>> >> included. PDF might have a "last_edited" field. Word might have a
>> >> "last_modified" field where the two mean the same thing. Here's a link
>> >> to a SolrJ program that'll dump all the fields:
>> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
>> >> hack out the DB bits.
>> >>
>> >> BTW, once you get more familiar with processing, I strongly recommend
>> >> you do the document processing on the client, the reasons are outlined
>> >> in that article.
>> >>
>> >> bq: even I define the fields as he said I cannot see them in the
>> >> search results as keys in JSON
>> >> are the fields set as stored="true"? They must be to be returned in
>> >> requests (skipping the docValues discussion here).
>> >>
>> >> 3> Yes, the text field is a concatenation of all the other ones.
>> >> Because it has stored=false, you can only search it, you cannot
>> >> highlight or view. Fields you highlight must have stored=true BTW.
>> >>
>> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
>> >> things, most particularly whether that text is ever actually in a
>> >> field in your index. Just because there's no guarantee that the name
>> >> of the file is indexed in a searchable/highlightable way.
>> >>
>> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
>> parsed
>> >> as
>> >> id:Trevor _text_:Hastie
>> >> _text_ is the default field, look for a "df" parameter in your request
>> >> handler in solrconfig.xml (usually "/select" or "/query").
>> >>
>> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>> >>> Hi,
>> >>>
>> >>> I am new to Solr and I need to implement a full-text search of some PDF
>> >>> files. The indexing part works out of the box by using bin/post. I can
>> >> see
>> >>> search results in the admin UI given some queries, though without the
>> >>> matched texts and the context.
>> >>>
>> >>> Now I am reading this post
>> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >>> for the highlighting part. It is for an older version of Solr when
>> >> managed
>> >>> schema was not available. Before fully understand what it is doing I
>> have
>> >>> some questions:
>> >>>
>> >>> 1. He defined two fields:
>> >>>
>> >>> <field name="content" type="text_general" indexed="false" stored="true"
>> >>> multiValued="false"/>
>> >>> <field name="text" type="text_general" indexed="true" stored="false"
>> >>> multiValued="true"/>
>> >>>
>> >>> But why are there two fields needed? Can I define a field
>> >>>
>> >>> <field name="content" type="text_general" indexed="true" stored="true"
>> >>> multiValued="true"/>
>> >>>
>> >>> to capture the full text?
>> >>>
>> >>> 2. How are the fields filled? I don't see relevant information in
>> >>> TikaEntityProcessor's documentation
>> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>> dataimporthandler-extras/org/
>> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> >> fields.inherited.from.class.org.apache.solr.handler.
>> >> dataimport.EntityProcessorBase>.
>> >>> The current text extractor should already be Tika (I can see
>> >>>
>> >>> "x_parsed_by":
>> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> >> tika.parser.pdf.PDFParser"]
>> >>>
>> >>> in the returned JSON of some query). But even I define the fields as he
>> >>> said I cannot see them in the search results as keys in JSON.
>> >>>
>> >>> 3. The _text_ field seems a concatenation of other fields, does it
>> >> contain
>> >>> the full text? Though it does not seem to be accessible by default.
>> >>>
>> >>> To be brief, using The Elements of Statistical Learning
>> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> >> ESLII_print10.pdf>
>> >>> as an example, how to highlight the relevant texts for the query "SVM"?
>> >> And
>> >>> if changing the file name into "The Elements of Statistical Learning -
>> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for
>> the
>> >>> query "id:Trevor Hastie"?
>> >>>
>> >>> Thank you.
>> >>>
>> >>> Best regards,
>> >>> Ziyuan
>> >>
>>
>>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by ZiYuan <zi...@gmail.com>.

Hi Erick,

Now it is clear. I have to update the request handler of /update/extract/
from
"defaults":{"fmap.content":"_text_"}
to
"defaults":{"fmap.content":"content"}
to fill the field.

Hope that there is no side effect of not mapping the PDF content to _text_.
Thank you for the hint.

Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <er...@gmail.com>
wrote:

> Ziyuan -
>
> You may be interested in the example/files that ships with Solr too.  It’s
> got schema and config and even UI for file indexing and searching.   Check
> it out README.txt under example/files in your Solr install.
>
>         Erik
>
> > On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
> >
> > Hi Erick,
> >
> > thanks very much for the explanations! Clarification for question 2: more
> > specifically I cannot see the field content in the returned JSON, with
> the
> > the same definitions as in the post
> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> > :
> >
> > <field name="content" type="text_general" indexed="false" stored="true"/>
> > <field name="text" type="text_general" multiValued="true" indexed="true"
> > stored="false"/>
> > <copyField source="content" dest="text"/>
> >
> > Is it so that Tika does not fill these two fields automatically and I
> have
> > to write some client code to fill them?
> >
> > Best regards,
> > Ziyuan
> >
> >
> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> 1> Yes, you can use your single definition. The author identifies the
> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> copyField directive copying (perhaps) many different fields to the
> >> "text" field. That permits simple searches against a single field
> >> rather than, say, using edismax to search across multiple separate
> >> fields.
> >>
> >> 2> The link you referenced is for Data Import Handler, which is much
> >> different than just posting files to Solr. See
> >> ExtractingRequestHandler:
> >> https://cwiki.apache.org/confluence/display/solr/
> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> There are ways to map meta-data fields from the doc into specific
> >> fields matching your schema. Be a little careful here. There is no
> >> standard across different types of docs as to what meta-data field is
> >> included. PDF might have a "last_edited" field. Word might have a
> >> "last_modified" field where the two mean the same thing. Here's a link
> >> to a SolrJ program that'll dump all the fields:
> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
> >> hack out the DB bits.
> >>
> >> BTW, once you get more familiar with processing, I strongly recommend
> >> you do the document processing on the client, the reasons are outlined
> >> in that article.
> >>
> >> bq: even I define the fields as he said I cannot see them in the
> >> search results as keys in JSON
> >> are the fields set as stored="true"? They must be to be returned in
> >> requests (skipping the docValues discussion here).
> >>
> >> 3> Yes, the text field is a concatenation of all the other ones.
> >> Because it has stored=false, you can only search it, you cannot
> >> highlight or view. Fields you highlight must have stored=true BTW.
> >>
> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> >> things, most particularly whether that text is ever actually in a
> >> field in your index. Just because there's no guarantee that the name
> >> of the file is indexed in a searchable/highlightable way.
> >>
> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
> parsed
> >> as
> >> id:Trevor _text_:Hastie
> >> _text_ is the default field, look for a "df" parameter in your request
> >> handler in solrconfig.xml (usually "/select" or "/query").
> >>
> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> I am new to Solr and I need to implement a full-text search of some PDF
> >>> files. The indexing part works out of the box by using bin/post. I can
> >> see
> >>> search results in the admin UI given some queries, though without the
> >>> matched texts and the context.
> >>>
> >>> Now I am reading this post
> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >>> for the highlighting part. It is for an older version of Solr when
> >> managed
> >>> schema was not available. Before fully understand what it is doing I
> have
> >>> some questions:
> >>>
> >>> 1. He defined two fields:
> >>>
> >>> <field name="content" type="text_general" indexed="false" stored="true"
> >>> multiValued="false"/>
> >>> <field name="text" type="text_general" indexed="true" stored="false"
> >>> multiValued="true"/>
> >>>
> >>> But why are there two fields needed? Can I define a field
> >>>
> >>> <field name="content" type="text_general" indexed="true" stored="true"
> >>> multiValued="true"/>
> >>>
> >>> to capture the full text?
> >>>
> >>> 2. How are the fields filled? I don't see relevant information in
> >>> TikaEntityProcessor's documentation
> >>> <https://lucene.apache.org/solr/6_6_0/solr-
> dataimporthandler-extras/org/
> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
> >> fields.inherited.from.class.org.apache.solr.handler.
> >> dataimport.EntityProcessorBase>.
> >>> The current text extractor should already be Tika (I can see
> >>>
> >>> "x_parsed_by":
> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
> >> tika.parser.pdf.PDFParser"]
> >>>
> >>> in the returned JSON of some query). But even I define the fields as he
> >>> said I cannot see them in the search results as keys in JSON.
> >>>
> >>> 3. The _text_ field seems a concatenation of other fields, does it
> >> contain
> >>> the full text? Though it does not seem to be accessible by default.
> >>>
> >>> To be brief, using The Elements of Statistical Learning
> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
> >> ESLII_print10.pdf>
> >>> as an example, how to highlight the relevant texts for the query "SVM"?
> >> And
> >>> if changing the file name into "The Elements of Statistical Learning -
> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for
> the
> >>> query "id:Trevor Hastie"?
> >>>
> >>> Thank you.
> >>>
> >>> Best regards,
> >>> Ziyuan
> >>
>
>

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> There is no standard across different types of docs as to what meta-data field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing.

On Tika, we _try_ to normalize fields according to various standards, the most predominant is Dublin core, so that "author" in one format and "creator" in another will both be mapped to "dc:creator".  That said:

1) there are plenty of areas where we could do a better job of normalizing.  Please let us know how to improve!
2) no matter how well we normalize, there are some metadata items that are specific to various file formats...I strongly recommend running Tika against a representative batch of documents and deciding which fields you need for your application.

Finally, if there's a chance you want metadata from embedded documents/attachments, checkout the RecursiveParserWrapper.  Under legacy Tika, if you have a bunch of images in a zip file, you'd never get the lat/longs...or you'd never get "dc:creator" from an MSWord file sent as an attachment in an MSG file.

Finally, and I mean it this time, I heartily second Erik's point about SolrJ and the need to keep your file processing outside of Solr's JVM, VM and M!




-----Original Message-----
From: Erik Hatcher [mailto:erik.hatcher@gmail.com] 
Sent: Monday, June 19, 2017 6:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got schema and config and even UI for file indexing and searching.   Check it out README.txt under example/files in your Solr install.

	Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: 
> more specifically I cannot see the field content in the returned JSON, 
> with the the same definitions as in the post 
> <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-t
> ext-inside-documents-indexed-with-solr-plus-tika/>
> :
> 
> <field name="content" type="text_general" indexed="false" 
> stored="true"/> <field name="text" type="text_general" multiValued="true" indexed="true"
> stored="false"/>
> <copyField source="content" dest="text"/>
> 
> Is it so that Tika does not fill these two fields automatically and I 
> have to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson 
> <er...@gmail.com>
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a 
>> copyField directive copying (perhaps) many different fields to the 
>> "text" field. That permits simple searches against a single field 
>> rather than, say, using edismax to search across multiple separate 
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific 
>> fields matching your schema. Be a little careful here. There is no 
>> standard across different types of docs as to what meta-data field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing. Here's a 
>> link to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can 
>> easily hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend 
>> you do the document processing on the client, the reasons are 
>> outlined in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the 
>> search results as keys in JSON are the fields set as stored="true"? 
>> They must be to be returned in requests (skipping the docValues 
>> discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot 
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of 
>> things, most particularly whether that text is ever actually in a 
>> field in your index. Just because there's no guarantee that the name 
>> of the file is indexed in a searchable/highlightable way.
>> 
>> And the query q=id:Trevor Hastie won't do what you think. It'll be 
>> parsed as id:Trevor _text_:Hastie _text_ is the default field, look 
>> for a "df" parameter in your request handler in solrconfig.xml 
>> (usually "/select" or "/query").
>> 
>> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>>> Hi,
>>> 
>>> I am new to Solr and I need to implement a full-text search of some 
>>> PDF files. The indexing part works out of the box by using bin/post. 
>>> I can
>> see
>>> search results in the admin UI given some queries, though without 
>>> the matched texts and the context.
>>> 
>>> Now I am reading this post
>>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> for the highlighting part. It is for an older version of Solr when
>> managed
>>> schema was not available. Before fully understand what it is doing I 
>>> have some questions:
>>> 
>>> 1. He defined two fields:
>>> 
>>> <field name="content" type="text_general" indexed="false" stored="true"
>>> multiValued="false"/>
>>> <field name="text" type="text_general" indexed="true" stored="false"
>>> multiValued="true"/>
>>> 
>>> But why are there two fields needed? Can I define a field
>>> 
>>> <field name="content" type="text_general" indexed="true" stored="true"
>>> multiValued="true"/>
>>> 
>>> to capture the full text?
>>> 
>>> 2. How are the fields filled? I don't see relevant information in 
>>> TikaEntityProcessor's documentation 
>>> <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/
>>> org/
>> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> fields.inherited.from.class.org.apache.solr.handler.
>> dataimport.EntityProcessorBase>.
>>> The current text extractor should already be Tika (I can see
>>> 
>>> "x_parsed_by":
>>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> tika.parser.pdf.PDFParser"]
>>> 
>>> in the returned JSON of some query). But even I define the fields as 
>>> he said I cannot see them in the search results as keys in JSON.
>>> 
>>> 3. The _text_ field seems a concatenation of other fields, does it
>> contain
>>> the full text? Though it does not seem to be accessible by default.
>>> 
>>> To be brief, using The Elements of Statistical Learning 
>>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> ESLII_print10.pdf>
>>> as an example, how to highlight the relevant texts for the query "SVM"?
>> And
>>> if changing the file name into "The Elements of Statistical Learning 
>>> - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" 
>>> for the query "id:Trevor Hastie"?
>>> 
>>> Thank you.
>>> 
>>> Best regards,
>>> Ziyuan
>>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by Erik Hatcher <er...@gmail.com>.

Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got schema and config and even UI for file indexing and searching.   Check it out README.txt under example/files in your Solr install.

	Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan <zi...@gmail.com> wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: more
> specifically I cannot see the field content in the returned JSON, with the
> the same definitions as in the post
> <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> :
> 
> <field name="content" type="text_general" indexed="false" stored="true"/>
> <field name="text" type="text_general" multiValued="true" indexed="true"
> stored="false"/>
> <copyField source="content" dest="text"/>
> 
> Is it so that Tika does not fill these two fields automatically and I have
> to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <er...@gmail.com>
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a
>> copyField directive copying (perhaps) many different fields to the
>> "text" field. That permits simple searches against a single field
>> rather than, say, using edismax to search across multiple separate
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific
>> fields matching your schema. Be a little careful here. There is no
>> standard across different types of docs as to what meta-data field is
>> included. PDF might have a "last_edited" field. Word might have a
>> "last_modified" field where the two mean the same thing. Here's a link
>> to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
>> hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend
>> you do the document processing on the client, the reasons are outlined
>> in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the
>> search results as keys in JSON
>> are the fields set as stored="true"? They must be to be returned in
>> requests (skipping the docValues discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of
>> things, most particularly whether that text is ever actually in a
>> field in your index. Just because there's no guarantee that the name
>> of the file is indexed in a searchable/highlightable way.
>> 
>> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed
>> as
>> id:Trevor _text_:Hastie
>> _text_ is the default field, look for a "df" parameter in your request
>> handler in solrconfig.xml (usually "/select" or "/query").
>> 
>> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
>>> Hi,
>>> 
>>> I am new to Solr and I need to implement a full-text search of some PDF
>>> files. The indexing part works out of the box by using bin/post. I can
>> see
>>> search results in the admin UI given some queries, though without the
>>> matched texts and the context.
>>> 
>>> Now I am reading this post
>>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> for the highlighting part. It is for an older version of Solr when
>> managed
>>> schema was not available. Before fully understand what it is doing I have
>>> some questions:
>>> 
>>> 1. He defined two fields:
>>> 
>>> <field name="content" type="text_general" indexed="false" stored="true"
>>> multiValued="false"/>
>>> <field name="text" type="text_general" indexed="true" stored="false"
>>> multiValued="true"/>
>>> 
>>> But why are there two fields needed? Can I define a field
>>> 
>>> <field name="content" type="text_general" indexed="true" stored="true"
>>> multiValued="true"/>
>>> 
>>> to capture the full text?
>>> 
>>> 2. How are the fields filled? I don't see relevant information in
>>> TikaEntityProcessor's documentation
>>> <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/
>> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> fields.inherited.from.class.org.apache.solr.handler.
>> dataimport.EntityProcessorBase>.
>>> The current text extractor should already be Tika (I can see
>>> 
>>> "x_parsed_by":
>>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> tika.parser.pdf.PDFParser"]
>>> 
>>> in the returned JSON of some query). But even I define the fields as he
>>> said I cannot see them in the search results as keys in JSON.
>>> 
>>> 3. The _text_ field seems a concatenation of other fields, does it
>> contain
>>> the full text? Though it does not seem to be accessible by default.
>>> 
>>> To be brief, using The Elements of Statistical Learning
>>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> ESLII_print10.pdf>
>>> as an example, how to highlight the relevant texts for the query "SVM"?
>> And
>>> if changing the file name into "The Elements of Statistical Learning -
>>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
>>> query "id:Trevor Hastie"?
>>> 
>>> Thank you.
>>> 
>>> Best regards,
>>> Ziyuan
>>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by ZiYuan <zi...@gmail.com>.

Hi Erick,

thanks very much for the explanations! Clarification for question 2: more
specifically I cannot see the field content in the returned JSON, with the
the same definitions as in the post
<http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
:

<field name="content" type="text_general" indexed="false" stored="true"/>
<field name="text" type="text_general" multiValued="true" indexed="true"
stored="false"/>
<copyField source="content" dest="text"/>

Is it so that Tika does not fill these two fields automatically and I have
to write some client code to fill them?

Best regards,
Ziyuan


On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <er...@gmail.com>
wrote:

> 1> Yes, you can use your single definition. The author identifies the
> "text" field as a catch-all. Somewhere in the schema there'll be a
> copyField directive copying (perhaps) many different fields to the
> "text" field. That permits simple searches against a single field
> rather than, say, using edismax to search across multiple separate
> fields.
>
> 2> The link you referenced is for Data Import Handler, which is much
> different than just posting files to Solr. See
> ExtractingRequestHandler:
> https://cwiki.apache.org/confluence/display/solr/
> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> There are ways to map meta-data fields from the doc into specific
> fields matching your schema. Be a little careful here. There is no
> standard across different types of docs as to what meta-data field is
> included. PDF might have a "last_edited" field. Word might have a
> "last_modified" field where the two mean the same thing. Here's a link
> to a SolrJ program that'll dump all the fields:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
> hack out the DB bits.
>
> BTW, once you get more familiar with processing, I strongly recommend
> you do the document processing on the client, the reasons are outlined
> in that article.
>
> bq: even I define the fields as he said I cannot see them in the
> search results as keys in JSON
> are the fields set as stored="true"? They must be to be returned in
> requests (skipping the docValues discussion here).
>
> 3> Yes, the text field is a concatenation of all the other ones.
> Because it has stored=false, you can only search it, you cannot
> highlight or view. Fields you highlight must have stored=true BTW.
>
> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> things, most particularly whether that text is ever actually in a
> field in your index. Just because there's no guarantee that the name
> of the file is indexed in a searchable/highlightable way.
>
> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed
> as
> id:Trevor _text_:Hastie
> _text_ is the default field, look for a "df" parameter in your request
> handler in solrconfig.xml (usually "/select" or "/query").
>
> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
> > Hi,
> >
> > I am new to Solr and I need to implement a full-text search of some PDF
> > files. The indexing part works out of the box by using bin/post. I can
> see
> > search results in the admin UI given some queries, though without the
> > matched texts and the context.
> >
> > Now I am reading this post
> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> > for the highlighting part. It is for an older version of Solr when
> managed
> > schema was not available. Before fully understand what it is doing I have
> > some questions:
> >
> > 1. He defined two fields:
> >
> > <field name="content" type="text_general" indexed="false" stored="true"
> > multiValued="false"/>
> > <field name="text" type="text_general" indexed="true" stored="false"
> > multiValued="true"/>
> >
> > But why are there two fields needed? Can I define a field
> >
> > <field name="content" type="text_general" indexed="true" stored="true"
> > multiValued="true"/>
> >
> > to capture the full text?
> >
> > 2. How are the fields filled? I don't see relevant information in
> > TikaEntityProcessor's documentation
> > <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/
> apache/solr/handler/dataimport/TikaEntityProcessor.html#
> fields.inherited.from.class.org.apache.solr.handler.
> dataimport.EntityProcessorBase>.
> > The current text extractor should already be Tika (I can see
> >
> > "x_parsed_by":
> > ["org.apache.tika.parser.DefaultParser","org.apache.
> tika.parser.pdf.PDFParser"]
> >
> > in the returned JSON of some query). But even I define the fields as he
> > said I cannot see them in the search results as keys in JSON.
> >
> > 3. The _text_ field seems a concatenation of other fields, does it
> contain
> > the full text? Though it does not seem to be accessible by default.
> >
> > To be brief, using The Elements of Statistical Learning
> > <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
> ESLII_print10.pdf>
> > as an example, how to highlight the relevant texts for the query "SVM"?
> And
> > if changing the file name into "The Elements of Statistical Learning -
> > Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
> > query "id:Trevor Hastie"?
> >
> > Thank you.
> >
> > Best regards,
> > Ziyuan
>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Posted by Erick Erickson <er...@gmail.com>.

1> Yes, you can use your single definition. The author identifies the
"text" field as a catch-all. Somewhere in the schema there'll be a
copyField directive copying (perhaps) many different fields to the
"text" field. That permits simple searches against a single field
rather than, say, using edismax to search across multiple separate
fields.

2> The link you referenced is for Data Import Handler, which is much
different than just posting files to Solr. See
ExtractingRequestHandler:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika.
There are ways to map meta-data fields from the doc into specific
fields matching your schema. Be a little careful here. There is no
standard across different types of docs as to what meta-data field is
included. PDF might have a "last_edited" field. Word might have a
"last_modified" field where the two mean the same thing. Here's a link
to a SolrJ program that'll dump all the fields:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
hack out the DB bits.

BTW, once you get more familiar with processing, I strongly recommend
you do the document processing on the client, the reasons are outlined
in that article.

bq: even I define the fields as he said I cannot see them in the
search results as keys in JSON
are the fields set as stored="true"? They must be to be returned in
requests (skipping the docValues discussion here).

3> Yes, the text field is a concatenation of all the other ones.
Because it has stored=false, you can only search it, you cannot
highlight or view. Fields you highlight must have stored=true BTW.

Whether or not you can highlight "Trevor Hastie" depends an a lot of
things, most particularly whether that text is ever actually in a
field in your index. Just because there's no guarantee that the name
of the file is indexed in a searchable/highlightable way.

And the query q=id:Trevor Hastie won't do what you think. It'll be parsed as
id:Trevor _text_:Hastie
_text_ is the default field, look for a "df" parameter in your request
handler in solrconfig.xml (usually "/select" or "/query").

On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <zi...@gmail.com> wrote:
> Hi,
>
> I am new to Solr and I need to implement a full-text search of some PDF
> files. The indexing part works out of the box by using bin/post. I can see
> search results in the admin UI given some queries, though without the
> matched texts and the context.
>
> Now I am reading this post
> <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> for the highlighting part. It is for an older version of Solr when managed
> schema was not available. Before fully understand what it is doing I have
> some questions:
>
> 1. He defined two fields:
>
> <field name="content" type="text_general" indexed="false" stored="true"
> multiValued="false"/>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
>
> But why are there two fields needed? Can I define a field
>
> <field name="content" type="text_general" indexed="true" stored="true"
> multiValued="true"/>
>
> to capture the full text?
>
> 2. How are the fields filled? I don't see relevant information in
> TikaEntityProcessor's documentation
> <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/apache/solr/handler/dataimport/TikaEntityProcessor.html#fields.inherited.from.class.org.apache.solr.handler.dataimport.EntityProcessorBase>.
> The current text extractor should already be Tika (I can see
>
> "x_parsed_by":
> ["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
>
> in the returned JSON of some query). But even I define the fields as he
> said I cannot see them in the search results as keys in JSON.
>
> 3. The _text_ field seems a concatenation of other fields, does it contain
> the full text? Though it does not seem to be accessible by default.
>
> To be brief, using The Elements of Statistical Learning
> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf>
> as an example, how to highlight the relevant texts for the query "SVM"? And
> if changing the file name into "The Elements of Statistical Learning -
> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
> query "id:Trevor Hastie"?
>
> Thank you.
>
> Best regards,
> Ziyuan