You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Betsey Benagh <be...@stresearch.com> on 2016/08/25 15:39:48 UTC

Question about indexing PDFs

Following the instructions in the quick start guide, I imported a bunch of PDF documents into my Solr 6.0 instance.  As far as I can tell from the documentation, there should be a 'content' field indexing, well, the content, but I don't see it in the schema for that collection.  Is there something obvious I might have missed?

Thanks!

Re: Question about indexing PDFs

Posted by Betsey Benagh <be...@stresearch.com>.

Erick,

I’m not sure of anything.  I’m new to Solr and find the documentation
extremely confusing.  I’ve searched the web and found tutorials/advice,
but they generally refer to older versions of Solr, and refer to
methods/settings/whatever that no longer exist. That’s why I’m asking for
help here.

I looked at the list of fields in the schema browser, and ‘content' is not
there.  If that is not enough to ‘assume’ that the content is not being
indexed, then please enlighten me as to what is.

I inserted the docs in batches by posting them, following the ‘Quick
Start’ tutorial.  It seemed like a safe assumption that the tutorial on
the Solr site would be correct and produce desirable results.

What I really want to do is index the XML versions of the documents which
have been run through another system, but I cannot for the life of me
figure out how to do that.  I’ve tried, but the documentation about XML
makes no sense to me.  I thought indexing the PDF versions would be easier
and more straightforward, but perhaps that is not the case.

Thanks,

betsey

On 8/25/16, 5:39 PM, "Erick Erickson" <er...@gmail.com> wrote:

>That is always a dangerous assumption. Are you sure
>you're searching on the proper field? Are you sure it's indexed? Are
>you sure it's....
>
>The schema browser I indicated above will give you some
>idea what's actually in the field. You can not only see the
>fields Solr (actually Lucene) see in your index, but you can
>also see what some of the terms are.
>
>Adding &debug=query and looking at the parsed query
>will show you what fields are being searched against. The
>most common causes of what you're describing are:
>
>> not searching against the field you think you are. This
>is very easy to do without knowing it.
>
>> not actually having 'indexed="true" set in your schema
>
>> not committing after inserting the doc
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
>betsey.benagh@stresearch.com> wrote:
>
>> It looks like the metadata of the PDFs was indexed, but not the content
>> (which is what I was interested in).  Searches on terms I know exist in
>> the content come up empty.
>>
>> On 8/25/16, 2:16 PM, "Betsey Benagh" <be...@stresearch.com>
>>wrote:
>>
>> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused
>>me.
>> >
>> >
>> >On 8/25/16, 1:56 PM, "Erick Erickson" <er...@gmail.com> wrote:
>> >
>> >>when you say "I don't see it in the schema for that collection" are
>>you
>> >>talking schema.xml? managed_schema? Or actual documents in the index?
>> >>Often
>> >>these are defined by dynamic fields and the like in the schema files.
>> >>
>> >>Take a look at the admin UI>>schema browser>>drop down and you'll see
>>all
>> >>the actual fields in your index...
>> >>
>> >>Best,
>> >>Erick
>> >>
>> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>> >><betsey.benagh@stresearch.com
>> >>> wrote:
>> >>
>> >>> Following the instructions in the quick start guide, I imported a
>>bunch
>> >>>of
>> >>> PDF documents into my Solr 6.0 instance.  As far as I can tell from
>>the
>> >>> documentation, there should be a 'content' field indexing, well, the
>> >>> content, but I don't see it in the schema for that collection.  Is
>> >>>there
>> >>> something obvious I might have missed?
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >
>>
>>

RE: Question about indexing PDFs

Posted by Srinivasa Meenavalli <Sm...@zensar.com>.

Hi Betsey,

I executed some examples in Solr 5.5 from apache Tika Data import handler . content/Text was not store by default.
I can see PDF contents with documents when stored="true" enabled .

solr start -e dih

<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

/solr/tika/select?q=*%3A*&wt=json&indent=true

<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
                url="${solr.install.dir}/example/exampledocs/solr-word.pdf" format="text">
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
        </entity>
    </document>
</dataConfig>

Regards
Srinivas Meenavalli

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, August 26, 2016 3:09 AM
To: solr-user
Subject: Re: Question about indexing PDFs

That is always a dangerous assumption. Are you sure you're searching on the proper field? Are you sure it's indexed? Are you sure it's....

The schema browser I indicated above will give you some idea what's actually in the field. You can not only see the fields Solr (actually Lucene) see in your index, but you can also see what some of the terms are.

Adding &debug=query and looking at the parsed query will show you what fields are being searched against. The most common causes of what you're describing are:

> not searching against the field you think you are. This
is very easy to do without knowing it.

> not actually having 'indexed="true" set in your schema

> not committing after inserting the doc

Best,
Erick

On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < betsey.benagh@stresearch.com> wrote:

> It looks like the metadata of the PDFs was indexed, but not the
> content (which is what I was interested in).  Searches on terms I know
> exist in the content come up empty.
>
> On 8/25/16, 2:16 PM, "Betsey Benagh" <be...@stresearch.com> wrote:
>
> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
> >
> >
> >On 8/25/16, 1:56 PM, "Erick Erickson" <er...@gmail.com> wrote:
> >
> >>when you say "I don't see it in the schema for that collection" are
> >>you talking schema.xml? managed_schema? Or actual documents in the index?
> >>Often
> >>these are defined by dynamic fields and the like in the schema files.
> >>
> >>Take a look at the admin UI>>schema browser>>drop down and you'll
> >>see all the actual fields in your index...
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
> >><betsey.benagh@stresearch.com
> >>> wrote:
> >>
> >>> Following the instructions in the quick start guide, I imported a
> >>>bunch of  PDF documents into my Solr 6.0 instance.  As far as I can
> >>>tell from the  documentation, there should be a 'content' field
> >>>indexing, well, the  content, but I don't see it in the schema for
> >>>that collection.  Is there  something obvious I might have missed?
> >>>
> >>> Thanks!
> >>>
> >>>
> >
>
>
Disclaimer: The contents of this e-mail and attachment(s) thereto are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or Zensar Technologies Limited or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of Zensar Technologies Limited or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. Zensar Technologies Ltd or its affiliate do not accept any liability for virus infected mails.

Re: Question about indexing PDFs

Posted by Erick Erickson <er...@gmail.com>.

That is always a dangerous assumption. Are you sure
you're searching on the proper field? Are you sure it's indexed? Are
you sure it's....

The schema browser I indicated above will give you some
idea what's actually in the field. You can not only see the
fields Solr (actually Lucene) see in your index, but you can
also see what some of the terms are.

Adding &debug=query and looking at the parsed query
will show you what fields are being searched against. The
most common causes of what you're describing are:

> not searching against the field you think you are. This
is very easy to do without knowing it.

> not actually having 'indexed="true" set in your schema

> not committing after inserting the doc

Best,
Erick

On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
betsey.benagh@stresearch.com> wrote:

> It looks like the metadata of the PDFs was indexed, but not the content
> (which is what I was interested in).  Searches on terms I know exist in
> the content come up empty.
>
> On 8/25/16, 2:16 PM, "Betsey Benagh" <be...@stresearch.com> wrote:
>
> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
> >
> >
> >On 8/25/16, 1:56 PM, "Erick Erickson" <er...@gmail.com> wrote:
> >
> >>when you say "I don't see it in the schema for that collection" are you
> >>talking schema.xml? managed_schema? Or actual documents in the index?
> >>Often
> >>these are defined by dynamic fields and the like in the schema files.
> >>
> >>Take a look at the admin UI>>schema browser>>drop down and you'll see all
> >>the actual fields in your index...
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
> >><betsey.benagh@stresearch.com
> >>> wrote:
> >>
> >>> Following the instructions in the quick start guide, I imported a bunch
> >>>of
> >>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
> >>> documentation, there should be a 'content' field indexing, well, the
> >>> content, but I don't see it in the schema for that collection.  Is
> >>>there
> >>> something obvious I might have missed?
> >>>
> >>> Thanks!
> >>>
> >>>
> >
>
>

Re: Question about indexing PDFs

Posted by Betsey Benagh <be...@stresearch.com>.

It looks like the metadata of the PDFs was indexed, but not the content
(which is what I was interested in).  Searches on terms I know exist in
the content come up empty.

On 8/25/16, 2:16 PM, "Betsey Benagh" <be...@stresearch.com> wrote:

>Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
>
>
>On 8/25/16, 1:56 PM, "Erick Erickson" <er...@gmail.com> wrote:
>
>>when you say "I don't see it in the schema for that collection" are you
>>talking schema.xml? managed_schema? Or actual documents in the index?
>>Often
>>these are defined by dynamic fields and the like in the schema files.
>>
>>Take a look at the admin UI>>schema browser>>drop down and you'll see all
>>the actual fields in your index...
>>
>>Best,
>>Erick
>>
>>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>><betsey.benagh@stresearch.com
>>> wrote:
>>
>>> Following the instructions in the quick start guide, I imported a bunch
>>>of
>>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
>>> documentation, there should be a 'content' field indexing, well, the
>>> content, but I don't see it in the schema for that collection.  Is
>>>there
>>> something obvious I might have missed?
>>>
>>> Thanks!
>>>
>>>
>

Re: Question about indexing PDFs

Posted by Betsey Benagh <be...@stresearch.com>.

Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.


On 8/25/16, 1:56 PM, "Erick Erickson" <er...@gmail.com> wrote:

>when you say "I don't see it in the schema for that collection" are you
>talking schema.xml? managed_schema? Or actual documents in the index?
>Often
>these are defined by dynamic fields and the like in the schema files.
>
>Take a look at the admin UI>>schema browser>>drop down and you'll see all
>the actual fields in your index...
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
><betsey.benagh@stresearch.com
>> wrote:
>
>> Following the instructions in the quick start guide, I imported a bunch
>>of
>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
>> documentation, there should be a 'content' field indexing, well, the
>> content, but I don't see it in the schema for that collection.  Is there
>> something obvious I might have missed?
>>
>> Thanks!
>>
>>

Re: Question about indexing PDFs

Posted by Erick Erickson <er...@gmail.com>.

when you say "I don't see it in the schema for that collection" are you
talking schema.xml? managed_schema? Or actual documents in the index? Often
these are defined by dynamic fields and the like in the schema files.

Take a look at the admin UI>>schema browser>>drop down and you'll see all
the actual fields in your index...

Best,
Erick

On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh <betsey.benagh@stresearch.com
> wrote:

> Following the instructions in the quick start guide, I imported a bunch of
> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
> documentation, there should be a 'content' field indexing, well, the
> content, but I don't see it in the schema for that collection.  Is there
> something obvious I might have missed?
>
> Thanks!
>
>