You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Srijan <sh...@gmail.com> on 2020/02/17 15:24:54 UTC

Metadata info on Stored Fields

Hi,

I have a data model where the operational "Object" can have one or more
files attached. Indexing these objects in Solr means indexing all metadata
info and the contents of the files. For file contents what I have right now
is a single multi-valued field (for each locale)

Example:
<doc>
<metadata field 1>xxx
<metadata field 2>yyy
<file_content_en_US> portion of file 1
<file_content_en_US> remaining portion of file 1
<file_content_en_US> portion of file 2
<file_content_en_US> contents from file 2 again...
...
</doc>

Search is easy and everything's been working fine. We recently introduced
highlighting functionality on these file content fields. Again, straight
forward use-case. Next requirement is where things get a little tricky. We
want to be able to return the name of the file ( generalizing this - or
some other metadata info related to the file content field). If our data
model had a 1:1 relation between our operational object and the file it
contains, the file name would have been just another field on the main doc
but unfortunately that's not the case - each file content field could
belong to any file.

There are a couple of potential solutions I have been thinking of:
1. Use nested docs to preserve the logical grouping of file content and the
file info where this content is coming from. This could potentially work
but I haven't done any testing yet (I know highlighting doesn't work on
nested docs for example)

2. Encode the file name in the file content fields themselves. The file
name will be removed during indexing but will be stored. How do I get the
file name included in each snippet fragment - this again needs exploring on
my end

Another approach I have been thinking is extending the StoredField to also
store additional meta data information. So basically when a stored field is
retrieved, or a fragment is returned, I also have additional information
associated with the stored field. Can someone tell me this is a terrible
idea and I should not be pursuing.

Is there something else I can try?

Thanks a lot,
Srijan

Re: Metadata info on Stored Fields

Posted by Edward Ribeiro <ed...@gmail.com>.
Sorry, my fault,

I bypassed this excerpt of yours: " do I get the file name included in each
snippet fragment - this again needs exploring on my end". No, the solution
I proposed doesn't address that. :(

Edward

Em seg, 17 de fev de 2020 14:03, Srijan <sh...@gmail.com> escreveu:

> You know what, I think I missed a major description in my earlier email. I
> want to be able to return additional data from stored fields alongside the
> snippets during highlighting. In this case, the filename where this snippet
> came from. Not sure your approach would address that.
>
> On Mon, Feb 17, 2020, 10:44 Edward Ribeiro <ed...@gmail.com>
> wrote:
>
> > Hi,
> >
> > You may try to create two kinds of docs forming a parent-child
> relationship
> > without nesting. Like
> >
> > <doc>
> > <id>894</id>
> > <type>parent</type>
> >
> > ...
> > <doc/>
> >
> > <doc>
> > <id>3213</id>
> > <type>child</type>
> > <parent_id>894</parent_id>
> > <metadata field 1>xxx
> > <file_content_en_US> portion of file 1
> > <file_content_en_US> remaining portion of file 1
> > ...
> > <doc/>
> >
> > Then you can add metadata for each child doc. The search can be done on
> > child docs but if you need to group you can use the join query parser (it
> > has some limitations though) or grouping by parent_id.
> >
> > Cheers,
> > Edward
> >
> >
> > Em seg, 17 de fev de 2020 12:25, Srijan <sh...@gmail.com> escreveu:
> >
> > > Hi,
> > >
> > > I have a data model where the operational "Object" can have one or more
> > > files attached. Indexing these objects in Solr means indexing all
> > metadata
> > > info and the contents of the files. For file contents what I have right
> > now
> > > is a single multi-valued field (for each locale)
> > >
> > > Example:
> > > <doc>
> > > <metadata field 1>xxx
> > > <metadata field 2>yyy
> > > <file_content_en_US> portion of file 1
> > > <file_content_en_US> remaining portion of file 1
> > > <file_content_en_US> portion of file 2
> > > <file_content_en_US> contents from file 2 again...
> > > ...
> > > </doc>
> > >
> > > Search is easy and everything's been working fine. We recently
> introduced
> > > highlighting functionality on these file content fields. Again,
> straight
> > > forward use-case. Next requirement is where things get a little tricky.
> > We
> > > want to be able to return the name of the file ( generalizing this - or
> > > some other metadata info related to the file content field). If our
> data
> > > model had a 1:1 relation between our operational object and the file it
> > > contains, the file name would have been just another field on the main
> > doc
> > > but unfortunately that's not the case - each file content field could
> > > belong to any file.
> > >
> > > There are a couple of potential solutions I have been thinking of:
> > > 1. Use nested docs to preserve the logical grouping of file content and
> > the
> > > file info where this content is coming from. This could potentially
> work
> > > but I haven't done any testing yet (I know highlighting doesn't work on
> > > nested docs for example)
> > >
> > > 2. Encode the file name in the file content fields themselves. The file
> > > name will be removed during indexing but will be stored. How do I get
> the
> > > file name included in each snippet fragment - this again needs
> exploring
> > on
> > > my end
> > >
> > > Another approach I have been thinking is extending the StoredField to
> > also
> > > store additional meta data information. So basically when a stored
> field
> > is
> > > retrieved, or a fragment is returned, I also have additional
> information
> > > associated with the stored field. Can someone tell me this is a
> terrible
> > > idea and I should not be pursuing.
> > >
> > > Is there something else I can try?
> > >
> > > Thanks a lot,
> > > Srijan
> > >
> >
>

Re: Metadata info on Stored Fields

Posted by Srijan <sh...@gmail.com>.
You know what, I think I missed a major description in my earlier email. I
want to be able to return additional data from stored fields alongside the
snippets during highlighting. In this case, the filename where this snippet
came from. Not sure your approach would address that.

On Mon, Feb 17, 2020, 10:44 Edward Ribeiro <ed...@gmail.com> wrote:

> Hi,
>
> You may try to create two kinds of docs forming a parent-child relationship
> without nesting. Like
>
> <doc>
> <id>894</id>
> <type>parent</type>
>
> ...
> <doc/>
>
> <doc>
> <id>3213</id>
> <type>child</type>
> <parent_id>894</parent_id>
> <metadata field 1>xxx
> <file_content_en_US> portion of file 1
> <file_content_en_US> remaining portion of file 1
> ...
> <doc/>
>
> Then you can add metadata for each child doc. The search can be done on
> child docs but if you need to group you can use the join query parser (it
> has some limitations though) or grouping by parent_id.
>
> Cheers,
> Edward
>
>
> Em seg, 17 de fev de 2020 12:25, Srijan <sh...@gmail.com> escreveu:
>
> > Hi,
> >
> > I have a data model where the operational "Object" can have one or more
> > files attached. Indexing these objects in Solr means indexing all
> metadata
> > info and the contents of the files. For file contents what I have right
> now
> > is a single multi-valued field (for each locale)
> >
> > Example:
> > <doc>
> > <metadata field 1>xxx
> > <metadata field 2>yyy
> > <file_content_en_US> portion of file 1
> > <file_content_en_US> remaining portion of file 1
> > <file_content_en_US> portion of file 2
> > <file_content_en_US> contents from file 2 again...
> > ...
> > </doc>
> >
> > Search is easy and everything's been working fine. We recently introduced
> > highlighting functionality on these file content fields. Again, straight
> > forward use-case. Next requirement is where things get a little tricky.
> We
> > want to be able to return the name of the file ( generalizing this - or
> > some other metadata info related to the file content field). If our data
> > model had a 1:1 relation between our operational object and the file it
> > contains, the file name would have been just another field on the main
> doc
> > but unfortunately that's not the case - each file content field could
> > belong to any file.
> >
> > There are a couple of potential solutions I have been thinking of:
> > 1. Use nested docs to preserve the logical grouping of file content and
> the
> > file info where this content is coming from. This could potentially work
> > but I haven't done any testing yet (I know highlighting doesn't work on
> > nested docs for example)
> >
> > 2. Encode the file name in the file content fields themselves. The file
> > name will be removed during indexing but will be stored. How do I get the
> > file name included in each snippet fragment - this again needs exploring
> on
> > my end
> >
> > Another approach I have been thinking is extending the StoredField to
> also
> > store additional meta data information. So basically when a stored field
> is
> > retrieved, or a fragment is returned, I also have additional information
> > associated with the stored field. Can someone tell me this is a terrible
> > idea and I should not be pursuing.
> >
> > Is there something else I can try?
> >
> > Thanks a lot,
> > Srijan
> >
>

Re: Metadata info on Stored Fields

Posted by Edward Ribeiro <ed...@gmail.com>.
Hi,

You may try to create two kinds of docs forming a parent-child relationship
without nesting. Like

<doc>
<id>894</id>
<type>parent</type>

...
<doc/>

<doc>
<id>3213</id>
<type>child</type>
<parent_id>894</parent_id>
<metadata field 1>xxx
<file_content_en_US> portion of file 1
<file_content_en_US> remaining portion of file 1
...
<doc/>

Then you can add metadata for each child doc. The search can be done on
child docs but if you need to group you can use the join query parser (it
has some limitations though) or grouping by parent_id.

Cheers,
Edward


Em seg, 17 de fev de 2020 12:25, Srijan <sh...@gmail.com> escreveu:

> Hi,
>
> I have a data model where the operational "Object" can have one or more
> files attached. Indexing these objects in Solr means indexing all metadata
> info and the contents of the files. For file contents what I have right now
> is a single multi-valued field (for each locale)
>
> Example:
> <doc>
> <metadata field 1>xxx
> <metadata field 2>yyy
> <file_content_en_US> portion of file 1
> <file_content_en_US> remaining portion of file 1
> <file_content_en_US> portion of file 2
> <file_content_en_US> contents from file 2 again...
> ...
> </doc>
>
> Search is easy and everything's been working fine. We recently introduced
> highlighting functionality on these file content fields. Again, straight
> forward use-case. Next requirement is where things get a little tricky. We
> want to be able to return the name of the file ( generalizing this - or
> some other metadata info related to the file content field). If our data
> model had a 1:1 relation between our operational object and the file it
> contains, the file name would have been just another field on the main doc
> but unfortunately that's not the case - each file content field could
> belong to any file.
>
> There are a couple of potential solutions I have been thinking of:
> 1. Use nested docs to preserve the logical grouping of file content and the
> file info where this content is coming from. This could potentially work
> but I haven't done any testing yet (I know highlighting doesn't work on
> nested docs for example)
>
> 2. Encode the file name in the file content fields themselves. The file
> name will be removed during indexing but will be stored. How do I get the
> file name included in each snippet fragment - this again needs exploring on
> my end
>
> Another approach I have been thinking is extending the StoredField to also
> store additional meta data information. So basically when a stored field is
> retrieved, or a fragment is returned, I also have additional information
> associated with the stored field. Can someone tell me this is a terrible
> idea and I should not be pursuing.
>
> Is there something else I can try?
>
> Thanks a lot,
> Srijan
>