You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Andrew Musselman <an...@gmail.com> on 2015/07/21 19:31:06 UTC

Parsing and indexing parts of the input file paths

Dear user and dev lists,

We are loading files from a directory and would like to index a portion of
each file path as a field as well as the text inside the file.

E.g., on HDFS we have this file path:

/user/andrew/1234/1234/file.pdf

And we would like the "1234" token parsed from the file path and indexed as
an additional field that can be searched on.

>From my initial searches I can't see how to do this easily, so would I need
to write some custom code, or a plugin?

Thanks!

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Which can only happen if I post it to a web service, and won't happen if I
do it through config?

On Tue, Jul 21, 2015 at 2:19 PM, Upayavira <uv...@odoko.co.uk> wrote:

> yes, unless it has been added consciously as a separate field.
>
> On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
> > Thanks, so by the time we would get to an Analyzer the file path is
> > forgotten?
> >
> > https://cwiki.apache.org/confluence/display/solr/Analyzers
> >
> > On Tue, Jul 21, 2015 at 1:27 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > > Solr generally does not interact with the file system in that way (with
> > > the exception of the DIH).
> > >
> > > It is the job of the code that pushes a file to Solr to process the
> > > filename and send that along with the request.
> > >
> > > See here for more info:
> > >
> > >
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> > >
> > > You could provide literal.filename=blah/blah
> > >
> > > Upayavira
> > >
> > >
> > > On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > > > I'm not sure, it's a remote team but will get more info.  For now,
> > > > assuming
> > > > that a certain directory is specified, like "/user/andrew/", and a
> regex
> > > > is
> > > > applied to capture anything two directories below matching
> "*/*/*.pdf".
> > > >
> > > > Would there be a way to capture the wild-carded values and index
> them as
> > > > fields?
> > > >
> > > > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira <uv...@odoko.co.uk> wrote:
> > > >
> > > > > Keeping to the user list (the right place for this question).
> > > > >
> > > > > More information is needed here - how are you getting these
> documents
> > > > > into Solr? Are you posting them to /update/extract? Or using DIH,
> or?
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > > > Dear user and dev lists,
> > > > > >
> > > > > > We are loading files from a directory and would like to index a
> > > portion
> > > > > > of
> > > > > > each file path as a field as well as the text inside the file.
> > > > > >
> > > > > > E.g., on HDFS we have this file path:
> > > > > >
> > > > > > /user/andrew/1234/1234/file.pdf
> > > > > >
> > > > > > And we would like the "1234" token parsed from the file path and
> > > indexed
> > > > > > as
> > > > > > an additional field that can be searched on.
> > > > > >
> > > > > > From my initial searches I can't see how to do this easily, so
> would
> > > I
> > > > > > need
> > > > > > to write some custom code, or a plugin?
> > > > > >
> > > > > > Thanks!
> > > > >
> > >
>

Re: Parsing and indexing parts of the input file paths

Posted by Upayavira <uv...@odoko.co.uk>.

yes, unless it has been added consciously as a separate field.

On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
> Thanks, so by the time we would get to an Analyzer the file path is
> forgotten?
> 
> https://cwiki.apache.org/confluence/display/solr/Analyzers
> 
> On Tue, Jul 21, 2015 at 1:27 PM, Upayavira <uv...@odoko.co.uk> wrote:
> 
> > Solr generally does not interact with the file system in that way (with
> > the exception of the DIH).
> >
> > It is the job of the code that pushes a file to Solr to process the
> > filename and send that along with the request.
> >
> > See here for more info:
> >
> > https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> >
> > You could provide literal.filename=blah/blah
> >
> > Upayavira
> >
> >
> > On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > > I'm not sure, it's a remote team but will get more info.  For now,
> > > assuming
> > > that a certain directory is specified, like "/user/andrew/", and a regex
> > > is
> > > applied to capture anything two directories below matching "*/*/*.pdf".
> > >
> > > Would there be a way to capture the wild-carded values and index them as
> > > fields?
> > >
> > > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira <uv...@odoko.co.uk> wrote:
> > >
> > > > Keeping to the user list (the right place for this question).
> > > >
> > > > More information is needed here - how are you getting these documents
> > > > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> > > >
> > > > Upayavira
> > > >
> > > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > > Dear user and dev lists,
> > > > >
> > > > > We are loading files from a directory and would like to index a
> > portion
> > > > > of
> > > > > each file path as a field as well as the text inside the file.
> > > > >
> > > > > E.g., on HDFS we have this file path:
> > > > >
> > > > > /user/andrew/1234/1234/file.pdf
> > > > >
> > > > > And we would like the "1234" token parsed from the file path and
> > indexed
> > > > > as
> > > > > an additional field that can be searched on.
> > > > >
> > > > > From my initial searches I can't see how to do this easily, so would
> > I
> > > > > need
> > > > > to write some custom code, or a plugin?
> > > > >
> > > > > Thanks!
> > > >
> >

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Thanks, so by the time we would get to an Analyzer the file path is
forgotten?

https://cwiki.apache.org/confluence/display/solr/Analyzers

On Tue, Jul 21, 2015 at 1:27 PM, Upayavira <uv...@odoko.co.uk> wrote:

> Solr generally does not interact with the file system in that way (with
> the exception of the DIH).
>
> It is the job of the code that pushes a file to Solr to process the
> filename and send that along with the request.
>
> See here for more info:
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
> You could provide literal.filename=blah/blah
>
> Upayavira
>
>
> On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > I'm not sure, it's a remote team but will get more info.  For now,
> > assuming
> > that a certain directory is specified, like "/user/andrew/", and a regex
> > is
> > applied to capture anything two directories below matching "*/*/*.pdf".
> >
> > Would there be a way to capture the wild-carded values and index them as
> > fields?
> >
> > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira <uv...@odoko.co.uk> wrote:
> >
> > > Keeping to the user list (the right place for this question).
> > >
> > > More information is needed here - how are you getting these documents
> > > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> > >
> > > Upayavira
> > >
> > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > Dear user and dev lists,
> > > >
> > > > We are loading files from a directory and would like to index a
> portion
> > > > of
> > > > each file path as a field as well as the text inside the file.
> > > >
> > > > E.g., on HDFS we have this file path:
> > > >
> > > > /user/andrew/1234/1234/file.pdf
> > > >
> > > > And we would like the "1234" token parsed from the file path and
> indexed
> > > > as
> > > > an additional field that can be searched on.
> > > >
> > > > From my initial searches I can't see how to do this easily, so would
> I
> > > > need
> > > > to write some custom code, or a plugin?
> > > >
> > > > Thanks!
> > >
>

Re: Parsing and indexing parts of the input file paths

Posted by Upayavira <uv...@odoko.co.uk>.

Solr generally does not interact with the file system in that way (with
the exception of the DIH).

It is the job of the code that pushes a file to Solr to process the
filename and send that along with the request.

See here for more info:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

You could provide literal.filename=blah/blah

Upayavira


On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> I'm not sure, it's a remote team but will get more info.  For now,
> assuming
> that a certain directory is specified, like "/user/andrew/", and a regex
> is
> applied to capture anything two directories below matching "*/*/*.pdf".
> 
> Would there be a way to capture the wild-carded values and index them as
> fields?
> 
> On Tue, Jul 21, 2015 at 11:20 AM, Upayavira <uv...@odoko.co.uk> wrote:
> 
> > Keeping to the user list (the right place for this question).
> >
> > More information is needed here - how are you getting these documents
> > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> >
> > Upayavira
> >
> > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > Dear user and dev lists,
> > >
> > > We are loading files from a directory and would like to index a portion
> > > of
> > > each file path as a field as well as the text inside the file.
> > >
> > > E.g., on HDFS we have this file path:
> > >
> > > /user/andrew/1234/1234/file.pdf
> > >
> > > And we would like the "1234" token parsed from the file path and indexed
> > > as
> > > an additional field that can be searched on.
> > >
> > > From my initial searches I can't see how to do this easily, so would I
> > > need
> > > to write some custom code, or a plugin?
> > >
> > > Thanks!
> >

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

I'm not sure, it's a remote team but will get more info.  For now, assuming
that a certain directory is specified, like "/user/andrew/", and a regex is
applied to capture anything two directories below matching "*/*/*.pdf".

Would there be a way to capture the wild-carded values and index them as
fields?

On Tue, Jul 21, 2015 at 11:20 AM, Upayavira <uv...@odoko.co.uk> wrote:

> Keeping to the user list (the right place for this question).
>
> More information is needed here - how are you getting these documents
> into Solr? Are you posting them to /update/extract? Or using DIH, or?
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > Dear user and dev lists,
> >
> > We are loading files from a directory and would like to index a portion
> > of
> > each file path as a field as well as the text inside the file.
> >
> > E.g., on HDFS we have this file path:
> >
> > /user/andrew/1234/1234/file.pdf
> >
> > And we would like the "1234" token parsed from the file path and indexed
> > as
> > an additional field that can be searched on.
> >
> > From my initial searches I can't see how to do this easily, so would I
> > need
> > to write some custom code, or a plugin?
> >
> > Thanks!
>

Re: Parsing and indexing parts of the input file paths

Posted by Upayavira <uv...@odoko.co.uk>.

Keeping to the user list (the right place for this question).

More information is needed here - how are you getting these documents
into Solr? Are you posting them to /update/extract? Or using DIH, or?

Upayavira

On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> Dear user and dev lists,
> 
> We are loading files from a directory and would like to index a portion
> of
> each file path as a field as well as the text inside the file.
> 
> E.g., on HDFS we have this file path:
> 
> /user/andrew/1234/1234/file.pdf
> 
> And we would like the "1234" token parsed from the file path and indexed
> as
> an additional field that can be searched on.
> 
> From my initial searches I can't see how to do this easily, so would I
> need
> to write some custom code, or a plugin?
> 
> Thanks!

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Thanks; I don't know how the file path is getting into the id field.  Must
be some Tika default?

On Wed, Jul 22, 2015 at 9:52 AM, Erick Erickson <er...@gmail.com>
wrote:

> the id field is absolutely NOT the thing you need to try to parse.
> Assuming you're stuffing the file path into that field, use a
> copyField to copy the filepath into another text (not string)
> field and do your work there.
>
> As far as whether the filepath is in some other field, well, you have
> to put it there, either through Tika configurations or explicitly through
> your crawler.
>
> Best,
> Erick
>
> On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
> <an...@gmail.com> wrote:
> > Trying to figure out how to parse the file path, which when I run the
> > "cloud" instance becomes the "id" for each PDF document.
> >
> > Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> > the config?  If not, is there a "file-path" field I can parse?
> >
> > On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> Don't understand your question. If you're talking two different
> >> fields, use copyField.
> >>
> >> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
> >> <an...@gmail.com> wrote:
> >> > Fwding to user..
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: Andrew Musselman <an...@gmail.com>
> >> > Date: Wed, Jul 22, 2015 at 8:54 AM
> >> > Subject: Re: Parsing and indexing parts of the input file paths
> >> > To: dev@lucene.apache.org
> >> >
> >> >
> >> > Thanks, and tell it to index the "id" field, which eventually contains
> >> the
> >> > file path?
> >> >
> >> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> PatternReplacecFilterFactory would be just a configuration solution,
> >> >> construct a fieldType in schema.xml and you're done. It would require
> >> >> re-indexing of course.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> >> >> <an...@gmail.com> wrote:
> >> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be
> known,
> >> and
> >> >> > can be put into config, let's assume.  Would this be config-only or
> >> >> would it
> >> >> > require some code, and could you point to some classes I can start
> >> with
> >> >> if I
> >> >> > need to write code, and some up-to-date docs?
> >> >> >
> >> >> > Same for the update processor, is there an example I could read?
> >> >> >
> >> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
> >> erik.hatcher@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> If this is only for search, then an analysis chain could be
> crafted,
> >> >> >> likely with the pattern regex filter in the mix, to pull out
> pieces
> >> of
> >> >> the
> >> >> >> path.  How will you know the prefix of the file though?
> >> >> >>
> >> >> >> There’s also the ability to do this sort of thing in an update
> >> >> processor,
> >> >> >> most easily using the script update processor, using a bit of
> >> >> JavaScript to
> >> >> >> pull out the piece(s) you want to index (and even store at this
> >> point).
> >> >> >>
> >> >> >> —
> >> >> >> Erik Hatcher, Senior Solutions Architect
> >> >> >> http://www.lucidworks.com
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> >> >> andrew.musselman@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> Dear user and dev lists,
> >> >> >>
> >> >> >> We are loading files from a directory and would like to index a
> >> portion
> >> >> of
> >> >> >> each file path as a field as well as the text inside the file.
> >> >> >>
> >> >> >> E.g., on HDFS we have this file path:
> >> >> >>
> >> >> >> /user/andrew/1234/1234/file.pdf
> >> >> >>
> >> >> >> And we would like the "1234" token parsed from the file path and
> >> indexed
> >> >> >> as an additional field that can be searched on.
> >> >> >>
> >> >> >> From my initial searches I can't see how to do this easily, so
> would
> >> I
> >> >> >> need to write some custom code, or a plugin?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >>
> >>
>

Re: Parsing and indexing parts of the input file paths

Posted by Erick Erickson <er...@gmail.com>.

the id field is absolutely NOT the thing you need to try to parse.
Assuming you're stuffing the file path into that field, use a
copyField to copy the filepath into another text (not string)
field and do your work there.

As far as whether the filepath is in some other field, well, you have
to put it there, either through Tika configurations or explicitly through
your crawler.

Best,
Erick

On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
<an...@gmail.com> wrote:
> Trying to figure out how to parse the file path, which when I run the
> "cloud" instance becomes the "id" for each PDF document.
>
> Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> the config?  If not, is there a "file-path" field I can parse?
>
> On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Don't understand your question. If you're talking two different
>> fields, use copyField.
>>
>> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
>> <an...@gmail.com> wrote:
>> > Fwding to user..
>> >
>> > ---------- Forwarded message ----------
>> > From: Andrew Musselman <an...@gmail.com>
>> > Date: Wed, Jul 22, 2015 at 8:54 AM
>> > Subject: Re: Parsing and indexing parts of the input file paths
>> > To: dev@lucene.apache.org
>> >
>> >
>> > Thanks, and tell it to index the "id" field, which eventually contains
>> the
>> > file path?
>> >
>> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> PatternReplacecFilterFactory would be just a configuration solution,
>> >> construct a fieldType in schema.xml and you're done. It would require
>> >> re-indexing of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
>> >> <an...@gmail.com> wrote:
>> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be known,
>> and
>> >> > can be put into config, let's assume.  Would this be config-only or
>> >> would it
>> >> > require some code, and could you point to some classes I can start
>> with
>> >> if I
>> >> > need to write code, and some up-to-date docs?
>> >> >
>> >> > Same for the update processor, is there an example I could read?
>> >> >
>> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
>> erik.hatcher@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> If this is only for search, then an analysis chain could be crafted,
>> >> >> likely with the pattern regex filter in the mix, to pull out pieces
>> of
>> >> the
>> >> >> path.  How will you know the prefix of the file though?
>> >> >>
>> >> >> There’s also the ability to do this sort of thing in an update
>> >> processor,
>> >> >> most easily using the script update processor, using a bit of
>> >> JavaScript to
>> >> >> pull out the piece(s) you want to index (and even store at this
>> point).
>> >> >>
>> >> >> —
>> >> >> Erik Hatcher, Senior Solutions Architect
>> >> >> http://www.lucidworks.com
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
>> >> andrew.musselman@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> Dear user and dev lists,
>> >> >>
>> >> >> We are loading files from a directory and would like to index a
>> portion
>> >> of
>> >> >> each file path as a field as well as the text inside the file.
>> >> >>
>> >> >> E.g., on HDFS we have this file path:
>> >> >>
>> >> >> /user/andrew/1234/1234/file.pdf
>> >> >>
>> >> >> And we would like the "1234" token parsed from the file path and
>> indexed
>> >> >> as an additional field that can be searched on.
>> >> >>
>> >> >> From my initial searches I can't see how to do this easily, so would
>> I
>> >> >> need to write some custom code, or a plugin?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>>

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Trying to figure out how to parse the file path, which when I run the
"cloud" instance becomes the "id" for each PDF document.

Is that "id" field the thing to parse with PatternReplaceFilterFactory in
the config?  If not, is there a "file-path" field I can parse?

On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <er...@gmail.com>
wrote:

> Don't understand your question. If you're talking two different
> fields, use copyField.
>
> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
> <an...@gmail.com> wrote:
> > Fwding to user..
> >
> > ---------- Forwarded message ----------
> > From: Andrew Musselman <an...@gmail.com>
> > Date: Wed, Jul 22, 2015 at 8:54 AM
> > Subject: Re: Parsing and indexing parts of the input file paths
> > To: dev@lucene.apache.org
> >
> >
> > Thanks, and tell it to index the "id" field, which eventually contains
> the
> > file path?
> >
> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> PatternReplacecFilterFactory would be just a configuration solution,
> >> construct a fieldType in schema.xml and you're done. It would require
> >> re-indexing of course.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> >> <an...@gmail.com> wrote:
> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be known,
> and
> >> > can be put into config, let's assume.  Would this be config-only or
> >> would it
> >> > require some code, and could you point to some classes I can start
> with
> >> if I
> >> > need to write code, and some up-to-date docs?
> >> >
> >> > Same for the update processor, is there an example I could read?
> >> >
> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
> erik.hatcher@gmail.com>
> >> > wrote:
> >> >>
> >> >> If this is only for search, then an analysis chain could be crafted,
> >> >> likely with the pattern regex filter in the mix, to pull out pieces
> of
> >> the
> >> >> path.  How will you know the prefix of the file though?
> >> >>
> >> >> There’s also the ability to do this sort of thing in an update
> >> processor,
> >> >> most easily using the script update processor, using a bit of
> >> JavaScript to
> >> >> pull out the piece(s) you want to index (and even store at this
> point).
> >> >>
> >> >> —
> >> >> Erik Hatcher, Senior Solutions Architect
> >> >> http://www.lucidworks.com
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> >> andrew.musselman@gmail.com>
> >> >> wrote:
> >> >>
> >> >> Dear user and dev lists,
> >> >>
> >> >> We are loading files from a directory and would like to index a
> portion
> >> of
> >> >> each file path as a field as well as the text inside the file.
> >> >>
> >> >> E.g., on HDFS we have this file path:
> >> >>
> >> >> /user/andrew/1234/1234/file.pdf
> >> >>
> >> >> And we would like the "1234" token parsed from the file path and
> indexed
> >> >> as an additional field that can be searched on.
> >> >>
> >> >> From my initial searches I can't see how to do this easily, so would
> I
> >> >> need to write some custom code, or a plugin?
> >> >>
> >> >> Thanks!
> >> >>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >>
>

Re: Parsing and indexing parts of the input file paths

Posted by Erick Erickson <er...@gmail.com>.

Don't understand your question. If you're talking two different
fields, use copyField.

On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
<an...@gmail.com> wrote:
> Fwding to user..
>
> ---------- Forwarded message ----------
> From: Andrew Musselman <an...@gmail.com>
> Date: Wed, Jul 22, 2015 at 8:54 AM
> Subject: Re: Parsing and indexing parts of the input file paths
> To: dev@lucene.apache.org
>
>
> Thanks, and tell it to index the "id" field, which eventually contains the
> file path?
>
> On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> PatternReplacecFilterFactory would be just a configuration solution,
>> construct a fieldType in schema.xml and you're done. It would require
>> re-indexing of course.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
>> <an...@gmail.com> wrote:
>> > Erik, thanks; the prefix starting with "/user/andrew/" will be known, and
>> > can be put into config, let's assume.  Would this be config-only or
>> would it
>> > require some code, and could you point to some classes I can start with
>> if I
>> > need to write code, and some up-to-date docs?
>> >
>> > Same for the update processor, is there an example I could read?
>> >
>> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <er...@gmail.com>
>> > wrote:
>> >>
>> >> If this is only for search, then an analysis chain could be crafted,
>> >> likely with the pattern regex filter in the mix, to pull out pieces of
>> the
>> >> path.  How will you know the prefix of the file though?
>> >>
>> >> There’s also the ability to do this sort of thing in an update
>> processor,
>> >> most easily using the script update processor, using a bit of
>> JavaScript to
>> >> pull out the piece(s) you want to index (and even store at this point).
>> >>
>> >> —
>> >> Erik Hatcher, Senior Solutions Architect
>> >> http://www.lucidworks.com
>> >>
>> >>
>> >>
>> >>
>> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
>> andrew.musselman@gmail.com>
>> >> wrote:
>> >>
>> >> Dear user and dev lists,
>> >>
>> >> We are loading files from a directory and would like to index a portion
>> of
>> >> each file path as a field as well as the text inside the file.
>> >>
>> >> E.g., on HDFS we have this file path:
>> >>
>> >> /user/andrew/1234/1234/file.pdf
>> >>
>> >> And we would like the "1234" token parsed from the file path and indexed
>> >> as an additional field that can be searched on.
>> >>
>> >> From my initial searches I can't see how to do this easily, so would I
>> >> need to write some custom code, or a plugin?
>> >>
>> >> Thanks!
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Fwd: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Fwding to user..

---------- Forwarded message ----------
From: Andrew Musselman <an...@gmail.com>
Date: Wed, Jul 22, 2015 at 8:54 AM
Subject: Re: Parsing and indexing parts of the input file paths
To: dev@lucene.apache.org


Thanks, and tell it to index the "id" field, which eventually contains the
file path?

On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <er...@gmail.com>
wrote:

> PatternReplacecFilterFactory would be just a configuration solution,
> construct a fieldType in schema.xml and you're done. It would require
> re-indexing of course.
>
> Best,
> Erick
>
> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> <an...@gmail.com> wrote:
> > Erik, thanks; the prefix starting with "/user/andrew/" will be known, and
> > can be put into config, let's assume.  Would this be config-only or
> would it
> > require some code, and could you point to some classes I can start with
> if I
> > need to write code, and some up-to-date docs?
> >
> > Same for the update processor, is there an example I could read?
> >
> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <er...@gmail.com>
> > wrote:
> >>
> >> If this is only for search, then an analysis chain could be crafted,
> >> likely with the pattern regex filter in the mix, to pull out pieces of
> the
> >> path.  How will you know the prefix of the file though?
> >>
> >> There’s also the ability to do this sort of thing in an update
> processor,
> >> most easily using the script update processor, using a bit of
> JavaScript to
> >> pull out the piece(s) you want to index (and even store at this point).
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com
> >>
> >>
> >>
> >>
> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> andrew.musselman@gmail.com>
> >> wrote:
> >>
> >> Dear user and dev lists,
> >>
> >> We are loading files from a directory and would like to index a portion
> of
> >> each file path as a field as well as the text inside the file.
> >>
> >> E.g., on HDFS we have this file path:
> >>
> >> /user/andrew/1234/1234/file.pdf
> >>
> >> And we would like the "1234" token parsed from the file path and indexed
> >> as an additional field that can be searched on.
> >>
> >> From my initial searches I can't see how to do this easily, so would I
> >> need to write some custom code, or a plugin?
> >>
> >> Thanks!
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Thanks, and tell it to index the "id" field, which eventually contains the
file path?

On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <er...@gmail.com>
wrote:

> PatternReplacecFilterFactory would be just a configuration solution,
> construct a fieldType in schema.xml and you're done. It would require
> re-indexing of course.
>
> Best,
> Erick
>
> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> <an...@gmail.com> wrote:
> > Erik, thanks; the prefix starting with "/user/andrew/" will be known, and
> > can be put into config, let's assume.  Would this be config-only or
> would it
> > require some code, and could you point to some classes I can start with
> if I
> > need to write code, and some up-to-date docs?
> >
> > Same for the update processor, is there an example I could read?
> >
> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <er...@gmail.com>
> > wrote:
> >>
> >> If this is only for search, then an analysis chain could be crafted,
> >> likely with the pattern regex filter in the mix, to pull out pieces of
> the
> >> path.  How will you know the prefix of the file though?
> >>
> >> There’s also the ability to do this sort of thing in an update
> processor,
> >> most easily using the script update processor, using a bit of
> JavaScript to
> >> pull out the piece(s) you want to index (and even store at this point).
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com
> >>
> >>
> >>
> >>
> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> andrew.musselman@gmail.com>
> >> wrote:
> >>
> >> Dear user and dev lists,
> >>
> >> We are loading files from a directory and would like to index a portion
> of
> >> each file path as a field as well as the text inside the file.
> >>
> >> E.g., on HDFS we have this file path:
> >>
> >> /user/andrew/1234/1234/file.pdf
> >>
> >> And we would like the "1234" token parsed from the file path and indexed
> >> as an additional field that can be searched on.
> >>
> >> From my initial searches I can't see how to do this easily, so would I
> >> need to write some custom code, or a plugin?
> >>
> >> Thanks!
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Parsing and indexing parts of the input file paths

Posted by Erick Erickson <er...@gmail.com>.

PatternReplacecFilterFactory would be just a configuration solution,
construct a fieldType in schema.xml and you're done. It would require
re-indexing of course.

Best,
Erick

On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
<an...@gmail.com> wrote:
> Erik, thanks; the prefix starting with "/user/andrew/" will be known, and
> can be put into config, let's assume.  Would this be config-only or would it
> require some code, and could you point to some classes I can start with if I
> need to write code, and some up-to-date docs?
>
> Same for the update processor, is there an example I could read?
>
> On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <er...@gmail.com>
> wrote:
>>
>> If this is only for search, then an analysis chain could be crafted,
>> likely with the pattern regex filter in the mix, to pull out pieces of the
>> path.  How will you know the prefix of the file though?
>>
>> There’s also the ability to do this sort of thing in an update processor,
>> most easily using the script update processor, using a bit of JavaScript to
>> pull out the piece(s) you want to index (and even store at this point).
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com
>>
>>
>>
>>
>> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <an...@gmail.com>
>> wrote:
>>
>> Dear user and dev lists,
>>
>> We are loading files from a directory and would like to index a portion of
>> each file path as a field as well as the text inside the file.
>>
>> E.g., on HDFS we have this file path:
>>
>> /user/andrew/1234/1234/file.pdf
>>
>> And we would like the "1234" token parsed from the file path and indexed
>> as an additional field that can be searched on.
>>
>> From my initial searches I can't see how to do this easily, so would I
>> need to write some custom code, or a plugin?
>>
>> Thanks!
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Parsing and indexing parts of the input file paths

Posted by Andrew Musselman <an...@gmail.com>.

Erik, thanks; the prefix starting with "/user/andrew/" will be known, and
can be put into config, let's assume.  Would this be config-only or would
it require some code, and could you point to some classes I can start with
if I need to write code, and some up-to-date docs?

Same for the update processor, is there an example I could read?

On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <er...@gmail.com>
wrote:

> If this is only for search, then an analysis chain could be crafted,
> likely with the pattern regex filter in the mix, to pull out pieces of the
> path.  How will you know the prefix of the file though?
>
> There’s also the ability to do this sort of thing in an update processor,
> most easily using the script update processor, using a bit of JavaScript to
> pull out the piece(s) you want to index (and even store at this point).
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com
>
>
>
>
> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <an...@gmail.com>
> wrote:
>
> Dear user and dev lists,
>
> We are loading files from a directory and would like to index a portion of
> each file path as a field as well as the text inside the file.
>
> E.g., on HDFS we have this file path:
>
> /user/andrew/1234/1234/file.pdf
>
> And we would like the "1234" token parsed from the file path and indexed
> as an additional field that can be searched on.
>
> From my initial searches I can't see how to do this easily, so would I
> need to write some custom code, or a plugin?
>
> Thanks!
>
>
>

Re: Parsing and indexing parts of the input file paths

Posted by Erik Hatcher <er...@gmail.com>.

If this is only for search, then an analysis chain could be crafted, likely with the pattern regex filter in the mix, to pull out pieces of the path.  How will you know the prefix of the file though? 

There’s also the ability to do this sort of thing in an update processor, most easily using the script update processor, using a bit of JavaScript to pull out the piece(s) you want to index (and even store at this point).

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>

> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <an...@gmail.com> wrote:
> 
> Dear user and dev lists,
> 
> We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file.
> 
> E.g., on HDFS we have this file path:
> 
> /user/andrew/1234/1234/file.pdf
> 
> And we would like the "1234" token parsed from the file path and indexed as an additional field that can be searched on.
> 
> From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin?
> 
> Thanks!