You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gytis Mikuciunas <gy...@gmail.com> on 2017/02/10 07:38:12 UTC

how to get modified field data if it doesn't exist in meta

Hi,

We have started to use solr for our documents indexing (vsd, vsdx,
xls,xlsx, doc, docx, pdf, txt).

Modified date values is needed for each file. MS Office's files, pdfs have
this value.
Problem is with txt files as they don't have this value in their meta.

Is there any possibility to get it somehow from os level and force adding
it to solr when we do indexing.

p.s.

Windows 2012 server, single instance

typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
-Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx -jar
example/exampledocs/post.jar "M:\DNS_dump"


Regards,

Gytis

Re: how to get modified field data if it doesn't exist in meta

Posted by Gytis Mikuciunas <gy...@gmail.com>.
As I understand TimestampUpdateProcessorFactory will insert current
date(now). We don't want this.

Regards,
Gytis

On Feb 10, 2017 19:18, "Erick Erickson" <er...@gmail.com> wrote:

> Would TimestampUpdateProcessorFactory work?
>
> Best,
> Erick
>
> On Fri, Feb 10, 2017 at 4:59 AM, Alexandre Rafalovitch
> <ar...@gmail.com> wrote:
> > Custom update request processor that looks up a file from the name and
> gets
> > the date should work.
> >
> > Regards,
> >     Alex
> >
> > On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" <gy...@gmail.com> wrote:
> >
> > Hi,
> >
> > We have started to use solr for our documents indexing (vsd, vsdx,
> > xls,xlsx, doc, docx, pdf, txt).
> >
> > Modified date values is needed for each file. MS Office's files, pdfs
> have
> > this value.
> > Problem is with txt files as they don't have this value in their meta.
> >
> > Is there any possibility to get it somehow from os level and force adding
> > it to solr when we do indexing.
> >
> > p.s.
> >
> > Windows 2012 server, single instance
> >
> > typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
> > -Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx
> -jar
> > example/exampledocs/post.jar "M:\DNS_dump"
> >
> >
> > Regards,
> >
> > Gytis
>

Re: how to get modified field data if it doesn't exist in meta

Posted by Erick Erickson <er...@gmail.com>.
Would TimestampUpdateProcessorFactory work?

Best,
Erick

On Fri, Feb 10, 2017 at 4:59 AM, Alexandre Rafalovitch
<ar...@gmail.com> wrote:
> Custom update request processor that looks up a file from the name and gets
> the date should work.
>
> Regards,
>     Alex
>
> On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" <gy...@gmail.com> wrote:
>
> Hi,
>
> We have started to use solr for our documents indexing (vsd, vsdx,
> xls,xlsx, doc, docx, pdf, txt).
>
> Modified date values is needed for each file. MS Office's files, pdfs have
> this value.
> Problem is with txt files as they don't have this value in their meta.
>
> Is there any possibility to get it somehow from os level and force adding
> it to solr when we do indexing.
>
> p.s.
>
> Windows 2012 server, single instance
>
> typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
> -Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx -jar
> example/exampledocs/post.jar "M:\DNS_dump"
>
>
> Regards,
>
> Gytis

Re: how to get modified field data if it doesn't exist in meta

Posted by Gytis Mikuciunas <gy...@gmail.com>.
Hi,

Who can compile me this to jar file? (I found something similar i need in
google: (
http://stackoverflow.com/questions/20745935/set-last-modified-field-when-not-defined-in-document-in-solr
))

package modifiedG4;

import java.io.IOException;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.SolrQueryResponse;
import org.apache.solr.update.AddUpdateCommand;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.solr.update.processor.UpdateRequestProcessorFactory;

public class LastModifiedMergeProcessorFactory
   extends UpdateRequestProcessorFactory {

  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
       SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new LastModifiedMergeProcessor(next);
  }
}

class LastModifiedMergeProcessor extends UpdateRequestProcessor {

  public LastModifiedMergeProcessor(UpdateRequestProcessor next) {
    super(next);
  }

  @Override
  public void processAdd(AddUpdateCommand cmd) throws IOException {
    SolrInputDocument doc = cmd.getSolrInputDocument();

    Object metaDate = doc.getFieldValue( "last_modified" );
    Object fileDate = doc.getFieldValue( "file_date" );
    if( metaDate == null && fileDate != null) {
        doc.addField( "last_modified", fileDate );
    }

      // pass it up the chain
      super.processAdd(cmd);
    }
}

On Sun, Feb 12, 2017 at 8:45 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> It would have to be a custom one. One you write. But I believe Tika
> would pass a file name as one of the parameters, so you just need to
> use standard Java API to look up the system date. That - of course -
> assumes that the files you index are on the same filesystem as Solr
> itself, so it could look it up.
>
> You can find more about the UPRs at:
> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
> You can find the full list of the URPs at:
> http://www.solr-start.com/info/update-request-processors/
> If you are on the latest Solr 6.4, you would probably want to subclass
> SimpleUpdateProcessorFactory and follow the implementation example of
> TemplateUpdateProcessorFactory
> https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.4.0/solr/core/src/java/org/apache/solr/update/processor/
> TemplateUpdateProcessorFactory.java
>
> Alternatively, you could implement your URP in Javascript, but I am
> not sure that has an API to check file dates.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 12 February 2017 at 13:28, Gytis Mikuciunas <gy...@gmail.com> wrote:
> > Alexandre, could you provide some link or give more info about this
> > processor?
> > I'm novice in the solr world;)
> >
> >
> > Regards,
> > Gytis
> >
> > On Feb 10, 2017 14:59, "Alexandre Rafalovitch" <ar...@gmail.com>
> wrote:
> >
> > Custom update request processor that looks up a file from the name and
> gets
> > the date should work.
> >
> > Regards,
> >     Alex
> >
> > On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" <gy...@gmail.com> wrote:
> >
> > Hi,
> >
> > We have started to use solr for our documents indexing (vsd, vsdx,
> > xls,xlsx, doc, docx, pdf, txt).
> >
> > Modified date values is needed for each file. MS Office's files, pdfs
> have
> > this value.
> > Problem is with txt files as they don't have this value in their meta.
> >
> > Is there any possibility to get it somehow from os level and force adding
> > it to solr when we do indexing.
> >
> > p.s.
> >
> > Windows 2012 server, single instance
> >
> > typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
> > -Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx
> -jar
> > example/exampledocs/post.jar "M:\DNS_dump"
> >
> >
> > Regards,
> >
> > Gytis
>

Re: how to get modified field data if it doesn't exist in meta

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
It would have to be a custom one. One you write. But I believe Tika
would pass a file name as one of the parameters, so you just need to
use standard Java API to look up the system date. That - of course -
assumes that the files you index are on the same filesystem as Solr
itself, so it could look it up.

You can find more about the UPRs at:
https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
You can find the full list of the URPs at:
http://www.solr-start.com/info/update-request-processors/
If you are on the latest Solr 6.4, you would probably want to subclass
SimpleUpdateProcessorFactory and follow the implementation example of
TemplateUpdateProcessorFactory
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/solr/core/src/java/org/apache/solr/update/processor/TemplateUpdateProcessorFactory.java

Alternatively, you could implement your URP in Javascript, but I am
not sure that has an API to check file dates.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 12 February 2017 at 13:28, Gytis Mikuciunas <gy...@gmail.com> wrote:
> Alexandre, could you provide some link or give more info about this
> processor?
> I'm novice in the solr world;)
>
>
> Regards,
> Gytis
>
> On Feb 10, 2017 14:59, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>
> Custom update request processor that looks up a file from the name and gets
> the date should work.
>
> Regards,
>     Alex
>
> On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" <gy...@gmail.com> wrote:
>
> Hi,
>
> We have started to use solr for our documents indexing (vsd, vsdx,
> xls,xlsx, doc, docx, pdf, txt).
>
> Modified date values is needed for each file. MS Office's files, pdfs have
> this value.
> Problem is with txt files as they don't have this value in their meta.
>
> Is there any possibility to get it somehow from os level and force adding
> it to solr when we do indexing.
>
> p.s.
>
> Windows 2012 server, single instance
>
> typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
> -Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx -jar
> example/exampledocs/post.jar "M:\DNS_dump"
>
>
> Regards,
>
> Gytis

Re: how to get modified field data if it doesn't exist in meta

Posted by Gytis Mikuciunas <gy...@gmail.com>.
Alexandre, could you provide some link or give more info about this
processor?
I'm novice in the solr world;)


Regards,
Gytis

On Feb 10, 2017 14:59, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:

Custom update request processor that looks up a file from the name and gets
the date should work.

Regards,
    Alex

On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" <gy...@gmail.com> wrote:

Hi,

We have started to use solr for our documents indexing (vsd, vsdx,
xls,xlsx, doc, docx, pdf, txt).

Modified date values is needed for each file. MS Office's files, pdfs have
this value.
Problem is with txt files as they don't have this value in their meta.

Is there any possibility to get it somehow from os level and force adding
it to solr when we do indexing.

p.s.

Windows 2012 server, single instance

typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
-Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx -jar
example/exampledocs/post.jar "M:\DNS_dump"


Regards,

Gytis

Re: how to get modified field data if it doesn't exist in meta

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Custom update request processor that looks up a file from the name and gets
the date should work.

Regards,
    Alex

On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" <gy...@gmail.com> wrote:

Hi,

We have started to use solr for our documents indexing (vsd, vsdx,
xls,xlsx, doc, docx, pdf, txt).

Modified date values is needed for each file. MS Office's files, pdfs have
this value.
Problem is with txt files as they don't have this value in their meta.

Is there any possibility to get it somehow from os level and force adding
it to solr when we do indexing.

p.s.

Windows 2012 server, single instance

typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
-Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin:xxxx -jar
example/exampledocs/post.jar "M:\DNS_dump"


Regards,

Gytis