You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Karan Saini <ma...@gmail.com> on 2018/01/29 10:18:23 UTC

Perform incremental import with PDF Files

Hi folks,

Please suggest the solution for importing and indexing PDF files
*incrementally*. My requirements is to pull the PDF files remotely from the
network folder path. This network folder will be having new sets of PDF
files after certain intervals (for say 20 secs). The folder will be forced
to get empty, every time the new sets of PDF files are copied into it. I do
not want to loose the earlier saved index of the old files, while doing the
next incremental import.

Currently, i am using Solr 6.6 version for the research.

The dataimport handler config is currently like this :-

<!--Remote Access--><dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="K2FileEntity" processor="FileListEntityProcessor"
dataSource="null"
			recursive = "true"						
			baseDir="\\CLDSINGH02\*RemoteFileDepot*"
			fileName=".*pdf" rootEntity="false">
			
			<field column="file" name="id"/>			
                        <field column="fileSize" name="size" />-->
                        <field column="fileLastModified" name="lastmodified" />

			  <entity name="pdf" processor="TikaEntityProcessor" onError="skip"
					  url="${K2FileEntity.fileAbsolutePath}" format="text">				

				<field column="title" name="title" meta="true"/>
				<field column="dc:format" name="format" meta="true"/>
				<field column="text" name="text"/>
			  </entity>
    </entity>
  </document></dataConfig>


Kind regards,
Karan Singh

Re: Perform incremental import with PDF Files

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

If you need to make a request to Solr that has a lot of custom
parameters and values, you can create an additional definition for a
Request handler and all all those parameters in there, instead of
hardcoding them on the client side. See solrconfig.xml, there are lots
of examples there.

Regards,
   Alex.

On 29 January 2018 at 20:48, Karan Saini <ma...@gmail.com> wrote:
> Thanks Emir :-) . Setting the property *clean=false* worked for me.
>
> Is there a way, i can selectively clean the particular index from the
> C#.NET code using the SolrNet API ?
> Please suggest.
>
> Kind regards,
> Karan
>
>
> On 29 January 2018 at 16:49, Emir Arnautović <em...@sematext.com>
> wrote:
>
>> Hi Karan,
>> Did you try running full import with clean=false?
>>
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 29 Jan 2018, at 11:18, Karan Saini <ma...@gmail.com> wrote:
>> >
>> > Hi folks,
>> >
>> > Please suggest the solution for importing and indexing PDF files
>> > *incrementally*. My requirements is to pull the PDF files remotely from
>> the
>> > network folder path. This network folder will be having new sets of PDF
>> > files after certain intervals (for say 20 secs). The folder will be
>> forced
>> > to get empty, every time the new sets of PDF files are copied into it. I
>> do
>> > not want to loose the earlier saved index of the old files, while doing
>> the
>> > next incremental import.
>> >
>> > Currently, i am using Solr 6.6 version for the research.
>> >
>> > The dataimport handler config is currently like this :-
>> >
>> > <!--Remote Access--><dataConfig>
>> >  <dataSource type="BinFileDataSource"/>
>> >  <document>
>> >    <entity name="K2FileEntity" processor="FileListEntityProcessor"
>> > dataSource="null"
>> >                       recursive = "true"
>> >                       baseDir="\\CLDSINGH02\*RemoteFileDepot*"
>> >                       fileName=".*pdf" rootEntity="false">
>> >
>> >                       <field column="file" name="id"/>
>> >                        <field column="fileSize" name="size" />-->
>> >                        <field column="fileLastModified"
>> name="lastmodified" />
>> >
>> >                         <entity name="pdf" processor="TikaEntityProcessor"
>> onError="skip"
>> >                                         url="${K2FileEntity.fileAbsolutePath}"
>> format="text">
>> >
>> >                               <field column="title" name="title"
>> meta="true"/>
>> >                               <field column="dc:format" name="format"
>> meta="true"/>
>> >                               <field column="text" name="text"/>
>> >                         </entity>
>> >    </entity>
>> >  </document></dataConfig>
>> >
>> >
>> > Kind regards,
>> > Karan Singh
>>
>>

Re: Perform incremental import with PDF Files

Posted by Emir Arnautović <em...@sematext.com>.

Hi Karan,
clean=false will not delete existing documents in index, but if you reimport documents with the same ID they will be overwritten. If you see the same doc with updated timestamp, then it probably means that you did full-import of docs with the same file name.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 30 Jan 2018, at 08:34, Karan Saini <ma...@gmail.com> wrote:
> 
> Hi Emir,
> 
> There is one behavior i noticed while performing the incremental import. I
> added a new field into the managed-schema.xml to test the incremental
> nature of using the clean=false.
> 
>         *<field name="xtimestamp" type="date" indexed="true" stored="true"
> default="NOW" multiValued="false"/>*
> 
> Now xtimestamp is having a new value even on every DIH import with
> clean=false property. Now i am confused that how will i know, if
> clean=false is working or not ?
> Please suggest.
> 
> Kind regards,
> Karan
> 
> 
> 
> On 29 January 2018 at 20:12, Emir Arnautović <em...@sematext.com>
> wrote:
> 
>> Hi Karan,
>> Glad it worked for you.
>> 
>> I am not sure how to do it in C# client, but adding clean=false parameter
>> in URL should do the trick.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 29 Jan 2018, at 14:48, Karan Saini <ma...@gmail.com> wrote:
>>> 
>>> Thanks Emir :-) . Setting the property *clean=false* worked for me.
>>> 
>>> Is there a way, i can selectively clean the particular index from the
>>> C#.NET code using the SolrNet API ?
>>> Please suggest.
>>> 
>>> Kind regards,
>>> Karan
>>> 
>>> 
>>> On 29 January 2018 at 16:49, Emir Arnautović <
>> emir.arnautovic@sematext.com>
>>> wrote:
>>> 
>>>> Hi Karan,
>>>> Did you try running full import with clean=false?
>>>> 
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 29 Jan 2018, at 11:18, Karan Saini <ma...@gmail.com> wrote:
>>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> Please suggest the solution for importing and indexing PDF files
>>>>> *incrementally*. My requirements is to pull the PDF files remotely from
>>>> the
>>>>> network folder path. This network folder will be having new sets of PDF
>>>>> files after certain intervals (for say 20 secs). The folder will be
>>>> forced
>>>>> to get empty, every time the new sets of PDF files are copied into it.
>> I
>>>> do
>>>>> not want to loose the earlier saved index of the old files, while doing
>>>> the
>>>>> next incremental import.
>>>>> 
>>>>> Currently, i am using Solr 6.6 version for the research.
>>>>> 
>>>>> The dataimport handler config is currently like this :-
>>>>> 
>>>>> <!--Remote Access--><dataConfig>
>>>>> <dataSource type="BinFileDataSource"/>
>>>>> <document>
>>>>>  <entity name="K2FileEntity" processor="FileListEntityProcessor"
>>>>> dataSource="null"
>>>>>                     recursive = "true"
>>>>>                     baseDir="\\CLDSINGH02\*RemoteFileDepot*"
>>>>>                     fileName=".*pdf" rootEntity="false">
>>>>> 
>>>>>                     <field column="file" name="id"/>
>>>>>                      <field column="fileSize" name="size" />-->
>>>>>                      <field column="fileLastModified"
>>>> name="lastmodified" />
>>>>> 
>>>>>                       <entity name="pdf" processor="
>> TikaEntityProcessor"
>>>> onError="skip"
>>>>>                                       url="${K2FileEntity.
>> fileAbsolutePath}"
>>>> format="text">
>>>>> 
>>>>>                             <field column="title" name="title"
>>>> meta="true"/>
>>>>>                             <field column="dc:format" name="format"
>>>> meta="true"/>
>>>>>                             <field column="text" name="text"/>
>>>>>                       </entity>
>>>>>  </entity>
>>>>> </document></dataConfig>
>>>>> 
>>>>> 
>>>>> Kind regards,
>>>>> Karan Singh
>>>> 
>>>> 
>> 
>>

Re: Perform incremental import with PDF Files

Posted by Karan Saini <ma...@gmail.com>.

Hi Emir,

There is one behavior i noticed while performing the incremental import. I
added a new field into the managed-schema.xml to test the incremental
nature of using the clean=false.

         *<field name="xtimestamp" type="date" indexed="true" stored="true"
default="NOW" multiValued="false"/>*

Now xtimestamp is having a new value even on every DIH import with
clean=false property. Now i am confused that how will i know, if
clean=false is working or not ?
Please suggest.

Kind regards,
Karan



On 29 January 2018 at 20:12, Emir Arnautović <em...@sematext.com>
wrote:

> Hi Karan,
> Glad it worked for you.
>
> I am not sure how to do it in C# client, but adding clean=false parameter
> in URL should do the trick.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 29 Jan 2018, at 14:48, Karan Saini <ma...@gmail.com> wrote:
> >
> > Thanks Emir :-) . Setting the property *clean=false* worked for me.
> >
> > Is there a way, i can selectively clean the particular index from the
> > C#.NET code using the SolrNet API ?
> > Please suggest.
> >
> > Kind regards,
> > Karan
> >
> >
> > On 29 January 2018 at 16:49, Emir Arnautović <
> emir.arnautovic@sematext.com>
> > wrote:
> >
> >> Hi Karan,
> >> Did you try running full import with clean=false?
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 29 Jan 2018, at 11:18, Karan Saini <ma...@gmail.com> wrote:
> >>>
> >>> Hi folks,
> >>>
> >>> Please suggest the solution for importing and indexing PDF files
> >>> *incrementally*. My requirements is to pull the PDF files remotely from
> >> the
> >>> network folder path. This network folder will be having new sets of PDF
> >>> files after certain intervals (for say 20 secs). The folder will be
> >> forced
> >>> to get empty, every time the new sets of PDF files are copied into it.
> I
> >> do
> >>> not want to loose the earlier saved index of the old files, while doing
> >> the
> >>> next incremental import.
> >>>
> >>> Currently, i am using Solr 6.6 version for the research.
> >>>
> >>> The dataimport handler config is currently like this :-
> >>>
> >>> <!--Remote Access--><dataConfig>
> >>> <dataSource type="BinFileDataSource"/>
> >>> <document>
> >>>   <entity name="K2FileEntity" processor="FileListEntityProcessor"
> >>> dataSource="null"
> >>>                      recursive = "true"
> >>>                      baseDir="\\CLDSINGH02\*RemoteFileDepot*"
> >>>                      fileName=".*pdf" rootEntity="false">
> >>>
> >>>                      <field column="file" name="id"/>
> >>>                       <field column="fileSize" name="size" />-->
> >>>                       <field column="fileLastModified"
> >> name="lastmodified" />
> >>>
> >>>                        <entity name="pdf" processor="
> TikaEntityProcessor"
> >> onError="skip"
> >>>                                        url="${K2FileEntity.
> fileAbsolutePath}"
> >> format="text">
> >>>
> >>>                              <field column="title" name="title"
> >> meta="true"/>
> >>>                              <field column="dc:format" name="format"
> >> meta="true"/>
> >>>                              <field column="text" name="text"/>
> >>>                        </entity>
> >>>   </entity>
> >>> </document></dataConfig>
> >>>
> >>>
> >>> Kind regards,
> >>> Karan Singh
> >>
> >>
>
>

Re: Perform incremental import with PDF Files

Posted by Emir Arnautović <em...@sematext.com>.

Hi Karan,
Glad it worked for you.

I am not sure how to do it in C# client, but adding clean=false parameter in URL should do the trick.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 29 Jan 2018, at 14:48, Karan Saini <ma...@gmail.com> wrote:
> 
> Thanks Emir :-) . Setting the property *clean=false* worked for me.
> 
> Is there a way, i can selectively clean the particular index from the
> C#.NET code using the SolrNet API ?
> Please suggest.
> 
> Kind regards,
> Karan
> 
> 
> On 29 January 2018 at 16:49, Emir Arnautović <em...@sematext.com>
> wrote:
> 
>> Hi Karan,
>> Did you try running full import with clean=false?
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 29 Jan 2018, at 11:18, Karan Saini <ma...@gmail.com> wrote:
>>> 
>>> Hi folks,
>>> 
>>> Please suggest the solution for importing and indexing PDF files
>>> *incrementally*. My requirements is to pull the PDF files remotely from
>> the
>>> network folder path. This network folder will be having new sets of PDF
>>> files after certain intervals (for say 20 secs). The folder will be
>> forced
>>> to get empty, every time the new sets of PDF files are copied into it. I
>> do
>>> not want to loose the earlier saved index of the old files, while doing
>> the
>>> next incremental import.
>>> 
>>> Currently, i am using Solr 6.6 version for the research.
>>> 
>>> The dataimport handler config is currently like this :-
>>> 
>>> <!--Remote Access--><dataConfig>
>>> <dataSource type="BinFileDataSource"/>
>>> <document>
>>>   <entity name="K2FileEntity" processor="FileListEntityProcessor"
>>> dataSource="null"
>>>                      recursive = "true"
>>>                      baseDir="\\CLDSINGH02\*RemoteFileDepot*"
>>>                      fileName=".*pdf" rootEntity="false">
>>> 
>>>                      <field column="file" name="id"/>
>>>                       <field column="fileSize" name="size" />-->
>>>                       <field column="fileLastModified"
>> name="lastmodified" />
>>> 
>>>                        <entity name="pdf" processor="TikaEntityProcessor"
>> onError="skip"
>>>                                        url="${K2FileEntity.fileAbsolutePath}"
>> format="text">
>>> 
>>>                              <field column="title" name="title"
>> meta="true"/>
>>>                              <field column="dc:format" name="format"
>> meta="true"/>
>>>                              <field column="text" name="text"/>
>>>                        </entity>
>>>   </entity>
>>> </document></dataConfig>
>>> 
>>> 
>>> Kind regards,
>>> Karan Singh
>> 
>>

Re: Perform incremental import with PDF Files

Posted by Karan Saini <ma...@gmail.com>.

Thanks Emir :-) . Setting the property *clean=false* worked for me.

Is there a way, i can selectively clean the particular index from the
C#.NET code using the SolrNet API ?
Please suggest.

Kind regards,
Karan


On 29 January 2018 at 16:49, Emir Arnautović <em...@sematext.com>
wrote:

> Hi Karan,
> Did you try running full import with clean=false?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 29 Jan 2018, at 11:18, Karan Saini <ma...@gmail.com> wrote:
> >
> > Hi folks,
> >
> > Please suggest the solution for importing and indexing PDF files
> > *incrementally*. My requirements is to pull the PDF files remotely from
> the
> > network folder path. This network folder will be having new sets of PDF
> > files after certain intervals (for say 20 secs). The folder will be
> forced
> > to get empty, every time the new sets of PDF files are copied into it. I
> do
> > not want to loose the earlier saved index of the old files, while doing
> the
> > next incremental import.
> >
> > Currently, i am using Solr 6.6 version for the research.
> >
> > The dataimport handler config is currently like this :-
> >
> > <!--Remote Access--><dataConfig>
> >  <dataSource type="BinFileDataSource"/>
> >  <document>
> >    <entity name="K2FileEntity" processor="FileListEntityProcessor"
> > dataSource="null"
> >                       recursive = "true"
> >                       baseDir="\\CLDSINGH02\*RemoteFileDepot*"
> >                       fileName=".*pdf" rootEntity="false">
> >
> >                       <field column="file" name="id"/>
> >                        <field column="fileSize" name="size" />-->
> >                        <field column="fileLastModified"
> name="lastmodified" />
> >
> >                         <entity name="pdf" processor="TikaEntityProcessor"
> onError="skip"
> >                                         url="${K2FileEntity.fileAbsolutePath}"
> format="text">
> >
> >                               <field column="title" name="title"
> meta="true"/>
> >                               <field column="dc:format" name="format"
> meta="true"/>
> >                               <field column="text" name="text"/>
> >                         </entity>
> >    </entity>
> >  </document></dataConfig>
> >
> >
> > Kind regards,
> > Karan Singh
>
>

Re: Perform incremental import with PDF Files

Posted by Emir Arnautović <em...@sematext.com>.

Hi Karan,
Did you try running full import with clean=false?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 29 Jan 2018, at 11:18, Karan Saini <ma...@gmail.com> wrote:
> 
> Hi folks,
> 
> Please suggest the solution for importing and indexing PDF files
> *incrementally*. My requirements is to pull the PDF files remotely from the
> network folder path. This network folder will be having new sets of PDF
> files after certain intervals (for say 20 secs). The folder will be forced
> to get empty, every time the new sets of PDF files are copied into it. I do
> not want to loose the earlier saved index of the old files, while doing the
> next incremental import.
> 
> Currently, i am using Solr 6.6 version for the research.
> 
> The dataimport handler config is currently like this :-
> 
> <!--Remote Access--><dataConfig>
>  <dataSource type="BinFileDataSource"/>
>  <document>
>    <entity name="K2FileEntity" processor="FileListEntityProcessor"
> dataSource="null"
> 			recursive = "true"						
> 			baseDir="\\CLDSINGH02\*RemoteFileDepot*"
> 			fileName=".*pdf" rootEntity="false">
> 			
> 			<field column="file" name="id"/>			
>                        <field column="fileSize" name="size" />-->
>                        <field column="fileLastModified" name="lastmodified" />
> 
> 			  <entity name="pdf" processor="TikaEntityProcessor" onError="skip"
> 					  url="${K2FileEntity.fileAbsolutePath}" format="text">				
> 
> 				<field column="title" name="title" meta="true"/>
> 				<field column="dc:format" name="format" meta="true"/>
> 				<field column="text" name="text"/>
> 			  </entity>
>    </entity>
>  </document></dataConfig>
> 
> 
> Kind regards,
> Karan Singh