You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Bisonti Mario <Ma...@vimar.com> on 2023/05/25 10:03:43 UTC

Long Job on Windows Share

Hi,
I would like to understand how recrawl works

My job scan, using "Connection Type"  "Windows shares" works for near 18 hours.
My document numebr a little bit of 1 million.

If I check the documents scan from MifoldCF I see, for example:
[cid:image001.png@01D98F00.F3071580]

It seems that re work on the document every day even if it hadn't been modified.
So, is it right or I chose a wrong job to crawl the documents?

Thanks a lot
Mario



R: Long Job on Windows Share

Posted by Bisonti Mario <Ma...@vimar.com>.
In the manifoldcf.log I see many:
WARN 2023-06-05T21:36:51,630 (Worker thread '31') - JCIFS: Possibly transient exception detected on attempt 2 while getting share security: All pipe instances are busy.
jcifs.smb.SmbException: All pipe instances are busy.
        at jcifs.smb.SmbTransportImpl.checkStatus2(SmbTransportImpl.java:1441) ~[jcifs-ng-2.1.2.jar:?]
        at jcifs.smb.SmbTransportImpl.checkStatus(SmbTransportImpl.java:1552) ~[jcifs-ng-2.1.2.jar:?]
        at jcifs.smb.SmbTransportImpl.sendrecv(SmbTransportImpl.java:1007) ~[jcifs-ng-2.1.2.jar:?]
        at jcifs.smb.SmbTransportImpl.send(SmbTransportImpl.java:1523) ~[jcifs-ng-2.1.2.jar:?]


I don’t see information about reindex or not reindex

Have I to search in a differetn file?

Thanks a lot Mario

R: Long Job on Windows Share

Posted by Bisonti Mario <Ma...@vimar.com>.
Thanks a lot Karl

In the “Simple History” in ManifoldCF I see, for every document, even if it’s not been modified every day:

26/05/23, 08:47:47         document ingest (SolrShare)     file://///...Avanzato%202014.pptx<file://...Avanzato%202014.pptx>
26/05/23, 08:47:46         extract [TikaTrasform]          file://///...Avanzato%202014.pptx<file://...Avanzato%202014.pptx>
26/05/23, 08:47:45         access                          file://///...Avanzato%202014.pptx<file://...Avanzato%202014.pptx>


In Solr, I execute the query to search the document and I see, omitting extended result..) :

{
  "responseHeader":{
    "status":0,
    "QTime":977,
    "params":{
      "q":"id:*Avanzato*202014*",
      "_":"1685082709862"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":file://///...Avanzato%202014.pptx<file://...Avanzato%202014.pptx>,
        "last_modified":"2015-03-25T17:27:22Z",
        "resourcename":"...Avanzato 2014.pptx",
        "content_type":["application/vnd.openxmlformats-officedocument.presentationml.presentation"],
        "allow_token_document":["Active+Directory:S-1-5-21-……………..",
          "Active+Directory:S-1-..."],
        "deny_token_document":["Active+Directory:DEAD_AUTHORITY"],
        "allow_token_share":["Active+Directory:S-1-1-0"],
        "deny_token_share":["Active+Directory:DEAD_AUTHORITY"],
        "deny_token_parent":["__nosecurity__"],
        "allow_token_parent":["__nosecurity__"],
        "content":["ESER......
        "_version_":1766940934228934656}]
  }}


Is this what did you mean when you mentioned “activity log” ?

I see that document in Solr, so, I suppose that it is indexed

What could I investigated furthermore?
Thanks a lot

Mario



Da: Karl Wright <da...@gmail.com>
Inviato: venerdì 26 maggio 2023 07:20
A: user@manifoldcf.apache.org
Oggetto: Re: Long Job on Windows Share

The jcifs connector does not include a lot of information in the version string for a file - basically, the length, and the modified date.  So I would not expect there to be lot of actual work involved if there are no changes to a document.

The activity "access" does imply that the system believes that the document does need to be reindexed.  It clearly reads the document properly.  I would check to be sure it actually indexes the document.  I suspect that your job may be reading the file but determining it is not suitable for indexing and then repeating that every day.  You can see this by looking for the document in the activity log to see what ManifoldCF decided to do with it.

Karl


On Thu, May 25, 2023 at 6:03 AM Bisonti Mario <Ma...@vimar.com>> wrote:
Hi,
I would like to understand how recrawl works

My job scan, using “Connection Type”  “Windows shares” works for near 18 hours.
My document numebr a little bit of 1 million.

If I check the documents scan from MifoldCF I see, for example:
[cid:image001.png@01D98FB1.12689F10]

It seems that re work on the document every day even if it hadn’t been modified.
So, is it right or I chose a wrong job to crawl the documents?

Thanks a lot
Mario



Re: Long Job on Windows Share

Posted by Karl Wright <da...@gmail.com>.
The jcifs connector does not include a lot of information in the version
string for a file - basically, the length, and the modified date.  So I
would not expect there to be lot of actual work involved if there are no
changes to a document.

The activity "access" does imply that the system believes that the document
does need to be reindexed.  It clearly reads the document properly.  I
would check to be sure it actually indexes the document.  I suspect that
your job may be reading the file but determining it is not suitable for
indexing and then repeating that every day.  You can see this by looking
for the document in the activity log to see what ManifoldCF decided to do
with it.

Karl



On Thu, May 25, 2023 at 6:03 AM Bisonti Mario <Ma...@vimar.com>
wrote:

> Hi,
>
> I would like to understand how recrawl works
>
>
>
> My job scan, using “Connection Type”  “Windows shares” works for near 18
> hours.
>
> My document numebr a little bit of 1 million.
>
>
>
> If I check the documents scan from MifoldCF I see, for example:
>
>
>
> It seems that re work on the document every day even if it hadn’t been
> modified.
>
> So, is it right or I chose a wrong job to crawl the documents?
>
>
>
> Thanks a lot
>
> Mario
>
>
>
>
>