You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Gustavo Beneitez <gu...@gmail.com> on 2018/08/22 16:35:17 UTC

Documents that didn't change are reindexed

Hi everyone,

I am currently creating a job that indexes part of Liferay intranet content.
Every time the job is executed the documents are fully reindexed in
Elastic, no matter they didn't change.
I thought I had read somewhere the crawler uses "last-modified" http
header, but also that saves into database a hash.
I was looking for the right one within the user's manual but no luck, so
please could you tell me which is the correct one?

Thanks in advance!

Re: [External] Re: Documents that didn't change are reindexed

Posted by Gustavo Beneitez <gu...@gmail.com>.

Hi again, I managed to review the code and also get the headers. I saw one
that is most suspicious to me and would like to exclude it but I didn't
find how to.
The code seems to look for "config properties" but the user interface does
not allow that. Do you know where is it placed?


protected static Set<String> findExcludedHeaders(Specification spec)
    throws ManifoldCFException
  {
    Set<String> rval = new HashSet<String>();
    int i = 0;
    while (i < spec.getChildCount())
    {
      SpecificationNode n = spec.getChild(i++);
      if (n.getType().equals(WebcrawlerConfig.NODE_EXCLUDEHEADER))
      {
        String value = n.getAttributeValue(WebcrawlerConfig.ATTR_VALUE);
        rval.add(value);
      }
    }
    return rval;
  }

[image: image.png]

Thanks again!

El jue., 23 ago. 2018 a las 14:33, Gustavo Beneitez (<
gustavo.beneitez@gmail.com>) escribió:

> Hi,
>
> thanks everyone.
>
> @Karl, many thanks I am going to write a little test and see what happens.
>
> @Konrad, yes you are right, I think Liferay is creating something wrong
> that might confuse the crawler. Let me write the test and see what it is.
>
> Thanks!
>
> El jue., 23 ago. 2018 a las 14:24, Holl, Konrad (<
> konrad.holl@accenture.com>) escribió:
>
>> Did you check the "modified" header returned with the documents from
>> Liferay? Some systems tend to always use "now", which could explain the
>> behavior (this might even be a configuration option). You can see this in a
>> browser's debug window when you reload the page a couple of times (Ctrl+F5
>> to force reloading).
>>
>>
>> -Konrad
>>
>> ------------------------------
>> *Von:* Karl Wright <da...@gmail.com>
>> *Gesendet:* Donnerstag, 23. August 2018 14:18
>> *An:* user@manifoldcf.apache.org
>> *Betreff:* [External] Re: Documents that didn't change are reindexed
>>
>> I would suggest downloading the pages using curl a couple of times and
>> comparing content.
>> Headers also matter.  Here's the code:
>>
>> >>>>>>
>>             // Calculate version from document data, which is presumed to
>> be present.
>>             StringBuilder sb = new StringBuilder();
>>
>>             // Acls
>>             packList(sb,acls,'+');
>>             if (acls.length > 0)
>>             {
>>               sb.append('+');
>>               pack(sb,defaultAuthorityDenyToken,'+');
>>             }
>>             else
>>               sb.append('-');
>>
>>             // Now, do the metadata.
>>             Map<String,Set<String>> metaHash = new
>> HashMap<String,Set<String>>();
>>
>>             String[] fixedListStrings = new String[2];
>>             // They're all folded into the same part of the version
>> string.
>>             int headerCount = 0;
>>             Iterator<String> headerIterator =
>> fetchStatus.headerData.keySet().iterator();
>>             while (headerIterator.hasNext())
>>             {
>>               String headerName = headerIterator.next();
>>               String lowerHeaderName =
>> headerName.toLowerCase(Locale.ROOT);
>>               if (!reservedHeaders.contains(lowerHeaderName) &&
>> !excludedHeaders.contains(lowerHeaderName))
>>                 headerCount +=
>> fetchStatus.headerData.get(headerName).size();
>>             }
>>             String[] fullMetadata = new String[headerCount];
>>             headerCount = 0;
>>             headerIterator = fetchStatus.headerData.keySet().iterator();
>>             while (headerIterator.hasNext())
>>             {
>>               String headerName = headerIterator.next();
>>               String lowerHeaderName =
>> headerName.toLowerCase(Locale.ROOT);
>>               if (!reservedHeaders.contains(lowerHeaderName) &&
>> !excludedHeaders.contains(lowerHeaderName))
>>               {
>>                 Set<String> valueSet = metaHash.get(headerName);
>>                 if (valueSet == null)
>>                 {
>>                   valueSet = new HashSet<String>();
>>                   metaHash.put(headerName,valueSet);
>>                 }
>>                 List<String> headerValues =
>> fetchStatus.headerData.get(headerName);
>>                 for (String headerValue : headerValues)
>>                 {
>>                   valueSet.add(headerValue);
>>                   fixedListStrings[0] = "header-"+headerName;
>>                   fixedListStrings[1] = headerValue;
>>                   StringBuilder newsb = new StringBuilder();
>>                   packFixedList(newsb,fixedListStrings,'=');
>>                   fullMetadata[headerCount++] = newsb.toString();
>>                 }
>>               }
>>             }
>>             java.util.Arrays.sort(fullMetadata);
>>
>>             packList(sb,fullMetadata,'+');
>>             // Done with the parseable part!  Add the checksum.
>>             sb.append(fetchStatus.checkSum);
>>             // Add the filter version
>>             sb.append("+");
>>             sb.append(filterVersion);
>>
>>             String versionString = sb.toString();
>> <<<<<<
>>
>> The "filter version" comes from your job specification and will change
>> only if you change the job specification, but everything else should be
>> self-explanatory.  Looks like all headers matter, so that could explain it.
>>
>> Karl
>>
>>
>> On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>> Thanks Karl,
>>
>> I've been launching the job a couple of times with a small set of
>> documents and what I see is that the elastic indexes every time each
>> document, even though the weight of the document is always the same and I
>> don't notice any "html dynamic content" like current time that could cause
>> checksum to be different.
>>
>> Consulting the "Simple history" menu option shows that Elastic output
>> connector is called
>> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
>> So I guess there is a miss-configuration somewhere...
>>
>>
>>
>> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<da...@gmail.com>)
>> escribió:
>>
>> Hi Gustavo,
>>
>> I take it from your question that you are using the Web Connector?
>>
>> All connectors create a version string that is used to determine whether
>> content needs to be reindexed or not.  The Web Connector's version string
>> uses a checksum of the page contents; we found the "last modified" header
>> to be unreliable, if I recall correctly.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I am currently creating a job that indexes part of Liferay intranet
>> content.
>> Every time the job is executed the documents are fully reindexed in
>> Elastic, no matter they didn't change.
>> I thought I had read somewhere the crawler uses "last-modified" http
>> header, but also that saves into database a hash.
>> I was looking for the right one within the user's manual but no luck, so
>> please could you tell me which is the correct one?
>>
>> Thanks in advance!
>>
>>
>> ------------------------------
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy. Your privacy is important to us.
>> Accenture uses your personal data only in compliance with data protection
>> laws. For further information on how Accenture processes your personal
>> data, please see our privacy statement at
>> https://www.accenture.com/us-en/privacy-policy.
>>
>> ______________________________________________________________________________________
>>
>> www.accenture.com
>>
>

Re: [External] Re: Documents that didn't change are reindexed

Posted by Gustavo Beneitez <gu...@gmail.com>.

Hi,

thanks everyone.

@Karl, many thanks I am going to write a little test and see what happens.

@Konrad, yes you are right, I think Liferay is creating something wrong
that might confuse the crawler. Let me write the test and see what it is.

Thanks!

El jue., 23 ago. 2018 a las 14:24, Holl, Konrad (<ko...@accenture.com>)
escribió:

> Did you check the "modified" header returned with the documents from
> Liferay? Some systems tend to always use "now", which could explain the
> behavior (this might even be a configuration option). You can see this in a
> browser's debug window when you reload the page a couple of times (Ctrl+F5
> to force reloading).
>
>
> -Konrad
>
> ------------------------------
> *Von:* Karl Wright <da...@gmail.com>
> *Gesendet:* Donnerstag, 23. August 2018 14:18
> *An:* user@manifoldcf.apache.org
> *Betreff:* [External] Re: Documents that didn't change are reindexed
>
> I would suggest downloading the pages using curl a couple of times and
> comparing content.
> Headers also matter.  Here's the code:
>
> >>>>>>
>             // Calculate version from document data, which is presumed to
> be present.
>             StringBuilder sb = new StringBuilder();
>
>             // Acls
>             packList(sb,acls,'+');
>             if (acls.length > 0)
>             {
>               sb.append('+');
>               pack(sb,defaultAuthorityDenyToken,'+');
>             }
>             else
>               sb.append('-');
>
>             // Now, do the metadata.
>             Map<String,Set<String>> metaHash = new
> HashMap<String,Set<String>>();
>
>             String[] fixedListStrings = new String[2];
>             // They're all folded into the same part of the version string.
>             int headerCount = 0;
>             Iterator<String> headerIterator =
> fetchStatus.headerData.keySet().iterator();
>             while (headerIterator.hasNext())
>             {
>               String headerName = headerIterator.next();
>               String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
>               if (!reservedHeaders.contains(lowerHeaderName) &&
> !excludedHeaders.contains(lowerHeaderName))
>                 headerCount +=
> fetchStatus.headerData.get(headerName).size();
>             }
>             String[] fullMetadata = new String[headerCount];
>             headerCount = 0;
>             headerIterator = fetchStatus.headerData.keySet().iterator();
>             while (headerIterator.hasNext())
>             {
>               String headerName = headerIterator.next();
>               String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
>               if (!reservedHeaders.contains(lowerHeaderName) &&
> !excludedHeaders.contains(lowerHeaderName))
>               {
>                 Set<String> valueSet = metaHash.get(headerName);
>                 if (valueSet == null)
>                 {
>                   valueSet = new HashSet<String>();
>                   metaHash.put(headerName,valueSet);
>                 }
>                 List<String> headerValues =
> fetchStatus.headerData.get(headerName);
>                 for (String headerValue : headerValues)
>                 {
>                   valueSet.add(headerValue);
>                   fixedListStrings[0] = "header-"+headerName;
>                   fixedListStrings[1] = headerValue;
>                   StringBuilder newsb = new StringBuilder();
>                   packFixedList(newsb,fixedListStrings,'=');
>                   fullMetadata[headerCount++] = newsb.toString();
>                 }
>               }
>             }
>             java.util.Arrays.sort(fullMetadata);
>
>             packList(sb,fullMetadata,'+');
>             // Done with the parseable part!  Add the checksum.
>             sb.append(fetchStatus.checkSum);
>             // Add the filter version
>             sb.append("+");
>             sb.append(filterVersion);
>
>             String versionString = sb.toString();
> <<<<<<
>
> The "filter version" comes from your job specification and will change
> only if you change the job specification, but everything else should be
> self-explanatory.  Looks like all headers matter, so that could explain it.
>
> Karl
>
>
> On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
> Thanks Karl,
>
> I've been launching the job a couple of times with a small set of
> documents and what I see is that the elastic indexes every time each
> document, even though the weight of the document is always the same and I
> don't notice any "html dynamic content" like current time that could cause
> checksum to be different.
>
> Consulting the "Simple history" menu option shows that Elastic output
> connector is called
> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
> So I guess there is a miss-configuration somewhere...
>
>
>
> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<da...@gmail.com>)
> escribió:
>
> Hi Gustavo,
>
> I take it from your question that you are using the Web Connector?
>
> All connectors create a version string that is used to determine whether
> content needs to be reindexed or not.  The Web Connector's version string
> uses a checksum of the page contents; we found the "last modified" header
> to be unreliable, if I recall correctly.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
> Hi everyone,
>
> I am currently creating a job that indexes part of Liferay intranet
> content.
> Every time the job is executed the documents are fully reindexed in
> Elastic, no matter they didn't change.
> I thought I had read somewhere the crawler uses "last-modified" http
> header, but also that saves into database a hash.
> I was looking for the right one within the user's manual but no luck, so
> please could you tell me which is the correct one?
>
> Thanks in advance!
>
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>

AW: [External] Re: Documents that didn't change are reindexed

Posted by "Holl, Konrad" <ko...@accenture.com>.

Did you check the "modified" header returned with the documents from Liferay? Some systems tend to always use "now", which could explain the behavior (this might even be a configuration option). You can see this in a browser's debug window when you reload the page a couple of times (Ctrl+F5 to force reloading).


-Konrad

________________________________
Von: Karl Wright <da...@gmail.com>
Gesendet: Donnerstag, 23. August 2018 14:18
An: user@manifoldcf.apache.org
Betreff: [External] Re: Documents that didn't change are reindexed

I would suggest downloading the pages using curl a couple of times and comparing content.
Headers also matter.  Here's the code:

>>>>>>
            // Calculate version from document data, which is presumed to be present.
            StringBuilder sb = new StringBuilder();

            // Acls
            packList(sb,acls,'+');
            if (acls.length > 0)
            {
              sb.append('+');
              pack(sb,defaultAuthorityDenyToken,'+');
            }
            else
              sb.append('-');

            // Now, do the metadata.
            Map<String,Set<String>> metaHash = new HashMap<String,Set<String>>();

            String[] fixedListStrings = new String[2];
            // They're all folded into the same part of the version string.
            int headerCount = 0;
            Iterator<String> headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && !excludedHeaders.contains(lowerHeaderName))
                headerCount += fetchStatus.headerData.get(headerName).size();
            }
            String[] fullMetadata = new String[headerCount];
            headerCount = 0;
            headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && !excludedHeaders.contains(lowerHeaderName))
              {
                Set<String> valueSet = metaHash.get(headerName);
                if (valueSet == null)
                {
                  valueSet = new HashSet<String>();
                  metaHash.put(headerName,valueSet);
                }
                List<String> headerValues = fetchStatus.headerData.get(headerName);
                for (String headerValue : headerValues)
                {
                  valueSet.add(headerValue);
                  fixedListStrings[0] = "header-"+headerName;
                  fixedListStrings[1] = headerValue;
                  StringBuilder newsb = new StringBuilder();
                  packFixedList(newsb,fixedListStrings,'=');
                  fullMetadata[headerCount++] = newsb.toString();
                }
              }
            }
            java.util.Arrays.sort(fullMetadata);

            packList(sb,fullMetadata,'+');
            // Done with the parseable part!  Add the checksum.
            sb.append(fetchStatus.checkSum);
            // Add the filter version
            sb.append("+");
            sb.append(filterVersion);

            String versionString = sb.toString();
<<<<<<

The "filter version" comes from your job specification and will change only if you change the job specification, but everything else should be self-explanatory.  Looks like all headers matter, so that could explain it.

Karl


On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <gu...@gmail.com>> wrote:
Thanks Karl,

I've been launching the job a couple of times with a small set of documents and what I see is that the elastic indexes every time each document, even though the weight of the document is always the same and I don't notice any "html dynamic content" like current time that could cause checksum to be different.

Consulting the "Simple history" menu option shows that Elastic output connector is called
"08-23-2018 06:27:19.274        Indexation (Elasticsearch 2.4.6)"

So I guess there is a miss-configuration somewhere...



El jue., 23 ago. 2018 a las 1:45, Karl Wright (<da...@gmail.com>>) escribió:
Hi Gustavo,

I take it from your question that you are using the Web Connector?

All connectors create a version string that is used to determine whether content needs to be reindexed or not.  The Web Connector's version string uses a checksum of the page contents; we found the "last modified" header to be unreliable, if I recall correctly.

Thanks,
Karl


On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <gu...@gmail.com>> wrote:
Hi everyone,

I am currently creating a job that indexes part of Liferay intranet content.
Every time the job is executed the documents are fully reindexed in Elastic, no matter they didn't change.
I thought I had read somewhere the crawler uses "last-modified" http header, but also that saves into database a hash.
I was looking for the right one within the user's manual but no luck, so please could you tell me which is the correct one?

Thanks in advance!

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. Your privacy is important to us. Accenture uses your personal data only in compliance with data protection laws. For further information on how Accenture processes your personal data, please see our privacy statement at https://www.accenture.com/us-en/privacy-policy.
______________________________________________________________________________________

www.accenture.com

Re: Documents that didn't change are reindexed

Posted by Karl Wright <da...@gmail.com>.

I would suggest downloading the pages using curl a couple of times and
comparing content.
Headers also matter.  Here's the code:

>>>>>>
            // Calculate version from document data, which is presumed to
be present.
            StringBuilder sb = new StringBuilder();

            // Acls
            packList(sb,acls,'+');
            if (acls.length > 0)
            {
              sb.append('+');
              pack(sb,defaultAuthorityDenyToken,'+');
            }
            else
              sb.append('-');

            // Now, do the metadata.
            Map<String,Set<String>> metaHash = new
HashMap<String,Set<String>>();

            String[] fixedListStrings = new String[2];
            // They're all folded into the same part of the version string.
            int headerCount = 0;
            Iterator<String> headerIterator =
fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) &&
!excludedHeaders.contains(lowerHeaderName))
                headerCount +=
fetchStatus.headerData.get(headerName).size();
            }
            String[] fullMetadata = new String[headerCount];
            headerCount = 0;
            headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) &&
!excludedHeaders.contains(lowerHeaderName))
              {
                Set<String> valueSet = metaHash.get(headerName);
                if (valueSet == null)
                {
                  valueSet = new HashSet<String>();
                  metaHash.put(headerName,valueSet);
                }
                List<String> headerValues =
fetchStatus.headerData.get(headerName);
                for (String headerValue : headerValues)
                {
                  valueSet.add(headerValue);
                  fixedListStrings[0] = "header-"+headerName;
                  fixedListStrings[1] = headerValue;
                  StringBuilder newsb = new StringBuilder();
                  packFixedList(newsb,fixedListStrings,'=');
                  fullMetadata[headerCount++] = newsb.toString();
                }
              }
            }
            java.util.Arrays.sort(fullMetadata);

            packList(sb,fullMetadata,'+');
            // Done with the parseable part!  Add the checksum.
            sb.append(fetchStatus.checkSum);
            // Add the filter version
            sb.append("+");
            sb.append(filterVersion);

            String versionString = sb.toString();
<<<<<<

The "filter version" comes from your job specification and will change only
if you change the job specification, but everything else should be
self-explanatory.  Looks like all headers matter, so that could explain it.

Karl


On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <gu...@gmail.com>
wrote:

> Thanks Karl,
>
> I've been launching the job a couple of times with a small set of
> documents and what I see is that the elastic indexes every time each
> document, even though the weight of the document is always the same and I
> don't notice any "html dynamic content" like current time that could cause
> checksum to be different.
>
> Consulting the "Simple history" menu option shows that Elastic output
> connector is called
> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
> So I guess there is a miss-configuration somewhere...
>
>
>
> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<da...@gmail.com>)
> escribió:
>
>> Hi Gustavo,
>>
>> I take it from your question that you are using the Web Connector?
>>
>> All connectors create a version string that is used to determine whether
>> content needs to be reindexed or not.  The Web Connector's version string
>> uses a checksum of the page contents; we found the "last modified" header
>> to be unreliable, if I recall correctly.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I am currently creating a job that indexes part of Liferay intranet
>>> content.
>>> Every time the job is executed the documents are fully reindexed in
>>> Elastic, no matter they didn't change.
>>> I thought I had read somewhere the crawler uses "last-modified" http
>>> header, but also that saves into database a hash.
>>> I was looking for the right one within the user's manual but no luck, so
>>> please could you tell me which is the correct one?
>>>
>>> Thanks in advance!
>>>
>>

Re: Documents that didn't change are reindexed

Posted by Gustavo Beneitez <gu...@gmail.com>.

Thanks Karl,

I've been launching the job a couple of times with a small set of documents
and what I see is that the elastic indexes every time each document, even
though the weight of the document is always the same and I don't notice any
"html dynamic content" like current time that could cause checksum to be
different.

Consulting the "Simple history" menu option shows that Elastic output
connector is called
"08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
So I guess there is a miss-configuration somewhere...

El jue., 23 ago. 2018 a las 1:45, Karl Wright (<da...@gmail.com>)
escribió:

> Hi Gustavo,
>
> I take it from your question that you are using the Web Connector?
>
> All connectors create a version string that is used to determine whether
> content needs to be reindexed or not.  The Web Connector's version string
> uses a checksum of the page contents; we found the "last modified" header
> to be unreliable, if I recall correctly.
>
> Thanks,
> Karl
>
>
> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am currently creating a job that indexes part of Liferay intranet
>> content.
>> Every time the job is executed the documents are fully reindexed in
>> Elastic, no matter they didn't change.
>> I thought I had read somewhere the crawler uses "last-modified" http
>> header, but also that saves into database a hash.
>> I was looking for the right one within the user's manual but no luck, so
>> please could you tell me which is the correct one?
>>
>> Thanks in advance!
>>
>

Re: Documents that didn't change are reindexed

Posted by Karl Wright <da...@gmail.com>.

Hi Gustavo,

I take it from your question that you are using the Web Connector?

All connectors create a version string that is used to determine whether
content needs to be reindexed or not.  The Web Connector's version string
uses a checksum of the page contents; we found the "last modified" header
to be unreliable, if I recall correctly.

Thanks,
Karl

On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
gustavo.beneitez@gmail.com> wrote:

> Hi everyone,
>
> I am currently creating a job that indexes part of Liferay intranet
> content.
> Every time the job is executed the documents are fully reindexed in
> Elastic, no matter they didn't change.
> I thought I had read somewhere the crawler uses "last-modified" http
> header, but also that saves into database a hash.
> I was looking for the right one within the user's manual but no luck, so
> please could you tell me which is the correct one?
>
> Thanks in advance!
>