You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Gustavo Beneitez <gu...@gmail.com> on 2018/07/19 19:37:52 UTC

web crawler not sharing cookies

Hi everyone,

I have tried to look for an answer before writing this email, no luck.
Sorry for the inconvenience if it is already answered.

I need to set a cookie at the begining of the web crawling. The cookie
rules the language you get the content, and while there are several
choices, if no cookie is found there will be a "default language".

I made a JSP which sets the cookie and contains several links (href), and
pointed ManifoldCF to this page as the repository seed. I expected to get
the crawling engine starting to capture links with correct language
indicated by the cookie, but what I really got is a lot of content shown in
default language.

What I think about that is that cookies are not shared between thread
spiders, so it is not possible to get cookies remain between links. Cookie
domain is correct, also cookie expiration

I would appreciate so much  if you can help me on this.

Thanks in advance!

Re: web crawler not sharing cookies

Posted by Karl Wright <da...@gmail.com>.
Here's the documentation from HttpClient on the various cookie policies.
You're probably going to need to read some of the RFCs to see which policy
you want.  I will wait for you to get back to me with a recommendation
before taking any action in the MCF codebase.  Thanks!

https://hc.apache.org/httpcomponents-client-ga/tutorial/html/statemgmt.html

Karl


On Thu, Jul 26, 2018 at 3:19 AM Karl Wright <da...@gmail.com> wrote:

> Ok, so the database for your site crawl contains both z.com and x.y.z.com
> cookies?  And your site pages from domain a.y.z.com receive no cookies at
> all when fetched?  Is that a correct description of the situation?
>
> Please verify that the a.y.z.com pages are part of the protected part of
> your "site".  The regular expression that describes site membership for the
> login sequence you are trying to set up must include them or they will not
> receive any cookies no matter what we do.
>
> If this is set up correctly, then the only explanation is the HttpClient
> cookie policy in effect for site fetches.  It does not look like we
> override the cookie policy anywhere when setting up the client:
>
>         PoolingHttpClientConnectionManager poolingConnManager = new
> PoolingHttpClientConnectionManager(RegistryBuilder.<ConnectionSocketFactory>create()
>           .register("http",
> PlainConnectionSocketFactory.getSocketFactory())
>           .register("https", myFactory)
>           .build());
>         poolingConnManager.setDefaultMaxPerRoute(1);
>         poolingConnManager.setValidateAfterInactivity(2000);
>         poolingConnManager.setDefaultSocketConfig(SocketConfig.custom()
>           .setTcpNoDelay(true)
>           .setSoTimeout(socketTimeoutMilliseconds)
>           .build());
>         connManager = poolingConnManager;
>       }
>
>
> HttpClient tends to default to "strict" when stuff is not specified.  I'll
> see if I can find out what the behavior is.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
>> Hi,
>>
>> database may contain Z.com and X.Y.Z.com if created automatically
>> through a JSP, but not the intermediate one Y.Z.com.
>>
>> if the crawler decides to go to A.Y.Z.com and looking to database Z.com
>> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in
>> Z).
>>
>> Only doing that changes by hand (replacing domain with sub-domain in
>> database) and restarting manifold it begins to work.
>>
>> There might be security constrains somehow, I will consider further
>> analysis.
>>
>> Regards.
>>
>>
>> El jue., 26 jul. 2018 a las 0:06, Karl Wright (<da...@gmail.com>)
>> escribió:
>>
>>> The web connector, though, does not filter any cookies.  It takes them
>>> all -- whatever cookies HttpClient is storing at that point.  So you should
>>> see all the cookies in the database table, regardless of their site
>>> affinity, unless HttpClient is refusing to accept a cookie for security
>>> reasons.
>>>
>>> It's also possible that HttpClient is selective about which cookies to
>>> transmit on a page fetch.
>>>
>>> Can you look in the database and tell me whether your cookie gets
>>> stored, or not?  If not, then HttpClient's cookie acceptance policy is not
>>> lenient enough.  If it is in the database, then it's the transmission
>>> policy that is too strict.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
>>> gustavo.beneitez@gmail.com> wrote:
>>>
>>>> I agree, but the fact is that if my "login sequence" defines a login
>>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to
>>>> write same cookie  for every sub-domain, that solves the situation (and
>>>> thankfully is a language cookie and not a dynamic one).
>>>>
>>>> Regards.
>>>>
>>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<da...@gmail.com>)
>>>> escribió:
>>>>
>>>>> You should not need to fill the database by hand.  Your login sequence
>>>>> should include whatever redirection etc is used to set the cookies though.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>
>>>>>> Hi again,
>>>>>>
>>>>>> Thanks Karl, I was able of doing that after defining some "login
>>>>>> sequence", but also after filling database (cookiedata table) with certain
>>>>>> values due to "domain constrictions".
>>>>>> Before every web call, I suspect Manifold only takes cookies from URL
>>>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "
>>>>>> z.com" it won't be sent, so I added every subdomain by hand and
>>>>>> started to work.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>
>>>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>>>>> gustavo.beneitez@gmail.com>) escribió:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> thanks a lot, please let me check then the documentation for an
>>>>>>> example of that.
>>>>>>>
>>>>>>> Regards!
>>>>>>>
>>>>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> You are correct that cookies are not shared among threads.  That is
>>>>>>>> by design.
>>>>>>>>
>>>>>>>> The only way to set cookies for the WebConnector is to have there
>>>>>>>> be a "login sequence".  The login sequence sets cookies that are then used
>>>>>>>> by all subsequent fetches.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I have tried to look for an answer before writing this email, no
>>>>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>>>>
>>>>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>>>>> cookie rules the language you get the content, and while there are several
>>>>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>>>>
>>>>>>>>> I made a JSP which sets the cookie and contains several links
>>>>>>>>> (href), and pointed ManifoldCF to this page as the repository seed. I
>>>>>>>>> expected to get the crawling engine starting to capture links with correct
>>>>>>>>> language indicated by the cookie, but what I really got is a lot of content
>>>>>>>>> shown in default language.
>>>>>>>>>
>>>>>>>>> What I think about that is that cookies are not shared between
>>>>>>>>> thread spiders, so it is not possible to get cookies remain between links.
>>>>>>>>> Cookie domain is correct, also cookie expiration
>>>>>>>>>
>>>>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>>>>
>>>>>>>>> Thanks in advance!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: web crawler not sharing cookies

Posted by Karl Wright <da...@gmail.com>.
Ok, so the database for your site crawl contains both z.com and x.y.z.com
cookies?  And your site pages from domain a.y.z.com receive no cookies at
all when fetched?  Is that a correct description of the situation?

Please verify that the a.y.z.com pages are part of the protected part of
your "site".  The regular expression that describes site membership for the
login sequence you are trying to set up must include them or they will not
receive any cookies no matter what we do.

If this is set up correctly, then the only explanation is the HttpClient
cookie policy in effect for site fetches.  It does not look like we
override the cookie policy anywhere when setting up the client:

        PoolingHttpClientConnectionManager poolingConnManager = new
PoolingHttpClientConnectionManager(RegistryBuilder.<ConnectionSocketFactory>create()
          .register("http", PlainConnectionSocketFactory.getSocketFactory())
          .register("https", myFactory)
          .build());
        poolingConnManager.setDefaultMaxPerRoute(1);
        poolingConnManager.setValidateAfterInactivity(2000);
        poolingConnManager.setDefaultSocketConfig(SocketConfig.custom()
          .setTcpNoDelay(true)
          .setSoTimeout(socketTimeoutMilliseconds)
          .build());
        connManager = poolingConnManager;
      }


HttpClient tends to default to "strict" when stuff is not specified.  I'll
see if I can find out what the behavior is.

Karl


On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez <gu...@gmail.com>
wrote:

> Hi,
>
> database may contain Z.com and X.Y.Z.com if created automatically through
> a JSP, but not the intermediate one Y.Z.com.
>
> if the crawler decides to go to A.Y.Z.com and looking to database Z.com
> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in
> Z).
>
> Only doing that changes by hand (replacing domain with sub-domain in
> database) and restarting manifold it begins to work.
>
> There might be security constrains somehow, I will consider further
> analysis.
>
> Regards.
>
>
> El jue., 26 jul. 2018 a las 0:06, Karl Wright (<da...@gmail.com>)
> escribió:
>
>> The web connector, though, does not filter any cookies.  It takes them
>> all -- whatever cookies HttpClient is storing at that point.  So you should
>> see all the cookies in the database table, regardless of their site
>> affinity, unless HttpClient is refusing to accept a cookie for security
>> reasons.
>>
>> It's also possible that HttpClient is selective about which cookies to
>> transmit on a page fetch.
>>
>> Can you look in the database and tell me whether your cookie gets stored,
>> or not?  If not, then HttpClient's cookie acceptance policy is not lenient
>> enough.  If it is in the database, then it's the transmission policy that
>> is too strict.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> I agree, but the fact is that if my "login sequence" defines a login
>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
>>> same cookie  for every sub-domain, that solves the situation (and
>>> thankfully is a language cookie and not a dynamic one).
>>>
>>> Regards.
>>>
>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<da...@gmail.com>)
>>> escribió:
>>>
>>>> You should not need to fill the database by hand.  Your login sequence
>>>> should include whatever redirection etc is used to set the cookies though.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>>> gustavo.beneitez@gmail.com> wrote:
>>>>
>>>>> Hi again,
>>>>>
>>>>> Thanks Karl, I was able of doing that after defining some "login
>>>>> sequence", but also after filling database (cookiedata table) with certain
>>>>> values due to "domain constrictions".
>>>>> Before every web call, I suspect Manifold only takes cookies from URL
>>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "
>>>>> z.com" it won't be sent, so I added every subdomain by hand and
>>>>> started to work.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>>>> gustavo.beneitez@gmail.com>) escribió:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> thanks a lot, please let me check then the documentation for an
>>>>>> example of that.
>>>>>>
>>>>>> Regards!
>>>>>>
>>>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
>>>>>> escribió:
>>>>>>
>>>>>>> You are correct that cookies are not shared among threads.  That is
>>>>>>> by design.
>>>>>>>
>>>>>>> The only way to set cookies for the WebConnector is to have there be
>>>>>>> a "login sequence".  The login sequence sets cookies that are then used by
>>>>>>> all subsequent fetches.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I have tried to look for an answer before writing this email, no
>>>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>>>
>>>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>>>> cookie rules the language you get the content, and while there are several
>>>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>>>
>>>>>>>> I made a JSP which sets the cookie and contains several links
>>>>>>>> (href), and pointed ManifoldCF to this page as the repository seed. I
>>>>>>>> expected to get the crawling engine starting to capture links with correct
>>>>>>>> language indicated by the cookie, but what I really got is a lot of content
>>>>>>>> shown in default language.
>>>>>>>>
>>>>>>>> What I think about that is that cookies are not shared between
>>>>>>>> thread spiders, so it is not possible to get cookies remain between links.
>>>>>>>> Cookie domain is correct, also cookie expiration
>>>>>>>>
>>>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: web crawler not sharing cookies

Posted by Gustavo Beneitez <gu...@gmail.com>.
Hi,

database may contain Z.com and X.Y.Z.com if created automatically through a
JSP, but not the intermediate one Y.Z.com.

if the crawler decides to go to A.Y.Z.com and looking to database Z.com is
present, it still doesn't work (it should since A.Y.Z is a sub-domain in Z).

Only doing that changes by hand (replacing domain with sub-domain in
database) and restarting manifold it begins to work.

There might be security constrains somehow, I will consider further
analysis.

Regards.


El jue., 26 jul. 2018 a las 0:06, Karl Wright (<da...@gmail.com>)
escribió:

> The web connector, though, does not filter any cookies.  It takes them all
> -- whatever cookies HttpClient is storing at that point.  So you should see
> all the cookies in the database table, regardless of their site affinity,
> unless HttpClient is refusing to accept a cookie for security reasons.
>
> It's also possible that HttpClient is selective about which cookies to
> transmit on a page fetch.
>
> Can you look in the database and tell me whether your cookie gets stored,
> or not?  If not, then HttpClient's cookie acceptance policy is not lenient
> enough.  If it is in the database, then it's the transmission policy that
> is too strict.
>
> Thanks,
> Karl
>
>
> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
>> I agree, but the fact is that if my "login sequence" defines a login
>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
>> same cookie  for every sub-domain, that solves the situation (and
>> thankfully is a language cookie and not a dynamic one).
>>
>> Regards.
>>
>> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<da...@gmail.com>)
>> escribió:
>>
>>> You should not need to fill the database by hand.  Your login sequence
>>> should include whatever redirection etc is used to set the cookies though.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>> gustavo.beneitez@gmail.com> wrote:
>>>
>>>> Hi again,
>>>>
>>>> Thanks Karl, I was able of doing that after defining some "login
>>>> sequence", but also after filling database (cookiedata table) with certain
>>>> values due to "domain constrictions".
>>>> Before every web call, I suspect Manifold only takes cookies from URL
>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "
>>>> z.com" it won't be sent, so I added every subdomain by hand and
>>>> started to work.
>>>>
>>>> Regards.
>>>>
>>>>
>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>>> gustavo.beneitez@gmail.com>) escribió:
>>>>
>>>>> Hi,
>>>>>
>>>>> thanks a lot, please let me check then the documentation for an
>>>>> example of that.
>>>>>
>>>>> Regards!
>>>>>
>>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
>>>>> escribió:
>>>>>
>>>>>> You are correct that cookies are not shared among threads.  That is
>>>>>> by design.
>>>>>>
>>>>>> The only way to set cookies for the WebConnector is to have there be
>>>>>> a "login sequence".  The login sequence sets cookies that are then used by
>>>>>> all subsequent fetches.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I have tried to look for an answer before writing this email, no
>>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>>
>>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>>> cookie rules the language you get the content, and while there are several
>>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>>
>>>>>>> I made a JSP which sets the cookie and contains several links
>>>>>>> (href), and pointed ManifoldCF to this page as the repository seed. I
>>>>>>> expected to get the crawling engine starting to capture links with correct
>>>>>>> language indicated by the cookie, but what I really got is a lot of content
>>>>>>> shown in default language.
>>>>>>>
>>>>>>> What I think about that is that cookies are not shared between
>>>>>>> thread spiders, so it is not possible to get cookies remain between links.
>>>>>>> Cookie domain is correct, also cookie expiration
>>>>>>>
>>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>>
>>>>>>>

Re: web crawler not sharing cookies

Posted by Karl Wright <da...@gmail.com>.
The web connector, though, does not filter any cookies.  It takes them all
-- whatever cookies HttpClient is storing at that point.  So you should see
all the cookies in the database table, regardless of their site affinity,
unless HttpClient is refusing to accept a cookie for security reasons.

It's also possible that HttpClient is selective about which cookies to
transmit on a page fetch.

Can you look in the database and tell me whether your cookie gets stored,
or not?  If not, then HttpClient's cookie acceptance policy is not lenient
enough.  If it is in the database, then it's the transmission policy that
is too strict.

Thanks,
Karl


On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <gu...@gmail.com>
wrote:

> I agree, but the fact is that if my "login sequence" defines a login
> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
> same cookie  for every sub-domain, that solves the situation (and
> thankfully is a language cookie and not a dynamic one).
>
> Regards.
>
> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<da...@gmail.com>)
> escribió:
>
>> You should not need to fill the database by hand.  Your login sequence
>> should include whatever redirection etc is used to set the cookies though.
>>
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> Hi again,
>>>
>>> Thanks Karl, I was able of doing that after defining some "login
>>> sequence", but also after filling database (cookiedata table) with certain
>>> values due to "domain constrictions".
>>> Before every web call, I suspect Manifold only takes cookies from URL
>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com"
>>> it won't be sent, so I added every subdomain by hand and started to work.
>>>
>>> Regards.
>>>
>>>
>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>> gustavo.beneitez@gmail.com>) escribió:
>>>
>>>> Hi,
>>>>
>>>> thanks a lot, please let me check then the documentation for an example
>>>> of that.
>>>>
>>>> Regards!
>>>>
>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
>>>> escribió:
>>>>
>>>>> You are correct that cookies are not shared among threads.  That is by
>>>>> design.
>>>>>
>>>>> The only way to set cookies for the WebConnector is to have there be a
>>>>> "login sequence".  The login sequence sets cookies that are then used by
>>>>> all subsequent fetches.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have tried to look for an answer before writing this email, no
>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>
>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>> cookie rules the language you get the content, and while there are several
>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>
>>>>>> I made a JSP which sets the cookie and contains several links (href),
>>>>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>>>>> get the crawling engine starting to capture links with correct language
>>>>>> indicated by the cookie, but what I really got is a lot of content shown in
>>>>>> default language.
>>>>>>
>>>>>> What I think about that is that cookies are not shared between thread
>>>>>> spiders, so it is not possible to get cookies remain between links. Cookie
>>>>>> domain is correct, also cookie expiration
>>>>>>
>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>>
>>>>>>

Re: web crawler not sharing cookies

Posted by Gustavo Beneitez <gu...@gmail.com>.
I agree, but the fact is that if my "login sequence" defines a login
credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
same cookie  for every sub-domain, that solves the situation (and
thankfully is a language cookie and not a dynamic one).

Regards.

El mié., 25 jul. 2018 a las 19:17, Karl Wright (<da...@gmail.com>)
escribió:

> You should not need to fill the database by hand.  Your login sequence
> should include whatever redirection etc is used to set the cookies though.
>
> Karl
>
>
> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
>> Hi again,
>>
>> Thanks Karl, I was able of doing that after defining some "login
>> sequence", but also after filling database (cookiedata table) with certain
>> values due to "domain constrictions".
>> Before every web call, I suspect Manifold only takes cookies from URL
>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com"
>> it won't be sent, so I added every subdomain by hand and started to work.
>>
>> Regards.
>>
>>
>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>> gustavo.beneitez@gmail.com>) escribió:
>>
>>> Hi,
>>>
>>> thanks a lot, please let me check then the documentation for an example
>>> of that.
>>>
>>> Regards!
>>>
>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
>>> escribió:
>>>
>>>> You are correct that cookies are not shared among threads.  That is by
>>>> design.
>>>>
>>>> The only way to set cookies for the WebConnector is to have there be a
>>>> "login sequence".  The login sequence sets cookies that are then used by
>>>> all subsequent fetches.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>> gustavo.beneitez@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I have tried to look for an answer before writing this email, no luck.
>>>>> Sorry for the inconvenience if it is already answered.
>>>>>
>>>>> I need to set a cookie at the begining of the web crawling. The cookie
>>>>> rules the language you get the content, and while there are several
>>>>> choices, if no cookie is found there will be a "default language".
>>>>>
>>>>> I made a JSP which sets the cookie and contains several links (href),
>>>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>>>> get the crawling engine starting to capture links with correct language
>>>>> indicated by the cookie, but what I really got is a lot of content shown in
>>>>> default language.
>>>>>
>>>>> What I think about that is that cookies are not shared between thread
>>>>> spiders, so it is not possible to get cookies remain between links. Cookie
>>>>> domain is correct, also cookie expiration
>>>>>
>>>>> I would appreciate so much  if you can help me on this.
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>>
>>>>>

Re: web crawler not sharing cookies

Posted by Karl Wright <da...@gmail.com>.
You should not need to fill the database by hand.  Your login sequence
should include whatever redirection etc is used to set the cookies though.

Karl


On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <gu...@gmail.com>
wrote:

> Hi again,
>
> Thanks Karl, I was able of doing that after defining some "login
> sequence", but also after filling database (cookiedata table) with certain
> values due to "domain constrictions".
> Before every web call, I suspect Manifold only takes cookies from URL
> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com"
> it won't be sent, so I added every subdomain by hand and started to work.
>
> Regards.
>
>
> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
> gustavo.beneitez@gmail.com>) escribió:
>
>> Hi,
>>
>> thanks a lot, please let me check then the documentation for an example
>> of that.
>>
>> Regards!
>>
>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
>> escribió:
>>
>>> You are correct that cookies are not shared among threads.  That is by
>>> design.
>>>
>>> The only way to set cookies for the WebConnector is to have there be a
>>> "login sequence".  The login sequence sets cookies that are then used by
>>> all subsequent fetches.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>> gustavo.beneitez@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I have tried to look for an answer before writing this email, no luck.
>>>> Sorry for the inconvenience if it is already answered.
>>>>
>>>> I need to set a cookie at the begining of the web crawling. The cookie
>>>> rules the language you get the content, and while there are several
>>>> choices, if no cookie is found there will be a "default language".
>>>>
>>>> I made a JSP which sets the cookie and contains several links (href),
>>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>>> get the crawling engine starting to capture links with correct language
>>>> indicated by the cookie, but what I really got is a lot of content shown in
>>>> default language.
>>>>
>>>> What I think about that is that cookies are not shared between thread
>>>> spiders, so it is not possible to get cookies remain between links. Cookie
>>>> domain is correct, also cookie expiration
>>>>
>>>> I would appreciate so much  if you can help me on this.
>>>>
>>>> Thanks in advance!
>>>>
>>>>
>>>>

Re: web crawler not sharing cookies

Posted by Gustavo Beneitez <gu...@gmail.com>.
Hi again,

Thanks Karl, I was able of doing that after defining some "login sequence",
but also after filling database (cookiedata table) with certain values due
to "domain constrictions".
Before every web call, I suspect Manifold only takes cookies from URL exact
subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com" it
won't be sent, so I added every subdomain by hand and started to work.

Regards.


El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
gustavo.beneitez@gmail.com>) escribió:

> Hi,
>
> thanks a lot, please let me check then the documentation for an example of
> that.
>
> Regards!
>
> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
> escribió:
>
>> You are correct that cookies are not shared among threads.  That is by
>> design.
>>
>> The only way to set cookies for the WebConnector is to have there be a
>> "login sequence".  The login sequence sets cookies that are then used by
>> all subsequent fetches.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I have tried to look for an answer before writing this email, no luck.
>>> Sorry for the inconvenience if it is already answered.
>>>
>>> I need to set a cookie at the begining of the web crawling. The cookie
>>> rules the language you get the content, and while there are several
>>> choices, if no cookie is found there will be a "default language".
>>>
>>> I made a JSP which sets the cookie and contains several links (href),
>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>> get the crawling engine starting to capture links with correct language
>>> indicated by the cookie, but what I really got is a lot of content shown in
>>> default language.
>>>
>>> What I think about that is that cookies are not shared between thread
>>> spiders, so it is not possible to get cookies remain between links. Cookie
>>> domain is correct, also cookie expiration
>>>
>>> I would appreciate so much  if you can help me on this.
>>>
>>> Thanks in advance!
>>>
>>>
>>>

Re: web crawler not sharing cookies

Posted by Gustavo Beneitez <gu...@gmail.com>.
Hi,

thanks a lot, please let me check then the documentation for an example of
that.

Regards!

El jue., 19 jul. 2018 a las 21:54, Karl Wright (<da...@gmail.com>)
escribió:

> You are correct that cookies are not shared among threads.  That is by
> design.
>
> The only way to set cookies for the WebConnector is to have there be a
> "login sequence".  The login sequence sets cookies that are then used by
> all subsequent fetches.
>
> Thanks,
> Karl
>
>
> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I have tried to look for an answer before writing this email, no luck.
>> Sorry for the inconvenience if it is already answered.
>>
>> I need to set a cookie at the begining of the web crawling. The cookie
>> rules the language you get the content, and while there are several
>> choices, if no cookie is found there will be a "default language".
>>
>> I made a JSP which sets the cookie and contains several links (href), and
>> pointed ManifoldCF to this page as the repository seed. I expected to get
>> the crawling engine starting to capture links with correct language
>> indicated by the cookie, but what I really got is a lot of content shown in
>> default language.
>>
>> What I think about that is that cookies are not shared between thread
>> spiders, so it is not possible to get cookies remain between links. Cookie
>> domain is correct, also cookie expiration
>>
>> I would appreciate so much  if you can help me on this.
>>
>> Thanks in advance!
>>
>>
>>

Re: web crawler not sharing cookies

Posted by Karl Wright <da...@gmail.com>.
You are correct that cookies are not shared among threads.  That is by
design.

The only way to set cookies for the WebConnector is to have there be a
"login sequence".  The login sequence sets cookies that are then used by
all subsequent fetches.

Thanks,
Karl


On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <gu...@gmail.com>
wrote:

> Hi everyone,
>
> I have tried to look for an answer before writing this email, no luck.
> Sorry for the inconvenience if it is already answered.
>
> I need to set a cookie at the begining of the web crawling. The cookie
> rules the language you get the content, and while there are several
> choices, if no cookie is found there will be a "default language".
>
> I made a JSP which sets the cookie and contains several links (href), and
> pointed ManifoldCF to this page as the repository seed. I expected to get
> the crawling engine starting to capture links with correct language
> indicated by the cookie, but what I really got is a lot of content shown in
> default language.
>
> What I think about that is that cookies are not shared between thread
> spiders, so it is not possible to get cookies remain between links. Cookie
> domain is correct, also cookie expiration
>
> I would appreciate so much  if you can help me on this.
>
> Thanks in advance!
>
>
>