You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Priya Arora <pr...@smartshore.nl> on 2019/06/20 06:04:00 UTC

Fwd: Manifold Crawler Crashes

Hi,

I am running multiple jobs(2,3) simultaneously on Manifold server and the
configuration is

1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
v3 @ 2.60GHz and

2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660
v3 @ 2.60GHz
Job working is to fetch data from some public and intranet sites and then
ingesting data into Elastic search.

Maximum connection on both Repository connections and Output connection is
48(for all 3 jobs).

What problem i am facing here is when i am running multiple jobs the
manifold crashes after some time and there is nothing inside manifold.log
files that hints out me some error.
Is the maximum connections increases(48+48+48) while running all three jobs
together?
So do i need to divide max connections(48) among all three jobs?
How many connections maximum we can have to run the jobs individually and
simultaneously.

what should be the maximum allowed number of max handles in properties.xml
file and postgres config file?

So the problem is to figure out what is the reason for the crawler crash.
Can you please help me on that as soon as possible.

Thanks and regards
Priya
priya@smartshore.nl

Re: Manifold Crawler Crashes

Posted by Karl Wright <da...@gmail.com>.
If you are already on postgresql, then the memory usage is likely due to
the Tika Extractor.  It's not very well determined how much Tika uses for
any given document; we try never to load documents into memory, but in some
situations Tika uses a ton of memory nonetheless.  The more worker threads
you have, therefore, the more memory you need to give ManifoldCF to be sure
it doesn't run out.

So I recommend the following:

(1) LIMIT the number of worker threads.  The default value may be too high
for your setup if you are using the Tika Extractor.  Performance will not
suffer in any way until the number of worker threads is less than the
number of CPUs that the system has.
(2) MODIFY your start-options.env.* file to specify more memory.  How much
more is something you will need to experiment with.

Thanks,
Karl




On Thu, Jun 20, 2019 at 8:42 AM Priya Arora <pr...@smartshore.nl> wrote:

>   I would highly recommend moving to Postgresql if you have any really
> sizable crawl.
> Yes, we are already using Postgresql 9.6.10 for it. Below are the settings
> in postgresql.conf file our postgres server.
>
> max_connections = 100
> shared_buffers = 128MB
> #temp_buffers = 8MB
> #max_prepared_transactions = 0
> #max_files_per_process = 1000
> #autovacuum = on
> #deadlock_timeout = 1s
> #max_locks_per_transaction = 64
> #max_pred_locks_per_transaction = 64
>
> Can you please check if these parameters are sufficient to handle multiple
> job ingesting huge data(8 Lakhs or more data)into an index. If not, can you
> please let me know at maximum what these parameters should to be to have
> optimal run of the jobs.
>
> Alternatively you could just hand the manifoldCF process more memory.
> Your choice.
> can you please help me on this, how to achieve this.
>
> Also do we have to reduce some number of maximum connections in both
> Repository and Output connections. can this be the symptom for heavy memory
> load(due to multiple jobs running all together) that causes HEAP:-OUT OF
> MEMORY.
>
>
>
>
> On Thu, Jun 20, 2019 at 5:04 PM Karl Wright <da...@gmail.com> wrote:
>
>> If you are running single-process on top of HSQLDB, all database tables
>> are kept in memory so you need a lot of memory.
>>
>> I would highly recommend moving to Postgresql if you have any really
>> sizable crawl.
>>
>> Alternatively you could just hand the manifoldCF process more memory.
>> Your choice.
>>
>> However, if you cannot even use bash to get into the instance, something
>> far more serious is happening to your docker world.
>>
>> Karl
>>
>>
>> On Thu, Jun 20, 2019 at 6:27 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>>> Hi Karl,
>>> 1) It's single process deployment process.
>>> 2) Not  able to access through bash(during crash happens)
>>> 3) Server Configuration:-
>>>  For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
>>> v3 @ 2.60GHz and
>>> For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660
>>> v3 @ 2.60GHz
>>> 4) Manifold configuration:-
>>> Repository Max connection:-48
>>> Output Max connections:-48
>>>
>>> This crash happens when we are running more than two parallel jobs with
>>> almost same configuration at a time.
>>> [image: image.png]
>>>
>>> Also, facing these warnings in the log file.It seems to be the reason
>>> for crash.
>>>
>>> agents process ran out of memory - shutting down
>>> java.lang.OutOfMemoryError: Java heap space
>>>         at java.util.Arrays.copyOf(Arrays.java:3308)
>>>         at java.util.BitSet.ensureCapacity(BitSet.java:337)
>>>         at java.util.BitSet.expandTo(BitSet.java:352)
>>>         at java.util.BitSet.set(BitSet.java:447)
>>>         at
>>> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>>>         at
>>> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>>>         at
>>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>>>         at
>>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>>>         at
>>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>>>         at
>>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>>         at
>>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>>>
>>> On Thu, Jun 20, 2019 at 3:36 PM Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Priya,
>>>>
>>>> Being unable to reach the web interface sounds like either a network
>>>> issue or a problem with the app server.
>>>>
>>>> Can you describe the configuration you are running in?  Is this a
>>>> multiprocess deployment or a single-process deployment?
>>>>
>>>> When your docker container dies, can you still reach it via the
>>>> standard in-container bash tools?  What is happening there?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Jun 20, 2019 at 5:54 AM Priya Arora <pr...@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Crash here means, "the site could not be reached" kind of HTML page
>>>>> appears , when accessing
>>>>> http://localhost:3000/mcf-crawler-ui/index.jsp.
>>>>> Explanation:- When running certain job on ManifoldCF server(2.13)
>>>>> after sometime (of successful running state), suddenly browser gives me
>>>>> "the site could not be reached" (this kind of error) and page does not
>>>>> reload until i restart it through docker command.
>>>>> once i will restart the container through docker MCF get to load again.
>>>>>
>>>>> Thanks
>>>>> Priya
>>>>>
>>>>> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Please describe what you mean by "crash".  What actually happens?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am running multiple jobs(2,3) simultaneously on Manifold server
>>>>>>> and the configuration is
>>>>>>>
>>>>>>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>>>>>>> E5-2660 v3 @ 2.60GHz and
>>>>>>>
>>>>>>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>>>>>>> E5-2660 v3 @ 2.60GHz
>>>>>>> Job working is to fetch data from some public and intranet sites and
>>>>>>> then ingesting data into Elastic search.
>>>>>>>
>>>>>>> Maximum connection on both Repository connections and Output
>>>>>>> connection is 48(for all 3 jobs).
>>>>>>>
>>>>>>> What problem i am facing here is when i am running multiple jobs the
>>>>>>> manifold crashes after some time and there is nothing inside manifold.log
>>>>>>> files that hints out me some error.
>>>>>>> Is the maximum connections increases(48+48+48) while running all
>>>>>>> three jobs together?
>>>>>>> So do i need to divide max connections(48) among all three jobs?
>>>>>>> How many connections maximum we can have to run the jobs
>>>>>>> individually and simultaneously.
>>>>>>>
>>>>>>> what should be the maximum allowed number of max handles in
>>>>>>> properties.xml file and postgres config file?
>>>>>>>
>>>>>>> So the problem is to figure out what is the reason for the crawler
>>>>>>> crash.
>>>>>>> Can you please help me on that as soon as possible.
>>>>>>>
>>>>>>> Thanks and regards
>>>>>>> Priya
>>>>>>> priya@smartshore.nl
>>>>>>>
>>>>>>>
>>>>>>>

Re: Manifold Crawler Crashes

Posted by Priya Arora <pr...@smartshore.nl>.
  I would highly recommend moving to Postgresql if you have any really
sizable crawl.
Yes, we are already using Postgresql 9.6.10 for it. Below are the settings
in postgresql.conf file our postgres server.

max_connections = 100
shared_buffers = 128MB
#temp_buffers = 8MB
#max_prepared_transactions = 0
#max_files_per_process = 1000
#autovacuum = on
#deadlock_timeout = 1s
#max_locks_per_transaction = 64
#max_pred_locks_per_transaction = 64

Can you please check if these parameters are sufficient to handle multiple
job ingesting huge data(8 Lakhs or more data)into an index. If not, can you
please let me know at maximum what these parameters should to be to have
optimal run of the jobs.

Alternatively you could just hand the manifoldCF process more memory.  Your
choice.
can you please help me on this, how to achieve this.

Also do we have to reduce some number of maximum connections in both
Repository and Output connections. can this be the symptom for heavy memory
load(due to multiple jobs running all together) that causes HEAP:-OUT OF
MEMORY.




On Thu, Jun 20, 2019 at 5:04 PM Karl Wright <da...@gmail.com> wrote:

> If you are running single-process on top of HSQLDB, all database tables
> are kept in memory so you need a lot of memory.
>
> I would highly recommend moving to Postgresql if you have any really
> sizable crawl.
>
> Alternatively you could just hand the manifoldCF process more memory.
> Your choice.
>
> However, if you cannot even use bash to get into the instance, something
> far more serious is happening to your docker world.
>
> Karl
>
>
> On Thu, Jun 20, 2019 at 6:27 AM Priya Arora <pr...@smartshore.nl> wrote:
>
>> Hi Karl,
>> 1) It's single process deployment process.
>> 2) Not  able to access through bash(during crash happens)
>> 3) Server Configuration:-
>>  For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
>> v3 @ 2.60GHz and
>> For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660
>> v3 @ 2.60GHz
>> 4) Manifold configuration:-
>> Repository Max connection:-48
>> Output Max connections:-48
>>
>> This crash happens when we are running more than two parallel jobs with
>> almost same configuration at a time.
>> [image: image.png]
>>
>> Also, facing these warnings in the log file.It seems to be the reason for
>> crash.
>>
>> agents process ran out of memory - shutting down
>> java.lang.OutOfMemoryError: Java heap space
>>         at java.util.Arrays.copyOf(Arrays.java:3308)
>>         at java.util.BitSet.ensureCapacity(BitSet.java:337)
>>         at java.util.BitSet.expandTo(BitSet.java:352)
>>         at java.util.BitSet.set(BitSet.java:447)
>>         at
>> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>>         at
>> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>>         at
>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>>         at
>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>>         at
>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>>         at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>>         at
>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>>         at
>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>>         at
>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>>         at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>>
>> On Thu, Jun 20, 2019 at 3:36 PM Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Priya,
>>>
>>> Being unable to reach the web interface sounds like either a network
>>> issue or a problem with the app server.
>>>
>>> Can you describe the configuration you are running in?  Is this a
>>> multiprocess deployment or a single-process deployment?
>>>
>>> When your docker container dies, can you still reach it via the standard
>>> in-container bash tools?  What is happening there?
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jun 20, 2019 at 5:54 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Crash here means, "the site could not be reached" kind of HTML page
>>>> appears , when accessing http://localhost:3000/mcf-crawler-ui/index.jsp
>>>> .
>>>> Explanation:- When running certain job on ManifoldCF server(2.13) after
>>>> sometime (of successful running state), suddenly browser gives me "the site
>>>> could not be reached" (this kind of error) and page does not reload until i
>>>> restart it through docker command.
>>>> once i will restart the container through docker MCF get to load again.
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Please describe what you mean by "crash".  What actually happens?
>>>>>
>>>>> Karl
>>>>>
>>>>> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am running multiple jobs(2,3) simultaneously on Manifold server and
>>>>>> the configuration is
>>>>>>
>>>>>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>>>>>> E5-2660 v3 @ 2.60GHz and
>>>>>>
>>>>>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>>>>>> E5-2660 v3 @ 2.60GHz
>>>>>> Job working is to fetch data from some public and intranet sites and
>>>>>> then ingesting data into Elastic search.
>>>>>>
>>>>>> Maximum connection on both Repository connections and Output
>>>>>> connection is 48(for all 3 jobs).
>>>>>>
>>>>>> What problem i am facing here is when i am running multiple jobs the
>>>>>> manifold crashes after some time and there is nothing inside manifold.log
>>>>>> files that hints out me some error.
>>>>>> Is the maximum connections increases(48+48+48) while running all
>>>>>> three jobs together?
>>>>>> So do i need to divide max connections(48) among all three jobs?
>>>>>> How many connections maximum we can have to run the jobs individually
>>>>>> and simultaneously.
>>>>>>
>>>>>> what should be the maximum allowed number of max handles in
>>>>>> properties.xml file and postgres config file?
>>>>>>
>>>>>> So the problem is to figure out what is the reason for the crawler
>>>>>> crash.
>>>>>> Can you please help me on that as soon as possible.
>>>>>>
>>>>>> Thanks and regards
>>>>>> Priya
>>>>>> priya@smartshore.nl
>>>>>>
>>>>>>
>>>>>>

Re: Manifold Crawler Crashes

Posted by Karl Wright <da...@gmail.com>.
If you are running single-process on top of HSQLDB, all database tables are
kept in memory so you need a lot of memory.

I would highly recommend moving to Postgresql if you have any really
sizable crawl.

Alternatively you could just hand the manifoldCF process more memory.  Your
choice.

However, if you cannot even use bash to get into the instance, something
far more serious is happening to your docker world.

Karl


On Thu, Jun 20, 2019 at 6:27 AM Priya Arora <pr...@smartshore.nl> wrote:

> Hi Karl,
> 1) It's single process deployment process.
> 2) Not  able to access through bash(during crash happens)
> 3) Server Configuration:-
>  For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
> v3 @ 2.60GHz and
> For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660 v3
> @ 2.60GHz
> 4) Manifold configuration:-
> Repository Max connection:-48
> Output Max connections:-48
>
> This crash happens when we are running more than two parallel jobs with
> almost same configuration at a time.
> [image: image.png]
>
> Also, facing these warnings in the log file.It seems to be the reason for
> crash.
>
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:3308)
>         at java.util.BitSet.ensureCapacity(BitSet.java:337)
>         at java.util.BitSet.expandTo(BitSet.java:352)
>         at java.util.BitSet.set(BitSet.java:447)
>         at
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>         at
> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>         at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>         at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>         at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>         at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>         at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>         at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>         at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>         at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>         at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>         at
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>
> On Thu, Jun 20, 2019 at 3:36 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi Priya,
>>
>> Being unable to reach the web interface sounds like either a network
>> issue or a problem with the app server.
>>
>> Can you describe the configuration you are running in?  Is this a
>> multiprocess deployment or a single-process deployment?
>>
>> When your docker container dies, can you still reach it via the standard
>> in-container bash tools?  What is happening there?
>>
>> Karl
>>
>>
>> On Thu, Jun 20, 2019 at 5:54 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>>> Hi Karl,
>>>
>>> Crash here means, "the site could not be reached" kind of HTML page
>>> appears , when accessing http://localhost:3000/mcf-crawler-ui/index.jsp.
>>> Explanation:- When running certain job on ManifoldCF server(2.13) after
>>> sometime (of successful running state), suddenly browser gives me "the site
>>> could not be reached" (this kind of error) and page does not reload until i
>>> restart it through docker command.
>>> once i will restart the container through docker MCF get to load again.
>>>
>>> Thanks
>>> Priya
>>>
>>> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Please describe what you mean by "crash".  What actually happens?
>>>>
>>>> Karl
>>>>
>>>> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am running multiple jobs(2,3) simultaneously on Manifold server and
>>>>> the configuration is
>>>>>
>>>>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>>>>> E5-2660 v3 @ 2.60GHz and
>>>>>
>>>>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>>>>> E5-2660 v3 @ 2.60GHz
>>>>> Job working is to fetch data from some public and intranet sites and
>>>>> then ingesting data into Elastic search.
>>>>>
>>>>> Maximum connection on both Repository connections and Output
>>>>> connection is 48(for all 3 jobs).
>>>>>
>>>>> What problem i am facing here is when i am running multiple jobs the
>>>>> manifold crashes after some time and there is nothing inside manifold.log
>>>>> files that hints out me some error.
>>>>> Is the maximum connections increases(48+48+48) while running all three
>>>>> jobs together?
>>>>> So do i need to divide max connections(48) among all three jobs?
>>>>> How many connections maximum we can have to run the jobs individually
>>>>> and simultaneously.
>>>>>
>>>>> what should be the maximum allowed number of max handles in
>>>>> properties.xml file and postgres config file?
>>>>>
>>>>> So the problem is to figure out what is the reason for the crawler
>>>>> crash.
>>>>> Can you please help me on that as soon as possible.
>>>>>
>>>>> Thanks and regards
>>>>> Priya
>>>>> priya@smartshore.nl
>>>>>
>>>>>
>>>>>

Re: Manifold Crawler Crashes

Posted by Priya Arora <pr...@smartshore.nl>.
Hi Karl,
1) It's single process deployment process.
2) Not  able to access through bash(during crash happens)
3) Server Configuration:-
 For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660 v3
@ 2.60GHz and
For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660 v3
@ 2.60GHz
4) Manifold configuration:-
Repository Max connection:-48
Output Max connections:-48

This crash happens when we are running more than two parallel jobs with
almost same configuration at a time.
[image: image.png]

Also, facing these warnings in the log file.It seems to be the reason for
crash.

agents process ran out of memory - shutting down
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3308)
        at java.util.BitSet.ensureCapacity(BitSet.java:337)
        at java.util.BitSet.expandTo(BitSet.java:352)
        at java.util.BitSet.set(BitSet.java:447)
        at
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
        at
org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
        at
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
        at
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
        at
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
        at
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
        at
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
        at
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
        at
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
        at
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
        at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)

On Thu, Jun 20, 2019 at 3:36 PM Karl Wright <da...@gmail.com> wrote:

> Hi Priya,
>
> Being unable to reach the web interface sounds like either a network issue
> or a problem with the app server.
>
> Can you describe the configuration you are running in?  Is this a
> multiprocess deployment or a single-process deployment?
>
> When your docker container dies, can you still reach it via the standard
> in-container bash tools?  What is happening there?
>
> Karl
>
>
> On Thu, Jun 20, 2019 at 5:54 AM Priya Arora <pr...@smartshore.nl> wrote:
>
>> Hi Karl,
>>
>> Crash here means, "the site could not be reached" kind of HTML page
>> appears , when accessing http://localhost:3000/mcf-crawler-ui/index.jsp.
>> Explanation:- When running certain job on ManifoldCF server(2.13) after
>> sometime (of successful running state), suddenly browser gives me "the site
>> could not be reached" (this kind of error) and page does not reload until i
>> restart it through docker command.
>> once i will restart the container through docker MCF get to load again.
>>
>> Thanks
>> Priya
>>
>> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <da...@gmail.com> wrote:
>>
>>> Please describe what you mean by "crash".  What actually happens?
>>>
>>> Karl
>>>
>>> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl> wrote:
>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I am running multiple jobs(2,3) simultaneously on Manifold server and
>>>> the configuration is
>>>>
>>>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>>>> E5-2660 v3 @ 2.60GHz and
>>>>
>>>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>>>> E5-2660 v3 @ 2.60GHz
>>>> Job working is to fetch data from some public and intranet sites and
>>>> then ingesting data into Elastic search.
>>>>
>>>> Maximum connection on both Repository connections and Output connection
>>>> is 48(for all 3 jobs).
>>>>
>>>> What problem i am facing here is when i am running multiple jobs the
>>>> manifold crashes after some time and there is nothing inside manifold.log
>>>> files that hints out me some error.
>>>> Is the maximum connections increases(48+48+48) while running all three
>>>> jobs together?
>>>> So do i need to divide max connections(48) among all three jobs?
>>>> How many connections maximum we can have to run the jobs individually
>>>> and simultaneously.
>>>>
>>>> what should be the maximum allowed number of max handles in
>>>> properties.xml file and postgres config file?
>>>>
>>>> So the problem is to figure out what is the reason for the crawler
>>>> crash.
>>>> Can you please help me on that as soon as possible.
>>>>
>>>> Thanks and regards
>>>> Priya
>>>> priya@smartshore.nl
>>>>
>>>>
>>>>

Re: Manifold Crawler Crashes

Posted by Karl Wright <da...@gmail.com>.
Hi Priya,

Being unable to reach the web interface sounds like either a network issue
or a problem with the app server.

Can you describe the configuration you are running in?  Is this a
multiprocess deployment or a single-process deployment?

When your docker container dies, can you still reach it via the standard
in-container bash tools?  What is happening there?

Karl


On Thu, Jun 20, 2019 at 5:54 AM Priya Arora <pr...@smartshore.nl> wrote:

> Hi Karl,
>
> Crash here means, "the site could not be reached" kind of HTML page
> appears , when accessing http://localhost:3000/mcf-crawler-ui/index.jsp.
> Explanation:- When running certain job on ManifoldCF server(2.13) after
> sometime (of successful running state), suddenly browser gives me "the site
> could not be reached" (this kind of error) and page does not reload until i
> restart it through docker command.
> once i will restart the container through docker MCF get to load again.
>
> Thanks
> Priya
>
> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <da...@gmail.com> wrote:
>
>> Please describe what you mean by "crash".  What actually happens?
>>
>> Karl
>>
>> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> I am running multiple jobs(2,3) simultaneously on Manifold server and
>>> the configuration is
>>>
>>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>>> E5-2660 v3 @ 2.60GHz and
>>>
>>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>>> E5-2660 v3 @ 2.60GHz
>>> Job working is to fetch data from some public and intranet sites and
>>> then ingesting data into Elastic search.
>>>
>>> Maximum connection on both Repository connections and Output connection
>>> is 48(for all 3 jobs).
>>>
>>> What problem i am facing here is when i am running multiple jobs the
>>> manifold crashes after some time and there is nothing inside manifold.log
>>> files that hints out me some error.
>>> Is the maximum connections increases(48+48+48) while running all three
>>> jobs together?
>>> So do i need to divide max connections(48) among all three jobs?
>>> How many connections maximum we can have to run the jobs individually
>>> and simultaneously.
>>>
>>> what should be the maximum allowed number of max handles in
>>> properties.xml file and postgres config file?
>>>
>>> So the problem is to figure out what is the reason for the crawler crash.
>>> Can you please help me on that as soon as possible.
>>>
>>> Thanks and regards
>>> Priya
>>> priya@smartshore.nl
>>>
>>>
>>>

Re: Manifold Crawler Crashes

Posted by Priya Arora <pr...@smartshore.nl>.
Hi Karl,

Crash here means, "the site could not be reached" kind of HTML page appears
, when accessing http://localhost:3000/mcf-crawler-ui/index.jsp.
Explanation:- When running certain job on ManifoldCF server(2.13) after
sometime (of successful running state), suddenly browser gives me "the site
could not be reached" (this kind of error) and page does not reload until i
restart it through docker command.
once i will restart the container through docker MCF get to load again.

Thanks
Priya

On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <da...@gmail.com> wrote:

> Please describe what you mean by "crash".  What actually happens?
>
> Karl
>
> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl> wrote:
>
>>
>>
>> Hi,
>>
>> I am running multiple jobs(2,3) simultaneously on Manifold server and the
>> configuration is
>>
>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>> E5-2660 v3 @ 2.60GHz and
>>
>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>> E5-2660 v3 @ 2.60GHz
>> Job working is to fetch data from some public and intranet sites and then
>> ingesting data into Elastic search.
>>
>> Maximum connection on both Repository connections and Output connection
>> is 48(for all 3 jobs).
>>
>> What problem i am facing here is when i am running multiple jobs the
>> manifold crashes after some time and there is nothing inside manifold.log
>> files that hints out me some error.
>> Is the maximum connections increases(48+48+48) while running all three
>> jobs together?
>> So do i need to divide max connections(48) among all three jobs?
>> How many connections maximum we can have to run the jobs individually and
>> simultaneously.
>>
>> what should be the maximum allowed number of max handles in
>> properties.xml file and postgres config file?
>>
>> So the problem is to figure out what is the reason for the crawler crash.
>> Can you please help me on that as soon as possible.
>>
>> Thanks and regards
>> Priya
>> priya@smartshore.nl
>>
>>
>>

Re: Manifold Crawler Crashes

Posted by Karl Wright <da...@gmail.com>.
Please describe what you mean by "crash".  What actually happens?

Karl

On Thu, Jun 20, 2019, 2:04 AM Priya Arora <pr...@smartshore.nl> wrote:

>
>
> Hi,
>
> I am running multiple jobs(2,3) simultaneously on Manifold server and the
> configuration is
>
> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
> v3 @ 2.60GHz and
>
> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
> E5-2660 v3 @ 2.60GHz
> Job working is to fetch data from some public and intranet sites and then
> ingesting data into Elastic search.
>
> Maximum connection on both Repository connections and Output connection is
> 48(for all 3 jobs).
>
> What problem i am facing here is when i am running multiple jobs the
> manifold crashes after some time and there is nothing inside manifold.log
> files that hints out me some error.
> Is the maximum connections increases(48+48+48) while running all three
> jobs together?
> So do i need to divide max connections(48) among all three jobs?
> How many connections maximum we can have to run the jobs individually and
> simultaneously.
>
> what should be the maximum allowed number of max handles in properties.xml
> file and postgres config file?
>
> So the problem is to figure out what is the reason for the crawler crash.
> Can you please help me on that as soon as possible.
>
> Thanks and regards
> Priya
> priya@smartshore.nl
>
>
>