You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2012/07/27 14:50:23 UTC

Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5

There was a bug fixed in the way hopcount was being computed.  See
CONNECTORS-464.

This means that fewer documents are left in the queue, but the number
of indexed documents should be the same.

Karl

On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
<sh...@g.softbank.co.jp> wrote:
>
> Hi guys.
>
>
> I wonder if anyone has ever faced the experience on web crawling that the
> number of crawled counts differs between MCF0.4 and MCF0.5.
>
>
> I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
> MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
> around half of the contents.
> I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
> I hope changing DB does not affect the crawling results:
>
>
> MCF0.4:
>   - Crawled Counts: 12000 and over
>   - Solr3.5
>   - PostgreSQL 9.1.3
>   - Tomcat6
>   - Max Hop on Links: 15
>   - Max Hop on Redirects: 10
>   - Include only hosts matching seeds: Checked
>   - org.apache.manifoldcf.crawler.threads: 50
>   - org.apache.manifoldcf.database.maxhandles: 100
>
>
> MCF0.5:
>   - Crawled Counts: around 6000
>   - Solr3.5
>   - MySQL5.5
>   - Tomcat6
>   - Max Hop on Links: 15
>   - Max Hop on Redirects: 10
>   - Include only hosts matching seeds: Checked
>   - org.apache.manifoldcf.crawler.threads: 50
>   - org.apache.manifoldcf.database.maxhandles: 100
>
>
> Does anyone have any ideas?
>

Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5

Posted by Shigeki Kobayashi <sh...@g.softbank.co.jp>.
>(1) Make sure that the repository connections and job definitions are
indeed identical between MySQL and PostgreSQL.

Yes, they are all the same.

>(2) See if you can locate an example document that was crawled with
PostgreSQL but not crawled with MySQL.

I confirmed the documents crawled with PostgreSQL but not crawled with
MySQL actually exist.

>(3) If you create a second web connection and job under MySQL, and run
the job to completion, does the document that was not included get
skipped again?  Or does it seem random which documents are skipped on
each run?

Ok. I created two connections and jobs with exactly same description, and
then
ran the jobs to completion.
Those run resulted with different number of crawled documents ( as shown in
the attached picture).

It seems the first run skipped some documents and the second run skipped
different documents, but all the skipped docs can be located.  I have no
clue how those docs are skipped.


Regards,

Shigeki

2012/7/30 Karl Wright <da...@gmail.com>

> There should be no differences between crawling using MySQL as the
> database and PostgreSQL, on the same version of ManifoldCF.
>
> We include an RSS crawling test which finds exactly the expected
> number of documents on MySQL.  This is a 100,000 document crawl.
> There are no back-end-specific logic differences in the web connector
> that would be expected to yield different results based on the
> back-end database.
>
> If you believe you have found a difference between MySQL and
> PostgreSQL, I suggest the following:
>
> (1) Make sure that the repository connections and job definitions are
> indeed identical between MySQL and PostgreSQL.
> (2) See if you can locate an example document that was crawled with
> PostgreSQL but not crawled with MySQL.
> (3) If you create a second web connection and job under MySQL, and run
> the job to completion, does the document that was not included get
> skipped again?  Or does it seem random which documents are skipped on
> each run?
>
> Thanks,
> Karl
>
>
>
> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi
> <sh...@g.softbank.co.jp> wrote:
> > Aren't there some difference in crawling logics between MySQL and
> > PostgreSQL?
> >
> >
> >
> > I did some tests on web crawling using both of MySQL and PostgreSQL.
> >
> >
> >
> >
> >
> > MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5
> running on
> > PostgreSQL indexed over 12000 documents.
> >
> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on PostgreSQL
> > indexed over 12000 documents.
> >
> >
> >
> >
> >
> > Each number of indexed documents above is a result of first crawling
> after
> > deleting indexing history from DB.
> >
> > It seems that changing DB affects crawling and indexing.
> >
> >
> >
> > Regards,
> >
> > Shigeki
> >
> > 2012/7/27 Karl Wright <da...@gmail.com>
> >>
> >> There was a bug fixed in the way hopcount was being computed.  See
> >> CONNECTORS-464.
> >>
> >> This means that fewer documents are left in the queue, but the number
> >> of indexed documents should be the same.
> >>
> >> Karl
> >>
> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
> >> <sh...@g.softbank.co.jp> wrote:
> >> >
> >> > Hi guys.
> >> >
> >> >
> >> > I wonder if anyone has ever faced the experience on web crawling that
> >> > the
> >> > number of crawled counts differs between MCF0.4 and MCF0.5.
> >> >
> >> >
> >> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
> >> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
> >> > around half of the contents.
> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
> >> > I hope changing DB does not affect the crawling results:
> >> >
> >> >
> >> > MCF0.4:
> >> >   - Crawled Counts: 12000 and over
> >> >   - Solr3.5
> >> >   - PostgreSQL 9.1.3
> >> >   - Tomcat6
> >> >   - Max Hop on Links: 15
> >> >   - Max Hop on Redirects: 10
> >> >   - Include only hosts matching seeds: Checked
> >> >   - org.apache.manifoldcf.crawler.threads: 50
> >> >   - org.apache.manifoldcf.database.maxhandles: 100
> >> >
> >> >
> >> > MCF0.5:
> >> >   - Crawled Counts: around 6000
> >> >   - Solr3.5
> >> >   - MySQL5.5
> >> >   - Tomcat6
> >> >   - Max Hop on Links: 15
> >> >   - Max Hop on Redirects: 10
> >> >   - Include only hosts matching seeds: Checked
> >> >   - org.apache.manifoldcf.crawler.threads: 50
> >> >   - org.apache.manifoldcf.database.maxhandles: 100
> >> >
> >> >
> >> > Does anyone have any ideas?
> >> >
> >
> >
> >
> >
> > --
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >  ソフトバンクモバイル株式会社
> >  情報システム本部
> >  システムサービス事業統括部
> >  サービス企画部
> >
> >  小林 茂樹
> >  shigeki.kobayashi3@g.softbank.co.jp
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> >
>



-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5

Posted by Karl Wright <da...@gmail.com>.
There should be no differences between crawling using MySQL as the
database and PostgreSQL, on the same version of ManifoldCF.

We include an RSS crawling test which finds exactly the expected
number of documents on MySQL.  This is a 100,000 document crawl.
There are no back-end-specific logic differences in the web connector
that would be expected to yield different results based on the
back-end database.

If you believe you have found a difference between MySQL and
PostgreSQL, I suggest the following:

(1) Make sure that the repository connections and job definitions are
indeed identical between MySQL and PostgreSQL.
(2) See if you can locate an example document that was crawled with
PostgreSQL but not crawled with MySQL.
(3) If you create a second web connection and job under MySQL, and run
the job to completion, does the document that was not included get
skipped again?  Or does it seem random which documents are skipped on
each run?

Thanks,
Karl



On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi
<sh...@g.softbank.co.jp> wrote:
> Aren't there some difference in crawling logics between MySQL and
> PostgreSQL?
>
>
>
> I did some tests on web crawling using both of MySQL and PostgreSQL.
>
>
>
>
>
> MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 running on
> PostgreSQL indexed over 12000 documents.
>
> MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on PostgreSQL
> indexed over 12000 documents.
>
>
>
>
>
> Each number of indexed documents above is a result of first crawling after
> deleting indexing history from DB.
>
> It seems that changing DB affects crawling and indexing.
>
>
>
> Regards,
>
> Shigeki
>
> 2012/7/27 Karl Wright <da...@gmail.com>
>>
>> There was a bug fixed in the way hopcount was being computed.  See
>> CONNECTORS-464.
>>
>> This means that fewer documents are left in the queue, but the number
>> of indexed documents should be the same.
>>
>> Karl
>>
>> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
>> <sh...@g.softbank.co.jp> wrote:
>> >
>> > Hi guys.
>> >
>> >
>> > I wonder if anyone has ever faced the experience on web crawling that
>> > the
>> > number of crawled counts differs between MCF0.4 and MCF0.5.
>> >
>> >
>> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
>> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
>> > around half of the contents.
>> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
>> > I hope changing DB does not affect the crawling results:
>> >
>> >
>> > MCF0.4:
>> >   - Crawled Counts: 12000 and over
>> >   - Solr3.5
>> >   - PostgreSQL 9.1.3
>> >   - Tomcat6
>> >   - Max Hop on Links: 15
>> >   - Max Hop on Redirects: 10
>> >   - Include only hosts matching seeds: Checked
>> >   - org.apache.manifoldcf.crawler.threads: 50
>> >   - org.apache.manifoldcf.database.maxhandles: 100
>> >
>> >
>> > MCF0.5:
>> >   - Crawled Counts: around 6000
>> >   - Solr3.5
>> >   - MySQL5.5
>> >   - Tomcat6
>> >   - Max Hop on Links: 15
>> >   - Max Hop on Redirects: 10
>> >   - Include only hosts matching seeds: Checked
>> >   - org.apache.manifoldcf.crawler.threads: 50
>> >   - org.apache.manifoldcf.database.maxhandles: 100
>> >
>> >
>> > Does anyone have any ideas?
>> >
>
>
>
>
> --
> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>  ソフトバンクモバイル株式会社
>  情報システム本部
>  システムサービス事業統括部
>  サービス企画部
>
>  小林 茂樹
>  shigeki.kobayashi3@g.softbank.co.jp
> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>
>
>

Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5

Posted by Shigeki Kobayashi <sh...@g.softbank.co.jp>.
Aren't there some difference in crawling logics between MySQL and
PostgreSQL?



I did some tests on web crawling using both of MySQL and PostgreSQL.





MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 running
on PostgreSQL indexed over 12000 documents.

MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on PostgreSQL
indexed over 12000 documents.





Each number of indexed documents above is a result of first crawling after
deleting indexing history from DB.

It seems that changing DB affects crawling and indexing.


Regards,

Shigeki

2012/7/27 Karl Wright <da...@gmail.com>

> There was a bug fixed in the way hopcount was being computed.  See
> CONNECTORS-464.
>
> This means that fewer documents are left in the queue, but the number
> of indexed documents should be the same.
>
> Karl
>
> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
> <sh...@g.softbank.co.jp> wrote:
> >
> > Hi guys.
> >
> >
> > I wonder if anyone has ever faced the experience on web crawling that the
> > number of crawled counts differs between MCF0.4 and MCF0.5.
> >
> >
> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
> > around half of the contents.
> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
> > I hope changing DB does not affect the crawling results:
> >
> >
> > MCF0.4:
> >   - Crawled Counts: 12000 and over
> >   - Solr3.5
> >   - PostgreSQL 9.1.3
> >   - Tomcat6
> >   - Max Hop on Links: 15
> >   - Max Hop on Redirects: 10
> >   - Include only hosts matching seeds: Checked
> >   - org.apache.manifoldcf.crawler.threads: 50
> >   - org.apache.manifoldcf.database.maxhandles: 100
> >
> >
> > MCF0.5:
> >   - Crawled Counts: around 6000
> >   - Solr3.5
> >   - MySQL5.5
> >   - Tomcat6
> >   - Max Hop on Links: 15
> >   - Max Hop on Redirects: 10
> >   - Include only hosts matching seeds: Checked
> >   - org.apache.manifoldcf.crawler.threads: 50
> >   - org.apache.manifoldcf.database.maxhandles: 100
> >
> >
> > Does anyone have any ideas?
> >
>



-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*