You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dragan Menoski <dr...@x3mlabs.com> on 2013/02/14 18:10:48 UTC

Nutch 2.1 different batch id (null)

Hi,

I try to set Nutch 2.1 and Solr 4.0 with MySQL database, according to the
instruction in this link: http://nlp.solutions.asia/?p=180.

I made same changes in conf/nutch-site.xml (set threads to 50).

When I start crawl (path: ~/Desktop/apache-nutch-2.1/runtime/local,
command: bin/nutch crawl urls -depth 5 -topN 1) I saw the message:
"Skipping http://www.domainname.com/category/viewvideo/111; different batch
id (null)" for a lot of pages.

My nutch-site.xml file is in attach.

I use Debian 6.0.5 (x64) on Virtual Machine on Windows 7 (x64).

I have many records in database with: headers = null, status = 1, text =
null and the others fields are also null.

In conf/regex-urlfilter.txt I have:

# accept anything else
+^http://([a-z0-9]*\.)*www.domain01.com
+^http://([a-z0-9]*\.)*domain02.com
+^http://([a-z0-9]*\.)*www.domain03.com.mk

In /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt I have:

http://www.domain01.com
http://domain02.com
http://www.domain03.com.mk



Best Regards,

Dragan Menoski

Re: Nutch 2.1 different batch id (null)

Posted by Lewis John Mcgibbney <le...@gmail.com>.

I've opened NUTCH-1567 to track and address this.
https://issues.apache.org/jira/browse/NUTCH-1567


On Tue, Apr 30, 2013 at 9:39 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
> There is a pretty difficult aspect to this problem which makes it
> difficult for others/me to address.
> There are a number of variables which may (depending on your task
> execution between crawls) change the possibility/probability of some MARK
> not being present.
> The core problem here within the ParserJob at least is that the
> Mark.FETCH_MARK.checkMark(page); is null.
> The explanation I was given for this is documented on the wiki
> (unfortunately wiki is under maintenance just now).
> I do not think that the DEBUG logging currently in 2.x branch HEAD is
> useful at all. It should display the batchId as oppose to the Mark. Mu
> justification for this is that the batchId is always null, so showing this
> is pointless. We would be better showing the batchId which will enable the
> user to refetch the batchId in an attempt to ensure that a MARK is assigned
> to the page.
> Does this make sense?
> Lewis
>
>
> On Sun, Apr 28, 2013 at 8:33 AM, cervenkovab <ce...@gmail.com>wrote:
>
>> Hallo,
>> I have the same problem with *"Skipping some.relevant.page.com; different
>> batch id (null)"* for a lot of pages. My configuration is almost the same
>> as
>> bellow (only different OS and storage is Hbase).
>>
>> I do the steps (inject), generate, fetch, and the skipping appears in
>> parse
>> phase. But I want those pages to be parsed, the urls are relevant for me.
>> There is a problem that I want to crawl a lot of websites. *When a lot of
>> pages are skipped, I have very few collected pages, many empty pages and
>> it
>> is bad for me*. And I also dont know why the page for example
>> /http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched
>> and parsed and for example the page
>> /
>> http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/
>> is skipped and most of the other pages of the domain /arte.tv/ is
>> skipped.
>> It is the same domain name.
>>
>> *What causes this error? How can I resolve this problem?*
>> Thanks for help
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: Nutch 2.1 different batch id (null)

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,
There is a pretty difficult aspect to this problem which makes it difficult
for others/me to address.
There are a number of variables which may (depending on your task execution
between crawls) change the possibility/probability of some MARK not being
present.
The core problem here within the ParserJob at least is that the
Mark.FETCH_MARK.checkMark(page); is null.
The explanation I was given for this is documented on the wiki
(unfortunately wiki is under maintenance just now).
I do not think that the DEBUG logging currently in 2.x branch HEAD is
useful at all. It should display the batchId as oppose to the Mark. Mu
justification for this is that the batchId is always null, so showing this
is pointless. We would be better showing the batchId which will enable the
user to refetch the batchId in an attempt to ensure that a MARK is assigned
to the page.
Does this make sense?
Lewis

On Sun, Apr 28, 2013 at 8:33 AM, cervenkovab <ce...@gmail.com> wrote:

> Hallo,
> I have the same problem with *"Skipping some.relevant.page.com; different
> batch id (null)"* for a lot of pages. My configuration is almost the same
> as
> bellow (only different OS and storage is Hbase).
>
> I do the steps (inject), generate, fetch, and the skipping appears in parse
> phase. But I want those pages to be parsed, the urls are relevant for me.
> There is a problem that I want to crawl a lot of websites. *When a lot of
> pages are skipped, I have very few collected pages, many empty pages and it
> is bad for me*. And I also dont know why the page for example
> /http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched
> and parsed and for example the page
> /
> http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/
> is skipped and most of the other pages of the domain /arte.tv/ is skipped.
> It is the same domain name.
>
> *What causes this error? How can I resolve this problem?*
> Thanks for help
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Nutch 2.1 different batch id (null)

Posted by cervenkovab <ce...@gmail.com>.

Hallo,
I have the same problem with *"Skipping some.relevant.page.com; different
batch id (null)"* for a lot of pages. My configuration is almost the same as
bellow (only different OS and storage is Hbase).

I do the steps (inject), generate, fetch, and the skipping appears in parse
phase. But I want those pages to be parsed, the urls are relevant for me.  		
There is a problem that I want to crawl a lot of websites. *When a lot of
pages are skipped, I have very few collected pages, many empty pages and it
is bad for me*. And I also dont know why the page for example
/http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched
and parsed and for example the page
/http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/
is skipped and most of the other pages of the domain /arte.tv/ is skipped.
It is the same domain name. 

*What causes this error? How can I resolve this problem?*
Thanks for help





--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 2.1 different batch id (null)

Posted by Lewis John Mcgibbney <le...@gmail.com>.

And you want to get to the bottom of the batchId = null?
You haven't actually asked a question.here.

On Thursday, February 14, 2013, Dragan Menoski <dr...@x3mlabs.com>
wrote:
> Hi,
> I try to set Nutch 2.1 and Solr 4.0 with MySQL database, according to the
instruction in this link: http://nlp.solutions.asia/?p=180.
> I made same changes in conf/nutch-site.xml (set threads to 50).
> When I start crawl (path: ~/Desktop/apache-nutch-2.1/runtime/local,
command: bin/nutch crawl urls -depth 5 -topN 1) I saw the message:
"Skipping http://www.domainname.com/category/viewvideo/111; different batch
id (null)" for a lot of pages.
> My nutch-site.xml file is in attach.
> I use Debian 6.0.5 (x64) on Virtual Machine on Windows 7 (x64).
> I have many records in database with: headers = null, status = 1, text =
null and the others fields are also null.
> In conf/regex-urlfilter.txt I have:
> # accept anything else
> +^http://([a-z0-9]*\.)*www.domain01.com
> +^http://([a-z0-9]*\.)*domain02.com
> +^http://([a-z0-9]*\.)*www.domain03.com.mk
> In /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt I have:
> http://www.domain01.com
> http://domain02.com
> http://www.domain03.com.mk
>
>
> Best Regards,
> Dragan Menoski

-- 
*Lewis*