You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Shane Wood <sh...@cbm8bit.com> on 2014/03/25 03:50:42 UTC

crawl data

I have setup Nutch Solr and MYSQL as per this how too 
http://nlp.solutions.asia/?p=362
I run Nutch using these commands.

./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch -all
./bin/nutch parse -all
./bin/nutch updatedb

./bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

I have a /crawl folder yet nothing appears in it while it's indexing 
where does nutch
store the content etc while it's indexing ?

Is there a informative faq on what differences using MYSQL makes too 
your setup.

Cheers for any help
Shane.

Re: MYSQL field meanings

Posted by Shane Wood <sh...@cbm8bit.com>.
Can i tell generate too generate a fetch based on the status field in 
MYSQL, i wish to index only status  1 meaning not yet fetched and parse 
them only till there all done. This would be a great help.

Cheers
Shane.


On 27/03/14 13:15, Shane Wood wrote:
> Could someone comment in what these fields do when using Nutch and 
> MYSQL ?
> or is there a web page where this information is already available.
> Thanks
>
>
>
> id
> headers
> text
> status
> markers
> parseStatus
> modifiedTime <---- this is always NULL ? any idea why.
> prevModifiedTime <---- this is always NULL ? any idea why.
> score
> typ
> batchId
> baseUrl
> content
> title
> reprUrl
> fetchInterval
> prevFetchTime
> inlinks
> prevSignature
> outlinks
> fetchTime
> retriesSinceFetch
> Ascending
> protocolStatus
> signature
> metadata
>
>


Re: MYSQL field meanings

Posted by Shane Wood <sh...@cbm8bit.com>.
Thanks will read up on that...

Cheers. :)



On 27/03/14 18:48, Vangelis karv wrote:
> http://nlp.solutions.asia/?p=232
> Hope this helps :)
>
>    
>> Date: Thu, 27 Mar 2014 13:15:08 +1000
>> From: shane@cbm8bit.com
>> To: user@nutch.apache.org
>> Subject: MYSQL field meanings
>>
>> Could someone comment in what these fields do when using Nutch and MYSQL ?
>> or is there a web page where this information is already available.
>> Thanks
>>
>>
>>
>> id
>> headers
>> text
>> status
>> markers
>> parseStatus
>> modifiedTime<---- this is always NULL ? any idea why.
>> prevModifiedTime<---- this is always NULL ? any idea why.
>> score
>> typ
>> batchId
>> baseUrl
>> content
>> title
>> reprUrl
>> fetchInterval
>> prevFetchTime
>> inlinks
>> prevSignature
>> outlinks
>> fetchTime
>> retriesSinceFetch
>> Ascending
>> protocolStatus
>> signature
>> metadata
>>
>>
>>      
>   		 	   		
>    


RE: MYSQL field meanings

Posted by Vangelis karv <ka...@hotmail.com>.
http://nlp.solutions.asia/?p=232
Hope this helps :)

> Date: Thu, 27 Mar 2014 13:15:08 +1000
> From: shane@cbm8bit.com
> To: user@nutch.apache.org
> Subject: MYSQL field meanings
> 
> Could someone comment in what these fields do when using Nutch and MYSQL ?
> or is there a web page where this information is already available.
> Thanks
> 
> 
> 
> id
> headers
> text
> status
> markers
> parseStatus
> modifiedTime <---- this is always NULL ? any idea why.
> prevModifiedTime <---- this is always NULL ? any idea why.
> score
> typ
> batchId
> baseUrl
> content
> title
> reprUrl
> fetchInterval
> prevFetchTime
> inlinks
> prevSignature
> outlinks
> fetchTime
> retriesSinceFetch
> Ascending
> protocolStatus
> signature
> metadata
> 
> 
 		 	   		  

Re: MYSQL field meanings

Posted by Talat Uyarer <ta...@uyarer.com>.
Yes,

On 2.x branch the patch is commited.
27 Mar 2014 12:31 tarihinde "Shane Wood" <sh...@cbm8bit.com> yazdı:

> I'm using Nutch 2.2 as per this install tutorial would this patch already
> been added to the newer version ?.
> http://nlp.solutions.asia/?p=362
>
> Enjoy
> Shane.
>
> On 27/03/14 18:54, Talat Uyarer wrote:
>
>> Hi Shane,
>>
>> Which version of nutch do you use  ? If you use Nutch 2.2.1. This a a bug.
>> You should take a look at https://issues.apache.org/
>> jira/browse/NUTCH-1651
>>
>> Talat
>>
>>
>> 2014-03-27 5:15 GMT+02:00 Shane Wood<sh...@cbm8bit.com>:
>>
>>
>>
>>> Could someone comment in what these fields do when using Nutch and MYSQL
>>> ?
>>> or is there a web page where this information is already available.
>>> Thanks
>>>
>>>
>>>
>>> id
>>> headers
>>> text
>>> status
>>> markers
>>> parseStatus
>>> modifiedTime<---- this is always NULL ? any idea why.
>>> prevModifiedTime<---- this is always NULL ? any idea why.
>>> score
>>> typ
>>> batchId
>>> baseUrl
>>> content
>>> title
>>> reprUrl
>>> fetchInterval
>>> prevFetchTime
>>> inlinks
>>> prevSignature
>>> outlinks
>>> fetchTime
>>> retriesSinceFetch
>>> Ascending
>>> protocolStatus
>>> signature
>>> metadata
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>

Re: MYSQL field meanings

Posted by Shane Wood <sh...@cbm8bit.com>.
I'm using Nutch 2.2 as per this install tutorial would this patch 
already been added to the newer version ?.
http://nlp.solutions.asia/?p=362

Enjoy
Shane.

On 27/03/14 18:54, Talat Uyarer wrote:
> Hi Shane,
>
> Which version of nutch do you use  ? If you use Nutch 2.2.1. This a a bug.
> You should take a look at https://issues.apache.org/jira/browse/NUTCH-1651
>
> Talat
>
>
> 2014-03-27 5:15 GMT+02:00 Shane Wood<sh...@cbm8bit.com>:
>
>    
>> Could someone comment in what these fields do when using Nutch and MYSQL ?
>> or is there a web page where this information is already available.
>> Thanks
>>
>>
>>
>> id
>> headers
>> text
>> status
>> markers
>> parseStatus
>> modifiedTime<---- this is always NULL ? any idea why.
>> prevModifiedTime<---- this is always NULL ? any idea why.
>> score
>> typ
>> batchId
>> baseUrl
>> content
>> title
>> reprUrl
>> fetchInterval
>> prevFetchTime
>> inlinks
>> prevSignature
>> outlinks
>> fetchTime
>> retriesSinceFetch
>> Ascending
>> protocolStatus
>> signature
>> metadata
>>
>>
>>
>>      
>
>    


Re: MYSQL field meanings

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Shane,

Which version of nutch do you use  ? If you use Nutch 2.2.1. This a a bug.
You should take a look at https://issues.apache.org/jira/browse/NUTCH-1651

Talat


2014-03-27 5:15 GMT+02:00 Shane Wood <sh...@cbm8bit.com>:

> Could someone comment in what these fields do when using Nutch and MYSQL ?
> or is there a web page where this information is already available.
> Thanks
>
>
>
> id
> headers
> text
> status
> markers
> parseStatus
> modifiedTime <---- this is always NULL ? any idea why.
> prevModifiedTime <---- this is always NULL ? any idea why.
> score
> typ
> batchId
> baseUrl
> content
> title
> reprUrl
> fetchInterval
> prevFetchTime
> inlinks
> prevSignature
> outlinks
> fetchTime
> retriesSinceFetch
> Ascending
> protocolStatus
> signature
> metadata
>
>
>


-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

MYSQL field meanings

Posted by Shane Wood <sh...@cbm8bit.com>.
Could someone comment in what these fields do when using Nutch and MYSQL ?
or is there a web page where this information is already available.
Thanks



id
headers
text
status
markers
parseStatus
modifiedTime <---- this is always NULL ? any idea why.
prevModifiedTime <---- this is always NULL ? any idea why.
score
typ
batchId
baseUrl
content
title
reprUrl
fetchInterval
prevFetchTime
inlinks
prevSignature
outlinks
fetchTime
retriesSinceFetch
Ascending
protocolStatus
signature
metadata



Re: crawl data

Posted by Shane Wood <sh...@cbm8bit.com>.
As generate does not get the urls not yet fetched, no amount of indexing 
now adds more too my index i've hit somekind of wall.

Can i force Nutch to only generate urls not yet fetched and not the ones 
already fetched.

Cheer
Shane.


On 26/03/14 09:29, Shane Wood wrote:
> Yes only error "warn i get is"
>
> mapred.FileOutputCommitter - Output path is null in cleanup
>
> What does this mean? what would be the command line too index a single 
> domain. say test.com
>
> Why does generate give me the same fetch list every time ? i thought 
> Nutch would only re indexed the same page once every 30 days
> my setup fetch the same pages every time i index, this seems a waist 
> of resources.
>
> Cheers
> Shane.
>
>
> On 26/03/14 06:37, d_k wrote:
>> Are you sure all the steps are working? Did you look at the logs?
>>
>>
>> On Tue, Mar 25, 2014 at 4:50 AM, Shane Wood<sh...@cbm8bit.com>  wrote:
>>
>>> I have setup Nutch Solr and MYSQL as per this how too
>>> http://nlp.solutions.asia/?p=362
>>> I run Nutch using these commands.
>>>
>>> ./bin/nutch inject urls
>>> ./bin/nutch generate -topN 20
>>> ./bin/nutch fetch -all
>>> ./bin/nutch parse -all
>>> ./bin/nutch updatedb
>>>
>>> ./bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>>
>>> I have a /crawl folder yet nothing appears in it while it's indexing 
>>> where
>>> does nutch
>>> store the content etc while it's indexing ?
>>>
>>> Is there a informative faq on what differences using MYSQL makes too 
>>> your
>>> setup.
>>>
>>> Cheers for any help
>>> Shane.
>>>
>


Re: crawl data

Posted by Shane Wood <sh...@cbm8bit.com>.
Yes only error "warn i get is"

mapred.FileOutputCommitter - Output path is null in cleanup

What does this mean? what would be the command line too index a single 
domain. say test.com

Why does generate give me the same fetch list every time ? i thought Nutch would only re indexed the same page once every 30 days
my setup fetch the same pages every time i index, this seems a waist of resources.

Cheers
Shane.


On 26/03/14 06:37, d_k wrote:
> Are you sure all the steps are working? Did you look at the logs?
>
>
> On Tue, Mar 25, 2014 at 4:50 AM, Shane Wood<sh...@cbm8bit.com>  wrote:
>
>    
>> I have setup Nutch Solr and MYSQL as per this how too
>> http://nlp.solutions.asia/?p=362
>> I run Nutch using these commands.
>>
>> ./bin/nutch inject urls
>> ./bin/nutch generate -topN 20
>> ./bin/nutch fetch -all
>> ./bin/nutch parse -all
>> ./bin/nutch updatedb
>>
>> ./bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>
>> I have a /crawl folder yet nothing appears in it while it's indexing where
>> does nutch
>> store the content etc while it's indexing ?
>>
>> Is there a informative faq on what differences using MYSQL makes too your
>> setup.
>>
>> Cheers for any help
>> Shane.
>>
>>      
>    


Re: crawl data

Posted by d_k <ma...@gmail.com>.
Are you sure all the steps are working? Did you look at the logs?


On Tue, Mar 25, 2014 at 4:50 AM, Shane Wood <sh...@cbm8bit.com> wrote:

> I have setup Nutch Solr and MYSQL as per this how too
> http://nlp.solutions.asia/?p=362
> I run Nutch using these commands.
>
> ./bin/nutch inject urls
> ./bin/nutch generate -topN 20
> ./bin/nutch fetch -all
> ./bin/nutch parse -all
> ./bin/nutch updatedb
>
> ./bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>
> I have a /crawl folder yet nothing appears in it while it's indexing where
> does nutch
> store the content etc while it's indexing ?
>
> Is there a informative faq on what differences using MYSQL makes too your
> setup.
>
> Cheers for any help
> Shane.
>