You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Steve Cohen <ma...@gmail.com> on 2014/12/23 16:32:04 UTC

Questions about parse checker and indexing solr with nutch 1.9

Hello,


I am using nutch 1.9 to crawl a file system with symlinks and trying to
index it into solr. I applied NUTCH-1884-trunk-v1.patch
and NUTCH-1885-trunk-v1.patch and it seems to parse correctly but I am not
seeing anything in solr.

If I use readseg on the last segment, the dump file is over 4GB

-bash-4.1$ ls -lh steve/dump
-rwxrwxrwx 1 nutch nutch 4.2G Dec 22 22:50 steve/dump

and it shows that it followed the symlinks and is parsing the actual
documents

Recno:: 8
URL::
file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Mon Dec 22 19:39:02 EST 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.0
Signature: c68700d32bd0f3940f535dd7530580af
Metadata:

ParseData::
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: nutch.content.digest=c68700d32bd0f3940f535dd7530580af
Content-Length=3730 Last-Modified=Sun, 26 Oct 2014 18:26:55 GMT
nutch.crawl.score=3.7866482E-7 _fst_=33 nutch.segment.name=20141222193423
Content-Type=text/plain _ftk_=1419295006963
Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
charset=windows-1252

ParseText:: <not including parse text since it is sensitive information>

But all that shows up in solr is the symlinks to the actual data. the solr
index is 120 MB.

Also when I run parsechecker on the file that was parsed, it shows me this:

-bash-4.1$ /opt/nutch/runtime/local/bin/nutch parsechecker
file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
fetching:
file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
parsing:
file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
contentType: text/plain
signature: c68700d32bd0f3940f535dd7530580af
Failed to get parse from parse result
Available parses in parse result (by URL key):

file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
Parse result does not contain a parse for URL to be checked:

file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449

Parsechecker seems to be adding two "/" after the file: and because of that
it won't properly parse even though the actual parse command seems to work.

Any ideas about getting the crawl to index?

Thanks,
Steve Cohen

Re: Questions about parse checker and indexing solr with nutch 1.9

Posted by Steve Cohen <ma...@gmail.com>.

To respond to myself, I got nutch trunk to crawl a file system with
symbolic links and index solr.

I had changed file.crawl.redirect_noncanonical to false in my
nutch-site.xml file for some reason and when I put it back to true (the
default in the nutch-default.xml) It started indexing the symlinked files.

Thanks,
Steve Cohen

On Tue, Dec 23, 2014 at 9:07 PM, Steve Cohen <ma...@gmail.com> wrote:

> I installed the trunk and tried a test and I seem to be running into an
> issue with the symlinked files. If I run parsechecker on one of the
> symlinks to the files I get this:
>
> -bash-4.1$ /usr/local/src/nutch/runtime/local/bin/nutch parsechecker
> file:/RMS/solr/initial_crawl/10000/8280.txt
> fetching: file:/RMS/solr/initial_crawl/10000/8280.txt
> parsing: file:/RMS/solr/initial_crawl/10000/8280.txt
> contentType: text/plain
> signature: 41460a2beb4c79608fa043d41ff0618c
> Failed to get parse from parse result
> Available parses in parse result (by URL key):
>
> file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
> Parse result does not contain a parse for URL to be checked:
>   file:/RMS/solr/initial_crawl/10000/8280.txt
>
> The parsechecker sorta follows the symlink. It see the actual file but
> doesn't parse it.
>
> If I run parsechecker on the actual file it works
>
> -bash-4.1$ /usr/local/src/nutch/runtime/local/bin/nutch parsechecker
> file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
> fetching:
> file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
> parsing:
> file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
> contentType: text/plain
> signature: 41460a2beb4c79608fa043d41ff0618c
> ---------
> Url
> ---------------
>
>
> file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: Content-Length=327 nutch.crawl.score=0.0
> Last-Modified=Mon, 27 Oct 2014 06:04:49 GMT Content-Type=text/plain
> Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
> charset=windows-1252
>
> Any suggestions for the symlinks?
>
> On Tue, Dec 23, 2014 at 4:11 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
>
>> Hi Steve,
>>
>> > https://issues.apache.org/jira/browse/NUTCH-1076
>> > Is this the reason indexing isn't working for me when I crawl a file
>> system?
>>
>> Possibly, but at a first glance I would try the current trunk.
>> A lot of issues have been fixed regarding protocol-file,
>> in addition to the redirect issues: NUTCH-1879 and NUTCH-1880.
>>
>> >> Parsechecker seems to be adding two "/" after the file: and because of
>> >> that it won't properly parse even though the actual parse command
>> seems to
>> >> work.
>>
>> That's exactly the point of NUTCH-1879 and NUTCH-1880.
>> Can you try to apply the patches or test the current trunk?
>>
>> Thanks,
>> Sebastian
>>
>> On 12/23/2014 07:48 PM, Steve Cohen wrote:
>> > Hello,
>> >
>> > I was looking through nutch issues and saw this one:
>> >
>> >
>> https://issues.apache.org/jira/browse/NUTCH-1076?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%201.10%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20indexer%20ORDER%20BY%20priority%20DESC
>> >
>> > Is this the reason indexing isn't working for me when I crawl a file
>> system?
>> >
>> > Thanks,
>> > Steve Cohen
>> >
>> >
>> >
>> > On Tue, Dec 23, 2014 at 10:32 AM, Steve Cohen <ma...@gmail.com>
>> wrote:
>> >
>> >> Hello,
>> >>
>> >>
>> >> I am using nutch 1.9 to crawl a file system with symlinks and trying to
>> >> index it into solr. I applied NUTCH-1884-trunk-v1.patch
>> >> and NUTCH-1885-trunk-v1.patch and it seems to parse correctly but I am
>> not
>> >> seeing anything in solr.
>> >>
>> >> If I use readseg on the last segment, the dump file is over 4GB
>> >>
>> >> -bash-4.1$ ls -lh steve/dump
>> >> -rwxrwxrwx 1 nutch nutch 4.2G Dec 22 22:50 steve/dump
>> >>
>> >> and it shows that it followed the symlinks and is parsing the actual
>> >> documents
>> >>
>> >> Recno:: 8
>> >> URL::
>> >>
>> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> >>
>> >> CrawlDatum::
>> >> Version: 7
>> >> Status: 65 (signature)
>> >> Fetch time: Mon Dec 22 19:39:02 EST 2014
>> >> Modified time: Wed Dec 31 19:00:00 EST 1969
>> >> Retries since fetch: 0
>> >> Retry interval: 0 seconds (0 days)
>> >> Score: 0.0
>> >> Signature: c68700d32bd0f3940f535dd7530580af
>> >> Metadata:
>> >>
>> >> ParseData::
>> >> Version: 5
>> >> Status: success(1,0)
>> >> Title:
>> >> Outlinks: 0
>> >> Content Metadata: nutch.content.digest=c68700d32bd0f3940f535dd7530580af
>> >> Content-Length=3730 Last-Modified=Sun, 26 Oct 2014 18:26:55 GMT
>> >> nutch.crawl.score=3.7866482E-7 _fst_=33 nutch.segment.name
>> =20141222193423
>> >> Content-Type=text/plain _ftk_=1419295006963
>> >> Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
>> >> charset=windows-1252
>> >>
>> >> ParseText:: <not including parse text since it is sensitive
>> information>
>> >>
>> >> But all that shows up in solr is the symlinks to the actual data. the
>> solr
>> >> index is 120 MB.
>> >>
>> >> Also when I run parsechecker on the file that was parsed, it shows me
>> this:
>> >>
>> >> -bash-4.1$ /opt/nutch/runtime/local/bin/nutch parsechecker
>> >>
>> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> >> fetching:
>> >>
>> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> >> parsing:
>> >>
>> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> >> contentType: text/plain
>> >> signature: c68700d32bd0f3940f535dd7530580af
>> >> Failed to get parse from parse result
>> >> Available parses in parse result (by URL key):
>> >>
>> >>
>> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> >> Parse result does not contain a parse for URL to be checked:
>> >>
>> >>
>> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> >>
>> >> Parsechecker seems to be adding two "/" after the file: and because of
>> >> that it won't properly parse even though the actual parse command
>> seems to
>> >> work.
>> >>
>> >> Any ideas about getting the crawl to index?
>> >>
>> >> Thanks,
>> >> Steve Cohen
>> >>
>> >
>>
>>
>

Re: Questions about parse checker and indexing solr with nutch 1.9

Posted by Steve Cohen <ma...@gmail.com>.

I installed the trunk and tried a test and I seem to be running into an
issue with the symlinked files. If I run parsechecker on one of the
symlinks to the files I get this:

-bash-4.1$ /usr/local/src/nutch/runtime/local/bin/nutch parsechecker
file:/RMS/solr/initial_crawl/10000/8280.txt
fetching: file:/RMS/solr/initial_crawl/10000/8280.txt
parsing: file:/RMS/solr/initial_crawl/10000/8280.txt
contentType: text/plain
signature: 41460a2beb4c79608fa043d41ff0618c
Failed to get parse from parse result
Available parses in parse result (by URL key):

file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
Parse result does not contain a parse for URL to be checked:
  file:/RMS/solr/initial_crawl/10000/8280.txt

The parsechecker sorta follows the symlink. It see the actual file but
doesn't parse it.

If I run parsechecker on the actual file it works

-bash-4.1$ /usr/local/src/nutch/runtime/local/bin/nutch parsechecker
file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
fetching:
file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
parsing:
file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
contentType: text/plain
signature: 41460a2beb4c79608fa043d41ff0618c
---------
Url
---------------

file:/RMS/sha256/a6/ab/c8/7e/aa/09/43/8e/a6abc87eaa09438ef6aea129e7b014d26627239f5efb118f0217e45b8c022a1a
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: Content-Length=327 nutch.crawl.score=0.0
Last-Modified=Mon, 27 Oct 2014 06:04:49 GMT Content-Type=text/plain
Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
charset=windows-1252

Any suggestions for the symlinks?

On Tue, Dec 23, 2014 at 4:11 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Steve,
>
> > https://issues.apache.org/jira/browse/NUTCH-1076
> > Is this the reason indexing isn't working for me when I crawl a file
> system?
>
> Possibly, but at a first glance I would try the current trunk.
> A lot of issues have been fixed regarding protocol-file,
> in addition to the redirect issues: NUTCH-1879 and NUTCH-1880.
>
> >> Parsechecker seems to be adding two "/" after the file: and because of
> >> that it won't properly parse even though the actual parse command seems
> to
> >> work.
>
> That's exactly the point of NUTCH-1879 and NUTCH-1880.
> Can you try to apply the patches or test the current trunk?
>
> Thanks,
> Sebastian
>
> On 12/23/2014 07:48 PM, Steve Cohen wrote:
> > Hello,
> >
> > I was looking through nutch issues and saw this one:
> >
> >
> https://issues.apache.org/jira/browse/NUTCH-1076?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%201.10%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20indexer%20ORDER%20BY%20priority%20DESC
> >
> > Is this the reason indexing isn't working for me when I crawl a file
> system?
> >
> > Thanks,
> > Steve Cohen
> >
> >
> >
> > On Tue, Dec 23, 2014 at 10:32 AM, Steve Cohen <ma...@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >>
> >> I am using nutch 1.9 to crawl a file system with symlinks and trying to
> >> index it into solr. I applied NUTCH-1884-trunk-v1.patch
> >> and NUTCH-1885-trunk-v1.patch and it seems to parse correctly but I am
> not
> >> seeing anything in solr.
> >>
> >> If I use readseg on the last segment, the dump file is over 4GB
> >>
> >> -bash-4.1$ ls -lh steve/dump
> >> -rwxrwxrwx 1 nutch nutch 4.2G Dec 22 22:50 steve/dump
> >>
> >> and it shows that it followed the symlinks and is parsing the actual
> >> documents
> >>
> >> Recno:: 8
> >> URL::
> >>
> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> >>
> >> CrawlDatum::
> >> Version: 7
> >> Status: 65 (signature)
> >> Fetch time: Mon Dec 22 19:39:02 EST 2014
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 0 seconds (0 days)
> >> Score: 0.0
> >> Signature: c68700d32bd0f3940f535dd7530580af
> >> Metadata:
> >>
> >> ParseData::
> >> Version: 5
> >> Status: success(1,0)
> >> Title:
> >> Outlinks: 0
> >> Content Metadata: nutch.content.digest=c68700d32bd0f3940f535dd7530580af
> >> Content-Length=3730 Last-Modified=Sun, 26 Oct 2014 18:26:55 GMT
> >> nutch.crawl.score=3.7866482E-7 _fst_=33 nutch.segment.name
> =20141222193423
> >> Content-Type=text/plain _ftk_=1419295006963
> >> Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
> >> charset=windows-1252
> >>
> >> ParseText:: <not including parse text since it is sensitive information>
> >>
> >> But all that shows up in solr is the symlinks to the actual data. the
> solr
> >> index is 120 MB.
> >>
> >> Also when I run parsechecker on the file that was parsed, it shows me
> this:
> >>
> >> -bash-4.1$ /opt/nutch/runtime/local/bin/nutch parsechecker
> >>
> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> >> fetching:
> >>
> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> >> parsing:
> >>
> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> >> contentType: text/plain
> >> signature: c68700d32bd0f3940f535dd7530580af
> >> Failed to get parse from parse result
> >> Available parses in parse result (by URL key):
> >>
> >>
> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> >> Parse result does not contain a parse for URL to be checked:
> >>
> >>
> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> >>
> >> Parsechecker seems to be adding two "/" after the file: and because of
> >> that it won't properly parse even though the actual parse command seems
> to
> >> work.
> >>
> >> Any ideas about getting the crawl to index?
> >>
> >> Thanks,
> >> Steve Cohen
> >>
> >
>
>

Re: Questions about parse checker and indexing solr with nutch 1.9

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Steve,

> https://issues.apache.org/jira/browse/NUTCH-1076
> Is this the reason indexing isn't working for me when I crawl a file system?

Possibly, but at a first glance I would try the current trunk.
A lot of issues have been fixed regarding protocol-file,
in addition to the redirect issues: NUTCH-1879 and NUTCH-1880.

>> Parsechecker seems to be adding two "/" after the file: and because of
>> that it won't properly parse even though the actual parse command seems to
>> work.

That's exactly the point of NUTCH-1879 and NUTCH-1880.
Can you try to apply the patches or test the current trunk?

Thanks,
Sebastian

On 12/23/2014 07:48 PM, Steve Cohen wrote:
> Hello,
> 
> I was looking through nutch issues and saw this one:
> 
> https://issues.apache.org/jira/browse/NUTCH-1076?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%201.10%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20indexer%20ORDER%20BY%20priority%20DESC
> 
> Is this the reason indexing isn't working for me when I crawl a file system?
> 
> Thanks,
> Steve Cohen
> 
> 
> 
> On Tue, Dec 23, 2014 at 10:32 AM, Steve Cohen <ma...@gmail.com> wrote:
> 
>> Hello,
>>
>>
>> I am using nutch 1.9 to crawl a file system with symlinks and trying to
>> index it into solr. I applied NUTCH-1884-trunk-v1.patch
>> and NUTCH-1885-trunk-v1.patch and it seems to parse correctly but I am not
>> seeing anything in solr.
>>
>> If I use readseg on the last segment, the dump file is over 4GB
>>
>> -bash-4.1$ ls -lh steve/dump
>> -rwxrwxrwx 1 nutch nutch 4.2G Dec 22 22:50 steve/dump
>>
>> and it shows that it followed the symlinks and is parsing the actual
>> documents
>>
>> Recno:: 8
>> URL::
>> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>>
>> CrawlDatum::
>> Version: 7
>> Status: 65 (signature)
>> Fetch time: Mon Dec 22 19:39:02 EST 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 0 seconds (0 days)
>> Score: 0.0
>> Signature: c68700d32bd0f3940f535dd7530580af
>> Metadata:
>>
>> ParseData::
>> Version: 5
>> Status: success(1,0)
>> Title:
>> Outlinks: 0
>> Content Metadata: nutch.content.digest=c68700d32bd0f3940f535dd7530580af
>> Content-Length=3730 Last-Modified=Sun, 26 Oct 2014 18:26:55 GMT
>> nutch.crawl.score=3.7866482E-7 _fst_=33 nutch.segment.name=20141222193423
>> Content-Type=text/plain _ftk_=1419295006963
>> Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
>> charset=windows-1252
>>
>> ParseText:: <not including parse text since it is sensitive information>
>>
>> But all that shows up in solr is the symlinks to the actual data. the solr
>> index is 120 MB.
>>
>> Also when I run parsechecker on the file that was parsed, it shows me this:
>>
>> -bash-4.1$ /opt/nutch/runtime/local/bin/nutch parsechecker
>> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> fetching:
>> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> parsing:
>> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> contentType: text/plain
>> signature: c68700d32bd0f3940f535dd7530580af
>> Failed to get parse from parse result
>> Available parses in parse result (by URL key):
>>
>> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>> Parse result does not contain a parse for URL to be checked:
>>
>> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>>
>> Parsechecker seems to be adding two "/" after the file: and because of
>> that it won't properly parse even though the actual parse command seems to
>> work.
>>
>> Any ideas about getting the crawl to index?
>>
>> Thanks,
>> Steve Cohen
>>
>

Re: Questions about parse checker and indexing solr with nutch 1.9

Posted by Steve Cohen <ma...@gmail.com>.

Hello,

I was looking through nutch issues and saw this one:

https://issues.apache.org/jira/browse/NUTCH-1076?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%201.10%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20indexer%20ORDER%20BY%20priority%20DESC

Is this the reason indexing isn't working for me when I crawl a file system?

Thanks,
Steve Cohen



On Tue, Dec 23, 2014 at 10:32 AM, Steve Cohen <ma...@gmail.com> wrote:

> Hello,
>
>
> I am using nutch 1.9 to crawl a file system with symlinks and trying to
> index it into solr. I applied NUTCH-1884-trunk-v1.patch
> and NUTCH-1885-trunk-v1.patch and it seems to parse correctly but I am not
> seeing anything in solr.
>
> If I use readseg on the last segment, the dump file is over 4GB
>
> -bash-4.1$ ls -lh steve/dump
> -rwxrwxrwx 1 nutch nutch 4.2G Dec 22 22:50 steve/dump
>
> and it shows that it followed the symlinks and is parsing the actual
> documents
>
> Recno:: 8
> URL::
> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Mon Dec 22 19:39:02 EST 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 0.0
> Signature: c68700d32bd0f3940f535dd7530580af
> Metadata:
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: nutch.content.digest=c68700d32bd0f3940f535dd7530580af
> Content-Length=3730 Last-Modified=Sun, 26 Oct 2014 18:26:55 GMT
> nutch.crawl.score=3.7866482E-7 _fst_=33 nutch.segment.name=20141222193423
> Content-Type=text/plain _ftk_=1419295006963
> Parse Metadata: Content-Encoding=windows-1252 Content-Type=text/plain;
> charset=windows-1252
>
> ParseText:: <not including parse text since it is sensitive information>
>
> But all that shows up in solr is the symlinks to the actual data. the solr
> index is 120 MB.
>
> Also when I run parsechecker on the file that was parsed, it shows me this:
>
> -bash-4.1$ /opt/nutch/runtime/local/bin/nutch parsechecker
> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> fetching:
> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> parsing:
> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> contentType: text/plain
> signature: c68700d32bd0f3940f535dd7530580af
> Failed to get parse from parse result
> Available parses in parse result (by URL key):
>
> file:/RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
> Parse result does not contain a parse for URL to be checked:
>
> file:///RMS/sha256/00/07/3a/b9/1b/06/cc/07/00073ab91b06cc079c9f510ddcc8c406159dbb55c0baff9a864898b5cfa67449
>
> Parsechecker seems to be adding two "/" after the file: and because of
> that it won't properly parse even though the actual parse command seems to
> work.
>
> Any ideas about getting the crawl to index?
>
> Thanks,
> Steve Cohen
>