You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2012/12/25 01:16:50 UTC

Not all parsed docs is indexed & inconsistent parsed docs.

Hi All,

I'm a new on nutch and solr, with following platforms:
- nutch 2.1
- solr 4.0
- jdk 1.7 on ubuntu 10.04

I'm also part of "member" of the legendary implementation nutch with
MySQL at http://nlp.solutions.asia/?p=180 ;-)
I have installed all of above successfully with some minors
corrections on table structure (i.e. change "typ" column into "type"
and also change its size to varchar(64)).

I created an index.html (with simple text inside) at URL
http://localhost/sapi/ and put it into urls/seed.txt as source URL
crawled.
For testing I created 5 inlinks which contains 5 documents with 2
formats (pdf and odt) and filename format (filename with space and
no-space) in index.html file:

1. http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
2. http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
3. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
4. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
5. http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt

*the chars %20 on links above is actually whitespace character. I only
copied what my browser read/interpret and converted into safe URLs.
**Converting the rules above (space char) has also applied on
regex-normalize.xml file.

Here are some facts and doubts I got after play around with nutch and solr:

1. All of those docs has parsed "successfully" since the status is "2".
2. Why I called it "successfully" is because some of docs (#1 and #2
above) are not having the value on "text" column in webpage MySQL
table. It means those docs are failed to parse by nutch. CMIIW.
3. The number of docs (numdocs) reported on Solr Admin is always 2
docs! :( -- only indexing index.html and 4.
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
successfully indexed by Solr. Even I do repeat the crawl and reindex
process many times.

Below are 2 lines commands in single bash script to crawl and index my page:

#!/bin/bash
./runtime/local/bin/nutch crawl urls -depth 3 -topN 5
./runtime/local/bin/nutch solrindex http://localhost:8080/solr/ -reindex

Appreciate for any help.

TIA

--
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Please see below

On Sat, Jan 12, 2013 at 8:48 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

>
> That's tomcat port for Solr.
> Should we activate the proxy setting?
>

Is it already activated in nutch-site.xml? No I do not think it should be
activated unless you have a proxy running.

>
>
>
> But the strange is the whole status of documents fetched is 2.
>
> This is fine, there is clearly no problem with fetching. It is a parsing
problem for sure.

>
> So, why the PDF parser could not parsed completely to whole PDFs docs?
>

http.content.limit?

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

On Sun, Jan 13, 2013 at 12:02 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
>
> On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> >
> > We can see that some of parse processes were not completed successfully.
> >
>
> Yes I see this. I also see that you have a http.proxy.port = 8080 but no
> proxy host and that the protocol-httpclient plugin is not activated.
>

That's tomcat port for Solr.
Should we activate the proxy setting?

> I also see some strange fetcher behaviour as it seems to fetch the server
> instance e.g. 2013-01-12 05:37:41,987 INFO  fetcher.FetcherJob - fetching
> http://localhost/, however I assume there is no document @ this location
> on
> the server...
>
>
There is index.html on that URL.
Here is the content:

<html>
<head><title>Contoh link dokumen</title></head>
<body>
<h3>testing dokumen</h3>
<p>
Namun dalam realitas kita melihat banyak manusia modern justeru bersikap
sebaliknya. Dan ini tidak saja diperlihatkan oleh sembarang manusia. Bahkan
sebagian manusia yang mengaku muslim sekalipun menampilkan sikap terbalik.
Bila menyangkut urusan peluang keberhasilan di dunia ia menjadi sangat
serius. Ia kerahkan perhatian, waktu, tenaga dan uang tanpa keraguan. Namun
bila menyangkut urusan peluang keberhasilan di akhirat ia malah bersikap
setengah hati bahkan bermain-main dan bersenda-gurau. Ia sangat fokus akan
sukses dunia namun sangat tidak peduli sukses akhirat. Seolah sukses dunia
merupakan sesuatu yang hakiki sedangkan sukses akhirat hanyalah mimpi tanpa
bukti. Mengapa hal ini terjadi?
</p>
<ol>
<li><a href="sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf">ini
contoh dokumen tak pakai spasi</a></li>
<li><a href="sapi/spasi Akhirat Lebih Utama Daripada Dunia.pdf">contoh
pakai sepasi</a></li>
<li><a href="sapi/Akhirat Lebih Utama Daripada Dunia.pdf">contoh pakai
sepasi ke-2</a></li>
<li><a href="sapi/Akhirat Lebih Utama Daripada Dunia.odt">file odt pakai
spasi kosong</a></li>
<li><a href="sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt">file odt pakai
underscore</a></li>
</ol>
Ini dokumen tambahan <a href="sapi/Solr-install-v2.pdf">Instalasi Solr</a>
yang Bayu buat :-).
</body>
</html>

This index.html is successfully parsed and indexed.
I can see the records on MySQL database.

Only this index.html and single odt I mentioned before can be parsed and
the contents exist on database.
But the strange is the whole status of documents fetched is 2.
If I'm not mistake the status 5 is document indexing successfully. CMIIW.

That being said, as we've established fetching does not seem to be the
> problem.
>
> Unless you wish to skip parsing for truncated documents then you will need
> to increase the http.content.limit to something over ~40K. This will then
> remove the following log output (meaning that the document should be fully
> parsed)
> 2013-01-12 05:38:27,508 WARN  parse.ParserJob -
> http://localhost/sapi/Solr-install-v2.pdf skipped. Content of size 395125
> was truncated to 65536
> You may also wish to consider the parser.skip.truncated property in
> nutch-site.xml
>
>
OK. I can increase it.

> I don't suppose these PDF's are password protected or something like that?
>
>
Nope.
I just create .odt and save nto PDF files.

> I would also explicitly map the content type
> application/vnd.oasis.opendocument.text to parse-tika in parse-plugins.xml.
>
> 2013-01-12 05:39:07,594 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/vnd.oasis.opendocument.text, but they are not mapped to it  in
> the parse-plugins.xml file
>

Yupe. I will do it.

So, why the PDF parser could not parsed completely to whole PDFs docs?

-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

>
> We can see that some of parse processes were not completed successfully.
>

Yes I see this. I also see that you have a http.proxy.port = 8080 but no
proxy host and that the protocol-httpclient plugin is not activated.
I also see some strange fetcher behaviour as it seems to fetch the server
instance e.g. 2013-01-12 05:37:41,987 INFO  fetcher.FetcherJob - fetching
http://localhost/, however I assume there is no document @ this location on
the server...

That being said, as we've established fetching does not seem to be the
problem.

Unless you wish to skip parsing for truncated documents then you will need
to increase the http.content.limit to something over ~40K. This will then
remove the following log output (meaning that the document should be fully
parsed)
2013-01-12 05:38:27,508 WARN  parse.ParserJob -
http://localhost/sapi/Solr-install-v2.pdf skipped. Content of size 395125
was truncated to 65536
You may also wish to consider the parser.skip.truncated property in
nutch-site.xml

I don't suppose these PDF's are password protected or something like that?

I would also explicitly map the content type
application/vnd.oasis.opendocument.text to parse-tika in parse-plugins.xml.

2013-01-12 05:39:07,594 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content type
application/vnd.oasis.opendocument.text, but they are not mapped to it  in
the parse-plugins.xml file

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Lewis,
Sorry for late reply.

Please find the complete log here:
http://pastebin.com/EqeMtsb2

We can see that some of parse processes were not completed successfully.

Following are crawling and indexing steps commands.

*[Crawling step]*
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch crawl urls -depth 3
-topN 5

*[Indexing step]*
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch solrindex
http://localhost:8080/solr -reindex
SolrIndexerJob: starting
Adding 1 documents
SolrIndexerJob: done.

Even though I repeat many times on crawling, the indexing is always only
proceed adding 1 document.

Below are parsechecker output of success and fail files parsed:

*[success]* -- but it's inconsistent since another .odt file is FAIL parsed
by nutch. see the hadoop log.
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch parsechecker
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
---------
Url
---------------
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
---------
Metadata
---------
Page-Count :     1
dc:creator :     Bayu Widyasanyata
meta:character-count :     532
Paragraph-Count :     2
nbWord :     69
meta:paragraph-count :     2
Character Count :     532
Last-Save-Date :     2012-12-21T05:37:30
dcterms:modified :     2012-12-21T05:37:30
Object-Count :     0
meta:object-count :     0
Author :     Bayu Widyasanyata
nbObject :     0
creator :     Bayu Widyasanyata
xmpTPg:NPages :     1
meta:image-count :     0
Table-Count :     0
nbCharacter :     532
Word-Count :     69
meta:table-count :     0
meta:initial-author :     Bayu Widyasanyata
Last-Modified :     2012-12-21T05:37:30
Creation-Date :     2012-12-21T05:33:12
generator :     OpenOffice.org/3.2$Linux
OpenOffice.org_project/320m12$Build-9483
meta:creation-date :     2012-12-21T05:33:12
meta:word-count :     69
Image-Count :     0
nbImg :     0
meta:author :     Bayu Widyasanyata
nbTab :     0
nbPage :     1
editing-cycles :     2
Content-Type :     application/vnd.oasis.opendocument.text
meta:save-date :     2012-12-21T05:37:30
meta:page-count :     1
Edit-Time :     PT00H04M18S
initial-creator :     Bayu Widyasanyata
nbPara :     2
modified :     2012-12-21T05:37:30
date :     2012-12-21T05:33:12
dcterms:created :     2012-12-21T05:33:12

*[failed]*
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch parsechecker
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Url
---------------
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Metadata
---------
xmp:CreatorTool :     Writer
meta:author :     Bayu Widyasanyata
xmpTPg:NPages :     1
dc:creator :     Bayu Widyasanyata
Content-Type :     application/pdf
created :     Sun Dec 23 19:23:22 WIT 2012
Author :     Bayu Widyasanyata
Creation-Date :     2012-12-23T12:23:22Z
date :     2012-12-23T12:23:22Z
producer :     OpenOffice.org 3.2
meta:creation-date :     2012-12-23T12:23:22Z
creator :     Bayu Widyasanyata
dcterms:created :     2012-12-23T12:23:22Z

Thanks.-

On Fri, Jan 11, 2013 at 11:09 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> I can't see any log output. Can you fetch and parse the pdfs with the
> parsechecker tool?
>
> On Thursday, January 10, 2013, Bayu Widyasanyata <bw...@gmail.com>
> wrote:
> > For clarity, the log below is the about 4 of 5 my PDF docs that can't be
> > parsed by nutch.
> >
> > On Fri, Jan 11, 2013 at 8:29 AM, Bayu Widyasanyata
> > <bw...@gmail.com>wrote:
> >
> >> nutch parsing is still problem on pdf files.
> >> Only 1 pdf can be parsed successfully.
> >>
> >> 2013-01-11 08:11:23,679 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> >> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf of
> >> type application/pdf
> >>
> >> Even I had added on parse-plugins.xml explicitly:
> >>
> >>     <mimeType name="application/pdf">
> >>       <plugin id="parse-tika" />
> >>     </mimeType>
> >>
> >> What the missed things?
> >>
> >> On Fri, Jan 11, 2013 at 7:55 AM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >>> No problem at all.
> >>>
> >>> Better safe than sorry.
> >>>
> >>> Lewis
> >>>
> >>> On Thu, Jan 10, 2013 at 4:43 PM, Bayu Widyasanyata
> >>> <bw...@gmail.com>wrote:
> >>>
> >>> > Yes, I forgot that things even I already put on my notes on previous
> >>> > installation.
> >>> > I'm quite new on nutch and also Java developments :)
> >>> >
> >>> > Thanks!
> >>> >
> >>> > On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
> >>> > lewis.mcgibbney@gmail.com> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > java.io.IOException: java.lang.ClassNotFoundException:
> >>> > > > com.mysql.jdbc.Driver
> >>> > > >
> >>> > >
> >>> > > If you look at ivy.xml [0] you will see that the
> mysql-connector-java
> >>> > > dependency is commented out. Please uncomment it, then build Nutch
> 2.x
> >>> > src
> >>> > > again.
> >>> > >
> >>> > > This will download the dependency and make it available on your
> >>> > classpath.
> >>> > >
> >>> > > Thank you
> >>> > >
> >>> > > Lewis
> >>> > >
> >>> > > [0]
> >>> > >
> >>>
> http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > wassalam,
> >>> > [bayu]
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>
> >>
> >>
> >> --
> >> wassalam,
> >> [bayu]
> >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
> --
> *Lewis*
>



-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

I can't see any log output. Can you fetch and parse the pdfs with the
parsechecker tool?

On Thursday, January 10, 2013, Bayu Widyasanyata <bw...@gmail.com>
wrote:
> For clarity, the log below is the about 4 of 5 my PDF docs that can't be
> parsed by nutch.
>
> On Fri, Jan 11, 2013 at 8:29 AM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
>> nutch parsing is still problem on pdf files.
>> Only 1 pdf can be parsed successfully.
>>
>> 2013-01-11 08:11:23,679 WARN  parse.ParseUtil - Unable to successfully
>> parse content
>> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf of
>> type application/pdf
>>
>> Even I had added on parse-plugins.xml explicitly:
>>
>>     <mimeType name="application/pdf">
>>       <plugin id="parse-tika" />
>>     </mimeType>
>>
>> What the missed things?
>>
>> On Fri, Jan 11, 2013 at 7:55 AM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> No problem at all.
>>>
>>> Better safe than sorry.
>>>
>>> Lewis
>>>
>>> On Thu, Jan 10, 2013 at 4:43 PM, Bayu Widyasanyata
>>> <bw...@gmail.com>wrote:
>>>
>>> > Yes, I forgot that things even I already put on my notes on previous
>>> > installation.
>>> > I'm quite new on nutch and also Java developments :)
>>> >
>>> > Thanks!
>>> >
>>> > On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
>>> > lewis.mcgibbney@gmail.com> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > java.io.IOException: java.lang.ClassNotFoundException:
>>> > > > com.mysql.jdbc.Driver
>>> > > >
>>> > >
>>> > > If you look at ivy.xml [0] you will see that the
mysql-connector-java
>>> > > dependency is commented out. Please uncomment it, then build Nutch
2.x
>>> > src
>>> > > again.
>>> > >
>>> > > This will download the dependency and make it available on your
>>> > classpath.
>>> > >
>>> > > Thank you
>>> > >
>>> > > Lewis
>>> > >
>>> > > [0]
>>> > >
>>> http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > wassalam,
>>> > [bayu]
>>> >
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>>
>> --
>> wassalam,
>> [bayu]
>
>
>
>
> --
> wassalam,
> [bayu]
>

-- 
*Lewis*

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

For clarity, the log below is the about 4 of 5 my PDF docs that can't be
parsed by nutch.

On Fri, Jan 11, 2013 at 8:29 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> nutch parsing is still problem on pdf files.
> Only 1 pdf can be parsed successfully.
>
> 2013-01-11 08:11:23,679 WARN  parse.ParseUtil - Unable to successfully
> parse content
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf of
> type application/pdf
>
> Even I had added on parse-plugins.xml explicitly:
>
>     <mimeType name="application/pdf">
>       <plugin id="parse-tika" />
>     </mimeType>
>
> What the missed things?
>
> On Fri, Jan 11, 2013 at 7:55 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> No problem at all.
>>
>> Better safe than sorry.
>>
>> Lewis
>>
>> On Thu, Jan 10, 2013 at 4:43 PM, Bayu Widyasanyata
>> <bw...@gmail.com>wrote:
>>
>> > Yes, I forgot that things even I already put on my notes on previous
>> > installation.
>> > I'm quite new on nutch and also Java developments :)
>> >
>> > Thanks!
>> >
>> > On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
>> > lewis.mcgibbney@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > java.io.IOException: java.lang.ClassNotFoundException:
>> > > > com.mysql.jdbc.Driver
>> > > >
>> > >
>> > > If you look at ivy.xml [0] you will see that the mysql-connector-java
>> > > dependency is commented out. Please uncomment it, then build Nutch 2.x
>> > src
>> > > again.
>> > >
>> > > This will download the dependency and make it available on your
>> > classpath.
>> > >
>> > > Thank you
>> > >
>> > > Lewis
>> > >
>> > > [0]
>> > >
>> http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
>> > >
>> >
>> >
>> >
>> > --
>> > wassalam,
>> > [bayu]
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> wassalam,
> [bayu]




-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

nutch parsing is still problem on pdf files.
Only 1 pdf can be parsed successfully.

2013-01-11 08:11:23,679 WARN  parse.ParseUtil - Unable to successfully
parse content
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf of
type application/pdf

Even I had added on parse-plugins.xml explicitly:

    <mimeType name="application/pdf">
      <plugin id="parse-tika" />
    </mimeType>

What the missed things?

On Fri, Jan 11, 2013 at 7:55 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> No problem at all.
>
> Better safe than sorry.
>
> Lewis
>
> On Thu, Jan 10, 2013 at 4:43 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Yes, I forgot that things even I already put on my notes on previous
> > installation.
> > I'm quite new on nutch and also Java developments :)
> >
> > Thanks!
> >
> > On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > java.io.IOException: java.lang.ClassNotFoundException:
> > > > com.mysql.jdbc.Driver
> > > >
> > >
> > > If you look at ivy.xml [0] you will see that the mysql-connector-java
> > > dependency is commented out. Please uncomment it, then build Nutch 2.x
> > src
> > > again.
> > >
> > > This will download the dependency and make it available on your
> > classpath.
> > >
> > > Thank you
> > >
> > > Lewis
> > >
> > > [0]
> > >
> http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
> > >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> *Lewis*
>



-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

No problem at all.

Better safe than sorry.

Lewis

On Thu, Jan 10, 2013 at 4:43 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Yes, I forgot that things even I already put on my notes on previous
> installation.
> I'm quite new on nutch and also Java developments :)
>
> Thanks!
>
> On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi,
> >
> > java.io.IOException: java.lang.ClassNotFoundException:
> > > com.mysql.jdbc.Driver
> > >
> >
> > If you look at ivy.xml [0] you will see that the mysql-connector-java
> > dependency is commented out. Please uncomment it, then build Nutch 2.x
> src
> > again.
> >
> > This will download the dependency and make it available on your
> classpath.
> >
> > Thank you
> >
> > Lewis
> >
> > [0]
> > http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
> >
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
*Lewis*

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Yes, I forgot that things even I already put on my notes on previous
installation.
I'm quite new on nutch and also Java developments :)

Thanks!

On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
>
> java.io.IOException: java.lang.ClassNotFoundException:
> > com.mysql.jdbc.Driver
> >
>
> If you look at ivy.xml [0] you will see that the mysql-connector-java
> dependency is commented out. Please uncomment it, then build Nutch 2.x src
> again.
>
> This will download the dependency and make it available on your classpath.
>
> Thank you
>
> Lewis
>
> [0]
> http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
>

-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

java.io.IOException: java.lang.ClassNotFoundException:
> com.mysql.jdbc.Driver
>

If you look at ivy.xml [0] you will see that the mysql-connector-java
dependency is commented out. Please uncomment it, then build Nutch 2.x src
again.

This will download the dependency and make it available on your classpath.

Thank you

Lewis

[0] http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi,

Ant build is success.
But now classic issues that I can't solved :(

bayu@thinkpato:/opt/searchengine2/nutch$ ./bin/nutch crawl urls -depth 3
-topN 5
Exception in thread "main" org.apache.gora.util.GoraException:
java.io.IOException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
    at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
    at
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
Caused by: java.io.IOException: java.lang.ClassNotFoundException:
com.mysql.jdbc.Driver
    at org.apache.gora.sql.store.SqlStore.getConnection(SqlStore.java:747)
    at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:160)
    at
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
    at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
    ... 8 more
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:186)
    at org.apache.gora.sql.store.SqlStore.getConnection(SqlStore.java:735)
    ... 11 more

I already set CLASSPATH but don't know what should the correct filename of
the jar file.
mysql.jar or should named mysql-connector-java.jar??
Which is nutch will call/refer?

On Tue, Jan 8, 2013 at 2:47 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi Lewis,
>
> Thanks for the link!
>
>
> On Tue, Jan 8, 2013 at 6:11 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Bayu,
>>
>> On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata
>> <bw...@gmail.com>wrote:
>>
>> >
>> > Anyone can give me a hint?
>> >
>> > In parallel I changed to use nutch 1.6 binary and works well.
>> > But curious to use the latest of nutch 2.1.
>> >
>> > Please check out the latest 2.x branch here [0]. This uses Tika 1.2 and
>> should fit your needs.
>>
>> Sorry for late response.
>>
>> Lewis
>>
>> [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/
>>
>
>
>
> --
> wassalam,
> [bayu]




-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Lewis,

Thanks for the link!

On Tue, Jan 8, 2013 at 6:11 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Bayu,
>
> On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> >
> > Anyone can give me a hint?
> >
> > In parallel I changed to use nutch 1.6 binary and works well.
> > But curious to use the latest of nutch 2.1.
> >
> > Please check out the latest 2.x branch here [0]. This uses Tika 1.2 and
> should fit your needs.
>
> Sorry for late response.
>
> Lewis
>
> [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/
>



-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Bayu,

On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

>
> Anyone can give me a hint?
>
> In parallel I changed to use nutch 1.6 binary and works well.
> But curious to use the latest of nutch 2.1.
>
> Please check out the latest 2.x branch here [0]. This uses Tika 1.2 and
should fit your needs.

Sorry for late response.

Lewis

[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi,

I still failed upgrading nutch 2.1 with Tika 1.2 :(
I followed to copy as mentioned on NUTCH-1433 patch, execute "ant runtime".
But too many errors!

========================================
.... part of:
    [javac]
/home/bayu/Downloads/solr/apache-nutch-2.1/src/java/org/apache/nutch/util/PrefixStringMatcher.java:50:
warning: [rawtypes] found raw type: Iterator
    [javac]     Iterator iter= prefixes.iterator();
    [javac]     ^
    [javac]   missing type arguments for generic class Iterator<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in interface Iterator
    [javac]
/home/bayu/Downloads/solr/apache-nutch-2.1/src/java/org/apache/nutch/util/SuffixStringMatcher.java:44:
warning: [rawtypes] found raw type: Collection
    [javac]   public SuffixStringMatcher(Collection suffixes) {
    [javac]                              ^
    [javac]   missing type arguments for generic class Collection<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in interface Collection
    [javac]
/home/bayu/Downloads/solr/apache-nutch-2.1/src/java/org/apache/nutch/util/SuffixStringMatcher.java:46:
warning: [rawtypes] found raw type: Iterator
    [javac]     Iterator iter= suffixes.iterator();
    [javac]     ^
    [javac]   missing type arguments for generic class Iterator<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in interface Iterator
    [javac]
/home/bayu/Downloads/solr/apache-nutch-2.1/src/java/org/apache/nutch/util/ToolUtil.java:48:
warning: [unchecked] unchecked cast
    [javac]     Map<String,Object> jobs =
(Map<String,Object>)results.get(Nutch.STAT_JOBS);
    [javac]                                                              ^
    [javac]   required: Map<String,Object>
    [javac]   found:    Object
    [javac] 100 errors
    [javac] 52 warnings

BUILD FAILED
/home/bayu/Downloads/solr/apache-nutch-2.1/build.xml:97: Compile failed;
see the compiler error output for details.

Total time: 18 seconds
========================================

Anyone can give me a hint?

In parallel I changed to use nutch 1.6 binary and works well.
But curious to use the latest of nutch 2.1.

Thanks in advance!

On Sun, Dec 30, 2012 at 1:46 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi,
>
> Thank you for suggestions.
> And I was try to upgrade the Tika to 1.2 as mentioned on
> https://issues.apache.org/jira/browse/NUTCH-1433
>
> I will try your suggestions and/or upgrade tika.
>
> On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <lo...@gmail.com> wrote:
> > Hi,
> >
> > Tika should parse those formats, so unless there is something peculiar
> > with all your files or setup, have you tried the:
> >
> > - Size of the files to see if they are over configured limits
> > - used the nutch parsechecker command to test individual files
> >
> > Cheers,
> > Dave
> >
> > On 25 Dec 2012, at 01:34, Bayu Widyasanyata <bw...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> ==Update==
> >>
> >> Checking hadoop.log found some interesting info that the parsing was
> >> not completed successfully.
> >>
> >> ...
> >> 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
> >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> >> 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
> >> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> >> plugin.includes system property, and all claim to support the content
> >> type application/vnd.oasis.opendocument.text, but they are not mapped
> >> to it  in the parse-plugins.xml file
> >> 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> >> of type application/vnd.oasis.opendocument.text
> >> 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
> >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> >> 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
> >> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> >> plugin.includes system property, and all claim to support the content
> >> type application/pdf, but they are not mapped to it  in the
> >> parse-plugins.xml file
> >> 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> >> of type application/pdf
> >> 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
> >> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> >> 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> >> of type application/vnd.oasis.opendocument.text
> >> 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
> >> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> >> 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> >> of type application/pdf
> >> 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
> >>
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> >> 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> >> of type application/pdf
> >> ...
> >>
> >> I checked the parse-plugins.xml file and found no plugins handling
> >> type of application/pdf and application/vnd.oasis.opendocument.text.
> >> I knew that parse-tika handle PDF files but why those errors were still
> occurs?
> >>
> >> Any documents/links could explain in easy way to install and activate
> >> those supported plugins as mentioned at [1] on nutch parser?
> >>
> >> [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format
> >>
> >> Thanks,
> >>
> >> --
> >> wassalam,
> >> [bayu]
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi,

Thank you for suggestions.
And I was try to upgrade the Tika to 1.2 as mentioned on
https://issues.apache.org/jira/browse/NUTCH-1433

I will try your suggestions and/or upgrade tika.

On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <lo...@gmail.com> wrote:
> Hi,
>
> Tika should parse those formats, so unless there is something peculiar
> with all your files or setup, have you tried the:
>
> - Size of the files to see if they are over configured limits
> - used the nutch parsechecker command to test individual files
>
> Cheers,
> Dave
>
> On 25 Dec 2012, at 01:34, Bayu Widyasanyata <bw...@gmail.com> wrote:
>
>> Hi,
>>
>> ==Update==
>>
>> Checking hadoop.log found some interesting info that the parsing was
>> not completed successfully.
>>
>> ...
>> 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
>> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
>> 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
>> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type application/vnd.oasis.opendocument.text, but they are not mapped
>> to it  in the parse-plugins.xml file
>> 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
>> parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
>> of type application/vnd.oasis.opendocument.text
>> 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
>> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
>> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type application/pdf, but they are not mapped to it  in the
>> parse-plugins.xml file
>> 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
>> parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> of type application/pdf
>> 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
>> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>> 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
>> parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>> of type application/vnd.oasis.opendocument.text
>> 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
>> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
>> 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
>> parse content http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
>> of type application/pdf
>> 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
>> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
>> parse content http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> of type application/pdf
>> ...
>>
>> I checked the parse-plugins.xml file and found no plugins handling
>> type of application/pdf and application/vnd.oasis.opendocument.text.
>> I knew that parse-tika handle PDF files but why those errors were still occurs?
>>
>> Any documents/links could explain in easy way to install and activate
>> those supported plugins as mentioned at [1] on nutch parser?
>>
>> [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format
>>
>> Thanks,
>>
>> --
>> wassalam,
>> [bayu]



-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

It looks parse process is working fine even the log said "unable to
successfully" parsed:

LOGS:
++++++++++++++++++++++++++
2013-01-16 08:13:44,887 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
2013-01-16 08:13:44,911 WARN  parse.ParseUtil - Unable to successfully
parse content
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf of
type application/pdf


parsechecker -dumpText output
++++++++++++++++++++++++++
bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker
-dumpText
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Url
---------------
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Metadata
---------
xmp:CreatorTool :     Writer
meta:author :     Bayu Widyasanyata
xmpTPg:NPages :     1
dc:creator :     Bayu Widyasanyata
Content-Type :     application/pdf
created :     Sun Dec 23 19:23:22 WIT 2012
Author :     Bayu Widyasanyata
Creation-Date :     2012-12-23T12:23:22Z
date :     2012-12-23T12:23:22Z
producer :     OpenOffice.org 3.2
meta:creation-date :     2012-12-23T12:23:22Z
creator :     Bayu Widyasanyata
dcterms:created :     2012-12-23T12:23:22Z
---------
ParseText
---------
Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar secara serius
oleh seorang muttaqin ialah keberhasilan di akhirat. Baginya keberhasilan
di dunia merupakan sesuatu yang bersifat supplementary (faktor pelengkap)
saja. Tetapi keberhasilan di akhirat adalah sesuatu yang tidak boleh
ditawar sedikitpun karena ia merupakan faktor utama. Ia tidak rela
mempertaruhkan keberhasilannya di akhirat demi keberhasilannya di dunia.
Namun sebaliknya, demi keberhasilannya di akhirat ia rela kehilangan
keberhasilannya di dunia. SpasiKosong.

====

"text" value on my MySQL database is still empty for that file.

Thanks,

On Wed, Jan 16, 2013 at 7:41 AM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> On Tue, Jan 15, 2013 at 11:28 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Did you check the http.accept property in nutch-site.xml
>
>
> I copied from nutch-default.xml, then add application/pdf:
>
> <property>
>   <name>http.accept</name>
>
> <value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value>
>   <description>Value of the "Accept" request header field.
>   </description>
> </property>
>
> Also has shown on hadoop.log:
> 2013-01-16 07:39:22,232 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8
> --
> wassalam,
> [bayu]




-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

On Tue, Jan 15, 2013 at 11:28 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Did you check the http.accept property in nutch-site.xml


I copied from nutch-default.xml, then add application/pdf:

<property>
  <name>http.accept</name>

<value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value>
  <description>Value of the "Accept" request header field.
  </description>
</property>

Also has shown on hadoop.log:
2013-01-16 07:39:22,232 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8
-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Did you check the http.accept property in nutch-site.xml?

On Tuesday, January 15, 2013, Bayu Widyasanyata <bw...@gmail.com>
wrote:
> Hi Dave,
> Below are nutch parsechecker between nutch 1.6 and 2.x (checkout from
[0]):
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> VERSION 2.x
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> ---------
> Url
> ---------------
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> ---------
> Metadata
> ---------
> xmp:CreatorTool :     Writer
> meta:author :     Bayu Widyasanyata
> xmpTPg:NPages :     1
> dc:creator :     Bayu Widyasanyata
> Content-Type :     application/pdf
> created :     Fri Dec 21 05:38:05 WIT 2012
> Author :     Bayu Widyasanyata
> Creation-Date :     2012-12-20T22:38:05Z
> date :     2012-12-20T22:38:05Z
> producer :     OpenOffice.org 3.2
> meta:creation-date :     2012-12-20T22:38:05Z
> creator :     Bayu Widyasanyata
> dcterms:created :     2012-12-20T22:38:05Z
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> VERSION 1.6
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch parsechecker
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> fetching:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> parsing:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> contentType: application/pdf
> signature: f992108356e0248635192bfe7c6d3efc
> ---------
> Url
> ---------------
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> ---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: ETag="187478-a091-4d15067c794e6" Date=Tue, 15 Jan 2013
> 15:00:47 GMT Content-Length=41105 Last-Modified=Thu, 20 Dec 2012 22:39:35
> GMT Content-Type=application/pdf Connection=close Accept-Ranges=bytes
> Server=Apache/2.2.14 (Ubuntu)
> Parse Metadata: xmpTPg:NPages=1 Creation-Date=2012-12-20T22:38:05Z
> meta:author=Bayu Widyasanyata meta:creation-date=2012-12-20T22:38:05Z
> created=Fri Dec 21 05:38:05 WIT 2012 dc:creator=Bayu Widyasanyata
> Author=Bayu Widyasanyata producer=OpenOffice.org 3.2
> dcterms:created=2012-12-20T22:38:05Z date=2012-12-20T22:38:05Z
> Content-Type=application/pdf xmp:CreatorTool=Writer creator=Bayu
> Widyasanyata
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> And below are the "indexchecker" results which available only on version
> 1.6:
>
> bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch indexchecker
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> fetching:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> parsing:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> contentType: application/pdf
> content :    Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar
> secara serius oleh seorang muttaqin ial
> host :    localhost
> tstamp :    Tue Jan 15 22:05:50 WIT 2013
>
> ---
>
> Since version 2.x of nutch doesn't have "indexchecker" command, how
> nutch2.x know the content of a document (i.e. PDF files)?
> I'm not sure with this since my .odt file parsed successfully...
>
> Or might be something "mapping problem in Tika's pdf" parser with nutch?
>
> Anyway,
> Does this issue [1] has been solved?
> This issue is same with me...
>
> [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/
> [1]
>
http://lucene.472066.n3.nabble.com/Nutch-2-x-ParseUtil-failing-for-some-pdf-files-td4014084.html
>
> On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <lo...@gmail.com> wrote:
>
>> Hi,
>>
>> Tika should parse those formats, so unless there is something peculiar
>> with all your files or setup, have you tried the:
>>
>> - Size of the files to see if they are over configured limits
>> - used the nutch parsechecker command to test individual files
>>
>> Cheers,
>> Dave
>>
>> On 25 Dec 2012, at 01:34, Bayu Widyasanyata <bw...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > ==Update==
>> >
>> > Checking hadoop.log found some interesting info that the parsing was
>> > not completed successfully.
>> >
>> > ...
>> > 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
>> > 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
>> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > plugin.includes system property, and all claim to support the content
>> > type application/vnd.oasis.opendocument.text, but they are not mapped
>> > to it  in the parse-plugins.xml file
>> > 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
>> > of type application/vnd.oasis.opendocument.text
>> > 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
>> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > plugin.includes system property, and all claim to support the content
>> > type application/pdf, but they are not mapped to it  in the
>> > parse-plugins.xml file
>> > 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > of type application/pdf
>> > 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>> > 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>> > of type application/vnd.oasis.opendocument.text
>> > 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
>> > 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
>> > of type application/pdf
>> > 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
>> >
>>
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>>
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > of type application/pdf
>> > ...
>> >
>> > I checked the parse-plugins.xml file and found no plugins handling
>> > type of application/pdf and application/vnd.oasis.opendocument.text.
>> > I knew that parse-tika handle PDF files but why those errors were--
> wassalam,
> [bayu]
>

-- 
*Lewis*

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Dave,
Below are nutch parsechecker between nutch 1.6 and 2.x (checkout from [0]):

++++++++++++++++++++++++++++++++++++++++++++++++++++++
VERSION 2.x
++++++++++++++++++++++++++++++++++++++++++++++++++++++
bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
---------
Url
---------------
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
---------
Metadata
---------
xmp:CreatorTool :     Writer
meta:author :     Bayu Widyasanyata
xmpTPg:NPages :     1
dc:creator :     Bayu Widyasanyata
Content-Type :     application/pdf
created :     Fri Dec 21 05:38:05 WIT 2012
Author :     Bayu Widyasanyata
Creation-Date :     2012-12-20T22:38:05Z
date :     2012-12-20T22:38:05Z
producer :     OpenOffice.org 3.2
meta:creation-date :     2012-12-20T22:38:05Z
creator :     Bayu Widyasanyata
dcterms:created :     2012-12-20T22:38:05Z

++++++++++++++++++++++++++++++++++++++++++++++++++++++
VERSION 1.6
++++++++++++++++++++++++++++++++++++++++++++++++++++++
bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch parsechecker
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
fetching:
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
parsing:
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
contentType: application/pdf
signature: f992108356e0248635192bfe7c6d3efc
---------
Url
---------------
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: ETag="187478-a091-4d15067c794e6" Date=Tue, 15 Jan 2013
15:00:47 GMT Content-Length=41105 Last-Modified=Thu, 20 Dec 2012 22:39:35
GMT Content-Type=application/pdf Connection=close Accept-Ranges=bytes
Server=Apache/2.2.14 (Ubuntu)
Parse Metadata: xmpTPg:NPages=1 Creation-Date=2012-12-20T22:38:05Z
meta:author=Bayu Widyasanyata meta:creation-date=2012-12-20T22:38:05Z
created=Fri Dec 21 05:38:05 WIT 2012 dc:creator=Bayu Widyasanyata
Author=Bayu Widyasanyata producer=OpenOffice.org 3.2
dcterms:created=2012-12-20T22:38:05Z date=2012-12-20T22:38:05Z
Content-Type=application/pdf xmp:CreatorTool=Writer creator=Bayu
Widyasanyata

++++++++++++++++++++++++++++++++++++++++++++++++++++++

And below are the "indexchecker" results which available only on version
1.6:

bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch indexchecker
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
fetching:
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
parsing:
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
contentType: application/pdf
content :    Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar
secara serius oleh seorang muttaqin ial
host :    localhost
tstamp :    Tue Jan 15 22:05:50 WIT 2013

---

Since version 2.x of nutch doesn't have "indexchecker" command, how
nutch2.x know the content of a document (i.e. PDF files)?
I'm not sure with this since my .odt file parsed successfully...

Or might be something "mapping problem in Tika's pdf" parser with nutch?

Anyway,
Does this issue [1] has been solved?
This issue is same with me...

[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/
[1]
http://lucene.472066.n3.nabble.com/Nutch-2-x-ParseUtil-failing-for-some-pdf-files-td4014084.html

On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <lo...@gmail.com> wrote:

> Hi,
>
> Tika should parse those formats, so unless there is something peculiar
> with all your files or setup, have you tried the:
>
> - Size of the files to see if they are over configured limits
> - used the nutch parsechecker command to test individual files
>
> Cheers,
> Dave
>
> On 25 Dec 2012, at 01:34, Bayu Widyasanyata <bw...@gmail.com>
> wrote:
>
> > Hi,
> >
> > ==Update==
> >
> > Checking hadoop.log found some interesting info that the parsing was
> > not completed successfully.
> >
> > ...
> > 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> > 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content
> > type application/vnd.oasis.opendocument.text, but they are not mapped
> > to it  in the parse-plugins.xml file
> > 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> > of type application/vnd.oasis.opendocument.text
> > 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> > 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content
> > type application/pdf, but they are not mapped to it  in the
> > parse-plugins.xml file
> > 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> > of type application/pdf
> > 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
> > http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> > 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> > of type application/vnd.oasis.opendocument.text
> > 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
> > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> > 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> > of type application/pdf
> > 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
> >
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> > 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> > of type application/pdf
> > ...
> >
> > I checked the parse-plugins.xml file and found no plugins handling
> > type of application/pdf and application/vnd.oasis.opendocument.text.
> > I knew that parse-tika handle PDF files but why those errors were still
> occurs?
> >
> > Any documents/links could explain in easy way to install and activate
> > those supported plugins as mentioned at [1] on nutch parser?
> >
> > [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format
> >
> > Thanks,
> >
> > --
> > wassalam,
> > [bayu]
>



-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Dave Meikle <lo...@gmail.com>.

Hi,

Tika should parse those formats, so unless there is something peculiar
with all your files or setup, have you tried the:

- Size of the files to see if they are over configured limits
- used the nutch parsechecker command to test individual files

Cheers,
Dave

On 25 Dec 2012, at 01:34, Bayu Widyasanyata <bw...@gmail.com> wrote:

> Hi,
>
> ==Update==
>
> Checking hadoop.log found some interesting info that the parsing was
> not completed successfully.
>
> ...
> 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type application/vnd.oasis.opendocument.text, but they are not mapped
> to it  in the parse-plugins.xml file
> 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
> parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> of type application/vnd.oasis.opendocument.text
> 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type application/pdf, but they are not mapped to it  in the
> parse-plugins.xml file
> 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
> parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> of type application/pdf
> 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
> parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> of type application/vnd.oasis.opendocument.text
> 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
> parse content http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> of type application/pdf
> 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
> parse content http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> of type application/pdf
> ...
>
> I checked the parse-plugins.xml file and found no plugins handling
> type of application/pdf and application/vnd.oasis.opendocument.text.
> I knew that parse-tika handle PDF files but why those errors were still occurs?
>
> Any documents/links could explain in easy way to install and activate
> those supported plugins as mentioned at [1] on nutch parser?
>
> [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format
>
> Thanks,
>
> --
> wassalam,
> [bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi,

==Update==

Checking hadoop.log found some interesting info that the parsing was
not completed successfully.

...
2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/vnd.oasis.opendocument.text, but they are not mapped
to it  in the parse-plugins.xml file
2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
of type application/vnd.oasis.opendocument.text
2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file
2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
of type application/pdf
2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
of type application/vnd.oasis.opendocument.text
2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
parse content http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
of type application/pdf
2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
parse content http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
of type application/pdf
...

I checked the parse-plugins.xml file and found no plugins handling
type of application/pdf and application/vnd.oasis.opendocument.text.
I knew that parse-tika handle PDF files but why those errors were still occurs?

Any documents/links could explain in easy way to install and activate
those supported plugins as mentioned at [1] on nutch parser?

[1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format

Thanks,

On Tue, Dec 25, 2012 at 7:16 AM, Bayu Widyasanyata
<bw...@gmail.com> wrote:
> Hi All,
>
> I'm a new on nutch and solr, with following platforms:
> - nutch 2.1
> - solr 4.0
> - jdk 1.7 on ubuntu 10.04
>
> I'm also part of "member" of the legendary implementation nutch with
> MySQL at http://nlp.solutions.asia/?p=180 ;-)
> I have installed all of above successfully with some minors
> corrections on table structure (i.e. change "typ" column into "type"
> and also change its size to varchar(64)).
>
> I created an index.html (with simple text inside) at URL
> http://localhost/sapi/ and put it into urls/seed.txt as source URL
> crawled.
> For testing I created 5 inlinks which contains 5 documents with 2
> formats (pdf and odt) and filename format (filename with space and
> no-space) in index.html file:
>
> 1. http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> 2. http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 3. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 4. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> 5. http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>
> *the chars %20 on links above is actually whitespace character. I only
> copied what my browser read/interpret and converted into safe URLs.
> **Converting the rules above (space char) has also applied on
> regex-normalize.xml file.
>
> Here are some facts and doubts I got after play around with nutch and solr:
>
> 1. All of those docs has parsed "successfully" since the status is "2".
> 2. Why I called it "successfully" is because some of docs (#1 and #2
> above) are not having the value on "text" column in webpage MySQL
> table. It means those docs are failed to parse by nutch. CMIIW.
> 3. The number of docs (numdocs) reported on Solr Admin is always 2
> docs! :( -- only indexing index.html and 4.
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> successfully indexed by Solr. Even I do repeat the crawl and reindex
> process many times.
>
> Below are 2 lines commands in single bash script to crawl and index my page:
>
> #!/bin/bash
> ./runtime/local/bin/nutch crawl urls -depth 3 -topN 5
> ./runtime/local/bin/nutch solrindex http://localhost:8080/solr/ -reindex
>
> Appreciate for any help.
>
> TIA
>
> --
> wassalam,
> [bayu]

-- 
wassalam,
[bayu]