You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Luis Armando Roca Fumero <lr...@uclv.edu.cu> on 2013/10/18 16:05:19 UTC

Nutch 1.7 and Solr 4.4.0 Integrate

Hello friends:
 I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
All works good, but finally when Indexer is starting I get errors like this:

Indexer: starting at 2013-10-18 13:57:32
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication



What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
Thanks in advance

La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Kadir Sert <ka...@gmail.com>.
You might be run into this issue:

https://issues.apache.org/jira/browse/NUTCH-1100

2013/10/21 Luis Armando Roca Fumero <lr...@uclv.edu.cu>:
> Good Morning Friends:
> In order that I could not solve my problem with Nutch and Solr 4.4.0 1.7/2.2.1 I intend to publish what I have done from the beginning .
> 1 - I Downloaded solr 4.4.0
> 2 - I Downloaded Nutch 1.7
> 3 - I Copied the file to schema- solr4.xml / example/solr/collection1/conf and renamed to schema.xml
> 4 - When you start solr 4.4.0 , there was the following error: msg = SolrCore ' collection1 ' is not available due to init failure:
> Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist ) , trace = org.apache.solr.common.SolrException : SolrCore ' collection1 ' is not available due to init failure: Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist )
> 5 - To resolve this error was added the following line to schema.xml : <field name="_version_" indexed="true" type="long" stored="true"/>
> 6 - The Nutch configuration files can be found here :
>    nutch - site.xml : http://pastebin.com/Dh3tTacL
>    regex - urlfilter : http://pastebin.com/eRdxPB1b
>    seed.txt : http://pastebin.com/unNgJdmU
> 7 - When I run the next command: ./bin/nutch solrdedup http://localhost:8983/solr/
>
> I get this hadoop.log file:
> 2013-10-21 14:22:31,645 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2013-10-21 14:22:31
> 2013-10-21 14:22:31,647 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> 2013-10-21 14:22:32,050 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2013-10-21 14:22:32,927 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
> 2013-10-21 14:22:32,928 WARN  mapred.LocalJobRunner - job_local741622751_0001
> java.lang.Exception: java.lang.NullPointerException
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
>         at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
>
>
>
> Talat, can you explain me how to check solr index for committed documents? Sorry, I'm new with solr and nutch.
> I don'y know what I'm doing wrong, is necessary change to solr 3.x or solr 4.4.0 is find?? Can someone give me a tuto, step by step to integrate solr and nutch, I had followed the nutch tutorials in the web:
> http://wiki.apache.org/nutch/NutchTutorial , but I can get done the job
>
> Any ideas are welcomed
> Thanks for your time, friends,
> Luis Armando
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 10:59 p.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
>
> I am not sure what will be cause that. Did you check your solr index for
> committed document ? Maybe it didn't commit. You dont need run all over
> nutch jobs. Other jobs works fine. You can only run dedup job with :
> bin/nutch solrdedup sorl_url
> After that you can you share your solr.log.
>
> Talat
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>

Re: RV: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Hey Luis,

You are welcome! We wait your document ;)

Regards
Talat

22-10-2013 07:11 tarihinde, Luis Armando Roca Fumero yazdı:
> Hello Talat,
> First of all, Thank you, thank you and one thousand thank you. :)
> Finally I get nutch 1.7 and apache solr 4.4.0 working together, this can't  happened without the help of people from Nutch's mailing list, especially Talat, :)
> The reason of the success, in effect is that nutch-site needs the urlfilter-validator plugin :)
> Yeap I know that I'm using smileys to much, I'm very glad, this is a great step to get working with Nutch in the University that I work.
>
>
> About writing Nutch and Solr integration document, I'm new with this softs, If you have the time and patience to guide me, I accept the challenge!!!
> Best regards,
> Luis Armando
>
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: lunes, 21 de octubre de 2013 06:43 p.m.
> Para: user@nutch.apache.org
> Asunto: Re: RV: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
>
> the reason of issue is understood very hard. I have a solution but I am
> not sure :) When I looked your nutch-site.xml. You activated less
> plugins. If you dont use urlfilter-validator, when you parse websites,
> Parser generate unvalidate urls like as http://# or ???###eee etc. When
> I try to get your error. I am getting your issue, if my document has not
> valdiate url. Because of your issue, your unvalidate urls. Can you add
> urlfilter-validator plugin in your nutch-site.xml, drop your db and solr
> collection start again crawling.
>
> Little tips:
>
> For delete your table in hbase shell:
> truncate "webpage"
>
> For deleting your solr index :
> localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>
> (This delete your index)
> localhost:8983/solr/update?commit=true ( sometimes need For afffect your
> request )
>
> Your second question is solr mail list question. I believe they can give
> more information about solr. But if you want to look your index, you can
> use this url:
> http://localhost:8983/solr/collection1/select?q=*%3A*&start=0&rows=30&wt=xml&indent=true
>
> Actually you rgiht we need solr integration document. I learn very well.
> Can you write document about Solr Integration with nutch. I can review
> it and we will publish our wiki. What about you ?
> I hope I will hear your good news :)
> Hava a nice day
> Talat
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RE: RV: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Hello Talat,
First of all, Thank you, thank you and one thousand thank you. :)
Finally I get nutch 1.7 and apache solr 4.4.0 working together, this can't  happened without the help of people from Nutch's mailing list, especially Talat, :)
The reason of the success, in effect is that nutch-site needs the urlfilter-validator plugin :)
Yeap I know that I'm using smileys to much, I'm very glad, this is a great step to get working with Nutch in the University that I work.


About writing Nutch and Solr integration document, I'm new with this softs, If you have the time and patience to guide me, I accept the challenge!!!
Best regards,
Luis Armando

________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: lunes, 21 de octubre de 2013 06:43 p.m.
Para: user@nutch.apache.org
Asunto: Re: RV: Nutch 1.7 and Solr 4.4.0 Integrate

Hi Luis,

the reason of issue is understood very hard. I have a solution but I am
not sure :) When I looked your nutch-site.xml. You activated less
plugins. If you dont use urlfilter-validator, when you parse websites,
Parser generate unvalidate urls like as http://# or ???###eee etc. When
I try to get your error. I am getting your issue, if my document has not
valdiate url. Because of your issue, your unvalidate urls. Can you add
urlfilter-validator plugin in your nutch-site.xml, drop your db and solr
collection start again crawling.

Little tips:

For delete your table in hbase shell:
truncate "webpage"

For deleting your solr index :
localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>
(This delete your index)
localhost:8983/solr/update?commit=true ( sometimes need For afffect your
request )

Your second question is solr mail list question. I believe they can give
more information about solr. But if you want to look your index, you can
use this url:
http://localhost:8983/solr/collection1/select?q=*%3A*&start=0&rows=30&wt=xml&indent=true

Actually you rgiht we need solr integration document. I learn very well.
Can you write document about Solr Integration with nutch. I can review
it and we will publish our wiki. What about you ?
I hope I will hear your good news :)
Hava a nice day
Talat


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: RV: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Luis,

the reason of issue is understood very hard. I have a solution but I am 
not sure :) When I looked your nutch-site.xml. You activated less 
plugins. If you dont use urlfilter-validator, when you parse websites, 
Parser generate unvalidate urls like as http://# or ???###eee etc. When 
I try to get your error. I am getting your issue, if my document has not 
valdiate url. Because of your issue, your unvalidate urls. Can you add 
urlfilter-validator plugin in your nutch-site.xml, drop your db and solr 
collection start again crawling.

Little tips:

For delete your table in hbase shell:
truncate "webpage"

For deleting your solr index :
localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete> 
(This delete your index)
localhost:8983/solr/update?commit=true ( sometimes need For afffect your 
request )

Your second question is solr mail list question. I believe they can give 
more information about solr. But if you want to look your index, you can 
use this url:
http://localhost:8983/solr/collection1/select?q=*%3A*&start=0&rows=30&wt=xml&indent=true

Actually you rgiht we need solr integration document. I learn very well. 
Can you write document about Solr Integration with nutch. I can review 
it and we will publish our wiki. What about you ?
I hope I will hear your good news :)
Hava a nice day
Talat


21-10-2013 17:52 tarihinde, Luis Armando Roca Fumero yazdı:
> Sorry I forgot the solr.log:  http://pastebin.com/XAL58zbL
> I hope you can help me, thanks in advance
> Luis Armando
> ________________________________________
> De: Luis Armando Roca Fumero
> Enviado el: lunes, 21 de octubre de 2013 09:25 a.m.
> Para: user@nutch.apache.org
> Asunto: RE: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Good Morning Friends:
> In order that I could not solve my problem with Nutch and Solr 4.4.0 1.7/2.2.1 I intend to publish what I have done from the beginning .
> 1 - I Downloaded solr 4.4.0
> 2 - I Downloaded Nutch 1.7
> 3 - I Copied the file to schema- solr4.xml / example/solr/collection1/conf and renamed to schema.xml
> 4 - When you start solr 4.4.0 , there was the following error: msg = SolrCore ' collection1 ' is not available due to init failure:
> Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist ) , trace = org.apache.solr.common.SolrException : SolrCore ' collection1 ' is not available due to init failure: Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist )
> 5 - To resolve this error was added the following line to schema.xml : <field name="_version_" indexed="true" type="long" stored="true"/>
> 6 - The Nutch configuration files can be found here :
>     nutch - site.xml : http://pastebin.com/Dh3tTacL
>     regex - urlfilter : http://pastebin.com/eRdxPB1b
>     seed.txt : http://pastebin.com/unNgJdmU
> 7 - When I run the next command: ./bin/nutch solrdedup http://localhost:8983/solr/
>
> I get this hadoop.log file:
> 2013-10-21 14:22:31,645 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2013-10-21 14:22:31
> 2013-10-21 14:22:31,647 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> 2013-10-21 14:22:32,050 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2013-10-21 14:22:32,927 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
> 2013-10-21 14:22:32,928 WARN  mapred.LocalJobRunner - job_local741622751_0001
> java.lang.Exception: java.lang.NullPointerException
>          at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.NullPointerException
>          at org.apache.hadoop.io.Text.encode(Text.java:388)
>          at org.apache.hadoop.io.Text.set(Text.java:178)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
>          at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
>          at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>          at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>          at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>          at java.lang.Thread.run(Thread.java:724)
>
>
>
> Talat, can you explain me how to check solr index for committed documents? Sorry, I'm new with solr and nutch.
> I don'y know what I'm doing wrong, is necessary change to solr 3.x or solr 4.4.0 is fine?? Can someone give me a tuto, step by step to integrate solr and nutch, I had followed the nutch tutorials in the web:
> http://wiki.apache.org/nutch/NutchTutorial , but I can get done the job
>
> Any ideas are welcomed
> Thanks for your time, friends,
> Luis Armando
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 10:59 p.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
>
> I am not sure what will be cause that. Did you check your solr index for
> committed document ? Maybe it didn't commit. You dont need run all over
> nutch jobs. Other jobs works fine. You can only run dedup job with :
> bin/nutch solrdedup sorl_url
> After that you can you share your solr.log.
>
> Talat
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RV: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Sorry I forgot the solr.log:  http://pastebin.com/XAL58zbL
I hope you can help me, thanks in advance
Luis Armando
________________________________________
De: Luis Armando Roca Fumero
Enviado el: lunes, 21 de octubre de 2013 09:25 a.m.
Para: user@nutch.apache.org
Asunto: RE: Nutch 1.7 and Solr 4.4.0 Integrate

Good Morning Friends:
In order that I could not solve my problem with Nutch and Solr 4.4.0 1.7/2.2.1 I intend to publish what I have done from the beginning .
1 - I Downloaded solr 4.4.0
2 - I Downloaded Nutch 1.7
3 - I Copied the file to schema- solr4.xml / example/solr/collection1/conf and renamed to schema.xml
4 - When you start solr 4.4.0 , there was the following error: msg = SolrCore ' collection1 ' is not available due to init failure:
Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist ) , trace = org.apache.solr.common.SolrException : SolrCore ' collection1 ' is not available due to init failure: Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist )
5 - To resolve this error was added the following line to schema.xml : <field name="_version_" indexed="true" type="long" stored="true"/>
6 - The Nutch configuration files can be found here :
   nutch - site.xml : http://pastebin.com/Dh3tTacL
   regex - urlfilter : http://pastebin.com/eRdxPB1b
   seed.txt : http://pastebin.com/unNgJdmU
7 - When I run the next command: ./bin/nutch solrdedup http://localhost:8983/solr/

I get this hadoop.log file:
2013-10-21 14:22:31,645 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2013-10-21 14:22:31
2013-10-21 14:22:31,647 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
2013-10-21 14:22:32,050 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-21 14:22:32,927 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-10-21 14:22:32,928 WARN  mapred.LocalJobRunner - job_local741622751_0001
java.lang.Exception: java.lang.NullPointerException
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.io.Text.encode(Text.java:388)
        at org.apache.hadoop.io.Text.set(Text.java:178)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)



Talat, can you explain me how to check solr index for committed documents? Sorry, I'm new with solr and nutch.
I don'y know what I'm doing wrong, is necessary change to solr 3.x or solr 4.4.0 is fine?? Can someone give me a tuto, step by step to integrate solr and nutch, I had followed the nutch tutorials in the web:
http://wiki.apache.org/nutch/NutchTutorial , but I can get done the job

Any ideas are welcomed
Thanks for your time, friends,
Luis Armando
________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: viernes, 18 de octubre de 2013 10:59 p.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Hi Luis,

I am not sure what will be cause that. Did you check your solr index for
committed document ? Maybe it didn't commit. You dont need run all over
nutch jobs. Other jobs works fine. You can only run dedup job with :
bin/nutch solrdedup sorl_url
After that you can you share your solr.log.

Talat


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Good Morning Friends:
In order that I could not solve my problem with Nutch and Solr 4.4.0 1.7/2.2.1 I intend to publish what I have done from the beginning .
1 - I Downloaded solr 4.4.0
2 - I Downloaded Nutch 1.7
3 - I Copied the file to schema- solr4.xml / example/solr/collection1/conf and renamed to schema.xml
4 - When you start solr 4.4.0 , there was the following error: msg = SolrCore ' collection1 ' is not available due to init failure:
Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist ) , trace = org.apache.solr.common.SolrException : SolrCore ' collection1 ' is not available due to init failure: Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and multivalued = "false " ( _Version_ does not exist )
5 - To resolve this error was added the following line to schema.xml : <field name="_version_" indexed="true" type="long" stored="true"/>
6 - The Nutch configuration files can be found here :
   nutch - site.xml : http://pastebin.com/Dh3tTacL
   regex - urlfilter : http://pastebin.com/eRdxPB1b
   seed.txt : http://pastebin.com/unNgJdmU
7 - When I run the next command: ./bin/nutch solrdedup http://localhost:8983/solr/

I get this hadoop.log file:
2013-10-21 14:22:31,645 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2013-10-21 14:22:31
2013-10-21 14:22:31,647 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
2013-10-21 14:22:32,050 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-21 14:22:32,927 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-10-21 14:22:32,928 WARN  mapred.LocalJobRunner - job_local741622751_0001
java.lang.Exception: java.lang.NullPointerException
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.io.Text.encode(Text.java:388)
        at org.apache.hadoop.io.Text.set(Text.java:178)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)



Talat, can you explain me how to check solr index for committed documents? Sorry, I'm new with solr and nutch.
I don'y know what I'm doing wrong, is necessary change to solr 3.x or solr 4.4.0 is find?? Can someone give me a tuto, step by step to integrate solr and nutch, I had followed the nutch tutorials in the web:
http://wiki.apache.org/nutch/NutchTutorial , but I can get done the job

Any ideas are welcomed
Thanks for your time, friends,
Luis Armando
________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: viernes, 18 de octubre de 2013 10:59 p.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Hi Luis,

I am not sure what will be cause that. Did you check your solr index for
committed document ? Maybe it didn't commit. You dont need run all over
nutch jobs. Other jobs works fine. You can only run dedup job with :
bin/nutch solrdedup sorl_url
After that you can you share your solr.log.

Talat


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Luis,

I am not sure what will be cause that. Did you check your solr index for 
committed document ? Maybe it didn't commit. You dont need run all over 
nutch jobs. Other jobs works fine. You can only run dedup job with :
bin/nutch solrdedup sorl_url
After that you can you share your solr.log.

Talat

19-10-2013 04:43 tarihinde, Luis Armando Roca Fumero yazdı:
> Thanks a lot Talat :), I truly appreciate your help, and the others persons that gave me ideas
>
> I fixed Solr schema, following the Nutch Tutorial I had changed the line: <field name="content" type="text_general" stored="true" indexed="true"/> for <field name="content" type="text" stored="true" indexed="true"/>, but this is wrong
> I fixed that and ran again the nutch 1.7 but still getting problems :( , you can see a new hadoop.log here:  http://pastebin.com/2qY0sUJh
> The errors are:
> Exception in thread "main" java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>          at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>          at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
> Any ideas are wellcomed!!!
> Thanks in advance,
> Luis Armando
>
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 03:39 p.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Ok Luis,
>
> I found your problem. :) You have a problem about Solr Schema. In your
> hadoop.log you can see this line:
>
>   1.
>      org.apache.solr.common.SolrException: {msg=SolrCore 'collection1' is
>      not available due to init failure: Unknown fieldType 'text'
>      specified on field
>      content,trace=org.apache.solr.common.SolrException: SolrCore
>      'collection1' is not available due to init failure: Unknown
>      fieldType 'text' specified on field content        at
>      org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:860)
>      at
>      org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:251)
>           at
>      org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
>           at
>      org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>            at
>      org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>         at
>      org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>              at
>      org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
>        at
>      org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>          at
>      org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
>         a
>
>
> As you see, When nutch try to commit Solr throw an exception. You should
> check your Solr schema. You can ask me why does solrdedup throw an
> exception. Because IndexerJob didnt commit your document to Solr. When
> try to run dedup it didnt find any document check for duplication.
>
> Talat
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Thanks a lot Talat :), I truly appreciate your help, and the others persons that gave me ideas

I fixed Solr schema, following the Nutch Tutorial I had changed the line: <field name="content" type="text_general" stored="true" indexed="true"/> for <field name="content" type="text" stored="true" indexed="true"/>, but this is wrong
I fixed that and ran again the nutch 1.7 but still getting problems :( , you can see a new hadoop.log here:  http://pastebin.com/2qY0sUJh
The errors are:
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Any ideas are wellcomed!!!
Thanks in advance,
Luis Armando

________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: viernes, 18 de octubre de 2013 03:39 p.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Ok Luis,

I found your problem. :) You have a problem about Solr Schema. In your
hadoop.log you can see this line:

 1.
    org.apache.solr.common.SolrException: {msg=SolrCore 'collection1' is
    not available due to init failure: Unknown fieldType 'text'
    specified on field
    content,trace=org.apache.solr.common.SolrException: SolrCore
    'collection1' is not available due to init failure: Unknown
    fieldType 'text' specified on field content        at
    org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:860)
    at
    org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:251)
         at
    org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
         at
    org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
          at
    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
       at
    org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
            at
    org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
      at
    org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at
    org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
       a


As you see, When nutch try to commit Solr throw an exception. You should
check your Solr schema. You can ask me why does solrdedup throw an
exception. Because IndexerJob didnt commit your document to Solr. When
try to run dedup it didnt find any document check for duplication.

Talat


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Ok Luis,

I found your problem. :) You have a problem about Solr Schema. In your 
hadoop.log you can see this line:

 1.
    org.apache.solr.common.SolrException: {msg=SolrCore 'collection1' is
    not available due to init failure: Unknown fieldType 'text'
    specified on field
    content,trace=org.apache.solr.common.SolrException: SolrCore
    'collection1' is not available due to init failure: Unknown
    fieldType 'text' specified on field content        at
    org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:860)  
    at
    org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:251)
         at
    org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
         at
    org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
          at
    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
       at
    org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
            at
    org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
      at
    org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at
    org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
       a


As you see, When nutch try to commit Solr throw an exception. You should 
check your Solr schema. You can ask me why does solrdedup throw an 
exception. Because IndexerJob didnt commit your document to Solr. When 
try to run dedup it didnt find any document check for duplication.

Talat

18-10-2013 23:21 tarihinde, Luis Armando Roca Fumero yazdı:
> I running nutch from root user
> When I check under /crawl/segments/20131017194821/crawl_fetch doesn't exist
> It is incomplete, there are only _temporary and crawl_generate
> What can I do, If I copy a fresh binary files from version Nutch 1.7 ???
> thanks in advance,
> Luis armando
>
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 11:04 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Did you check your priviledged ? Can you check your path, is it exists ?
>
>   1.
>      2013-10-18 13:19:49,020 ERROR security.UserGroupInformation -
>      PriviledgedActionException as:root
>      cause:org.apache.hadoop.mapred.InvalidInputException: Input path
>      does not exist:
>      file:/opt/apache-nutch-1.7/crawl/segments/20131017194821/crawl_fetch
>
>
> 18-10-2013 18:22 tarihinde, Luis Armando Roca Fumero yazdı:
>> Ooooppppssss sorry Talat UAYRER:
>> This is the link for hadoop.log file: http://pastebin.com/F6qBQhSA
>>
>> ________________________________________
>> De: Talat UYARER [talat.uyarer@agmlab.com]
>> Enviado el: viernes, 18 de octubre de 2013 10:06 a.m.
>> Para: user@nutch.apache.org
>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>
>> Maillists dont accept attachment files. Can you share on pastebin etc.
>>
>> 18-10-2013 17:59 tarihinde, Luis Armando Roca Fumero yazdı:
>>> Here is the hadoop.log file
>>> Thanks for your time,
>>> Luis Armando
>>> ________________________________________
>>> De: Talat UYARER [talat.uyarer@agmlab.com]
>>> Enviado el: viernes, 18 de octubre de 2013 09:51 a.m.
>>> Para: user@nutch.apache.org
>>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>>
>>> Hi Luis,
>>> Can you share your hadoop.log file. We need verbouse output log for
>>> understanding problem. But If I can understand correct. You dont have
>>> any problem for IndexerJob.
>>>
>>> Talat
>>>
>>> 18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
>>>> Hello
>>>> I added the lines that Mourdak suggested me, but I still getting the same errors:
>>>>
>>>> SOLRIndexWriter
>>>>             solr.server.url : URL of the SOLR instance (mandatory)
>>>>             solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>             solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>>             solr.auth : use authentication (default false)
>>>>             solr.auth.username : use authentication (default false)
>>>>             solr.auth : username for authentication
>>>>             solr.auth.password : password for authentication
>>>>
>>>>
>>>> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
>>>> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
>>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>             at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>             at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>>>             at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>>>             at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>>>>             at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>             at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>>
>>>> Any other idea???
>>>> thanks for your time,
>>>> Luis Armando
>>>>
>>>> ________________________________________
>>>> De: Mouradk [mouradk78@gmail.com]
>>>> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
>>>> Para: user@nutch.apache.org
>>>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>>>
>>>> Hi Luis,
>>>>
>>>> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>>>>
>>>> <property>
>>>>       <name>plugin.includes</name>
>>>>       <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
>>>> </property>
>>>>
>>>> Hope this help,
>>>>
>>>> Mourad
>>>>
>>>>
>>>> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>>>>
>>>>> Hello friends:
>>>>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>>>>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>>>>> All works good, but finally when Indexer is starting I get errors like this:
>>>>>
>>>>> Indexer: starting at 2013-10-18 13:57:32
>>>>> Indexer: deleting gone documents: false
>>>>> Indexer: URL filtering: false
>>>>> Indexer: URL normalizing: false
>>>>> Active IndexWriters :
>>>>> SOLRIndexWriter
>>>>>            solr.server.url : URL of the SOLR instance (mandatory)
>>>>>            solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>            solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>>>            solr.auth : use authentication (default false)
>>>>>            solr.auth.username : use authentication (default false)
>>>>>            solr.auth : username for authentication
>>>>>            solr.auth.password : password for authentication
>>>>>
>>>>>
>>>>>
>>>>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>>>>> Thanks in advance
>>>>>
>>>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>>>
>>>>>
>>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>>
>>>>
>>>>
>>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>>
>>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
I running nutch from root user
When I check under /crawl/segments/20131017194821/crawl_fetch doesn't exist
It is incomplete, there are only _temporary and crawl_generate
What can I do, If I copy a fresh binary files from version Nutch 1.7 ???
thanks in advance,
Luis armando

________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: viernes, 18 de octubre de 2013 11:04 a.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Did you check your priviledged ? Can you check your path, is it exists ?

 1.
    2013-10-18 13:19:49,020 ERROR security.UserGroupInformation -
    PriviledgedActionException as:root
    cause:org.apache.hadoop.mapred.InvalidInputException: Input path
    does not exist:
    file:/opt/apache-nutch-1.7/crawl/segments/20131017194821/crawl_fetch


18-10-2013 18:22 tarihinde, Luis Armando Roca Fumero yazdı:
> Ooooppppssss sorry Talat UAYRER:
> This is the link for hadoop.log file: http://pastebin.com/F6qBQhSA
>
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 10:06 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Maillists dont accept attachment files. Can you share on pastebin etc.
>
> 18-10-2013 17:59 tarihinde, Luis Armando Roca Fumero yazdı:
>> Here is the hadoop.log file
>> Thanks for your time,
>> Luis Armando
>> ________________________________________
>> De: Talat UYARER [talat.uyarer@agmlab.com]
>> Enviado el: viernes, 18 de octubre de 2013 09:51 a.m.
>> Para: user@nutch.apache.org
>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>
>> Hi Luis,
>> Can you share your hadoop.log file. We need verbouse output log for
>> understanding problem. But If I can understand correct. You dont have
>> any problem for IndexerJob.
>>
>> Talat
>>
>> 18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
>>> Hello
>>> I added the lines that Mourdak suggested me, but I still getting the same errors:
>>>
>>> SOLRIndexWriter
>>>            solr.server.url : URL of the SOLR instance (mandatory)
>>>            solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>            solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>            solr.auth : use authentication (default false)
>>>            solr.auth.username : use authentication (default false)
>>>            solr.auth : username for authentication
>>>            solr.auth.password : password for authentication
>>>
>>>
>>> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
>>> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>>> Exception in thread "main" java.io.IOException: Job failed!
>>>            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>            at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>>            at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>>            at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>>>            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>            at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>
>>> Any other idea???
>>> thanks for your time,
>>> Luis Armando
>>>
>>> ________________________________________
>>> De: Mouradk [mouradk78@gmail.com]
>>> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
>>> Para: user@nutch.apache.org
>>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>>
>>> Hi Luis,
>>>
>>> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>>>
>>> <property>
>>>      <name>plugin.includes</name>
>>>      <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
>>> </property>
>>>
>>> Hope this help,
>>>
>>> Mourad
>>>
>>>
>>> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>>>
>>>> Hello friends:
>>>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>>>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>>>> All works good, but finally when Indexer is starting I get errors like this:
>>>>
>>>> Indexer: starting at 2013-10-18 13:57:32
>>>> Indexer: deleting gone documents: false
>>>> Indexer: URL filtering: false
>>>> Indexer: URL normalizing: false
>>>> Active IndexWriters :
>>>> SOLRIndexWriter
>>>>           solr.server.url : URL of the SOLR instance (mandatory)
>>>>           solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>>           solr.auth : use authentication (default false)
>>>>           solr.auth.username : use authentication (default false)
>>>>           solr.auth : username for authentication
>>>>           solr.auth.password : password for authentication
>>>>
>>>>
>>>>
>>>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>>>> Thanks in advance
>>>>
>>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>>
>>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Did you check your priviledged ? Can you check your path, is it exists ?

 1.
    2013-10-18 13:19:49,020 ERROR security.UserGroupInformation -
    PriviledgedActionException as:root
    cause:org.apache.hadoop.mapred.InvalidInputException: Input path
    does not exist:
    file:/opt/apache-nutch-1.7/crawl/segments/20131017194821/crawl_fetch


18-10-2013 18:22 tarihinde, Luis Armando Roca Fumero yazdı:
> Ooooppppssss sorry Talat UAYRER:
> This is the link for hadoop.log file: http://pastebin.com/F6qBQhSA
>
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 10:06 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Maillists dont accept attachment files. Can you share on pastebin etc.
>
> 18-10-2013 17:59 tarihinde, Luis Armando Roca Fumero yazdı:
>> Here is the hadoop.log file
>> Thanks for your time,
>> Luis Armando
>> ________________________________________
>> De: Talat UYARER [talat.uyarer@agmlab.com]
>> Enviado el: viernes, 18 de octubre de 2013 09:51 a.m.
>> Para: user@nutch.apache.org
>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>
>> Hi Luis,
>> Can you share your hadoop.log file. We need verbouse output log for
>> understanding problem. But If I can understand correct. You dont have
>> any problem for IndexerJob.
>>
>> Talat
>>
>> 18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
>>> Hello
>>> I added the lines that Mourdak suggested me, but I still getting the same errors:
>>>
>>> SOLRIndexWriter
>>>            solr.server.url : URL of the SOLR instance (mandatory)
>>>            solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>            solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>            solr.auth : use authentication (default false)
>>>            solr.auth.username : use authentication (default false)
>>>            solr.auth : username for authentication
>>>            solr.auth.password : password for authentication
>>>
>>>
>>> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
>>> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>>> Exception in thread "main" java.io.IOException: Job failed!
>>>            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>            at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>>            at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>>            at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>>>            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>            at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>
>>> Any other idea???
>>> thanks for your time,
>>> Luis Armando
>>>
>>> ________________________________________
>>> De: Mouradk [mouradk78@gmail.com]
>>> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
>>> Para: user@nutch.apache.org
>>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>>
>>> Hi Luis,
>>>
>>> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>>>
>>> <property>
>>>      <name>plugin.includes</name>
>>>      <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
>>> </property>
>>>
>>> Hope this help,
>>>
>>> Mourad
>>>
>>>
>>> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>>>
>>>> Hello friends:
>>>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>>>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>>>> All works good, but finally when Indexer is starting I get errors like this:
>>>>
>>>> Indexer: starting at 2013-10-18 13:57:32
>>>> Indexer: deleting gone documents: false
>>>> Indexer: URL filtering: false
>>>> Indexer: URL normalizing: false
>>>> Active IndexWriters :
>>>> SOLRIndexWriter
>>>>           solr.server.url : URL of the SOLR instance (mandatory)
>>>>           solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>>           solr.auth : use authentication (default false)
>>>>           solr.auth.username : use authentication (default false)
>>>>           solr.auth : username for authentication
>>>>           solr.auth.password : password for authentication
>>>>
>>>>
>>>>
>>>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>>>> Thanks in advance
>>>>
>>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>>
>>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Hello friends:
 I'm crawling with nutch, and I don't to craw images at all, and I don't to craw urls with "?" or strange characters . When I looking for *.gif. This is a fragment of my solr's search

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">73</int><lst name="params"><str name="q">*.gif</str></lst></lst><result name="response" numFound="352" start="0" maxScore="1.0"><doc><str name="content"/><str name="segment">20131114152100</str><float name="boost">1.0</float><str name="digest">85cb9286b70bdee25b40433645b9ff72</str><date name="tstamp">2013-11-14T16:18:22.029Z</date><str name="id">http://calorm.qf.uclv.edu.cu/Images1/BigPracBar.gif</str><str name="url">http://calorm.qf.uclv.edu.cu/Images1/BigPracBar.gif</str><long name="_version_">1451712741146361856</long></doc><doc><str name="content"/><str name="segment">20131114152100</str><float name="boost">1.0</float><str name="digest">292408955f4aae8eec90e0ce55fbd739</str><date name="tstamp">2013-11-14T16:39:27.359Z</date><str name="id">http://calorm.qf.uclv.edu.cu/Images1/Bigenlbar.gif</str><str name="url">http://calorm.qf.uclv.edu.cu/Images1/Bigenlbar.gif</str><long name="_version_">1451712741161041920</long></doc>
</str><str name="title">Forum UCLV • Preguntas Frecuentes</str><str name="segment">20131114152100</str><float name="boost">1.0</float><str name="digest">a8c190fb3d22f71d47b67647bc814cba</str><date name="tstamp">2013-11-14T16:41:55.548Z</date><str name="id">http://forum.uclv.edu.cu/faq.php?sid=371ada5505649fe6c0155ef3d7bc261e</str><str name="url">http://forum.uclv.edu.cu/faq.php?sid=371ada5505649fe6c0155ef3d7bc261e</str><long name="_version_">1451712743156482048</long></doc><doc>



------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nutch-site.xml:
<configuration>

    <property>
        <name>http.agent.name</name>
        <value>My Nutch Spider</value>
    </property>

    <property>
        <name>plugin.includes</name>
        <value>protocol-(http|ftp)|urlfilter-validator|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
    </property>

</configuration>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
regex-urlfilter.txt:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9]*\.).uclv.edu.cu/
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I ran nutch with this command: bin/crawl urls/seed.txt Testcrawl/ http://solr1:8983/solr 2

What is wrong in my conf files????

La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Ooooppppssss sorry Talat UAYRER:
This is the link for hadoop.log file: http://pastebin.com/F6qBQhSA

________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: viernes, 18 de octubre de 2013 10:06 a.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Maillists dont accept attachment files. Can you share on pastebin etc.

18-10-2013 17:59 tarihinde, Luis Armando Roca Fumero yazdı:
> Here is the hadoop.log file
> Thanks for your time,
> Luis Armando
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 09:51 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
> Can you share your hadoop.log file. We need verbouse output log for
> understanding problem. But If I can understand correct. You dont have
> any problem for IndexerJob.
>
> Talat
>
> 18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
>> Hello
>> I added the lines that Mourdak suggested me, but I still getting the same errors:
>>
>> SOLRIndexWriter
>>           solr.server.url : URL of the SOLR instance (mandatory)
>>           solr.commit.size : buffer size when sending to SOLR (default 1000)
>>           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>           solr.auth : use authentication (default false)
>>           solr.auth.username : use authentication (default false)
>>           solr.auth : username for authentication
>>           solr.auth.password : password for authentication
>>
>>
>> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
>> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>> Exception in thread "main" java.io.IOException: Job failed!
>>           at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>           at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>           at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>           at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>>           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>           at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>
>> Any other idea???
>> thanks for your time,
>> Luis Armando
>>
>> ________________________________________
>> De: Mouradk [mouradk78@gmail.com]
>> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
>> Para: user@nutch.apache.org
>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>
>> Hi Luis,
>>
>> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>>
>> <property>
>>     <name>plugin.includes</name>
>>     <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
>> </property>
>>
>> Hope this help,
>>
>> Mourad
>>
>>
>> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>>
>>> Hello friends:
>>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>>> All works good, but finally when Indexer is starting I get errors like this:
>>>
>>> Indexer: starting at 2013-10-18 13:57:32
>>> Indexer: deleting gone documents: false
>>> Indexer: URL filtering: false
>>> Indexer: URL normalizing: false
>>> Active IndexWriters :
>>> SOLRIndexWriter
>>>          solr.server.url : URL of the SOLR instance (mandatory)
>>>          solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>          solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>          solr.auth : use authentication (default false)
>>>          solr.auth.username : use authentication (default false)
>>>          solr.auth : username for authentication
>>>          solr.auth.password : password for authentication
>>>
>>>
>>>
>>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>>> Thanks in advance
>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Maillists dont accept attachment files. Can you share on pastebin etc.

18-10-2013 17:59 tarihinde, Luis Armando Roca Fumero yazdı:
> Here is the hadoop.log file
> Thanks for your time,
> Luis Armando
> ________________________________________
> De: Talat UYARER [talat.uyarer@agmlab.com]
> Enviado el: viernes, 18 de octubre de 2013 09:51 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
> Can you share your hadoop.log file. We need verbouse output log for
> understanding problem. But If I can understand correct. You dont have
> any problem for IndexerJob.
>
> Talat
>
> 18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
>> Hello
>> I added the lines that Mourdak suggested me, but I still getting the same errors:
>>
>> SOLRIndexWriter
>>           solr.server.url : URL of the SOLR instance (mandatory)
>>           solr.commit.size : buffer size when sending to SOLR (default 1000)
>>           solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>           solr.auth : use authentication (default false)
>>           solr.auth.username : use authentication (default false)
>>           solr.auth : username for authentication
>>           solr.auth.password : password for authentication
>>
>>
>> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
>> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
>> Exception in thread "main" java.io.IOException: Job failed!
>>           at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>           at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>>           at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>>           at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>>           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>           at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>
>> Any other idea???
>> thanks for your time,
>> Luis Armando
>>
>> ________________________________________
>> De: Mouradk [mouradk78@gmail.com]
>> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
>> Para: user@nutch.apache.org
>> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>>
>> Hi Luis,
>>
>> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>>
>> <property>
>>     <name>plugin.includes</name>
>>     <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
>> </property>
>>
>> Hope this help,
>>
>> Mourad
>>
>>
>> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>>
>>> Hello friends:
>>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>>> All works good, but finally when Indexer is starting I get errors like this:
>>>
>>> Indexer: starting at 2013-10-18 13:57:32
>>> Indexer: deleting gone documents: false
>>> Indexer: URL filtering: false
>>> Indexer: URL normalizing: false
>>> Active IndexWriters :
>>> SOLRIndexWriter
>>>          solr.server.url : URL of the SOLR instance (mandatory)
>>>          solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>          solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>>          solr.auth : use authentication (default false)
>>>          solr.auth.username : use authentication (default false)
>>>          solr.auth : username for authentication
>>>          solr.auth.password : password for authentication
>>>
>>>
>>>
>>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>>> Thanks in advance
>>>
>>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>>
>>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Here is the hadoop.log file
Thanks for your time,
Luis Armando
________________________________________
De: Talat UYARER [talat.uyarer@agmlab.com]
Enviado el: viernes, 18 de octubre de 2013 09:51 a.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Hi Luis,
Can you share your hadoop.log file. We need verbouse output log for
understanding problem. But If I can understand correct. You dont have
any problem for IndexerJob.

Talat

18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
> Hello
> I added the lines that Mourdak suggested me, but I still getting the same errors:
>
> SOLRIndexWriter
>          solr.server.url : URL of the SOLR instance (mandatory)
>          solr.commit.size : buffer size when sending to SOLR (default 1000)
>          solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>          solr.auth : use authentication (default false)
>          solr.auth.username : use authentication (default false)
>          solr.auth : username for authentication
>          solr.auth.password : password for authentication
>
>
> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> Exception in thread "main" java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>          at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>          at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
> Any other idea???
> thanks for your time,
> Luis Armando
>
> ________________________________________
> De: Mouradk [mouradk78@gmail.com]
> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
>
> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>
> <property>
>    <name>plugin.includes</name>
>    <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
> </property>
>
> Hope this help,
>
> Mourad
>
>
> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>
>> Hello friends:
>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>> All works good, but finally when Indexer is starting I get errors like this:
>>
>> Indexer: starting at 2013-10-18 13:57:32
>> Indexer: deleting gone documents: false
>> Indexer: URL filtering: false
>> Indexer: URL normalizing: false
>> Active IndexWriters :
>> SOLRIndexWriter
>>         solr.server.url : URL of the SOLR instance (mandatory)
>>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>>         solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>         solr.auth : use authentication (default false)
>>         solr.auth.username : use authentication (default false)
>>         solr.auth : username for authentication
>>         solr.auth.password : password for authentication
>>
>>
>>
>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>> Thanks in advance
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Talat UYARER <ta...@agmlab.com>.
Hi Luis,
Can you share your hadoop.log file. We need verbouse output log for 
understanding problem. But If I can understand correct. You dont have 
any problem for IndexerJob.

Talat

18-10-2013 17:36 tarihinde, Luis Armando Roca Fumero yazdı:
> Hello
> I added the lines that Mourdak suggested me, but I still getting the same errors:
>
> SOLRIndexWriter
>          solr.server.url : URL of the SOLR instance (mandatory)
>          solr.commit.size : buffer size when sending to SOLR (default 1000)
>          solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>          solr.auth : use authentication (default false)
>          solr.auth.username : use authentication (default false)
>          solr.auth : username for authentication
>          solr.auth.password : password for authentication
>
>
> Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
> SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> Exception in thread "main" java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>          at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>          at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>          at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
> Any other idea???
> thanks for your time,
> Luis Armando
>
> ________________________________________
> De: Mouradk [mouradk78@gmail.com]
> Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
> Para: user@nutch.apache.org
> Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate
>
> Hi Luis,
>
> Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:
>
> <property>
>    <name>plugin.includes</name>
>    <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
> </property>
>
> Hope this help,
>
> Mourad
>
>
> On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:
>
>> Hello friends:
>> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
>> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
>> All works good, but finally when Indexer is starting I get errors like this:
>>
>> Indexer: starting at 2013-10-18 13:57:32
>> Indexer: deleting gone documents: false
>> Indexer: URL filtering: false
>> Indexer: URL normalizing: false
>> Active IndexWriters :
>> SOLRIndexWriter
>>         solr.server.url : URL of the SOLR instance (mandatory)
>>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>>         solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>>         solr.auth : use authentication (default false)
>>         solr.auth.username : use authentication (default false)
>>         solr.auth : username for authentication
>>         solr.auth.password : password for authentication
>>
>>
>>
>> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
>> Thanks in advance
>>
>> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
>> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>>
>>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Hello
I added the lines that Mourdak suggested me, but I still getting the same errors:

SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2013-10-18 14:39:23, elapsed: 00:00:04
SolrDeleteDuplicates: starting at 2013-10-18 14:39:23
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Any other idea???
thanks for your time,
Luis Armando

________________________________________
De: Mouradk [mouradk78@gmail.com]
Enviado el: viernes, 18 de octubre de 2013 09:08 a.m.
Para: user@nutch.apache.org
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Hi Luis,

Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
</property>

Hope this help,

Mourad


On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:

> Hello friends:
> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
> All works good, but finally when Indexer is starting I get errors like this:
>
> Indexer: starting at 2013-10-18 13:57:32
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>        solr.server.url : URL of the SOLR instance (mandatory)
>        solr.commit.size : buffer size when sending to SOLR (default 1000)
>        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>        solr.auth : use authentication (default false)
>        solr.auth.username : use authentication (default false)
>        solr.auth : username for authentication
>        solr.auth.password : password for authentication
>
>
>
> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
> Thanks in advance
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
>
>


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Mouradk <mo...@gmail.com>.
Hi Luis,

Under you nutch-site.xml configuration file you need to add the SOLR indexer plugin:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
</property>

Hope this help,

Mourad


On 18 Oct 2013, at 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu> wrote:

> Hello friends:
> I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch Tutorial paper
> When I run the command: ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
> All works good, but finally when Indexer is starting I get errors like this:
> 
> Indexer: starting at 2013-10-18 13:57:32
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>        solr.server.url : URL of the SOLR instance (mandatory)
>        solr.commit.size : buffer size when sending to SOLR (default 1000)
>        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
>        solr.auth : use authentication (default false)
>        solr.auth.username : use authentication (default false)
>        solr.auth : username for authentication
>        solr.auth.password : password for authentication
> 
> 
> 
> What Can I do, what is wrong?? I have not idea, I had tried with Nutch 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to integrate nutch with solr, like baby steps :)
> Thanks in advance
> 
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/
> 
> 


RE: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Luis Armando Roca Fumero <lr...@uclv.edu.cu>.
Thanks a lot to  Mouradk and Julien

Sorry the errors that I talked are:


Indexer: starting at 2013-10-18 13:57:32
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication
----------------------------------------------------------------------------------------------------------------------------
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:160)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
---------------------------------------------------------------------------------------------------------------------------


I will try adding the line that Mouradk has suggested.
Thanks one more time



La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/



Re: Nutch 1.7 and Solr 4.4.0 Integrate

Posted by Julien Nioche <li...@gmail.com>.
These are not errors but logs indicating what tasks have been done as well
as additional info.


On 18 October 2013 15:05, Luis Armando Roca Fumero <lr...@uclv.edu.cu>wrote:

> Hello friends:
>  I had configurated nutch 1.7 and solr 4.4.0 to work together, by Nutch
> Tutorial paper
> When I run the command: ./bin/nutch crawl urls -solr
> http://localhost:8983/solr/ -depth 3 -topN 5 > test.txt
> All works good, but finally when Indexer is starting I get errors like
> this:
>
> Indexer: starting at 2013-10-18 13:57:32
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>         solr.server.url : URL of the SOLR instance (mandatory)
>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>         solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>         solr.auth : use authentication (default false)
>         solr.auth.username : use authentication (default false)
>         solr.auth : username for authentication
>         solr.auth.password : password for authentication
>
>
>
> What Can I do, what is wrong?? I have not idea, I had tried with Nutch
> 2.2.1 and doesn't work with solr 4.4.0 either. I need a tutorial to
> integrate nutch with solr, like baby steps :)
> Thanks in advance
>
> La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario.
> Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
> Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana.
> Cuba. http://www.congresouniversidad.cu/
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble