You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yash Thenuan Thenuan <ri...@iiita.ac.in> on 2018/03/01 05:38:31 UTC

Re: Regarding Indexing to elasticsearch

Hi Sebastian All of this is coming but the problem is,The content is not
sent sent.Nothing is indexed to es.
This is the output on debug level.

ElasticIndexWriter

elastic.cluster : elastic prefix cluster

elastic.host : hostname

elastic.port : port  (default 9200)

elastic.index : elastic index command

elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)

elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


no modules loaded

loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]

loaded plugin [org.elasticsearch.join.ParentJoinPlugin]

loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]

loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]

loaded plugin [org.elasticsearch.transport.Netty4Plugin]

created thread pool: name [force_merge], size [1], queue size [unbounded]

created thread pool: name [fetch_shard_started], core [1], max [8], keep
alive [5m]

created thread pool: name [listener], size [2], queue size [unbounded]

created thread pool: name [index], size [4], queue size [200]

created thread pool: name [refresh], core [1], max [2], keep alive [5m]

created thread pool: name [generic], core [4], max [128], keep alive [30s]

created thread pool: name [warmer], core [1], max [2], keep alive [5m]

thread pool [search] will adjust queue by [50] when determining automatic
queue size

created thread pool: name [search], size [7], queue size [1k]

created thread pool: name [flush], core [1], max [2], keep alive [5m]

created thread pool: name [fetch_shard_store], core [1], max [8], keep
alive [5m]

created thread pool: name [management], core [1], max [5], keep alive [5m]

created thread pool: name [get], size [4], queue size [1k]

created thread pool: name [bulk], size [4], queue size [200]

created thread pool: name [snapshot], core [1], max [2], keep alive [5m]

node_sampler_interval[5s]

adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
127.0.0.1:9300}]

connected to node
[{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
127.0.0.1:9300}]

IndexingJob: done


On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> I never tried ES with Nutch 2.3 but it should be similar to setup as for
> 1.x:
>
> - enable the plugin "indexer-elastic" in plugin.includes
>   (upgrade and rename to "indexer-elastic2" in 2.4)
>
> - expects ES 1.4.1
>
> - available/required options are found in the log file (hadoop.log):
>    ElasticIndexWriter
>         elastic.cluster : elastic prefix cluster
>         elastic.host : hostname
>         elastic.port : port  (default 9300)
>         elastic.index : elastic index command
>         elastic.max.bulk.docs : elastic bulk index doc counts. (default
> 250)
>         elastic.max.bulk.size : elastic bulk index length. (default
> 2500500 ~2.5MB)
>
> Sebastian
>
> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> > Yeah
> > I was also thinking that
> > Can somebody help me with nutch 2.3?
> >
> > On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
> >
> >> Sorry, I just realized that you're using Nutch 2.x and I'm answering for
> >> Nutch 1.x. I'm afraid I can't help you.
> >>
> >>> -----Original Message-----
> >>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>> Sent: 28 February 2018 14:20
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Regarding Indexing to elasticsearch
> >>>
> >>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
> >> output of
> >>> nutch index i have already configured the nutch-site.xml.
> >>>
> >>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>
> >>>> I suggest you run "nutch index", take a look at the returned help
> >>>> message, and continue from there.
> >>>> Broadly, first of all you need to configure your elasticsearch
> >>>> environment in nutch-site.xml, and then you need to run nutch index
> >>>> with the location of your CrawlDB and either the segment you want to
> >>>> index or the directory that contains all the segments you want to
> >> index.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>> Sent: 28 February 2018 14:06
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>
> >>>>> All I want  is to index my parsed data to elasticsearch.
> >>>>>
> >>>>>
> >>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>>>
> >>>>> Hi Yash,
> >>>>>
> >>>>> The nutch index command does not have a -all flag, so I'm not sure
> >>>>> what
> >>>> you're
> >>>>> trying to achieve here.
> >>>>>
> >>>>>         Yossi.
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>> Sent: 28 February 2018 13:55
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>
> >>>>>> Can somebody please tell me what happens when we hit the bin/nutc
> >>>>>> index
> >>>>> -all
> >>>>>> command.
> >>>>>> Because I can't figure out why the write function inside the
> >>>>> elastic-indexer is not
> >>>>>> getting executed.
> >>>>
> >>>>
> >>
> >>
> >
>
>

Re: Regarding Indexing to elasticsearch

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

> Map input records=79
> Map output records=0

... and no IndexerJob:DocumentCount counter

The map function got 79 records as input,
but did not write anything to the indexer.
There are a couple of reasons why a document is skipped,
e.g., nothing parsed, missing markers, errors in indexing filters, ...

Have a look at the map method:

https://github.com/apache/nutch/blob/branch-2.3.1/src/java/org/apache/nutch/indexer/IndexingJob.java#L95

and start debugging it. Alternatively, check your table and
the log files of the previous steps. There must be a reason
why nothing is indexed.

Best,
Sebastian


On 03/02/2018 01:03 PM, Yash Thenuan Thenuan wrote:
> I got this after  setting log4j.logger.org.apache.hadoop to info
> 
> 2018-03-02 17:29:40,157 INFO  indexer.IndexingJob - IndexingJob: starting
> 2018-03-02 17:29:40,775 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-03-02 17:29:40,853 INFO  Configuration.deprecation -
> mapred.output.key.comparator.class is deprecated. Instead, use
> mapreduce.job.output.key.comparator.class
> 2018-03-02 17:29:41,073 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 17:29:41,073 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 17:29:41,076 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 17:29:41,076 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 17:29:41,094 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 17:29:41,465 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 17:29:42,585 INFO  Configuration.deprecation - session.id is
> deprecated. Instead, use dfs.metrics.session-id
> 2018-03-02 17:29:42,587 INFO  jvm.JvmMetrics - Initializing JVM Metrics
> with processName=JobTracker, sessionId=
> 2018-03-02 17:29:43,277 INFO  mapreduce.JobSubmitter - number of splits:1
> 2018-03-02 17:29:43,501 INFO  mapreduce.JobSubmitter - Submitting tokens
> for job: job_local1792747860_0001
> 2018-03-02 17:29:43,566 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 17:29:43,570 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 17:29:43,726 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 17:29:43,731 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 17:29:43,755 INFO  mapreduce.Job - The url to track the job:
> http://localhost:8080/
> 2018-03-02 17:29:43,757 INFO  mapreduce.Job - Running job:
> job_local1792747860_0001
> 2018-03-02 17:29:43,757 INFO  mapred.LocalJobRunner - OutputCommitter set
> in config null
> 2018-03-02 17:29:43,767 INFO  mapred.LocalJobRunner - OutputCommitter is
> org.apache.nutch.indexer.IndexerOutputFormat$2
> 2018-03-02 17:29:43,838 INFO  mapred.LocalJobRunner - Waiting for map tasks
> 2018-03-02 17:29:43,841 INFO  mapred.LocalJobRunner - Starting task:
> attempt_local1792747860_0001_m_000000_0
> 2018-03-02 17:29:43,899 INFO  util.ProcfsBasedProcessTree -
> ProcfsBasedProcessTree currently is supported only on Linux.
> 2018-03-02 17:29:43,899 INFO  mapred.Task -  Using
> ResourceCalculatorProcessTree : null
> 2018-03-02 17:29:43,923 INFO  mapred.MapTask - Processing split:
> org.apache.gora.mapreduce.GoraInputSplit@424b7f03
> 2018-03-02 17:29:44,051 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 17:29:44,767 INFO  mapreduce.Job - Job job_local1792747860_0001
> running in uber mode : false
> 2018-03-02 17:29:44,769 INFO  mapreduce.Job -  map 0% reduce 0%
> 2018-03-02 17:29:50,926 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 17:29:50,926 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 17:29:50,927 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 17:29:51,153 INFO  mapred.LocalJobRunner -
> 2018-03-02 17:29:52,782 INFO  mapred.Task -
> Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process
> of committing
> 2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map
> 2018-03-02 17:29:52,825 INFO  mapred.Task - Task
> 'attempt_local1792747860_0001_m_000000_0' done.
> 2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - Finishing task:
> attempt_local1792747860_0001_m_000000_0
> 2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map task executor
> complete.
> 2018-03-02 17:29:53,791 INFO  mapreduce.Job -  map 100% reduce 0%
> 2018-03-02 17:29:53,791 INFO  mapreduce.Job - Job job_local1792747860_0001
> completed successfully
> 2018-03-02 17:29:53,849 INFO  mapreduce.Job - Counters: 15
> File System Counters
> FILE: Number of bytes read=610359
> FILE: Number of bytes written=891634
> FILE: Number of read operations=0
> FILE: Number of large read operations=0
> FILE: Number of write operations=0
> Map-Reduce Framework
> Map input records=79
> Map output records=0
> Input split bytes=995
> Spilled Records=0
> Failed Shuffles=0
> Merged Map outputs=0
> GC time elapsed (ms)=103
> Total committed heap usage (bytes)=225443840
> File Input Format Counters
> Bytes Read=0
> File Output Format Counters
> Bytes Written=0
> 2018-03-02 17:29:53,866 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 17:29:53,866 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port  (default 9200)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 
> 
> 2018-03-02 17:29:53,925 INFO  indexer.IndexingJob - IndexingJob: done.
> 
> 
> On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wa...@googlemail.com>
> wrote:
> 
>> Hi,
>>
>> looks more like that there is nothing to index.
>>
>> Unfortunately, in 2.x there are no log messages
>> on by default which indicate how many documents
>> are sent to the index back-ends.
>>
>> The easiest way is to enable Job counters in
>> conf/log4j.properties by adding the line:
>>
>>  log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>>
>> or setting the level to INFO for
>>
>>  log4j.logger.org.apache.hadoop=WARN
>>
>> Make sure the log4j.properties is correctly deployed
>> (in doubt, run "ant runtime"). Then check the hadoop.log
>> again: there should be a counter DocumentCount with non-zero
>> value.
>>
>> Best,
>> Sebastian
>>
>>
>> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
>>> Following are the logs from hadoop.log
>>>
>>> 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
>>> 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>> where
>>> applicable
>>> 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
>>> length for indexing set to: -1
>>> 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>>> 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.metadata.MetadataIndexer
>>> 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.more.MoreIndexingFilter
>>> 2018-03-02 11:18:48,663 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
>> staging/job_local1100834069_0001/job.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>>> 2018-03-02 11:18:48,666 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
>> staging/job_local1100834069_0001/job.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>>> 2018-03-02 11:18:48,792 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
>> local1100834069_0001/job_local1100834069_0001.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>>> 2018-03-02 11:18:48,798 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
>> local1100834069_0001/job_local1100834069_0001.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>>> 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
>>> 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
>>> length for indexing set to: -1
>>> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>>> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.metadata.MetadataIndexer
>>> 2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.more.MoreIndexingFilter
>>> 2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
>>> 2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
>>> ElasticIndexWriter
>>> elastic.cluster : elastic prefix cluster
>>> elastic.host : hostname
>>> elastic.port : port  (default 9200)
>>> elastic.index : elastic index command
>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>> ~2.5MB)
>>>
>>>
>>> 2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.
>>>
>>>
>>> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <
>> wastl.nagel@googlemail.com
>>>> wrote:
>>>
>>>> It's impossible to find the reason from console output.
>>>> Please check the hadoop.log, it should contain more logs
>>>> including those from ElasticIndexWriter.
>>>>
>>>> Sebastian
>>>>
>>>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
>>>>> Hi Sebastian All of this is coming but the problem is,The content is
>> not
>>>>> sent sent.Nothing is indexed to es.
>>>>> This is the output on debug level.
>>>>>
>>>>> ElasticIndexWriter
>>>>>
>>>>> elastic.cluster : elastic prefix cluster
>>>>>
>>>>> elastic.host : hostname
>>>>>
>>>>> elastic.port : port  (default 9200)
>>>>>
>>>>> elastic.index : elastic index command
>>>>>
>>>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>>>>
>>>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>>>> ~2.5MB)
>>>>>
>>>>>
>>>>> no modules loaded
>>>>>
>>>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>>>>>
>>>>> created thread pool: name [force_merge], size [1], queue size
>> [unbounded]
>>>>>
>>>>> created thread pool: name [fetch_shard_started], core [1], max [8],
>> keep
>>>>> alive [5m]
>>>>>
>>>>> created thread pool: name [listener], size [2], queue size [unbounded]
>>>>>
>>>>> created thread pool: name [index], size [4], queue size [200]
>>>>>
>>>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>>>>>
>>>>> created thread pool: name [generic], core [4], max [128], keep alive
>>>> [30s]
>>>>>
>>>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>>>>>
>>>>> thread pool [search] will adjust queue by [50] when determining
>> automatic
>>>>> queue size
>>>>>
>>>>> created thread pool: name [search], size [7], queue size [1k]
>>>>>
>>>>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>>>>>
>>>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
>>>>> alive [5m]
>>>>>
>>>>> created thread pool: name [management], core [1], max [5], keep alive
>>>> [5m]
>>>>>
>>>>> created thread pool: name [get], size [4], queue size [1k]
>>>>>
>>>>> created thread pool: name [bulk], size [4], queue size [200]
>>>>>
>>>>> created thread pool: name [snapshot], core [1], max [2], keep alive
>> [5m]
>>>>>
>>>>> node_sampler_interval[5s]
>>>>>
>>>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
>>>>> 127.0.0.1:9300}]
>>>>>
>>>>> connected to node
>>>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
>>>>> 127.0.0.1:9300}]
>>>>>
>>>>> IndexingJob: done
>>>>>
>>>>>
>>>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
>>>>> wastl.nagel@googlemail.com> wrote:
>>>>>
>>>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as
>> for
>>>>>> 1.x:
>>>>>>
>>>>>> - enable the plugin "indexer-elastic" in plugin.includes
>>>>>>   (upgrade and rename to "indexer-elastic2" in 2.4)
>>>>>>
>>>>>> - expects ES 1.4.1
>>>>>>
>>>>>> - available/required options are found in the log file (hadoop.log):
>>>>>>    ElasticIndexWriter
>>>>>>         elastic.cluster : elastic prefix cluster
>>>>>>         elastic.host : hostname
>>>>>>         elastic.port : port  (default 9300)
>>>>>>         elastic.index : elastic index command
>>>>>>         elastic.max.bulk.docs : elastic bulk index doc counts.
>> (default
>>>>>> 250)
>>>>>>         elastic.max.bulk.size : elastic bulk index length. (default
>>>>>> 2500500 ~2.5MB)
>>>>>>
>>>>>> Sebastian
>>>>>>
>>>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>>>>>> Yeah
>>>>>>> I was also thinking that
>>>>>>> Can somebody help me with nutch 2.3?
>>>>>>>
>>>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>>>
>>>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
>>>> for
>>>>>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>> Sent: 28 February 2018 14:20
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>
>>>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is
>> the
>>>>>>>> output of
>>>>>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>>>>>
>>>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com>
>> wrote:
>>>>>>>>>
>>>>>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>>>>>> message, and continue from there.
>>>>>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>>>>>> environment in nutch-site.xml, and then you need to run nutch
>> index
>>>>>>>>>> with the location of your CrawlDB and either the segment you want
>> to
>>>>>>>>>> index or the directory that contains all the segments you want to
>>>>>>>> index.
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>>>> Sent: 28 February 2018 14:06
>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>>>
>>>>>>>>>>> All I want  is to index my parsed data to elasticsearch.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Yash,
>>>>>>>>>>>
>>>>>>>>>>> The nutch index command does not have a -all flag, so I'm not
>> sure
>>>>>>>>>>> what
>>>>>>>>>> you're
>>>>>>>>>>> trying to achieve here.
>>>>>>>>>>>
>>>>>>>>>>>         Yossi.
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>>>>>
>>>>>>>>>>>> Can somebody please tell me what happens when we hit the
>> bin/nutc
>>>>>>>>>>>> index
>>>>>>>>>>> -all
>>>>>>>>>>>> command.
>>>>>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>>>>>> elastic-indexer is not
>>>>>>>>>>>> getting executed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Regarding Indexing to elasticsearch

Posted by Yash Thenuan Thenuan <ri...@iiita.ac.in>.

I got this after  setting log4j.logger.org.apache.hadoop to info

2018-03-02 17:29:40,157 INFO  indexer.IndexingJob - IndexingJob: starting
2018-03-02 17:29:40,775 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-02 17:29:40,853 INFO  Configuration.deprecation -
mapred.output.key.comparator.class is deprecated. Instead, use
mapreduce.job.output.key.comparator.class
2018-03-02 17:29:41,073 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 17:29:41,073 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 17:29:41,076 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 17:29:41,076 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 17:29:41,094 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 17:29:41,465 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 17:29:42,585 INFO  Configuration.deprecation - session.id is
deprecated. Instead, use dfs.metrics.session-id
2018-03-02 17:29:42,587 INFO  jvm.JvmMetrics - Initializing JVM Metrics
with processName=JobTracker, sessionId=
2018-03-02 17:29:43,277 INFO  mapreduce.JobSubmitter - number of splits:1
2018-03-02 17:29:43,501 INFO  mapreduce.JobSubmitter - Submitting tokens
for job: job_local1792747860_0001
2018-03-02 17:29:43,566 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 17:29:43,570 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 17:29:43,726 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 17:29:43,731 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 17:29:43,755 INFO  mapreduce.Job - The url to track the job:
http://localhost:8080/
2018-03-02 17:29:43,757 INFO  mapreduce.Job - Running job:
job_local1792747860_0001
2018-03-02 17:29:43,757 INFO  mapred.LocalJobRunner - OutputCommitter set
in config null
2018-03-02 17:29:43,767 INFO  mapred.LocalJobRunner - OutputCommitter is
org.apache.nutch.indexer.IndexerOutputFormat$2
2018-03-02 17:29:43,838 INFO  mapred.LocalJobRunner - Waiting for map tasks
2018-03-02 17:29:43,841 INFO  mapred.LocalJobRunner - Starting task:
attempt_local1792747860_0001_m_000000_0
2018-03-02 17:29:43,899 INFO  util.ProcfsBasedProcessTree -
ProcfsBasedProcessTree currently is supported only on Linux.
2018-03-02 17:29:43,899 INFO  mapred.Task -  Using
ResourceCalculatorProcessTree : null
2018-03-02 17:29:43,923 INFO  mapred.MapTask - Processing split:
org.apache.gora.mapreduce.GoraInputSplit@424b7f03
2018-03-02 17:29:44,051 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:44,767 INFO  mapreduce.Job - Job job_local1792747860_0001
running in uber mode : false
2018-03-02 17:29:44,769 INFO  mapreduce.Job -  map 0% reduce 0%
2018-03-02 17:29:50,926 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 17:29:50,926 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 17:29:50,927 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 17:29:51,153 INFO  mapred.LocalJobRunner -
2018-03-02 17:29:52,782 INFO  mapred.Task -
Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process
of committing
2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map
2018-03-02 17:29:52,825 INFO  mapred.Task - Task
'attempt_local1792747860_0001_m_000000_0' done.
2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - Finishing task:
attempt_local1792747860_0001_m_000000_0
2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map task executor
complete.
2018-03-02 17:29:53,791 INFO  mapreduce.Job -  map 100% reduce 0%
2018-03-02 17:29:53,791 INFO  mapreduce.Job - Job job_local1792747860_0001
completed successfully
2018-03-02 17:29:53,849 INFO  mapreduce.Job - Counters: 15
File System Counters
FILE: Number of bytes read=610359
FILE: Number of bytes written=891634
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=79
Map output records=0
Input split bytes=995
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=103
Total committed heap usage (bytes)=225443840
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2018-03-02 17:29:53,866 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:53,866 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port  (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


2018-03-02 17:29:53,925 INFO  indexer.IndexingJob - IndexingJob: done.


On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Hi,
>
> looks more like that there is nothing to index.
>
> Unfortunately, in 2.x there are no log messages
> on by default which indicate how many documents
> are sent to the index back-ends.
>
> The easiest way is to enable Job counters in
> conf/log4j.properties by adding the line:
>
>  log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>
> or setting the level to INFO for
>
>  log4j.logger.org.apache.hadoop=WARN
>
> Make sure the log4j.properties is correctly deployed
> (in doubt, run "ant runtime"). Then check the hadoop.log
> again: there should be a counter DocumentCount with non-zero
> value.
>
> Best,
> Sebastian
>
>
> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> > Following are the logs from hadoop.log
> >
> > 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
> > 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:48,663 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> > 2018-03-02 11:18:48,666 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts;  Ignoring.
> > 2018-03-02 11:18:48,792 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> > 2018-03-02 11:18:48,798 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts;  Ignoring.
> > 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
> > ElasticIndexWriter
> > elastic.cluster : elastic prefix cluster
> > elastic.host : hostname
> > elastic.port : port  (default 9200)
> > elastic.index : elastic index command
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> > elastic.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> >
> >
> > 2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.
> >
> >
> > On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> >> wrote:
> >
> >> It's impossible to find the reason from console output.
> >> Please check the hadoop.log, it should contain more logs
> >> including those from ElasticIndexWriter.
> >>
> >> Sebastian
> >>
> >> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> >>> Hi Sebastian All of this is coming but the problem is,The content is
> not
> >>> sent sent.Nothing is indexed to es.
> >>> This is the output on debug level.
> >>>
> >>> ElasticIndexWriter
> >>>
> >>> elastic.cluster : elastic prefix cluster
> >>>
> >>> elastic.host : hostname
> >>>
> >>> elastic.port : port  (default 9200)
> >>>
> >>> elastic.index : elastic index command
> >>>
> >>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> >>>
> >>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
> >> ~2.5MB)
> >>>
> >>>
> >>> no modules loaded
> >>>
> >>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> >>>
> >>> created thread pool: name [force_merge], size [1], queue size
> [unbounded]
> >>>
> >>> created thread pool: name [fetch_shard_started], core [1], max [8],
> keep
> >>> alive [5m]
> >>>
> >>> created thread pool: name [listener], size [2], queue size [unbounded]
> >>>
> >>> created thread pool: name [index], size [4], queue size [200]
> >>>
> >>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
> >>>
> >>> created thread pool: name [generic], core [4], max [128], keep alive
> >> [30s]
> >>>
> >>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
> >>>
> >>> thread pool [search] will adjust queue by [50] when determining
> automatic
> >>> queue size
> >>>
> >>> created thread pool: name [search], size [7], queue size [1k]
> >>>
> >>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
> >>>
> >>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
> >>> alive [5m]
> >>>
> >>> created thread pool: name [management], core [1], max [5], keep alive
> >> [5m]
> >>>
> >>> created thread pool: name [get], size [4], queue size [1k]
> >>>
> >>> created thread pool: name [bulk], size [4], queue size [200]
> >>>
> >>> created thread pool: name [snapshot], core [1], max [2], keep alive
> [5m]
> >>>
> >>> node_sampler_interval[5s]
> >>>
> >>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> >>> 127.0.0.1:9300}]
> >>>
> >>> connected to node
> >>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> >>> 127.0.0.1:9300}]
> >>>
> >>> IndexingJob: done
> >>>
> >>>
> >>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> >>> wastl.nagel@googlemail.com> wrote:
> >>>
> >>>> I never tried ES with Nutch 2.3 but it should be similar to setup as
> for
> >>>> 1.x:
> >>>>
> >>>> - enable the plugin "indexer-elastic" in plugin.includes
> >>>>   (upgrade and rename to "indexer-elastic2" in 2.4)
> >>>>
> >>>> - expects ES 1.4.1
> >>>>
> >>>> - available/required options are found in the log file (hadoop.log):
> >>>>    ElasticIndexWriter
> >>>>         elastic.cluster : elastic prefix cluster
> >>>>         elastic.host : hostname
> >>>>         elastic.port : port  (default 9300)
> >>>>         elastic.index : elastic index command
> >>>>         elastic.max.bulk.docs : elastic bulk index doc counts.
> (default
> >>>> 250)
> >>>>         elastic.max.bulk.size : elastic bulk index length. (default
> >>>> 2500500 ~2.5MB)
> >>>>
> >>>> Sebastian
> >>>>
> >>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> >>>>> Yeah
> >>>>> I was also thinking that
> >>>>> Can somebody help me with nutch 2.3?
> >>>>>
> >>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>>>
> >>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
> >> for
> >>>>>> Nutch 1.x. I'm afraid I can't help you.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>> Sent: 28 February 2018 14:20
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>
> >>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is
> the
> >>>>>> output of
> >>>>>>> nutch index i have already configured the nutch-site.xml.
> >>>>>>>
> >>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com>
> wrote:
> >>>>>>>
> >>>>>>>> I suggest you run "nutch index", take a look at the returned help
> >>>>>>>> message, and continue from there.
> >>>>>>>> Broadly, first of all you need to configure your elasticsearch
> >>>>>>>> environment in nutch-site.xml, and then you need to run nutch
> index
> >>>>>>>> with the location of your CrawlDB and either the segment you want
> to
> >>>>>>>> index or the directory that contains all the segments you want to
> >>>>>> index.
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>>>> Sent: 28 February 2018 14:06
> >>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>>>
> >>>>>>>>> All I want  is to index my parsed data to elasticsearch.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Yash,
> >>>>>>>>>
> >>>>>>>>> The nutch index command does not have a -all flag, so I'm not
> sure
> >>>>>>>>> what
> >>>>>>>> you're
> >>>>>>>>> trying to achieve here.
> >>>>>>>>>
> >>>>>>>>>         Yossi.
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>>>>> Sent: 28 February 2018 13:55
> >>>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>>>>>
> >>>>>>>>>> Can somebody please tell me what happens when we hit the
> bin/nutc
> >>>>>>>>>> index
> >>>>>>>>> -all
> >>>>>>>>>> command.
> >>>>>>>>>> Because I can't figure out why the write function inside the
> >>>>>>>>> elastic-indexer is not
> >>>>>>>>>> getting executed.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Regarding Indexing to elasticsearch

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

looks more like that there is nothing to index.

Unfortunately, in 2.x there are no log messages
on by default which indicate how many documents
are sent to the index back-ends.

The easiest way is to enable Job counters in
conf/log4j.properties by adding the line:

 log4j.logger.org.apache.hadoop.mapreduce.Job=INFO

or setting the level to INFO for

 log4j.logger.org.apache.hadoop=WARN

Make sure the log4j.properties is correctly deployed
(in doubt, run "ant runtime"). Then check the hadoop.log
again: there should be a counter DocumentCount with non-zero
value.

Best,
Sebastian


On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> Following are the logs from hadoop.log
> 
> 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
> 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 11:18:48,663 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 11:18:48,666 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 11:18:48,792 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 11:18:48,798 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port  (default 9200)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 
> 
> 2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.
> 
> 
> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> It's impossible to find the reason from console output.
>> Please check the hadoop.log, it should contain more logs
>> including those from ElasticIndexWriter.
>>
>> Sebastian
>>
>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
>>> Hi Sebastian All of this is coming but the problem is,The content is not
>>> sent sent.Nothing is indexed to es.
>>> This is the output on debug level.
>>>
>>> ElasticIndexWriter
>>>
>>> elastic.cluster : elastic prefix cluster
>>>
>>> elastic.host : hostname
>>>
>>> elastic.port : port  (default 9200)
>>>
>>> elastic.index : elastic index command
>>>
>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>>
>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>> ~2.5MB)
>>>
>>>
>>> no modules loaded
>>>
>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>>>
>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>>>
>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>>>
>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>>>
>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>>>
>>> created thread pool: name [force_merge], size [1], queue size [unbounded]
>>>
>>> created thread pool: name [fetch_shard_started], core [1], max [8], keep
>>> alive [5m]
>>>
>>> created thread pool: name [listener], size [2], queue size [unbounded]
>>>
>>> created thread pool: name [index], size [4], queue size [200]
>>>
>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>>>
>>> created thread pool: name [generic], core [4], max [128], keep alive
>> [30s]
>>>
>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>>>
>>> thread pool [search] will adjust queue by [50] when determining automatic
>>> queue size
>>>
>>> created thread pool: name [search], size [7], queue size [1k]
>>>
>>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>>>
>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
>>> alive [5m]
>>>
>>> created thread pool: name [management], core [1], max [5], keep alive
>> [5m]
>>>
>>> created thread pool: name [get], size [4], queue size [1k]
>>>
>>> created thread pool: name [bulk], size [4], queue size [200]
>>>
>>> created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
>>>
>>> node_sampler_interval[5s]
>>>
>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
>>> 127.0.0.1:9300}]
>>>
>>> connected to node
>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
>>> 127.0.0.1:9300}]
>>>
>>> IndexingJob: done
>>>
>>>
>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
>>> wastl.nagel@googlemail.com> wrote:
>>>
>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as for
>>>> 1.x:
>>>>
>>>> - enable the plugin "indexer-elastic" in plugin.includes
>>>>   (upgrade and rename to "indexer-elastic2" in 2.4)
>>>>
>>>> - expects ES 1.4.1
>>>>
>>>> - available/required options are found in the log file (hadoop.log):
>>>>    ElasticIndexWriter
>>>>         elastic.cluster : elastic prefix cluster
>>>>         elastic.host : hostname
>>>>         elastic.port : port  (default 9300)
>>>>         elastic.index : elastic index command
>>>>         elastic.max.bulk.docs : elastic bulk index doc counts. (default
>>>> 250)
>>>>         elastic.max.bulk.size : elastic bulk index length. (default
>>>> 2500500 ~2.5MB)
>>>>
>>>> Sebastian
>>>>
>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>>>> Yeah
>>>>> I was also thinking that
>>>>> Can somebody help me with nutch 2.3?
>>>>>
>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>
>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
>> for
>>>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>> Sent: 28 February 2018 14:20
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>
>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
>>>>>> output of
>>>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>>>
>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>>>
>>>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>>>> message, and continue from there.
>>>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>>>> environment in nutch-site.xml, and then you need to run nutch index
>>>>>>>> with the location of your CrawlDB and either the segment you want to
>>>>>>>> index or the directory that contains all the segments you want to
>>>>>> index.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>> Sent: 28 February 2018 14:06
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>
>>>>>>>>> All I want  is to index my parsed data to elasticsearch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Yash,
>>>>>>>>>
>>>>>>>>> The nutch index command does not have a -all flag, so I'm not sure
>>>>>>>>> what
>>>>>>>> you're
>>>>>>>>> trying to achieve here.
>>>>>>>>>
>>>>>>>>>         Yossi.
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>>>
>>>>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
>>>>>>>>>> index
>>>>>>>>> -all
>>>>>>>>>> command.
>>>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>>>> elastic-indexer is not
>>>>>>>>>> getting executed.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Regarding Indexing to elasticsearch

Posted by Yash Thenuan Thenuan <ri...@iiita.ac.in>.

Following are the logs from hadoop.log

2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 11:18:48,663 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 11:18:48,666 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 11:18:48,792 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 11:18:48,798 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port  (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.


On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> It's impossible to find the reason from console output.
> Please check the hadoop.log, it should contain more logs
> including those from ElasticIndexWriter.
>
> Sebastian
>
> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> > Hi Sebastian All of this is coming but the problem is,The content is not
> > sent sent.Nothing is indexed to es.
> > This is the output on debug level.
> >
> > ElasticIndexWriter
> >
> > elastic.cluster : elastic prefix cluster
> >
> > elastic.host : hostname
> >
> > elastic.port : port  (default 9200)
> >
> > elastic.index : elastic index command
> >
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> >
> > elastic.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> >
> >
> > no modules loaded
> >
> > loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> >
> > loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> >
> > loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> >
> > loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> >
> > loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> >
> > created thread pool: name [force_merge], size [1], queue size [unbounded]
> >
> > created thread pool: name [fetch_shard_started], core [1], max [8], keep
> > alive [5m]
> >
> > created thread pool: name [listener], size [2], queue size [unbounded]
> >
> > created thread pool: name [index], size [4], queue size [200]
> >
> > created thread pool: name [refresh], core [1], max [2], keep alive [5m]
> >
> > created thread pool: name [generic], core [4], max [128], keep alive
> [30s]
> >
> > created thread pool: name [warmer], core [1], max [2], keep alive [5m]
> >
> > thread pool [search] will adjust queue by [50] when determining automatic
> > queue size
> >
> > created thread pool: name [search], size [7], queue size [1k]
> >
> > created thread pool: name [flush], core [1], max [2], keep alive [5m]
> >
> > created thread pool: name [fetch_shard_store], core [1], max [8], keep
> > alive [5m]
> >
> > created thread pool: name [management], core [1], max [5], keep alive
> [5m]
> >
> > created thread pool: name [get], size [4], queue size [1k]
> >
> > created thread pool: name [bulk], size [4], queue size [200]
> >
> > created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
> >
> > node_sampler_interval[5s]
> >
> > adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> > 127.0.0.1:9300}]
> >
> > connected to node
> > [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> > 127.0.0.1:9300}]
> >
> > IndexingJob: done
> >
> >
> > On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> > wastl.nagel@googlemail.com> wrote:
> >
> >> I never tried ES with Nutch 2.3 but it should be similar to setup as for
> >> 1.x:
> >>
> >> - enable the plugin "indexer-elastic" in plugin.includes
> >>   (upgrade and rename to "indexer-elastic2" in 2.4)
> >>
> >> - expects ES 1.4.1
> >>
> >> - available/required options are found in the log file (hadoop.log):
> >>    ElasticIndexWriter
> >>         elastic.cluster : elastic prefix cluster
> >>         elastic.host : hostname
> >>         elastic.port : port  (default 9300)
> >>         elastic.index : elastic index command
> >>         elastic.max.bulk.docs : elastic bulk index doc counts. (default
> >> 250)
> >>         elastic.max.bulk.size : elastic bulk index length. (default
> >> 2500500 ~2.5MB)
> >>
> >> Sebastian
> >>
> >> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> >>> Yeah
> >>> I was also thinking that
> >>> Can somebody help me with nutch 2.3?
> >>>
> >>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>
> >>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
> for
> >>>> Nutch 1.x. I'm afraid I can't help you.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>> Sent: 28 February 2018 14:20
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>
> >>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
> >>>> output of
> >>>>> nutch index i have already configured the nutch-site.xml.
> >>>>>
> >>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>>>
> >>>>>> I suggest you run "nutch index", take a look at the returned help
> >>>>>> message, and continue from there.
> >>>>>> Broadly, first of all you need to configure your elasticsearch
> >>>>>> environment in nutch-site.xml, and then you need to run nutch index
> >>>>>> with the location of your CrawlDB and either the segment you want to
> >>>>>> index or the directory that contains all the segments you want to
> >>>> index.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>> Sent: 28 February 2018 14:06
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>
> >>>>>>> All I want  is to index my parsed data to elasticsearch.
> >>>>>>>
> >>>>>>>
> >>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
> wrote:
> >>>>>>>
> >>>>>>> Hi Yash,
> >>>>>>>
> >>>>>>> The nutch index command does not have a -all flag, so I'm not sure
> >>>>>>> what
> >>>>>> you're
> >>>>>>> trying to achieve here.
> >>>>>>>
> >>>>>>>         Yossi.
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>>> Sent: 28 February 2018 13:55
> >>>>>>>> To: user@nutch.apache.org
> >>>>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>>>
> >>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
> >>>>>>>> index
> >>>>>>> -all
> >>>>>>>> command.
> >>>>>>>> Because I can't figure out why the write function inside the
> >>>>>>> elastic-indexer is not
> >>>>>>>> getting executed.
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Regarding Indexing to elasticsearch

Posted by Sebastian Nagel <wa...@googlemail.com>.

It's impossible to find the reason from console output.
Please check the hadoop.log, it should contain more logs
including those from ElasticIndexWriter.

Sebastian

On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> Hi Sebastian All of this is coming but the problem is,The content is not
> sent sent.Nothing is indexed to es.
> This is the output on debug level.
> 
> ElasticIndexWriter
> 
> elastic.cluster : elastic prefix cluster
> 
> elastic.host : hostname
> 
> elastic.port : port  (default 9200)
> 
> elastic.index : elastic index command
> 
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> 
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 
> 
> no modules loaded
> 
> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> 
> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> 
> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> 
> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> 
> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> 
> created thread pool: name [force_merge], size [1], queue size [unbounded]
> 
> created thread pool: name [fetch_shard_started], core [1], max [8], keep
> alive [5m]
> 
> created thread pool: name [listener], size [2], queue size [unbounded]
> 
> created thread pool: name [index], size [4], queue size [200]
> 
> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
> 
> created thread pool: name [generic], core [4], max [128], keep alive [30s]
> 
> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
> 
> thread pool [search] will adjust queue by [50] when determining automatic
> queue size
> 
> created thread pool: name [search], size [7], queue size [1k]
> 
> created thread pool: name [flush], core [1], max [2], keep alive [5m]
> 
> created thread pool: name [fetch_shard_store], core [1], max [8], keep
> alive [5m]
> 
> created thread pool: name [management], core [1], max [5], keep alive [5m]
> 
> created thread pool: name [get], size [4], queue size [1k]
> 
> created thread pool: name [bulk], size [4], queue size [200]
> 
> created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
> 
> node_sampler_interval[5s]
> 
> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> 127.0.0.1:9300}]
> 
> connected to node
> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> 127.0.0.1:9300}]
> 
> IndexingJob: done
> 
> 
> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
> 
>> I never tried ES with Nutch 2.3 but it should be similar to setup as for
>> 1.x:
>>
>> - enable the plugin "indexer-elastic" in plugin.includes
>>   (upgrade and rename to "indexer-elastic2" in 2.4)
>>
>> - expects ES 1.4.1
>>
>> - available/required options are found in the log file (hadoop.log):
>>    ElasticIndexWriter
>>         elastic.cluster : elastic prefix cluster
>>         elastic.host : hostname
>>         elastic.port : port  (default 9300)
>>         elastic.index : elastic index command
>>         elastic.max.bulk.docs : elastic bulk index doc counts. (default
>> 250)
>>         elastic.max.bulk.size : elastic bulk index length. (default
>> 2500500 ~2.5MB)
>>
>> Sebastian
>>
>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>> Yeah
>>> I was also thinking that
>>> Can somebody help me with nutch 2.3?
>>>
>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>
>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering for
>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>
>>>>> -----Original Message-----
>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>> Sent: 28 February 2018 14:20
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>
>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
>>>> output of
>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>
>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>
>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>> message, and continue from there.
>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>> environment in nutch-site.xml, and then you need to run nutch index
>>>>>> with the location of your CrawlDB and either the segment you want to
>>>>>> index or the directory that contains all the segments you want to
>>>> index.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>> Sent: 28 February 2018 14:06
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>
>>>>>>> All I want  is to index my parsed data to elasticsearch.
>>>>>>>
>>>>>>>
>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>>>
>>>>>>> Hi Yash,
>>>>>>>
>>>>>>> The nutch index command does not have a -all flag, so I'm not sure
>>>>>>> what
>>>>>> you're
>>>>>>> trying to achieve here.
>>>>>>>
>>>>>>>         Yossi.
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>
>>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
>>>>>>>> index
>>>>>>> -all
>>>>>>>> command.
>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>> elastic-indexer is not
>>>>>>>> getting executed.
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>