You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yash Thenuan Thenuan <ri...@iiita.ac.in> on 2018/03/01 05:38:31 UTC
Re: Regarding Indexing to elasticsearch
Hi Sebastian All of this is coming but the problem is,The content is not
sent sent.Nothing is indexed to es.
This is the output on debug level.
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
no modules loaded
loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
loaded plugin [org.elasticsearch.transport.Netty4Plugin]
created thread pool: name [force_merge], size [1], queue size [unbounded]
created thread pool: name [fetch_shard_started], core [1], max [8], keep
alive [5m]
created thread pool: name [listener], size [2], queue size [unbounded]
created thread pool: name [index], size [4], queue size [200]
created thread pool: name [refresh], core [1], max [2], keep alive [5m]
created thread pool: name [generic], core [4], max [128], keep alive [30s]
created thread pool: name [warmer], core [1], max [2], keep alive [5m]
thread pool [search] will adjust queue by [50] when determining automatic
queue size
created thread pool: name [search], size [7], queue size [1k]
created thread pool: name [flush], core [1], max [2], keep alive [5m]
created thread pool: name [fetch_shard_store], core [1], max [8], keep
alive [5m]
created thread pool: name [management], core [1], max [5], keep alive [5m]
created thread pool: name [get], size [4], queue size [1k]
created thread pool: name [bulk], size [4], queue size [200]
created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
node_sampler_interval[5s]
adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
127.0.0.1:9300}]
connected to node
[{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
127.0.0.1:9300}]
IndexingJob: done
On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:
> I never tried ES with Nutch 2.3 but it should be similar to setup as for
> 1.x:
>
> - enable the plugin "indexer-elastic" in plugin.includes
> (upgrade and rename to "indexer-elastic2" in 2.4)
>
> - expects ES 1.4.1
>
> - available/required options are found in the log file (hadoop.log):
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port (default 9300)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default
> 250)
> elastic.max.bulk.size : elastic bulk index length. (default
> 2500500 ~2.5MB)
>
> Sebastian
>
> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> > Yeah
> > I was also thinking that
> > Can somebody help me with nutch 2.3?
> >
> > On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
> >
> >> Sorry, I just realized that you're using Nutch 2.x and I'm answering for
> >> Nutch 1.x. I'm afraid I can't help you.
> >>
> >>> -----Original Message-----
> >>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>> Sent: 28 February 2018 14:20
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Regarding Indexing to elasticsearch
> >>>
> >>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
> >> output of
> >>> nutch index i have already configured the nutch-site.xml.
> >>>
> >>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>
> >>>> I suggest you run "nutch index", take a look at the returned help
> >>>> message, and continue from there.
> >>>> Broadly, first of all you need to configure your elasticsearch
> >>>> environment in nutch-site.xml, and then you need to run nutch index
> >>>> with the location of your CrawlDB and either the segment you want to
> >>>> index or the directory that contains all the segments you want to
> >> index.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>> Sent: 28 February 2018 14:06
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>
> >>>>> All I want is to index my parsed data to elasticsearch.
> >>>>>
> >>>>>
> >>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>>>
> >>>>> Hi Yash,
> >>>>>
> >>>>> The nutch index command does not have a -all flag, so I'm not sure
> >>>>> what
> >>>> you're
> >>>>> trying to achieve here.
> >>>>>
> >>>>> Yossi.
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>> Sent: 28 February 2018 13:55
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>
> >>>>>> Can somebody please tell me what happens when we hit the bin/nutc
> >>>>>> index
> >>>>> -all
> >>>>>> command.
> >>>>>> Because I can't figure out why the write function inside the
> >>>>> elastic-indexer is not
> >>>>>> getting executed.
> >>>>
> >>>>
> >>
> >>
> >
>
>
Re: Regarding Indexing to elasticsearch
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
> Map input records=79
> Map output records=0
... and no IndexerJob:DocumentCount counter
The map function got 79 records as input,
but did not write anything to the indexer.
There are a couple of reasons why a document is skipped,
e.g., nothing parsed, missing markers, errors in indexing filters, ...
Have a look at the map method:
https://github.com/apache/nutch/blob/branch-2.3.1/src/java/org/apache/nutch/indexer/IndexingJob.java#L95
and start debugging it. Alternatively, check your table and
the log files of the previous steps. There must be a reason
why nothing is indexed.
Best,
Sebastian
On 03/02/2018 01:03 PM, Yash Thenuan Thenuan wrote:
> I got this after setting log4j.logger.org.apache.hadoop to info
>
> 2018-03-02 17:29:40,157 INFO indexer.IndexingJob - IndexingJob: starting
> 2018-03-02 17:29:40,775 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-03-02 17:29:40,853 INFO Configuration.deprecation -
> mapred.output.key.comparator.class is deprecated. Instead, use
> mapreduce.job.output.key.comparator.class
> 2018-03-02 17:29:41,073 INFO basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 17:29:41,073 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 17:29:41,076 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 17:29:41,076 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 17:29:41,094 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 17:29:41,465 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 17:29:42,585 INFO Configuration.deprecation - session.id is
> deprecated. Instead, use dfs.metrics.session-id
> 2018-03-02 17:29:42,587 INFO jvm.JvmMetrics - Initializing JVM Metrics
> with processName=JobTracker, sessionId=
> 2018-03-02 17:29:43,277 INFO mapreduce.JobSubmitter - number of splits:1
> 2018-03-02 17:29:43,501 INFO mapreduce.JobSubmitter - Submitting tokens
> for job: job_local1792747860_0001
> 2018-03-02 17:29:43,566 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval; Ignoring.
> 2018-03-02 17:29:43,570 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts; Ignoring.
> 2018-03-02 17:29:43,726 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval; Ignoring.
> 2018-03-02 17:29:43,731 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts; Ignoring.
> 2018-03-02 17:29:43,755 INFO mapreduce.Job - The url to track the job:
> http://localhost:8080/
> 2018-03-02 17:29:43,757 INFO mapreduce.Job - Running job:
> job_local1792747860_0001
> 2018-03-02 17:29:43,757 INFO mapred.LocalJobRunner - OutputCommitter set
> in config null
> 2018-03-02 17:29:43,767 INFO mapred.LocalJobRunner - OutputCommitter is
> org.apache.nutch.indexer.IndexerOutputFormat$2
> 2018-03-02 17:29:43,838 INFO mapred.LocalJobRunner - Waiting for map tasks
> 2018-03-02 17:29:43,841 INFO mapred.LocalJobRunner - Starting task:
> attempt_local1792747860_0001_m_000000_0
> 2018-03-02 17:29:43,899 INFO util.ProcfsBasedProcessTree -
> ProcfsBasedProcessTree currently is supported only on Linux.
> 2018-03-02 17:29:43,899 INFO mapred.Task - Using
> ResourceCalculatorProcessTree : null
> 2018-03-02 17:29:43,923 INFO mapred.MapTask - Processing split:
> org.apache.gora.mapreduce.GoraInputSplit@424b7f03
> 2018-03-02 17:29:44,051 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 17:29:44,767 INFO mapreduce.Job - Job job_local1792747860_0001
> running in uber mode : false
> 2018-03-02 17:29:44,769 INFO mapreduce.Job - map 0% reduce 0%
> 2018-03-02 17:29:50,926 INFO basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 17:29:50,926 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 17:29:50,927 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 17:29:51,153 INFO mapred.LocalJobRunner -
> 2018-03-02 17:29:52,782 INFO mapred.Task -
> Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process
> of committing
> 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map
> 2018-03-02 17:29:52,825 INFO mapred.Task - Task
> 'attempt_local1792747860_0001_m_000000_0' done.
> 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - Finishing task:
> attempt_local1792747860_0001_m_000000_0
> 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map task executor
> complete.
> 2018-03-02 17:29:53,791 INFO mapreduce.Job - map 100% reduce 0%
> 2018-03-02 17:29:53,791 INFO mapreduce.Job - Job job_local1792747860_0001
> completed successfully
> 2018-03-02 17:29:53,849 INFO mapreduce.Job - Counters: 15
> File System Counters
> FILE: Number of bytes read=610359
> FILE: Number of bytes written=891634
> FILE: Number of read operations=0
> FILE: Number of large read operations=0
> FILE: Number of write operations=0
> Map-Reduce Framework
> Map input records=79
> Map output records=0
> Input split bytes=995
> Spilled Records=0
> Failed Shuffles=0
> Merged Map outputs=0
> GC time elapsed (ms)=103
> Total committed heap usage (bytes)=225443840
> File Input Format Counters
> Bytes Read=0
> File Output Format Counters
> Bytes Written=0
> 2018-03-02 17:29:53,866 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 17:29:53,866 INFO indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port (default 9200)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
>
>
> 2018-03-02 17:29:53,925 INFO indexer.IndexingJob - IndexingJob: done.
>
>
> On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wa...@googlemail.com>
> wrote:
>
>> Hi,
>>
>> looks more like that there is nothing to index.
>>
>> Unfortunately, in 2.x there are no log messages
>> on by default which indicate how many documents
>> are sent to the index back-ends.
>>
>> The easiest way is to enable Job counters in
>> conf/log4j.properties by adding the line:
>>
>> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>>
>> or setting the level to INFO for
>>
>> log4j.logger.org.apache.hadoop=WARN
>>
>> Make sure the log4j.properties is correctly deployed
>> (in doubt, run "ant runtime"). Then check the hadoop.log
>> again: there should be a counter DocumentCount with non-zero
>> value.
>>
>> Best,
>> Sebastian
>>
>>
>> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
>>> Following are the logs from hadoop.log
>>>
>>> 2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting
>>> 2018-03-02 11:18:45,791 WARN util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>> where
>>> applicable
>>> 2018-03-02 11:18:46,138 INFO basic.BasicIndexingFilter - Maximum title
>>> length for indexing set to: -1
>>> 2018-03-02 11:18:46,138 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 2018-03-02 11:18:46,140 INFO anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2018-03-02 11:18:46,140 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>>> 2018-03-02 11:18:46,157 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.metadata.MetadataIndexer
>>> 2018-03-02 11:18:46,535 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.more.MoreIndexingFilter
>>> 2018-03-02 11:18:48,663 WARN conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
>> staging/job_local1100834069_0001/job.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.retry.interval; Ignoring.
>>> 2018-03-02 11:18:48,666 WARN conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
>> staging/job_local1100834069_0001/job.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.attempts; Ignoring.
>>> 2018-03-02 11:18:48,792 WARN conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
>> local1100834069_0001/job_local1100834069_0001.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.retry.interval; Ignoring.
>>> 2018-03-02 11:18:48,798 WARN conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
>> local1100834069_0001/job_local1100834069_0001.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.attempts; Ignoring.
>>> 2018-03-02 11:18:49,093 INFO indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
>>> 2018-03-02 11:18:54,737 INFO basic.BasicIndexingFilter - Maximum title
>>> length for indexing set to: -1
>>> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 2018-03-02 11:18:54,737 INFO anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>>> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.metadata.MetadataIndexer
>>> 2018-03-02 11:18:54,738 INFO indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.more.MoreIndexingFilter
>>> 2018-03-02 11:18:56,883 INFO indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
>>> 2018-03-02 11:18:56,884 INFO indexer.IndexingJob - Active IndexWriters :
>>> ElasticIndexWriter
>>> elastic.cluster : elastic prefix cluster
>>> elastic.host : hostname
>>> elastic.port : port (default 9200)
>>> elastic.index : elastic index command
>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>> ~2.5MB)
>>>
>>>
>>> 2018-03-02 11:18:56,939 INFO indexer.IndexingJob - IndexingJob: done.
>>>
>>>
>>> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <
>> wastl.nagel@googlemail.com
>>>> wrote:
>>>
>>>> It's impossible to find the reason from console output.
>>>> Please check the hadoop.log, it should contain more logs
>>>> including those from ElasticIndexWriter.
>>>>
>>>> Sebastian
>>>>
>>>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
>>>>> Hi Sebastian All of this is coming but the problem is,The content is
>> not
>>>>> sent sent.Nothing is indexed to es.
>>>>> This is the output on debug level.
>>>>>
>>>>> ElasticIndexWriter
>>>>>
>>>>> elastic.cluster : elastic prefix cluster
>>>>>
>>>>> elastic.host : hostname
>>>>>
>>>>> elastic.port : port (default 9200)
>>>>>
>>>>> elastic.index : elastic index command
>>>>>
>>>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>>>>
>>>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>>>> ~2.5MB)
>>>>>
>>>>>
>>>>> no modules loaded
>>>>>
>>>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>>>>>
>>>>> created thread pool: name [force_merge], size [1], queue size
>> [unbounded]
>>>>>
>>>>> created thread pool: name [fetch_shard_started], core [1], max [8],
>> keep
>>>>> alive [5m]
>>>>>
>>>>> created thread pool: name [listener], size [2], queue size [unbounded]
>>>>>
>>>>> created thread pool: name [index], size [4], queue size [200]
>>>>>
>>>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>>>>>
>>>>> created thread pool: name [generic], core [4], max [128], keep alive
>>>> [30s]
>>>>>
>>>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>>>>>
>>>>> thread pool [search] will adjust queue by [50] when determining
>> automatic
>>>>> queue size
>>>>>
>>>>> created thread pool: name [search], size [7], queue size [1k]
>>>>>
>>>>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>>>>>
>>>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
>>>>> alive [5m]
>>>>>
>>>>> created thread pool: name [management], core [1], max [5], keep alive
>>>> [5m]
>>>>>
>>>>> created thread pool: name [get], size [4], queue size [1k]
>>>>>
>>>>> created thread pool: name [bulk], size [4], queue size [200]
>>>>>
>>>>> created thread pool: name [snapshot], core [1], max [2], keep alive
>> [5m]
>>>>>
>>>>> node_sampler_interval[5s]
>>>>>
>>>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
>>>>> 127.0.0.1:9300}]
>>>>>
>>>>> connected to node
>>>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
>>>>> 127.0.0.1:9300}]
>>>>>
>>>>> IndexingJob: done
>>>>>
>>>>>
>>>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
>>>>> wastl.nagel@googlemail.com> wrote:
>>>>>
>>>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as
>> for
>>>>>> 1.x:
>>>>>>
>>>>>> - enable the plugin "indexer-elastic" in plugin.includes
>>>>>> (upgrade and rename to "indexer-elastic2" in 2.4)
>>>>>>
>>>>>> - expects ES 1.4.1
>>>>>>
>>>>>> - available/required options are found in the log file (hadoop.log):
>>>>>> ElasticIndexWriter
>>>>>> elastic.cluster : elastic prefix cluster
>>>>>> elastic.host : hostname
>>>>>> elastic.port : port (default 9300)
>>>>>> elastic.index : elastic index command
>>>>>> elastic.max.bulk.docs : elastic bulk index doc counts.
>> (default
>>>>>> 250)
>>>>>> elastic.max.bulk.size : elastic bulk index length. (default
>>>>>> 2500500 ~2.5MB)
>>>>>>
>>>>>> Sebastian
>>>>>>
>>>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>>>>>> Yeah
>>>>>>> I was also thinking that
>>>>>>> Can somebody help me with nutch 2.3?
>>>>>>>
>>>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>>>
>>>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
>>>> for
>>>>>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>> Sent: 28 February 2018 14:20
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>
>>>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is
>> the
>>>>>>>> output of
>>>>>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>>>>>
>>>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com>
>> wrote:
>>>>>>>>>
>>>>>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>>>>>> message, and continue from there.
>>>>>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>>>>>> environment in nutch-site.xml, and then you need to run nutch
>> index
>>>>>>>>>> with the location of your CrawlDB and either the segment you want
>> to
>>>>>>>>>> index or the directory that contains all the segments you want to
>>>>>>>> index.
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>>>> Sent: 28 February 2018 14:06
>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>>>
>>>>>>>>>>> All I want is to index my parsed data to elasticsearch.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Yash,
>>>>>>>>>>>
>>>>>>>>>>> The nutch index command does not have a -all flag, so I'm not
>> sure
>>>>>>>>>>> what
>>>>>>>>>> you're
>>>>>>>>>>> trying to achieve here.
>>>>>>>>>>>
>>>>>>>>>>> Yossi.
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>>>>>
>>>>>>>>>>>> Can somebody please tell me what happens when we hit the
>> bin/nutc
>>>>>>>>>>>> index
>>>>>>>>>>> -all
>>>>>>>>>>>> command.
>>>>>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>>>>>> elastic-indexer is not
>>>>>>>>>>>> getting executed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
Re: Regarding Indexing to elasticsearch
Posted by Yash Thenuan Thenuan <ri...@iiita.ac.in>.
I got this after setting log4j.logger.org.apache.hadoop to info
2018-03-02 17:29:40,157 INFO indexer.IndexingJob - IndexingJob: starting
2018-03-02 17:29:40,775 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-02 17:29:40,853 INFO Configuration.deprecation -
mapred.output.key.comparator.class is deprecated. Instead, use
mapreduce.job.output.key.comparator.class
2018-03-02 17:29:41,073 INFO basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 17:29:41,073 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 17:29:41,076 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 17:29:41,076 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 17:29:41,094 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 17:29:41,465 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 17:29:42,585 INFO Configuration.deprecation - session.id is
deprecated. Instead, use dfs.metrics.session-id
2018-03-02 17:29:42,587 INFO jvm.JvmMetrics - Initializing JVM Metrics
with processName=JobTracker, sessionId=
2018-03-02 17:29:43,277 INFO mapreduce.JobSubmitter - number of splits:1
2018-03-02 17:29:43,501 INFO mapreduce.JobSubmitter - Submitting tokens
for job: job_local1792747860_0001
2018-03-02 17:29:43,566 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2018-03-02 17:29:43,570 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2018-03-02 17:29:43,726 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2018-03-02 17:29:43,731 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2018-03-02 17:29:43,755 INFO mapreduce.Job - The url to track the job:
http://localhost:8080/
2018-03-02 17:29:43,757 INFO mapreduce.Job - Running job:
job_local1792747860_0001
2018-03-02 17:29:43,757 INFO mapred.LocalJobRunner - OutputCommitter set
in config null
2018-03-02 17:29:43,767 INFO mapred.LocalJobRunner - OutputCommitter is
org.apache.nutch.indexer.IndexerOutputFormat$2
2018-03-02 17:29:43,838 INFO mapred.LocalJobRunner - Waiting for map tasks
2018-03-02 17:29:43,841 INFO mapred.LocalJobRunner - Starting task:
attempt_local1792747860_0001_m_000000_0
2018-03-02 17:29:43,899 INFO util.ProcfsBasedProcessTree -
ProcfsBasedProcessTree currently is supported only on Linux.
2018-03-02 17:29:43,899 INFO mapred.Task - Using
ResourceCalculatorProcessTree : null
2018-03-02 17:29:43,923 INFO mapred.MapTask - Processing split:
org.apache.gora.mapreduce.GoraInputSplit@424b7f03
2018-03-02 17:29:44,051 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:44,767 INFO mapreduce.Job - Job job_local1792747860_0001
running in uber mode : false
2018-03-02 17:29:44,769 INFO mapreduce.Job - map 0% reduce 0%
2018-03-02 17:29:50,926 INFO basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 17:29:50,926 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 17:29:50,927 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 17:29:51,153 INFO mapred.LocalJobRunner -
2018-03-02 17:29:52,782 INFO mapred.Task -
Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process
of committing
2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map
2018-03-02 17:29:52,825 INFO mapred.Task - Task
'attempt_local1792747860_0001_m_000000_0' done.
2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - Finishing task:
attempt_local1792747860_0001_m_000000_0
2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map task executor
complete.
2018-03-02 17:29:53,791 INFO mapreduce.Job - map 100% reduce 0%
2018-03-02 17:29:53,791 INFO mapreduce.Job - Job job_local1792747860_0001
completed successfully
2018-03-02 17:29:53,849 INFO mapreduce.Job - Counters: 15
File System Counters
FILE: Number of bytes read=610359
FILE: Number of bytes written=891634
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=79
Map output records=0
Input split bytes=995
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=103
Total committed heap usage (bytes)=225443840
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2018-03-02 17:29:53,866 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:53,866 INFO indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2018-03-02 17:29:53,925 INFO indexer.IndexingJob - IndexingJob: done.
On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wa...@googlemail.com>
wrote:
> Hi,
>
> looks more like that there is nothing to index.
>
> Unfortunately, in 2.x there are no log messages
> on by default which indicate how many documents
> are sent to the index back-ends.
>
> The easiest way is to enable Job counters in
> conf/log4j.properties by adding the line:
>
> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>
> or setting the level to INFO for
>
> log4j.logger.org.apache.hadoop=WARN
>
> Make sure the log4j.properties is correctly deployed
> (in doubt, run "ant runtime"). Then check the hadoop.log
> again: there should be a counter DocumentCount with non-zero
> value.
>
> Best,
> Sebastian
>
>
> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> > Following are the logs from hadoop.log
> >
> > 2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting
> > 2018-03-02 11:18:45,791 WARN util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2018-03-02 11:18:46,138 INFO basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:46,138 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:46,140 INFO anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:46,140 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:46,157 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:46,535 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:48,663 WARN conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval; Ignoring.
> > 2018-03-02 11:18:48,666 WARN conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts; Ignoring.
> > 2018-03-02 11:18:48,792 WARN conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval; Ignoring.
> > 2018-03-02 11:18:48,798 WARN conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts; Ignoring.
> > 2018-03-02 11:18:49,093 INFO indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:54,737 INFO basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:54,737 INFO anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:54,738 INFO indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:56,883 INFO indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:56,884 INFO indexer.IndexingJob - Active IndexWriters :
> > ElasticIndexWriter
> > elastic.cluster : elastic prefix cluster
> > elastic.host : hostname
> > elastic.port : port (default 9200)
> > elastic.index : elastic index command
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> > elastic.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> >
> >
> > 2018-03-02 11:18:56,939 INFO indexer.IndexingJob - IndexingJob: done.
> >
> >
> > On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> >> wrote:
> >
> >> It's impossible to find the reason from console output.
> >> Please check the hadoop.log, it should contain more logs
> >> including those from ElasticIndexWriter.
> >>
> >> Sebastian
> >>
> >> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> >>> Hi Sebastian All of this is coming but the problem is,The content is
> not
> >>> sent sent.Nothing is indexed to es.
> >>> This is the output on debug level.
> >>>
> >>> ElasticIndexWriter
> >>>
> >>> elastic.cluster : elastic prefix cluster
> >>>
> >>> elastic.host : hostname
> >>>
> >>> elastic.port : port (default 9200)
> >>>
> >>> elastic.index : elastic index command
> >>>
> >>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> >>>
> >>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
> >> ~2.5MB)
> >>>
> >>>
> >>> no modules loaded
> >>>
> >>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> >>>
> >>> created thread pool: name [force_merge], size [1], queue size
> [unbounded]
> >>>
> >>> created thread pool: name [fetch_shard_started], core [1], max [8],
> keep
> >>> alive [5m]
> >>>
> >>> created thread pool: name [listener], size [2], queue size [unbounded]
> >>>
> >>> created thread pool: name [index], size [4], queue size [200]
> >>>
> >>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
> >>>
> >>> created thread pool: name [generic], core [4], max [128], keep alive
> >> [30s]
> >>>
> >>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
> >>>
> >>> thread pool [search] will adjust queue by [50] when determining
> automatic
> >>> queue size
> >>>
> >>> created thread pool: name [search], size [7], queue size [1k]
> >>>
> >>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
> >>>
> >>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
> >>> alive [5m]
> >>>
> >>> created thread pool: name [management], core [1], max [5], keep alive
> >> [5m]
> >>>
> >>> created thread pool: name [get], size [4], queue size [1k]
> >>>
> >>> created thread pool: name [bulk], size [4], queue size [200]
> >>>
> >>> created thread pool: name [snapshot], core [1], max [2], keep alive
> [5m]
> >>>
> >>> node_sampler_interval[5s]
> >>>
> >>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> >>> 127.0.0.1:9300}]
> >>>
> >>> connected to node
> >>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> >>> 127.0.0.1:9300}]
> >>>
> >>> IndexingJob: done
> >>>
> >>>
> >>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> >>> wastl.nagel@googlemail.com> wrote:
> >>>
> >>>> I never tried ES with Nutch 2.3 but it should be similar to setup as
> for
> >>>> 1.x:
> >>>>
> >>>> - enable the plugin "indexer-elastic" in plugin.includes
> >>>> (upgrade and rename to "indexer-elastic2" in 2.4)
> >>>>
> >>>> - expects ES 1.4.1
> >>>>
> >>>> - available/required options are found in the log file (hadoop.log):
> >>>> ElasticIndexWriter
> >>>> elastic.cluster : elastic prefix cluster
> >>>> elastic.host : hostname
> >>>> elastic.port : port (default 9300)
> >>>> elastic.index : elastic index command
> >>>> elastic.max.bulk.docs : elastic bulk index doc counts.
> (default
> >>>> 250)
> >>>> elastic.max.bulk.size : elastic bulk index length. (default
> >>>> 2500500 ~2.5MB)
> >>>>
> >>>> Sebastian
> >>>>
> >>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> >>>>> Yeah
> >>>>> I was also thinking that
> >>>>> Can somebody help me with nutch 2.3?
> >>>>>
> >>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>>>
> >>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
> >> for
> >>>>>> Nutch 1.x. I'm afraid I can't help you.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>> Sent: 28 February 2018 14:20
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>
> >>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is
> the
> >>>>>> output of
> >>>>>>> nutch index i have already configured the nutch-site.xml.
> >>>>>>>
> >>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com>
> wrote:
> >>>>>>>
> >>>>>>>> I suggest you run "nutch index", take a look at the returned help
> >>>>>>>> message, and continue from there.
> >>>>>>>> Broadly, first of all you need to configure your elasticsearch
> >>>>>>>> environment in nutch-site.xml, and then you need to run nutch
> index
> >>>>>>>> with the location of your CrawlDB and either the segment you want
> to
> >>>>>>>> index or the directory that contains all the segments you want to
> >>>>>> index.
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>>>> Sent: 28 February 2018 14:06
> >>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>>>
> >>>>>>>>> All I want is to index my parsed data to elasticsearch.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Yash,
> >>>>>>>>>
> >>>>>>>>> The nutch index command does not have a -all flag, so I'm not
> sure
> >>>>>>>>> what
> >>>>>>>> you're
> >>>>>>>>> trying to achieve here.
> >>>>>>>>>
> >>>>>>>>> Yossi.
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>>>>> Sent: 28 February 2018 13:55
> >>>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>>>>>
> >>>>>>>>>> Can somebody please tell me what happens when we hit the
> bin/nutc
> >>>>>>>>>> index
> >>>>>>>>> -all
> >>>>>>>>>> command.
> >>>>>>>>>> Because I can't figure out why the write function inside the
> >>>>>>>>> elastic-indexer is not
> >>>>>>>>>> getting executed.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>
Re: Regarding Indexing to elasticsearch
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
looks more like that there is nothing to index.
Unfortunately, in 2.x there are no log messages
on by default which indicate how many documents
are sent to the index back-ends.
The easiest way is to enable Job counters in
conf/log4j.properties by adding the line:
log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
or setting the level to INFO for
log4j.logger.org.apache.hadoop=WARN
Make sure the log4j.properties is correctly deployed
(in doubt, run "ant runtime"). Then check the hadoop.log
again: there should be a counter DocumentCount with non-zero
value.
Best,
Sebastian
On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> Following are the logs from hadoop.log
>
> 2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting
> 2018-03-02 11:18:45,791 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-03-02 11:18:46,138 INFO basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 11:18:46,138 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 11:18:46,140 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 11:18:46,140 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 11:18:46,157 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 11:18:46,535 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 11:18:48,663 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval; Ignoring.
> 2018-03-02 11:18:48,666 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts; Ignoring.
> 2018-03-02 11:18:48,792 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval; Ignoring.
> 2018-03-02 11:18:48,798 WARN conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts; Ignoring.
> 2018-03-02 11:18:49,093 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 11:18:54,737 INFO basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 11:18:54,737 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 11:18:54,738 INFO indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 11:18:56,883 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 11:18:56,884 INFO indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port (default 9200)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
>
>
> 2018-03-02 11:18:56,939 INFO indexer.IndexingJob - IndexingJob: done.
>
>
> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
>
>> It's impossible to find the reason from console output.
>> Please check the hadoop.log, it should contain more logs
>> including those from ElasticIndexWriter.
>>
>> Sebastian
>>
>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
>>> Hi Sebastian All of this is coming but the problem is,The content is not
>>> sent sent.Nothing is indexed to es.
>>> This is the output on debug level.
>>>
>>> ElasticIndexWriter
>>>
>>> elastic.cluster : elastic prefix cluster
>>>
>>> elastic.host : hostname
>>>
>>> elastic.port : port (default 9200)
>>>
>>> elastic.index : elastic index command
>>>
>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>>
>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>> ~2.5MB)
>>>
>>>
>>> no modules loaded
>>>
>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>>>
>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>>>
>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>>>
>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>>>
>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>>>
>>> created thread pool: name [force_merge], size [1], queue size [unbounded]
>>>
>>> created thread pool: name [fetch_shard_started], core [1], max [8], keep
>>> alive [5m]
>>>
>>> created thread pool: name [listener], size [2], queue size [unbounded]
>>>
>>> created thread pool: name [index], size [4], queue size [200]
>>>
>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>>>
>>> created thread pool: name [generic], core [4], max [128], keep alive
>> [30s]
>>>
>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>>>
>>> thread pool [search] will adjust queue by [50] when determining automatic
>>> queue size
>>>
>>> created thread pool: name [search], size [7], queue size [1k]
>>>
>>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>>>
>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
>>> alive [5m]
>>>
>>> created thread pool: name [management], core [1], max [5], keep alive
>> [5m]
>>>
>>> created thread pool: name [get], size [4], queue size [1k]
>>>
>>> created thread pool: name [bulk], size [4], queue size [200]
>>>
>>> created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
>>>
>>> node_sampler_interval[5s]
>>>
>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
>>> 127.0.0.1:9300}]
>>>
>>> connected to node
>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
>>> 127.0.0.1:9300}]
>>>
>>> IndexingJob: done
>>>
>>>
>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
>>> wastl.nagel@googlemail.com> wrote:
>>>
>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as for
>>>> 1.x:
>>>>
>>>> - enable the plugin "indexer-elastic" in plugin.includes
>>>> (upgrade and rename to "indexer-elastic2" in 2.4)
>>>>
>>>> - expects ES 1.4.1
>>>>
>>>> - available/required options are found in the log file (hadoop.log):
>>>> ElasticIndexWriter
>>>> elastic.cluster : elastic prefix cluster
>>>> elastic.host : hostname
>>>> elastic.port : port (default 9300)
>>>> elastic.index : elastic index command
>>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default
>>>> 250)
>>>> elastic.max.bulk.size : elastic bulk index length. (default
>>>> 2500500 ~2.5MB)
>>>>
>>>> Sebastian
>>>>
>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>>>> Yeah
>>>>> I was also thinking that
>>>>> Can somebody help me with nutch 2.3?
>>>>>
>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>
>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
>> for
>>>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>> Sent: 28 February 2018 14:20
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>
>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
>>>>>> output of
>>>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>>>
>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>>>
>>>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>>>> message, and continue from there.
>>>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>>>> environment in nutch-site.xml, and then you need to run nutch index
>>>>>>>> with the location of your CrawlDB and either the segment you want to
>>>>>>>> index or the directory that contains all the segments you want to
>>>>>> index.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>> Sent: 28 February 2018 14:06
>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>
>>>>>>>>> All I want is to index my parsed data to elasticsearch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Yash,
>>>>>>>>>
>>>>>>>>> The nutch index command does not have a -all flag, so I'm not sure
>>>>>>>>> what
>>>>>>>> you're
>>>>>>>>> trying to achieve here.
>>>>>>>>>
>>>>>>>>> Yossi.
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>>>> To: user@nutch.apache.org
>>>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>>>
>>>>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
>>>>>>>>>> index
>>>>>>>>> -all
>>>>>>>>>> command.
>>>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>>>> elastic-indexer is not
>>>>>>>>>> getting executed.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
Re: Regarding Indexing to elasticsearch
Posted by Yash Thenuan Thenuan <ri...@iiita.ac.in>.
Following are the logs from hadoop.log
2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting
2018-03-02 11:18:45,791 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-02 11:18:46,138 INFO basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 11:18:46,138 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 11:18:46,140 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 11:18:46,140 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 11:18:46,157 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 11:18:46,535 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 11:18:48,663 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2018-03-02 11:18:48,666 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2018-03-02 11:18:48,792 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2018-03-02 11:18:48,798 WARN conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2018-03-02 11:18:49,093 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 11:18:54,737 INFO basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 11:18:54,737 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 11:18:54,738 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 11:18:56,883 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 11:18:56,884 INFO indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2018-03-02 11:18:56,939 INFO indexer.IndexingJob - IndexingJob: done.
On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:
> It's impossible to find the reason from console output.
> Please check the hadoop.log, it should contain more logs
> including those from ElasticIndexWriter.
>
> Sebastian
>
> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> > Hi Sebastian All of this is coming but the problem is,The content is not
> > sent sent.Nothing is indexed to es.
> > This is the output on debug level.
> >
> > ElasticIndexWriter
> >
> > elastic.cluster : elastic prefix cluster
> >
> > elastic.host : hostname
> >
> > elastic.port : port (default 9200)
> >
> > elastic.index : elastic index command
> >
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> >
> > elastic.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> >
> >
> > no modules loaded
> >
> > loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> >
> > loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> >
> > loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> >
> > loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> >
> > loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> >
> > created thread pool: name [force_merge], size [1], queue size [unbounded]
> >
> > created thread pool: name [fetch_shard_started], core [1], max [8], keep
> > alive [5m]
> >
> > created thread pool: name [listener], size [2], queue size [unbounded]
> >
> > created thread pool: name [index], size [4], queue size [200]
> >
> > created thread pool: name [refresh], core [1], max [2], keep alive [5m]
> >
> > created thread pool: name [generic], core [4], max [128], keep alive
> [30s]
> >
> > created thread pool: name [warmer], core [1], max [2], keep alive [5m]
> >
> > thread pool [search] will adjust queue by [50] when determining automatic
> > queue size
> >
> > created thread pool: name [search], size [7], queue size [1k]
> >
> > created thread pool: name [flush], core [1], max [2], keep alive [5m]
> >
> > created thread pool: name [fetch_shard_store], core [1], max [8], keep
> > alive [5m]
> >
> > created thread pool: name [management], core [1], max [5], keep alive
> [5m]
> >
> > created thread pool: name [get], size [4], queue size [1k]
> >
> > created thread pool: name [bulk], size [4], queue size [200]
> >
> > created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
> >
> > node_sampler_interval[5s]
> >
> > adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> > 127.0.0.1:9300}]
> >
> > connected to node
> > [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> > 127.0.0.1:9300}]
> >
> > IndexingJob: done
> >
> >
> > On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> > wastl.nagel@googlemail.com> wrote:
> >
> >> I never tried ES with Nutch 2.3 but it should be similar to setup as for
> >> 1.x:
> >>
> >> - enable the plugin "indexer-elastic" in plugin.includes
> >> (upgrade and rename to "indexer-elastic2" in 2.4)
> >>
> >> - expects ES 1.4.1
> >>
> >> - available/required options are found in the log file (hadoop.log):
> >> ElasticIndexWriter
> >> elastic.cluster : elastic prefix cluster
> >> elastic.host : hostname
> >> elastic.port : port (default 9300)
> >> elastic.index : elastic index command
> >> elastic.max.bulk.docs : elastic bulk index doc counts. (default
> >> 250)
> >> elastic.max.bulk.size : elastic bulk index length. (default
> >> 2500500 ~2.5MB)
> >>
> >> Sebastian
> >>
> >> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> >>> Yeah
> >>> I was also thinking that
> >>> Can somebody help me with nutch 2.3?
> >>>
> >>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>
> >>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
> for
> >>>> Nutch 1.x. I'm afraid I can't help you.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>> Sent: 28 February 2018 14:20
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>
> >>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
> >>>> output of
> >>>>> nutch index i have already configured the nutch-site.xml.
> >>>>>
> >>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
> >>>>>
> >>>>>> I suggest you run "nutch index", take a look at the returned help
> >>>>>> message, and continue from there.
> >>>>>> Broadly, first of all you need to configure your elasticsearch
> >>>>>> environment in nutch-site.xml, and then you need to run nutch index
> >>>>>> with the location of your CrawlDB and either the segment you want to
> >>>>>> index or the directory that contains all the segments you want to
> >>>> index.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>> Sent: 28 February 2018 14:06
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>
> >>>>>>> All I want is to index my parsed data to elasticsearch.
> >>>>>>>
> >>>>>>>
> >>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com>
> wrote:
> >>>>>>>
> >>>>>>> Hi Yash,
> >>>>>>>
> >>>>>>> The nutch index command does not have a -all flag, so I'm not sure
> >>>>>>> what
> >>>>>> you're
> >>>>>>> trying to achieve here.
> >>>>>>>
> >>>>>>> Yossi.
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
> >>>>>>>> Sent: 28 February 2018 13:55
> >>>>>>>> To: user@nutch.apache.org
> >>>>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>>>
> >>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
> >>>>>>>> index
> >>>>>>> -all
> >>>>>>>> command.
> >>>>>>>> Because I can't figure out why the write function inside the
> >>>>>>> elastic-indexer is not
> >>>>>>>> getting executed.
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>
Re: Regarding Indexing to elasticsearch
Posted by Sebastian Nagel <wa...@googlemail.com>.
It's impossible to find the reason from console output.
Please check the hadoop.log, it should contain more logs
including those from ElasticIndexWriter.
Sebastian
On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> Hi Sebastian All of this is coming but the problem is,The content is not
> sent sent.Nothing is indexed to es.
> This is the output on debug level.
>
> ElasticIndexWriter
>
> elastic.cluster : elastic prefix cluster
>
> elastic.host : hostname
>
> elastic.port : port (default 9200)
>
> elastic.index : elastic index command
>
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
>
>
> no modules loaded
>
> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>
> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>
> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>
> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>
> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>
> created thread pool: name [force_merge], size [1], queue size [unbounded]
>
> created thread pool: name [fetch_shard_started], core [1], max [8], keep
> alive [5m]
>
> created thread pool: name [listener], size [2], queue size [unbounded]
>
> created thread pool: name [index], size [4], queue size [200]
>
> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>
> created thread pool: name [generic], core [4], max [128], keep alive [30s]
>
> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>
> thread pool [search] will adjust queue by [50] when determining automatic
> queue size
>
> created thread pool: name [search], size [7], queue size [1k]
>
> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>
> created thread pool: name [fetch_shard_store], core [1], max [8], keep
> alive [5m]
>
> created thread pool: name [management], core [1], max [5], keep alive [5m]
>
> created thread pool: name [get], size [4], queue size [1k]
>
> created thread pool: name [bulk], size [4], queue size [200]
>
> created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
>
> node_sampler_interval[5s]
>
> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> 127.0.0.1:9300}]
>
> connected to node
> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> 127.0.0.1:9300}]
>
> IndexingJob: done
>
>
> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
>
>> I never tried ES with Nutch 2.3 but it should be similar to setup as for
>> 1.x:
>>
>> - enable the plugin "indexer-elastic" in plugin.includes
>> (upgrade and rename to "indexer-elastic2" in 2.4)
>>
>> - expects ES 1.4.1
>>
>> - available/required options are found in the log file (hadoop.log):
>> ElasticIndexWriter
>> elastic.cluster : elastic prefix cluster
>> elastic.host : hostname
>> elastic.port : port (default 9300)
>> elastic.index : elastic index command
>> elastic.max.bulk.docs : elastic bulk index doc counts. (default
>> 250)
>> elastic.max.bulk.size : elastic bulk index length. (default
>> 2500500 ~2.5MB)
>>
>> Sebastian
>>
>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>> Yeah
>>> I was also thinking that
>>> Can somebody help me with nutch 2.3?
>>>
>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>
>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering for
>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>
>>>>> -----Original Message-----
>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>> Sent: 28 February 2018 14:20
>>>>> To: user@nutch.apache.org
>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>
>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
>>>> output of
>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>
>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>
>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>> message, and continue from there.
>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>> environment in nutch-site.xml, and then you need to run nutch index
>>>>>> with the location of your CrawlDB and either the segment you want to
>>>>>> index or the directory that contains all the segments you want to
>>>> index.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>> Sent: 28 February 2018 14:06
>>>>>>> To: user@nutch.apache.org
>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>
>>>>>>> All I want is to index my parsed data to elasticsearch.
>>>>>>>
>>>>>>>
>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yo...@pipl.com> wrote:
>>>>>>>
>>>>>>> Hi Yash,
>>>>>>>
>>>>>>> The nutch index command does not have a -all flag, so I'm not sure
>>>>>>> what
>>>>>> you're
>>>>>>> trying to achieve here.
>>>>>>>
>>>>>>> Yossi.
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014049@iiita.ac.in]
>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>> To: user@nutch.apache.org
>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>
>>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
>>>>>>>> index
>>>>>>> -all
>>>>>>>> command.
>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>> elastic-indexer is not
>>>>>>>> getting executed.
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>