You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt MacDonald <ma...@nearbyfyi.com> on 2012/09/02 16:16:58 UTC

Subset of fields in ElasticSearch compared to HBase using Nutch 2.0, ElasticSearch, HBase

Hi,

I'm using the most recent Nutch 2.x to crawl a single site, storing
the results in HBase and then indexing for search with ElasticSearch.
My crawl and indexing complete as expected. Looking in HBase I see
metadata that I would expect for a record. Fields like:

 f:typ
                     timestamp=1346408694547, value=text/html
 h:Cache-Control
                     timestamp=1346408694547, value=private
 h:Connection
                     timestamp=1346408694547, value=close
 h:Content-Length
                     timestamp=1346408694547, value=47166
 h:Content-Type
                     timestamp=1346408694547, value=text/html;
charset=utf-8
 h:Date
                     timestamp=1346408694547, value=Fri, 31 Aug 2012
10:24:37 GMT
 h:Server
                     timestamp=1346408694547, value=Microsoft-IIS/6.0
 h:Set-Cookie
                     timestamp=1346408694547,
value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
 h:X-AspNet-Version
                     timestamp=1346408694547, value=2.0.50727
 h:X-Powered-By
                     timestamp=1346408694547, value=ASP.NET
 h:p3p
                     timestamp=1346408694547, value=CP="IDC DSP COR
ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
 il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
                     timestamp=1346408808930, value=Printable Version
 il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
                     timestamp=1346408662165, value=5.18.10 Board of
Health May Minutes

But after indexing with bin/nutch elasticindex and looking at the same
record in ElasticSearch I'm only seeing a subset of the fields that I
see in HBase.

{
  id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
  site: "www.ci.watertown.ma.us",
  content: "Watertown, MA - ...",
  title: "Watertown, MA - Official Website",
  host: "www.ci.watertown.ma.us",
  digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
  boost: "0.0",
  tstamp: "2013-06-27T10:24:37.846Z",
  url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
  anchor: [
    "5.18.10 Board of Health May Minutes",
    "Printable Version"
  ]
}

I will need to be able to search/query against fields like
Content-Type so I'm wondering if I'm missing a configuration setting
to store those fields in the search index or what else might be going
on that is preventing the fields that I'm seeing in HBase from showing
up in ElasticSearch.

I'm very new to the Nutch codebase but I've looked in
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
and didn't notice anything that would prevent all the fields from
getting into ElasticSearch.

Thanks,
Matt

Re: Subset of fields in ElasticSearch compared to HBase using Nutch 2.0, ElasticSearch, HBase

Posted by Ferdy Galema <fe...@kalooga.com>.
Thanks for updating the list.

On Tue, Sep 4, 2012 at 2:52 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:

> Hi Ferdy,
>
> Thanks for the additional information. I found out what was missing in
> my configuration. I updated the nutch-site.xml plugin.includes section
> to use index-(basic|anchor|more|urlmeta) and I'm now seeing the fields
> that I was anticipating in the ElasticSearch index. These were the
> relevant urls that I came across that supplied the information that I
> was looking for:
>
> * http://wiki.apache.org/nutch/IndexStructure
> * https://issues.apache.org/jira/browse/NUTCH-940
>
> Thanks again for such a prompt reply and the help.
>
> Thanks,
> Matt
>
> On Mon, Sep 3, 2012 at 9:11 AM, Ferdy Galema <fe...@kalooga.com>
> wrote:
> > I'm not sure what the original purpose of the documentMeta is, but seeing
> > as there already is clearly defined 'fields' container for all fields
> that
> > should be indexed, I guess it is just a place for storing some extra data
> > about the fields or document that should be indexed. The Elasticwriter
> uses
> > it only for the type, the Solrwriter does not use it at all. It looks
> like
> > Nutch trunk does not use it either.
> >
> > In short, for now I would just use the 'fields' and ignore documentMeta.
> >
> > On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <ma...@nearbyfyi.com>
> wrote:
> >
> >> Hi Ferdy,
> >>
> >> It's likely that I'm confused about what to expect in the
> >> ElasticSearch index. Reviewing both ElasticWriter.java and
> >> NutchDocument.java I see that there are two properties that store data
> >> about the document:
> >>
> >> private Map<String, List<String>> fields;
> >> private Metadata documentMeta;
> >>
> >> Looking at Metadata.java it's likely that the fields that I was
> >> expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
> >> Last-Modified, etc.) would be contained in the documentMeta property.
> >> Is there a reason that the write(NutchDocument) method in
> >> ElasticWriter shouldn't also store documentMeta in ElasticSearch?
> >>
> >> Thanks,
> >> Matt
> >>
> >> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <fe...@kalooga.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > Do some of the fields that are missing in the index have any special
> >> > characters, such as hyphen? I can imagine that those are not
> supported.
> >> (I
> >> > have not tested this).
> >> >
> >> > Ferdy.
> >> >
> >> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <ma...@nearbyfyi.com>
> >> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm using the most recent Nutch 2.x to crawl a single site, storing
> >> >> the results in HBase and then indexing for search with ElasticSearch.
> >> >> My crawl and indexing complete as expected. Looking in HBase I see
> >> >> metadata that I would expect for a record. Fields like:
> >> >>
> >> >>  f:typ
> >> >>                      timestamp=1346408694547, value=text/html
> >> >>  h:Cache-Control
> >> >>                      timestamp=1346408694547, value=private
> >> >>  h:Connection
> >> >>                      timestamp=1346408694547, value=close
> >> >>  h:Content-Length
> >> >>                      timestamp=1346408694547, value=47166
> >> >>  h:Content-Type
> >> >>                      timestamp=1346408694547, value=text/html;
> >> >> charset=utf-8
> >> >>  h:Date
> >> >>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
> >> >> 10:24:37 GMT
> >> >>  h:Server
> >> >>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
> >> >>  h:Set-Cookie
> >> >>                      timestamp=1346408694547,
> >> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
> >> >>  h:X-AspNet-Version
> >> >>                      timestamp=1346408694547, value=2.0.50727
> >> >>  h:X-Powered-By
> >> >>                      timestamp=1346408694547, value=ASP.NET
> >> >>  h:p3p
> >> >>                      timestamp=1346408694547, value=CP="IDC DSP COR
> >> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
> >> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
> >> >>                      timestamp=1346408808930, value=Printable Version
> >> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
> >> >>                      timestamp=1346408662165, value=5.18.10 Board of
> >> >> Health May Minutes
> >> >>
> >> >> But after indexing with bin/nutch elasticindex and looking at the
> same
> >> >> record in ElasticSearch I'm only seeing a subset of the fields that I
> >> >> see in HBase.
> >> >>
> >> >> {
> >> >>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
> >> >>   site: "www.ci.watertown.ma.us",
> >> >>   content: "Watertown, MA - ...",
> >> >>   title: "Watertown, MA - Official Website",
> >> >>   host: "www.ci.watertown.ma.us",
> >> >>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
> >> >>   boost: "0.0",
> >> >>   tstamp: "2013-06-27T10:24:37.846Z",
> >> >>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
> >> >>   anchor: [
> >> >>     "5.18.10 Board of Health May Minutes",
> >> >>     "Printable Version"
> >> >>   ]
> >> >> }
> >> >>
> >> >> I will need to be able to search/query against fields like
> >> >> Content-Type so I'm wondering if I'm missing a configuration setting
> >> >> to store those fields in the search index or what else might be going
> >> >> on that is preventing the fields that I'm seeing in HBase from
> showing
> >> >> up in ElasticSearch.
> >> >>
> >> >> I'm very new to the Nutch codebase but I've looked in
> >> >>
> >> >>
> >>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
> >> >> and didn't notice anything that would prevent all the fields from
> >> >> getting into ElasticSearch.
> >> >>
> >> >> Thanks,
> >> >> Matt
> >> >>
> >>
>

Re: Subset of fields in ElasticSearch compared to HBase using Nutch 2.0, ElasticSearch, HBase

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi Ferdy,

Thanks for the additional information. I found out what was missing in
my configuration. I updated the nutch-site.xml plugin.includes section
to use index-(basic|anchor|more|urlmeta) and I'm now seeing the fields
that I was anticipating in the ElasticSearch index. These were the
relevant urls that I came across that supplied the information that I
was looking for:

* http://wiki.apache.org/nutch/IndexStructure
* https://issues.apache.org/jira/browse/NUTCH-940

Thanks again for such a prompt reply and the help.

Thanks,
Matt

On Mon, Sep 3, 2012 at 9:11 AM, Ferdy Galema <fe...@kalooga.com> wrote:
> I'm not sure what the original purpose of the documentMeta is, but seeing
> as there already is clearly defined 'fields' container for all fields that
> should be indexed, I guess it is just a place for storing some extra data
> about the fields or document that should be indexed. The Elasticwriter uses
> it only for the type, the Solrwriter does not use it at all. It looks like
> Nutch trunk does not use it either.
>
> In short, for now I would just use the 'fields' and ignore documentMeta.
>
> On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>
>> Hi Ferdy,
>>
>> It's likely that I'm confused about what to expect in the
>> ElasticSearch index. Reviewing both ElasticWriter.java and
>> NutchDocument.java I see that there are two properties that store data
>> about the document:
>>
>> private Map<String, List<String>> fields;
>> private Metadata documentMeta;
>>
>> Looking at Metadata.java it's likely that the fields that I was
>> expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
>> Last-Modified, etc.) would be contained in the documentMeta property.
>> Is there a reason that the write(NutchDocument) method in
>> ElasticWriter shouldn't also store documentMeta in ElasticSearch?
>>
>> Thanks,
>> Matt
>>
>> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <fe...@kalooga.com>
>> wrote:
>> > Hi,
>> >
>> > Do some of the fields that are missing in the index have any special
>> > characters, such as hyphen? I can imagine that those are not supported.
>> (I
>> > have not tested this).
>> >
>> > Ferdy.
>> >
>> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <ma...@nearbyfyi.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm using the most recent Nutch 2.x to crawl a single site, storing
>> >> the results in HBase and then indexing for search with ElasticSearch.
>> >> My crawl and indexing complete as expected. Looking in HBase I see
>> >> metadata that I would expect for a record. Fields like:
>> >>
>> >>  f:typ
>> >>                      timestamp=1346408694547, value=text/html
>> >>  h:Cache-Control
>> >>                      timestamp=1346408694547, value=private
>> >>  h:Connection
>> >>                      timestamp=1346408694547, value=close
>> >>  h:Content-Length
>> >>                      timestamp=1346408694547, value=47166
>> >>  h:Content-Type
>> >>                      timestamp=1346408694547, value=text/html;
>> >> charset=utf-8
>> >>  h:Date
>> >>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
>> >> 10:24:37 GMT
>> >>  h:Server
>> >>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
>> >>  h:Set-Cookie
>> >>                      timestamp=1346408694547,
>> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
>> >>  h:X-AspNet-Version
>> >>                      timestamp=1346408694547, value=2.0.50727
>> >>  h:X-Powered-By
>> >>                      timestamp=1346408694547, value=ASP.NET
>> >>  h:p3p
>> >>                      timestamp=1346408694547, value=CP="IDC DSP COR
>> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
>> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
>> >>                      timestamp=1346408808930, value=Printable Version
>> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
>> >>                      timestamp=1346408662165, value=5.18.10 Board of
>> >> Health May Minutes
>> >>
>> >> But after indexing with bin/nutch elasticindex and looking at the same
>> >> record in ElasticSearch I'm only seeing a subset of the fields that I
>> >> see in HBase.
>> >>
>> >> {
>> >>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
>> >>   site: "www.ci.watertown.ma.us",
>> >>   content: "Watertown, MA - ...",
>> >>   title: "Watertown, MA - Official Website",
>> >>   host: "www.ci.watertown.ma.us",
>> >>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
>> >>   boost: "0.0",
>> >>   tstamp: "2013-06-27T10:24:37.846Z",
>> >>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
>> >>   anchor: [
>> >>     "5.18.10 Board of Health May Minutes",
>> >>     "Printable Version"
>> >>   ]
>> >> }
>> >>
>> >> I will need to be able to search/query against fields like
>> >> Content-Type so I'm wondering if I'm missing a configuration setting
>> >> to store those fields in the search index or what else might be going
>> >> on that is preventing the fields that I'm seeing in HBase from showing
>> >> up in ElasticSearch.
>> >>
>> >> I'm very new to the Nutch codebase but I've looked in
>> >>
>> >>
>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
>> >> and didn't notice anything that would prevent all the fields from
>> >> getting into ElasticSearch.
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>>

Re: Subset of fields in ElasticSearch compared to HBase using Nutch 2.0, ElasticSearch, HBase

Posted by Ferdy Galema <fe...@kalooga.com>.
I'm not sure what the original purpose of the documentMeta is, but seeing
as there already is clearly defined 'fields' container for all fields that
should be indexed, I guess it is just a place for storing some extra data
about the fields or document that should be indexed. The Elasticwriter uses
it only for the type, the Solrwriter does not use it at all. It looks like
Nutch trunk does not use it either.

In short, for now I would just use the 'fields' and ignore documentMeta.

On Mon, Sep 3, 2012 at 2:38 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:

> Hi Ferdy,
>
> It's likely that I'm confused about what to expect in the
> ElasticSearch index. Reviewing both ElasticWriter.java and
> NutchDocument.java I see that there are two properties that store data
> about the document:
>
> private Map<String, List<String>> fields;
> private Metadata documentMeta;
>
> Looking at Metadata.java it's likely that the fields that I was
> expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
> Last-Modified, etc.) would be contained in the documentMeta property.
> Is there a reason that the write(NutchDocument) method in
> ElasticWriter shouldn't also store documentMeta in ElasticSearch?
>
> Thanks,
> Matt
>
> On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <fe...@kalooga.com>
> wrote:
> > Hi,
> >
> > Do some of the fields that are missing in the index have any special
> > characters, such as hyphen? I can imagine that those are not supported.
> (I
> > have not tested this).
> >
> > Ferdy.
> >
> > On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <ma...@nearbyfyi.com>
> wrote:
> >
> >> Hi,
> >>
> >> I'm using the most recent Nutch 2.x to crawl a single site, storing
> >> the results in HBase and then indexing for search with ElasticSearch.
> >> My crawl and indexing complete as expected. Looking in HBase I see
> >> metadata that I would expect for a record. Fields like:
> >>
> >>  f:typ
> >>                      timestamp=1346408694547, value=text/html
> >>  h:Cache-Control
> >>                      timestamp=1346408694547, value=private
> >>  h:Connection
> >>                      timestamp=1346408694547, value=close
> >>  h:Content-Length
> >>                      timestamp=1346408694547, value=47166
> >>  h:Content-Type
> >>                      timestamp=1346408694547, value=text/html;
> >> charset=utf-8
> >>  h:Date
> >>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
> >> 10:24:37 GMT
> >>  h:Server
> >>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
> >>  h:Set-Cookie
> >>                      timestamp=1346408694547,
> >> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
> >>  h:X-AspNet-Version
> >>                      timestamp=1346408694547, value=2.0.50727
> >>  h:X-Powered-By
> >>                      timestamp=1346408694547, value=ASP.NET
> >>  h:p3p
> >>                      timestamp=1346408694547, value=CP="IDC DSP COR
> >> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
> >>                      timestamp=1346408808930, value=Printable Version
> >>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
> >>                      timestamp=1346408662165, value=5.18.10 Board of
> >> Health May Minutes
> >>
> >> But after indexing with bin/nutch elasticindex and looking at the same
> >> record in ElasticSearch I'm only seeing a subset of the fields that I
> >> see in HBase.
> >>
> >> {
> >>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
> >>   site: "www.ci.watertown.ma.us",
> >>   content: "Watertown, MA - ...",
> >>   title: "Watertown, MA - Official Website",
> >>   host: "www.ci.watertown.ma.us",
> >>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
> >>   boost: "0.0",
> >>   tstamp: "2013-06-27T10:24:37.846Z",
> >>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
> >>   anchor: [
> >>     "5.18.10 Board of Health May Minutes",
> >>     "Printable Version"
> >>   ]
> >> }
> >>
> >> I will need to be able to search/query against fields like
> >> Content-Type so I'm wondering if I'm missing a configuration setting
> >> to store those fields in the search index or what else might be going
> >> on that is preventing the fields that I'm seeing in HBase from showing
> >> up in ElasticSearch.
> >>
> >> I'm very new to the Nutch codebase but I've looked in
> >>
> >>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
> >> and didn't notice anything that would prevent all the fields from
> >> getting into ElasticSearch.
> >>
> >> Thanks,
> >> Matt
> >>
>

Re: Subset of fields in ElasticSearch compared to HBase using Nutch 2.0, ElasticSearch, HBase

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi Ferdy,

It's likely that I'm confused about what to expect in the
ElasticSearch index. Reviewing both ElasticWriter.java and
NutchDocument.java I see that there are two properties that store data
about the document:

private Map<String, List<String>> fields;
private Metadata documentMeta;

Looking at Metadata.java it's likely that the fields that I was
expecting to show up in ElasticSearch (HTTP Headers like Content-Type,
Last-Modified, etc.) would be contained in the documentMeta property.
Is there a reason that the write(NutchDocument) method in
ElasticWriter shouldn't also store documentMeta in ElasticSearch?

Thanks,
Matt

On Sun, Sep 2, 2012 at 1:41 PM, Ferdy Galema <fe...@kalooga.com> wrote:
> Hi,
>
> Do some of the fields that are missing in the index have any special
> characters, such as hyphen? I can imagine that those are not supported. (I
> have not tested this).
>
> Ferdy.
>
> On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>
>> Hi,
>>
>> I'm using the most recent Nutch 2.x to crawl a single site, storing
>> the results in HBase and then indexing for search with ElasticSearch.
>> My crawl and indexing complete as expected. Looking in HBase I see
>> metadata that I would expect for a record. Fields like:
>>
>>  f:typ
>>                      timestamp=1346408694547, value=text/html
>>  h:Cache-Control
>>                      timestamp=1346408694547, value=private
>>  h:Connection
>>                      timestamp=1346408694547, value=close
>>  h:Content-Length
>>                      timestamp=1346408694547, value=47166
>>  h:Content-Type
>>                      timestamp=1346408694547, value=text/html;
>> charset=utf-8
>>  h:Date
>>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
>> 10:24:37 GMT
>>  h:Server
>>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
>>  h:Set-Cookie
>>                      timestamp=1346408694547,
>> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
>>  h:X-AspNet-Version
>>                      timestamp=1346408694547, value=2.0.50727
>>  h:X-Powered-By
>>                      timestamp=1346408694547, value=ASP.NET
>>  h:p3p
>>                      timestamp=1346408694547, value=CP="IDC DSP COR
>> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
>>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
>>                      timestamp=1346408808930, value=Printable Version
>>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
>>                      timestamp=1346408662165, value=5.18.10 Board of
>> Health May Minutes
>>
>> But after indexing with bin/nutch elasticindex and looking at the same
>> record in ElasticSearch I'm only seeing a subset of the fields that I
>> see in HBase.
>>
>> {
>>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
>>   site: "www.ci.watertown.ma.us",
>>   content: "Watertown, MA - ...",
>>   title: "Watertown, MA - Official Website",
>>   host: "www.ci.watertown.ma.us",
>>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
>>   boost: "0.0",
>>   tstamp: "2013-06-27T10:24:37.846Z",
>>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
>>   anchor: [
>>     "5.18.10 Board of Health May Minutes",
>>     "Printable Version"
>>   ]
>> }
>>
>> I will need to be able to search/query against fields like
>> Content-Type so I'm wondering if I'm missing a configuration setting
>> to store those fields in the search index or what else might be going
>> on that is preventing the fields that I'm seeing in HBase from showing
>> up in ElasticSearch.
>>
>> I'm very new to the Nutch codebase but I've looked in
>>
>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
>> and didn't notice anything that would prevent all the fields from
>> getting into ElasticSearch.
>>
>> Thanks,
>> Matt
>>

Re: Subset of fields in ElasticSearch compared to HBase using Nutch 2.0, ElasticSearch, HBase

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

Do some of the fields that are missing in the index have any special
characters, such as hyphen? I can imagine that those are not supported. (I
have not tested this).

Ferdy.

On Sun, Sep 2, 2012 at 4:16 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:

> Hi,
>
> I'm using the most recent Nutch 2.x to crawl a single site, storing
> the results in HBase and then indexing for search with ElasticSearch.
> My crawl and indexing complete as expected. Looking in HBase I see
> metadata that I would expect for a record. Fields like:
>
>  f:typ
>                      timestamp=1346408694547, value=text/html
>  h:Cache-Control
>                      timestamp=1346408694547, value=private
>  h:Connection
>                      timestamp=1346408694547, value=close
>  h:Content-Length
>                      timestamp=1346408694547, value=47166
>  h:Content-Type
>                      timestamp=1346408694547, value=text/html;
> charset=utf-8
>  h:Date
>                      timestamp=1346408694547, value=Fri, 31 Aug 2012
> 10:24:37 GMT
>  h:Server
>                      timestamp=1346408694547, value=Microsoft-IIS/6.0
>  h:Set-Cookie
>                      timestamp=1346408694547,
> value=ASP.NET_SessionId=vl222e555tn03ongipnv2j55; path=/; HttpOnly
>  h:X-AspNet-Version
>                      timestamp=1346408694547, value=2.0.50727
>  h:X-Powered-By
>                      timestamp=1346408694547, value=ASP.NET
>  h:p3p
>                      timestamp=1346408694547, value=CP="IDC DSP COR
> ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
>  il:http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027
>                      timestamp=1346408808930, value=Printable Version
>  il:http://www.ci.watertown.ma.us/Archive.aspx?AMID=40
>                      timestamp=1346408662165, value=5.18.10 Board of
> Health May Minutes
>
> But after indexing with bin/nutch elasticindex and looking at the same
> record in ElasticSearch I'm only seeing a subset of the fields that I
> see in HBase.
>
> {
>   id: "us.ma.watertown.ci.www:http/Archive.aspx?ADID=1027",
>   site: "www.ci.watertown.ma.us",
>   content: "Watertown, MA - ...",
>   title: "Watertown, MA - Official Website",
>   host: "www.ci.watertown.ma.us",
>   digest: "b30833d3cd1180ddd8beb4f7d3bbaeee",
>   boost: "0.0",
>   tstamp: "2013-06-27T10:24:37.846Z",
>   url: "http://www.ci.watertown.ma.us/Archive.aspx?ADID=1027",
>   anchor: [
>     "5.18.10 Board of Health May Minutes",
>     "Printable Version"
>   ]
> }
>
> I will need to be able to search/query against fields like
> Content-Type so I'm wondering if I'm missing a configuration setting
> to store those fields in the search index or what else might be going
> on that is preventing the fields that I'm seeing in HBase from showing
> up in ElasticSearch.
>
> I'm very new to the Nutch codebase but I've looked in
>
> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
> and didn't notice anything that would prevent all the fields from
> getting into ElasticSearch.
>
> Thanks,
> Matt
>