You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by dealmaker <vi...@gmail.com> on 2009/04/05 07:54:35 UTC

How to find out the encoding and format of the content stored in the index?

Hi,
  I am trying to find out the encoding and format of the content stored in
the index.  I modified the code in BasicIndexFilter.java to store the
content.  But I need to know the encoding of the stored content which
doesn't seem to store this information.  I also need to know whether it's
html, pdf, rss, etc.  I have the following code, but I have to create
Content object which needs the content type which I don't also have, I just
hard code it text/html but I should not.   Please help.  Thanks.

          try {
            contentInOctets = bean.getContent (detail);
	  } catch (IOException e) {
            if (LOG.isWarnEnabled()) 
              LOG.warn("GetContent Error", e);

	  }

	  InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
	  Content content = new Content(sUrl, sUrl, contentInOctets, "text/html",
new Metadata(), attConf); 
          detector.autoDetectClues(content, true);
	  detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
	  String encoding = detector.guessEncoding(content, defaultCharEncoding);

	  input.setEncoding(encoding);
-- 
View this message in context: http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to find out the encoding and format of the content stored in the index?

Posted by yanky young <ya...@gmail.com>.

Hi:

If you have a look at org.apache.nutch.parse.html.HtmlParser, you can see
there are two property are stored in matadata:

Metadata.ORIGINAL_CHAR_ENCODING
Metadata.CHAR_ENCODING_FOR_CONVERSION
Response.CONTENT_TYPE

so I think u can just get these properties in your index plugin in the
following way:

encoding = parse.getData.getMeta(Metadata.ORIGINAL_CHAR_ENCODING);
convEncoding = parse.getData.getMeta(Metadata.CHAR_ENCODING_FOR_CONVERSION)
contentType = parse.getData.getMeta(Response.CONTENT_TYPE)

and then add them to lucene Document by addField method.

good luck

yanky


2009/4/6 dealmaker <vi...@gmail.com>

>
> Thanks.  Is there similar thing for encoding?  I don't want to it to
> re-detect the encoding again for performance reason.
>
>
> yanky young wrote:
> >
> > hi:
> >
> > there is a index-more plugin that index some information about content
> > type.
> > u can have a look.
> >
> > 2009/4/5 dealmaker <vi...@gmail.com>
> >
> >>
> >> Hi,
> >>  I am trying to find out the encoding and format of the content stored
> in
> >> the index.  I modified the code in BasicIndexFilter.java to store the
> >> content.  But I need to know the encoding of the stored content which
> >> doesn't seem to store this information.  I also need to know whether
> it's
> >> html, pdf, rss, etc.  I have the following code, but I have to create
> >> Content object which needs the content type which I don't also have, I
> >> just
> >> hard code it text/html but I should not.   Please help.  Thanks.
> >>
> >>          try {
> >>            contentInOctets = bean.getContent (detail);
> >>          } catch (IOException e) {
> >>            if (LOG.isWarnEnabled())
> >>              LOG.warn("GetContent Error", e);
> >>
> >>          }
> >>
> >>          InputSource input = new InputSource(new
> >> ByteArrayInputStream(contentInOctets));
> >>          Content content = new Content(sUrl, sUrl, contentInOctets,
> >> "text/html",
> >> new Metadata(), attConf);
> >>          detector.autoDetectClues(content, true);
> >>          detector.addClue(sniffCharacterEncoding(contentInOctets),
> >> "sniffed");
> >>          String encoding = detector.guessEncoding(content,
> >> defaultCharEncoding);
> >>
> >>          input.setEncoding(encoding);
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: How to find out the encoding and format of the content stored in the index?

Posted by dealmaker <vi...@gmail.com>.

Thanks.  Is there similar thing for encoding?  I don't want to it to
re-detect the encoding again for performance reason.


yanky young wrote:
> 
> hi:
> 
> there is a index-more plugin that index some information about content
> type.
> u can have a look.
> 
> 2009/4/5 dealmaker <vi...@gmail.com>
> 
>>
>> Hi,
>>  I am trying to find out the encoding and format of the content stored in
>> the index.  I modified the code in BasicIndexFilter.java to store the
>> content.  But I need to know the encoding of the stored content which
>> doesn't seem to store this information.  I also need to know whether it's
>> html, pdf, rss, etc.  I have the following code, but I have to create
>> Content object which needs the content type which I don't also have, I
>> just
>> hard code it text/html but I should not.   Please help.  Thanks.
>>
>>          try {
>>            contentInOctets = bean.getContent (detail);
>>          } catch (IOException e) {
>>            if (LOG.isWarnEnabled())
>>              LOG.warn("GetContent Error", e);
>>
>>          }
>>
>>          InputSource input = new InputSource(new
>> ByteArrayInputStream(contentInOctets));
>>          Content content = new Content(sUrl, sUrl, contentInOctets,
>> "text/html",
>> new Metadata(), attConf);
>>          detector.autoDetectClues(content, true);
>>          detector.addClue(sniffCharacterEncoding(contentInOctets),
>> "sniffed");
>>          String encoding = detector.guessEncoding(content,
>> defaultCharEncoding);
>>
>>          input.setEncoding(encoding);
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to find out the encoding and format of the content stored in the index?

Posted by yanky young <ya...@gmail.com>.

hi:

there is a index-more plugin that index some information about content type.
u can have a look.

2009/4/5 dealmaker <vi...@gmail.com>

>
> Hi,
>  I am trying to find out the encoding and format of the content stored in
> the index.  I modified the code in BasicIndexFilter.java to store the
> content.  But I need to know the encoding of the stored content which
> doesn't seem to store this information.  I also need to know whether it's
> html, pdf, rss, etc.  I have the following code, but I have to create
> Content object which needs the content type which I don't also have, I just
> hard code it text/html but I should not.   Please help.  Thanks.
>
>          try {
>            contentInOctets = bean.getContent (detail);
>          } catch (IOException e) {
>            if (LOG.isWarnEnabled())
>              LOG.warn("GetContent Error", e);
>
>          }
>
>          InputSource input = new InputSource(new
> ByteArrayInputStream(contentInOctets));
>          Content content = new Content(sUrl, sUrl, contentInOctets,
> "text/html",
> new Metadata(), attConf);
>          detector.autoDetectClues(content, true);
>          detector.addClue(sniffCharacterEncoding(contentInOctets),
> "sniffed");
>          String encoding = detector.guessEncoding(content,
> defaultCharEncoding);
>
>          input.setEncoding(encoding);
> --
> View this message in context:
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>