You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dealmaker <vi...@gmail.com> on 2009/04/05 07:54:35 UTC
How to find out the encoding and format of the content stored in
the index?
Hi,
I am trying to find out the encoding and format of the content stored in
the index. I modified the code in BasicIndexFilter.java to store the
content. But I need to know the encoding of the stored content which
doesn't seem to store this information. I also need to know whether it's
html, pdf, rss, etc. I have the following code, but I have to create
Content object which needs the content type which I don't also have, I just
hard code it text/html but I should not. Please help. Thanks.
try {
contentInOctets = bean.getContent (detail);
} catch (IOException e) {
if (LOG.isWarnEnabled())
LOG.warn("GetContent Error", e);
}
InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
Content content = new Content(sUrl, sUrl, contentInOctets, "text/html",
new Metadata(), attConf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content, defaultCharEncoding);
input.setEncoding(encoding);
--
View this message in context: http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to find out the encoding and format of the content stored in
the index?
Posted by yanky young <ya...@gmail.com>.
Hi:
If you have a look at org.apache.nutch.parse.html.HtmlParser, you can see
there are two property are stored in matadata:
Metadata.ORIGINAL_CHAR_ENCODING
Metadata.CHAR_ENCODING_FOR_CONVERSION
Response.CONTENT_TYPE
so I think u can just get these properties in your index plugin in the
following way:
encoding = parse.getData.getMeta(Metadata.ORIGINAL_CHAR_ENCODING);
convEncoding = parse.getData.getMeta(Metadata.CHAR_ENCODING_FOR_CONVERSION)
contentType = parse.getData.getMeta(Response.CONTENT_TYPE)
and then add them to lucene Document by addField method.
good luck
yanky
2009/4/6 dealmaker <vi...@gmail.com>
>
> Thanks. Is there similar thing for encoding? I don't want to it to
> re-detect the encoding again for performance reason.
>
>
> yanky young wrote:
> >
> > hi:
> >
> > there is a index-more plugin that index some information about content
> > type.
> > u can have a look.
> >
> > 2009/4/5 dealmaker <vi...@gmail.com>
> >
> >>
> >> Hi,
> >> I am trying to find out the encoding and format of the content stored
> in
> >> the index. I modified the code in BasicIndexFilter.java to store the
> >> content. But I need to know the encoding of the stored content which
> >> doesn't seem to store this information. I also need to know whether
> it's
> >> html, pdf, rss, etc. I have the following code, but I have to create
> >> Content object which needs the content type which I don't also have, I
> >> just
> >> hard code it text/html but I should not. Please help. Thanks.
> >>
> >> try {
> >> contentInOctets = bean.getContent (detail);
> >> } catch (IOException e) {
> >> if (LOG.isWarnEnabled())
> >> LOG.warn("GetContent Error", e);
> >>
> >> }
> >>
> >> InputSource input = new InputSource(new
> >> ByteArrayInputStream(contentInOctets));
> >> Content content = new Content(sUrl, sUrl, contentInOctets,
> >> "text/html",
> >> new Metadata(), attConf);
> >> detector.autoDetectClues(content, true);
> >> detector.addClue(sniffCharacterEncoding(contentInOctets),
> >> "sniffed");
> >> String encoding = detector.guessEncoding(content,
> >> defaultCharEncoding);
> >>
> >> input.setEncoding(encoding);
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Re: How to find out the encoding and format of the content stored
in the index?
Posted by dealmaker <vi...@gmail.com>.
Thanks. Is there similar thing for encoding? I don't want to it to
re-detect the encoding again for performance reason.
yanky young wrote:
>
> hi:
>
> there is a index-more plugin that index some information about content
> type.
> u can have a look.
>
> 2009/4/5 dealmaker <vi...@gmail.com>
>
>>
>> Hi,
>> I am trying to find out the encoding and format of the content stored in
>> the index. I modified the code in BasicIndexFilter.java to store the
>> content. But I need to know the encoding of the stored content which
>> doesn't seem to store this information. I also need to know whether it's
>> html, pdf, rss, etc. I have the following code, but I have to create
>> Content object which needs the content type which I don't also have, I
>> just
>> hard code it text/html but I should not. Please help. Thanks.
>>
>> try {
>> contentInOctets = bean.getContent (detail);
>> } catch (IOException e) {
>> if (LOG.isWarnEnabled())
>> LOG.warn("GetContent Error", e);
>>
>> }
>>
>> InputSource input = new InputSource(new
>> ByteArrayInputStream(contentInOctets));
>> Content content = new Content(sUrl, sUrl, contentInOctets,
>> "text/html",
>> new Metadata(), attConf);
>> detector.autoDetectClues(content, true);
>> detector.addClue(sniffCharacterEncoding(contentInOctets),
>> "sniffed");
>> String encoding = detector.guessEncoding(content,
>> defaultCharEncoding);
>>
>> input.setEncoding(encoding);
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
--
View this message in context: http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to find out the encoding and format of the content stored in
the index?
Posted by yanky young <ya...@gmail.com>.
hi:
there is a index-more plugin that index some information about content type.
u can have a look.
2009/4/5 dealmaker <vi...@gmail.com>
>
> Hi,
> I am trying to find out the encoding and format of the content stored in
> the index. I modified the code in BasicIndexFilter.java to store the
> content. But I need to know the encoding of the stored content which
> doesn't seem to store this information. I also need to know whether it's
> html, pdf, rss, etc. I have the following code, but I have to create
> Content object which needs the content type which I don't also have, I just
> hard code it text/html but I should not. Please help. Thanks.
>
> try {
> contentInOctets = bean.getContent (detail);
> } catch (IOException e) {
> if (LOG.isWarnEnabled())
> LOG.warn("GetContent Error", e);
>
> }
>
> InputSource input = new InputSource(new
> ByteArrayInputStream(contentInOctets));
> Content content = new Content(sUrl, sUrl, contentInOctets,
> "text/html",
> new Metadata(), attConf);
> detector.autoDetectClues(content, true);
> detector.addClue(sniffCharacterEncoding(contentInOctets),
> "sniffed");
> String encoding = detector.guessEncoding(content,
> defaultCharEncoding);
>
> input.setEncoding(encoding);
> --
> View this message in context:
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>