You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2012/10/10 21:41:52 UTC

Nutch 2.x architecture Supporting multivalues

Hi,

I am working on porting parse-metatags plugin to Nutch 2.x series. I did
work on patches on the same plugin for Nutch 1.5 so that multivalued tags
are saved in an array and then sent to Solr. It all worked good in 1.5.

I have ported the plugin to Nutch 2.x now but it works only for a single
value of the tag. It does not work for multivalues of a tag.

I had problem working with the Nutch architecture and the api, since some
functions do not accept multivalues like 'add function in NutchDocument'.
It has accepted 'object' type as second argument in 1.5 version but only
accepts string type in 2.x versions.

I have tried changing the metadata type to 'Map<utf8, List<ByteBuffer>>' in
WebPage and all other functions which used it. It has worked but also
failed at some points. So i am not sure if its the best way to proceed.

Can someone point to me whats the best way to do this ?

I want value of the metadata key to accept multivalues, so we should be
storing it as an array type. NutchDocument.add should accept array type in
the second parameter to pass the index values as an array.

I am also interested in knowing the opinion of nutch developers regarding
these changes.

Many Thanks,

-- 
Kiran Chitturi

Re: Nutch 2.x architecture Supporting multivalues

Posted by kiran chitturi <ch...@gmail.com>.
Sorry for lot of posts. I am wrong about the NutchDocument, it indeed saves
the value of the key as ArrayList.

So the workaround i did to get everything working is;

*Parsing*
1) Build the multivalues in to a single string using StringBuilder.
Distinguish different values by using a separator
2) Save the string as ByteBuffer type and pass it as parameter

*Indexing*
1) Retrieve the value of the key from metadata
2) Split the string using the separator that is used previously.
3) pass each splitted string it in to NutchDocument.add

This is the current workaround i have. Last time, i got suggestions that
separator might not be a good idea to save multiple values.

Please let me know if you have any suggestions.

Now, with my patch the metatags are detected and they are sent to Solr for
indexing in Nutch 2.x series.

Next week, i will work on other plugins in porting them to 2.x.

I saw in the tika plugin, and in 'TikaParser.java' that a 'To Do for
multivalues' is written. May be both are similar issues here.

Thank you,
Kiran.

On Wed, Oct 10, 2012 at 5:35 PM, kiran chitturi
<ch...@gmail.com>wrote:

> One thing i thought of is, i could use a StringBuilder and append all the
> multivalues, convert it to string and save it as value in the ByteBuffer.
>
> In this way the metadata type need not be changed. Maybe  some kind of
> separator can be used to distinguish multiple values. I am not sure if this
> is ideal case.
>
> In the indexer, we can still separate values from the main string and then
> we can pass it as an array to NutchDocument if we can only change that type.
>
> Please let me know what you think of this. Seperator might not be an ideal
> case.
>
> Thank you,
> Kiran
>
>
>
>
>
> On Wed, Oct 10, 2012 at 4:46 PM, kiran chitturi <chitturikiran15@gmail.com
> > wrote:
>
>>
>> Hi,
>>
>> I am working on porting parse-metatags plugin to Nutch 2.x series. I did
>> work on patches on the same plugin for Nutch 1.5 so that multivalued tags
>> are saved in an array and then sent to Solr. It all worked good in 1.5.
>>
>> I have ported the plugin to Nutch 2.x now but it works only for a single
>> value of the tag. It does not work for multivalues of a tag.
>>
>> I had problem working with the Nutch architecture and the api, since some
>> functions do not accept multivalues like 'add function in NutchDocument'.
>> It has accepted 'object' type as second argument in 1.5 version but only
>> accepts string type in 2.x versions.
>>
>> I have tried changing the metadata type to 'Map<utf8, List<ByteBuffer>>'
>> in WebPage and all other functions which used it. It has worked but also
>> failed at some points. So i am not sure if its the best way to proceed.
>>
>> Can someone point to me whats the best way to do this ?
>>
>> I want value of the metadata key to accept multivalues, so we should be
>> storing it as an array type. NutchDocument.add should accept array type in
>> the second parameter to pass the index values as an array.
>>
>> I am also interested in knowing the opinion of nutch developers regarding
>> these changes.
>>
>> Many Thanks,
>>
>> --
>> Kiran Chitturi
>>
>>
>>
>>
>> --
>> Kiran Chitturi
>>
>>
>
>
> --
> Kiran Chitturi
>
>


-- 
Kiran Chitturi

Re: Nutch 2.x architecture Supporting multivalues

Posted by kiran chitturi <ch...@gmail.com>.
One thing i thought of is, i could use a StringBuilder and append all the
multivalues, convert it to string and save it as value in the ByteBuffer.

In this way the metadata type need not be changed. Maybe  some kind of
separator can be used to distinguish multiple values. I am not sure if this
is ideal case.

In the indexer, we can still separate values from the main string and then
we can pass it as an array to NutchDocument if we can only change that type.

Please let me know what you think of this. Seperator might not be an ideal
case.

Thank you,
Kiran





On Wed, Oct 10, 2012 at 4:46 PM, kiran chitturi
<ch...@gmail.com>wrote:

>
> Hi,
>
> I am working on porting parse-metatags plugin to Nutch 2.x series. I did
> work on patches on the same plugin for Nutch 1.5 so that multivalued tags
> are saved in an array and then sent to Solr. It all worked good in 1.5.
>
> I have ported the plugin to Nutch 2.x now but it works only for a single
> value of the tag. It does not work for multivalues of a tag.
>
> I had problem working with the Nutch architecture and the api, since some
> functions do not accept multivalues like 'add function in NutchDocument'.
> It has accepted 'object' type as second argument in 1.5 version but only
> accepts string type in 2.x versions.
>
> I have tried changing the metadata type to 'Map<utf8, List<ByteBuffer>>'
> in WebPage and all other functions which used it. It has worked but also
> failed at some points. So i am not sure if its the best way to proceed.
>
> Can someone point to me whats the best way to do this ?
>
> I want value of the metadata key to accept multivalues, so we should be
> storing it as an array type. NutchDocument.add should accept array type in
> the second parameter to pass the index values as an array.
>
> I am also interested in knowing the opinion of nutch developers regarding
> these changes.
>
> Many Thanks,
>
> --
> Kiran Chitturi
>
>
>
>
> --
> Kiran Chitturi
>
>


-- 
Kiran Chitturi

Nutch 2.x architecture Supporting multivalues

Posted by kiran chitturi <ch...@gmail.com>.
Hi,

I am working on porting parse-metatags plugin to Nutch 2.x series. I did
work on patches on the same plugin for Nutch 1.5 so that multivalued tags
are saved in an array and then sent to Solr. It all worked good in 1.5.

I have ported the plugin to Nutch 2.x now but it works only for a single
value of the tag. It does not work for multivalues of a tag.

I had problem working with the Nutch architecture and the api, since some
functions do not accept multivalues like 'add function in NutchDocument'.
It has accepted 'object' type as second argument in 1.5 version but only
accepts string type in 2.x versions.

I have tried changing the metadata type to 'Map<utf8, List<ByteBuffer>>' in
WebPage and all other functions which used it. It has worked but also
failed at some points. So i am not sure if its the best way to proceed.

Can someone point to me whats the best way to do this ?

I want value of the metadata key to accept multivalues, so we should be
storing it as an array type. NutchDocument.add should accept array type in
the second parameter to pass the index values as an array.

I am also interested in knowing the opinion of nutch developers regarding
these changes.

Many Thanks,

-- 
Kiran Chitturi




-- 
Kiran Chitturi