You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason S <ja...@gmail.com> on 2013/03/05 22:17:55 UTC

keep all pages from a domain in one slice

Hello,

I seem to remember seeing a discussion about this in the past but I can't seem to find it in the archives.

When using mergesegs -slice, is it possible to keep all the pages from a domain in the same slice?  I have just been messing around with this functionality (Nutch 1.6), and it seems like the records are simply split after the counter has reached the slice size specified, sometimes splitting the records from a single domain over multiple slices. 

How can I segregate a domain to a single slice?

Thanks in advance,

~Jason

Re: keep all pages from a domain in one slice

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Jason,

I saw no mention of mergesegs or any mention of Nutch within the first post
of this thread.
It probably would be a useful feature for many Nutch users. I agree.
It seems the thread did not get much feedback though which is a shame.
If you want to discuss more thoroughly then please do. There will surely be
some that will participate.
Thanks, have a great weekend.
Lewis

On Wed, Mar 6, 2013 at 1:34 AM, Stubblefield Jason <
mr.jason.stubblefield@gmail.com> wrote:

> Well Lewis, I quite frankly disagree.
>
> I am asking how I can have more control for the slice process in the nutch
> mergesegs operation.
>
> I think this could be a useful feature to many Nutch users.
>
> I can see that I wont get any more assistance here.
>
> Thanks,
>
> Jason
>
>
>
> On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Jason,
> > There is nothing I can see here which concerns Nutch.
> > Try solr lists please.
> > Thank you
> > Lewis
> >
> > On Tuesday, March 5, 2013, Stubblefield Jason <
> > mr.jason.stubblefield@gmail.com> wrote:
> >> I have several Solr 3.6 instances that for various reasons, I don't want
> > to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I
> want
> > to be able to slice the crawl so that I can have 1 slice per solr shard,
> > but also use the grouping feature on solr.  From what I understand, solr
> > grouping doesn't work properly when pages from a domain are spread across
> > solr shards.
> >>
> >> Basically i'm after something like this:
> >>
> >> slice1 (apache.org, linux.org) -> solr1
> >>
> >> slice2 (stackoverflow.com, wikipedia.org) -> solr2
> >>
> >> etc...
> >>
> >> I could upgrade to Solrcloud, or possibly use elasticsearch, but it
> would
> > be a fair amount of re-coding.  I was just curious if I could manage the
> > sharding manually.
> >>
> >> Suggestions would certainly be appreciated, it seems like I am faced
> with
> > a massive upgrade or to break the grouping functionality.
> >>
> >> ~Jason
> >>
> >> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <ma...@openindex.io>
> > wrote:
> >>
> >>> Hi
> >>>
> >>> You can't do this with -slice but you can merge segments and filter
> > them. This would mean you'd have to merge the segments for each domain.
> But
> > that's far too much work. Why do you want to do this? There may be better
> > ways in achieving you goal.
> >>>
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Jason S <ja...@gmail.com>
> >>>> Sent: Tue 05-Mar-2013 22:18
> >>>> To: user@nutch.apache.org
> >>>> Subject: keep all pages from a domain in one slice
> >>>>
> >>>> Hello,
> >>>>
> >>>> I seem to remember seeing a discussion about this in the past but I
> > can't seem to find it in the archives.
> >>>>
> >>>> When using mergesegs -slice, is it possible to keep all the pages from
> > a domain in the same slice?  I have just been messing around with this
> > functionality (Nutch 1.6), and it seems like the records are simply split
> > after the counter has reached the slice size specified, sometimes
> splitting
> > the records from a single domain over multiple slices.
> >>>>
> >>>> How can I segregate a domain to a single slice?
> >>>>
> >>>> Thanks in advance,
> >>>>
> >>>> ~Jason
> >>
> >>
> >
> > --
> > *Lewis*
>
>


-- 
*Lewis*

Re: keep all pages from a domain in one slice

Posted by SUJIT PAL <su...@comcast.net>.
Hi Jason,

I did something along the lines you are after and submitted the patch as NUTCH-945, maybe you will find it useful.

https://issues.apache.org/jira/browse/NUTCH-945

The idea behind the patch is this - you set up a list of SOLR servers in your configuration, and define and configure a partitioner that returns an index into the list of SOLR servers given the document URL. In your case I think you will have to build a custom partitioner that uses the domain to decide the partition.

-sujit

On Mar 6, 2013, at 1:34 AM, Stubblefield Jason wrote:

> Well Lewis, I quite frankly disagree.  
> 
> I am asking how I can have more control for the slice process in the nutch mergesegs operation.
> 
> I think this could be a useful feature to many Nutch users.
> 
> I can see that I wont get any more assistance here.
> 
> Thanks,
> 
> Jason
> 
> 
> 
> On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney <le...@gmail.com> wrote:
> 
>> Hi Jason,
>> There is nothing I can see here which concerns Nutch.
>> Try solr lists please.
>> Thank you
>> Lewis
>> 
>> On Tuesday, March 5, 2013, Stubblefield Jason <
>> mr.jason.stubblefield@gmail.com> wrote:
>>> I have several Solr 3.6 instances that for various reasons, I don't want
>> to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want
>> to be able to slice the crawl so that I can have 1 slice per solr shard,
>> but also use the grouping feature on solr.  From what I understand, solr
>> grouping doesn't work properly when pages from a domain are spread across
>> solr shards.
>>> 
>>> Basically i'm after something like this:
>>> 
>>> slice1 (apache.org, linux.org) -> solr1
>>> 
>>> slice2 (stackoverflow.com, wikipedia.org) -> solr2
>>> 
>>> etc...
>>> 
>>> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would
>> be a fair amount of re-coding.  I was just curious if I could manage the
>> sharding manually.
>>> 
>>> Suggestions would certainly be appreciated, it seems like I am faced with
>> a massive upgrade or to break the grouping functionality.
>>> 
>>> ~Jason
>>> 
>>> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <ma...@openindex.io>
>> wrote:
>>> 
>>>> Hi
>>>> 
>>>> You can't do this with -slice but you can merge segments and filter
>> them. This would mean you'd have to merge the segments for each domain. But
>> that's far too much work. Why do you want to do this? There may be better
>> ways in achieving you goal.
>>>> 
>>>> 
>>>> 
>>>> -----Original message-----
>>>>> From:Jason S <ja...@gmail.com>
>>>>> Sent: Tue 05-Mar-2013 22:18
>>>>> To: user@nutch.apache.org
>>>>> Subject: keep all pages from a domain in one slice
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I seem to remember seeing a discussion about this in the past but I
>> can't seem to find it in the archives.
>>>>> 
>>>>> When using mergesegs -slice, is it possible to keep all the pages from
>> a domain in the same slice?  I have just been messing around with this
>> functionality (Nutch 1.6), and it seems like the records are simply split
>> after the counter has reached the slice size specified, sometimes splitting
>> the records from a single domain over multiple slices.
>>>>> 
>>>>> How can I segregate a domain to a single slice?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> ~Jason
>>> 
>>> 
>> 
>> -- 
>> *Lewis*
> 


Re: keep all pages from a domain in one slice

Posted by Stubblefield Jason <mr...@gmail.com>.
Well Lewis, I quite frankly disagree.  

I am asking how I can have more control for the slice process in the nutch mergesegs operation.

I think this could be a useful feature to many Nutch users.

I can see that I wont get any more assistance here.

Thanks,

Jason



On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney <le...@gmail.com> wrote:

> Hi Jason,
> There is nothing I can see here which concerns Nutch.
> Try solr lists please.
> Thank you
> Lewis
> 
> On Tuesday, March 5, 2013, Stubblefield Jason <
> mr.jason.stubblefield@gmail.com> wrote:
>> I have several Solr 3.6 instances that for various reasons, I don't want
> to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want
> to be able to slice the crawl so that I can have 1 slice per solr shard,
> but also use the grouping feature on solr.  From what I understand, solr
> grouping doesn't work properly when pages from a domain are spread across
> solr shards.
>> 
>> Basically i'm after something like this:
>> 
>> slice1 (apache.org, linux.org) -> solr1
>> 
>> slice2 (stackoverflow.com, wikipedia.org) -> solr2
>> 
>> etc...
>> 
>> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would
> be a fair amount of re-coding.  I was just curious if I could manage the
> sharding manually.
>> 
>> Suggestions would certainly be appreciated, it seems like I am faced with
> a massive upgrade or to break the grouping functionality.
>> 
>> ~Jason
>> 
>> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <ma...@openindex.io>
> wrote:
>> 
>>> Hi
>>> 
>>> You can't do this with -slice but you can merge segments and filter
> them. This would mean you'd have to merge the segments for each domain. But
> that's far too much work. Why do you want to do this? There may be better
> ways in achieving you goal.
>>> 
>>> 
>>> 
>>> -----Original message-----
>>>> From:Jason S <ja...@gmail.com>
>>>> Sent: Tue 05-Mar-2013 22:18
>>>> To: user@nutch.apache.org
>>>> Subject: keep all pages from a domain in one slice
>>>> 
>>>> Hello,
>>>> 
>>>> I seem to remember seeing a discussion about this in the past but I
> can't seem to find it in the archives.
>>>> 
>>>> When using mergesegs -slice, is it possible to keep all the pages from
> a domain in the same slice?  I have just been messing around with this
> functionality (Nutch 1.6), and it seems like the records are simply split
> after the counter has reached the slice size specified, sometimes splitting
> the records from a single domain over multiple slices.
>>>> 
>>>> How can I segregate a domain to a single slice?
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> ~Jason
>> 
>> 
> 
> -- 
> *Lewis*


Re: keep all pages from a domain in one slice

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Jason,
There is nothing I can see here which concerns Nutch.
Try solr lists please.
Thank you
Lewis

On Tuesday, March 5, 2013, Stubblefield Jason <
mr.jason.stubblefield@gmail.com> wrote:
> I have several Solr 3.6 instances that for various reasons, I don't want
to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want
to be able to slice the crawl so that I can have 1 slice per solr shard,
but also use the grouping feature on solr.  From what I understand, solr
grouping doesn't work properly when pages from a domain are spread across
solr shards.
>
> Basically i'm after something like this:
>
> slice1 (apache.org, linux.org) -> solr1
>
> slice2 (stackoverflow.com, wikipedia.org) -> solr2
>
> etc...
>
> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would
be a fair amount of re-coding.  I was just curious if I could manage the
sharding manually.
>
> Suggestions would certainly be appreciated, it seems like I am faced with
a massive upgrade or to break the grouping functionality.
>
> ~Jason
>
> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <ma...@openindex.io>
wrote:
>
>> Hi
>>
>> You can't do this with -slice but you can merge segments and filter
them. This would mean you'd have to merge the segments for each domain. But
that's far too much work. Why do you want to do this? There may be better
ways in achieving you goal.
>>
>>
>>
>> -----Original message-----
>>> From:Jason S <ja...@gmail.com>
>>> Sent: Tue 05-Mar-2013 22:18
>>> To: user@nutch.apache.org
>>> Subject: keep all pages from a domain in one slice
>>>
>>> Hello,
>>>
>>> I seem to remember seeing a discussion about this in the past but I
can't seem to find it in the archives.
>>>
>>> When using mergesegs -slice, is it possible to keep all the pages from
a domain in the same slice?  I have just been messing around with this
functionality (Nutch 1.6), and it seems like the records are simply split
after the counter has reached the slice size specified, sometimes splitting
the records from a single domain over multiple slices.
>>>
>>> How can I segregate a domain to a single slice?
>>>
>>> Thanks in advance,
>>>
>>> ~Jason
>
>

-- 
*Lewis*

Re: keep all pages from a domain in one slice

Posted by Stubblefield Jason <mr...@gmail.com>.
I have several Solr 3.6 instances that for various reasons, I don't want to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want to be able to slice the crawl so that I can have 1 slice per solr shard, but also use the grouping feature on solr.  From what I understand, solr grouping doesn't work properly when pages from a domain are spread across solr shards.

Basically i'm after something like this:

slice1 (apache.org, linux.org) -> solr1

slice2 (stackoverflow.com, wikipedia.org) -> solr2

etc...

I could upgrade to Solrcloud, or possibly use elasticsearch, but it would be a fair amount of re-coding.  I was just curious if I could manage the sharding manually.

Suggestions would certainly be appreciated, it seems like I am faced with a massive upgrade or to break the grouping functionality.

~Jason

On Mar 5, 2013, at 11:02 PM, Markus Jelsma <ma...@openindex.io> wrote:

> Hi
> 
> You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal.
> 
> 
> 
> -----Original message-----
>> From:Jason S <ja...@gmail.com>
>> Sent: Tue 05-Mar-2013 22:18
>> To: user@nutch.apache.org
>> Subject: keep all pages from a domain in one slice
>> 
>> Hello,
>> 
>> I seem to remember seeing a discussion about this in the past but I can't seem to find it in the archives.
>> 
>> When using mergesegs -slice, is it possible to keep all the pages from a domain in the same slice?  I have just been messing around with this functionality (Nutch 1.6), and it seems like the records are simply split after the counter has reached the slice size specified, sometimes splitting the records from a single domain over multiple slices. 
>> 
>> How can I segregate a domain to a single slice?
>> 
>> Thanks in advance,
>> 
>> ~Jason


Re: keep all pages from a domain in one slice

Posted by feng lu <am...@gmail.com>.
Hi

Maybe you can implement SegmentMergeFilter interface to filter segments
during segment merge.


On Wed, Mar 6, 2013 at 6:02 AM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi
>
> You can't do this with -slice but you can merge segments and filter them.
> This would mean you'd have to merge the segments for each domain. But
> that's far too much work. Why do you want to do this? There may be better
> ways in achieving you goal.
>
>
>
> -----Original message-----
> > From:Jason S <ja...@gmail.com>
> > Sent: Tue 05-Mar-2013 22:18
> > To: user@nutch.apache.org
> > Subject: keep all pages from a domain in one slice
> >
> > Hello,
> >
> > I seem to remember seeing a discussion about this in the past but I
> can't seem to find it in the archives.
> >
> > When using mergesegs -slice, is it possible to keep all the pages from a
> domain in the same slice?  I have just been messing around with this
> functionality (Nutch 1.6), and it seems like the records are simply split
> after the counter has reached the slice size specified, sometimes splitting
> the records from a single domain over multiple slices.
> >
> > How can I segregate a domain to a single slice?
> >
> > Thanks in advance,
> >
> > ~Jason
>



-- 
Don't Grow Old, Grow Up... :-)

RE: keep all pages from a domain in one slice

Posted by Markus Jelsma <ma...@openindex.io>.
Hi

You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you want to do this? There may be better ways in achieving you goal.

 
 
-----Original message-----
> From:Jason S <ja...@gmail.com>
> Sent: Tue 05-Mar-2013 22:18
> To: user@nutch.apache.org
> Subject: keep all pages from a domain in one slice
> 
> Hello,
> 
> I seem to remember seeing a discussion about this in the past but I can't seem to find it in the archives.
> 
> When using mergesegs -slice, is it possible to keep all the pages from a domain in the same slice?  I have just been messing around with this functionality (Nutch 1.6), and it seems like the records are simply split after the counter has reached the slice size specified, sometimes splitting the records from a single domain over multiple slices. 
> 
> How can I segregate a domain to a single slice?
> 
> Thanks in advance,
> 
> ~Jason