You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Furkan KAMACI <fu...@gmail.com> on 2013/04/07 00:01:21 UTC

Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Hi;

First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)

However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly
storing them at Hbase (is it efficient or not)?

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Furkan KAMACI <fu...@gmail.com>.

All in all is there anything that we can say before measuring the
performance comparison of storing the stored values of documents at Hbase?
I mean as like:

* I will need to communicate with Hbase and this will produce more latency
than Lucene
* I will loose some built-in functionality that integrates Lucene and Solr
* I will loose some good things as like caching at memory with Lucene
* bla bla bala..

(These are not true, I just wrote them as an example)

Any ideas?



2013/4/17 adfel70 <ad...@gmail.com>

> Any rule of thumb regarding the size of document limitation when storing it
> in solr?
>
>
>
> Otis Gospodnetic-5 wrote
> > Use Solr.  It's pretty clear you don't yet have any problems that
> > would make you think about alternatives.  Using Solr to store and not
> > just index will make your life simpler (and your app simpler and
> > likely faster).
> >
> > Otis
> > --
> > Solr & ElasticSearch Support
> > http://sematext.com/
> >
> >
> >
> >
> >
> > On Tue, Apr 16, 2013 at 6:31 PM, Furkan KAMACI &lt;
>
> > furkankamaci@
>
> > &gt; wrote:
> >> Thanks again for your answer. If I find any document about such
> >> comparisons
> >> that I would like to read.
> >>
> >> By the way, is there any advantage for using Lucene instead of anything
> >> else as like that:
> >>
> >> Using Lucene is naturally supported at Solr and if I use anything else I
> >> may face with some compatibility problems or communicating issues?
> >>
> >>
> >> 2013/4/17 Otis Gospodnetic &lt;
>
> > otis.gospodnetic@
>
> > &gt;
> >>
> >>> People do use other data stores to retrieve data sometimes. e.g. Mongo
> >>> is popular for that.  Like I hinted in another email, I wouldn't
> >>> necessarily recommend this for common cases.  Don't do it unless you
> >>> really know you need it.  Otherwise, just store in Solr.
> >>>
> >>> Otis
> >>> --
> >>> Solr & ElasticSearch Support
> >>> http://sematext.com/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI &lt;
>
> > furkankamaci@
>
> > &gt;
> >>> wrote:
> >>> > Hi Otis and Jack;
> >>> >
> >>> > I have made a research about highlights and debugged code. I see that
> >>> > highlight are query dependent and not stored. Why Solr uses Lucene
> for
> >>> > storing text, I mean i.e. content of a web page. Is there any
> >>> comparison
> >>> > about to store texts at Hbase or any other databases versus Lucene.
> >>> >
> >>> > Also I want to learn that is there anybody who has used anything else
> >>> from
> >>> > Lucene to store text of document at our solr user list?
> >>> >
> >>> > 2013/4/11 Otis Gospodnetic &lt;
>
> > otis.gospodnetic@
>
> > &gt;
> >>> >
> >>> >> Source code is your best bet.  Wiki has info about how to use it,
> but
> >>> >> not how highlighting is implemented.  But you don't need to
> >>> understand
> >>> >> the implementation details to understand that they are dynamic,
> >>> >> computed specifically for each query for each matching document, so
> >>> >> you cannot store them anywhere ahead of time.
> >>> >>
> >>> >> Otis
> >>> >> --
> >>> >> Solr & ElasticSearch Support
> >>> >> http://sematext.com/
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI &lt;
>
> > furkankamaci@
>
> > &gt;> >
> >>> >> wrote:
> >>> >> > Hi Otis;
> >>> >> >
> >>> >> > It seems that I should read more about highlights. Is there any
> >>> where
> >>> >> that
> >>> >> > explains in detail how highlights are generated at Solr?
> >>> >> >
> >>> >> > 2013/4/11 Otis Gospodnetic &lt;
>
> > otis.gospodnetic@
>
> > &gt;
> >>> >> >
> >>> >> >> Hi,
> >>> >> >>
> >>> >> >> You can't store highlights ahead of time because they are query
> >>> >> >> dependent.  You could store documents in HBase and use Solr just
> >>> for
> >>> >> >> indexing.  Is that what you want to do?  If so, a custom
> >>> >> >> SearchComponent executed after QueryComponent could fetch data
> >>> from
> >>> >> >> external store like HBase.  I'm not sure if I'd recommend that.
> >>> >> >>
> >>> >> >> Otis
> >>> >> >> --
> >>> >> >> Solr & ElasticSearch Support
> >>> >> >> http://sematext.com/
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <
> >>>
>
> > furkankamaci@
>
> >>> >> >
> >>> >> >> wrote:
> >>> >> >> > Actually I don't think to store documents at Solr. I want to
> >>> store
> >>> >> just
> >>> >> >> > highlights (snippets) at Hbase and I want to retrieve them from
> >>> Hbase
> >>> >> >> when
> >>> >> >> > needed.
> >>> >> >> > What do you think about separating just highlights from Solr
> and
> >>> >> storing
> >>> >> >> > them into Hbase at Solrclod. By the way if you explain at which
> >>> >> process
> >>> >> >> and
> >>> >> >> > how highlights are genareted at Solr you are welcome.
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > 2013/4/9 Otis Gospodnetic &lt;
>
> > otis.gospodnetic@
>
> > &gt;
> >>> >> >> >
> >>> >> >> >> You may also be interested in looking at things like solrbase
> >>> (on
> >>> >> >> Github).
> >>> >> >> >>
> >>> >> >> >> Otis
> >>> >> >> >> --
> >>> >> >> >> Solr & ElasticSearch Support
> >>> >> >> >> http://sematext.com/
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <
> >>> >>
>
> > furkankamaci@
>
> >>
> >>> >> >> >> wrote:
> >>> >> >> >> > Hi;
> >>> >> >> >> >
> >>> >> >> >> > First of all should mention that I am new to Solr and making
> >>> a
> >>> >> >> research
> >>> >> >> >> > about it. What I am trying to do that I will crawl some
> >>> websites
> >>> >> with
> >>> >> >> >> Nutch
> >>> >> >> >> > and then I will index them with Solr. (Nutch 2.1,
> >>> Solr-SolrCloud
> >>> >> 4.2 )
> >>> >> >> >> >
> >>> >> >> >> > I wonder about something. I have a cloud of machines that
> >>> crawls
> >>> >> >> websites
> >>> >> >> >> > and stores that documents. Then I send that documents into
> >>> >> SolrCloud.
> >>> >> >> >> Solr
> >>> >> >> >> > indexes that documents and generates indexes and save them.
> I
> >>> know
> >>> >> >> that
> >>> >> >> >> > from Information Retrieval theory: it *may* not be efficient
> >>> to
> >>> >> store
> >>> >> >> >> > indexes at a NoSQL database (they are something like linked
> >>> lists
> >>> >> and
> >>> >> >> if
> >>> >> >> >> > you store them in such kind of database you *may* have a
> >>> sparse
> >>> >> >> >> > representation -by the way there may be some solutions for
> >>> it.
> >>> If
> >>> >> you
> >>> >> >> >> > explain them you are welcome.)
> >>> >> >> >> >
> >>> >> >> >> > However Solr stores some documents too (i.e. highlights) So
> >>> some
> >>> >> of my
> >>> >> >> >> > documents will be doubled somehow. If I consider that I will
> >>> have
> >>> >> many
> >>> >> >> >> > documents, that dobuled documents may cause a problem for
> me.
> >>> So is
> >>> >> >> there
> >>> >> >> >> > any way not storing that documents at Solr and pointing to
> >>> them
> >>> at
> >>> >> >> >> > Hbase(where I save my crawled documents) or instead of
> >>> pointing
> >>> >> >> directly
> >>> >> >> >> > storing them at Hbase (is it efficient or not)?
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Pointing-to-Hbase-for-Docuements-or-Directly-Saving-Documents-at-Hbase-tp4054277p4056599.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by adfel70 <ad...@gmail.com>.

Any rule of thumb regarding the size of document limitation when storing it
in solr?



Otis Gospodnetic-5 wrote
> Use Solr.  It's pretty clear you don't yet have any problems that
> would make you think about alternatives.  Using Solr to store and not
> just index will make your life simpler (and your app simpler and
> likely faster).
> 
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
> 
> 
> 
> 
> 
> On Tue, Apr 16, 2013 at 6:31 PM, Furkan KAMACI &lt;

> furkankamaci@

> &gt; wrote:
>> Thanks again for your answer. If I find any document about such
>> comparisons
>> that I would like to read.
>>
>> By the way, is there any advantage for using Lucene instead of anything
>> else as like that:
>>
>> Using Lucene is naturally supported at Solr and if I use anything else I
>> may face with some compatibility problems or communicating issues?
>>
>>
>> 2013/4/17 Otis Gospodnetic &lt;

> otis.gospodnetic@

> &gt;
>>
>>> People do use other data stores to retrieve data sometimes. e.g. Mongo
>>> is popular for that.  Like I hinted in another email, I wouldn't
>>> necessarily recommend this for common cases.  Don't do it unless you
>>> really know you need it.  Otherwise, just store in Solr.
>>>
>>> Otis
>>> --
>>> Solr & ElasticSearch Support
>>> http://sematext.com/
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI &lt;

> furkankamaci@

> &gt;
>>> wrote:
>>> > Hi Otis and Jack;
>>> >
>>> > I have made a research about highlights and debugged code. I see that
>>> > highlight are query dependent and not stored. Why Solr uses Lucene for
>>> > storing text, I mean i.e. content of a web page. Is there any
>>> comparison
>>> > about to store texts at Hbase or any other databases versus Lucene.
>>> >
>>> > Also I want to learn that is there anybody who has used anything else
>>> from
>>> > Lucene to store text of document at our solr user list?
>>> >
>>> > 2013/4/11 Otis Gospodnetic &lt;

> otis.gospodnetic@

> &gt;
>>> >
>>> >> Source code is your best bet.  Wiki has info about how to use it, but
>>> >> not how highlighting is implemented.  But you don't need to
>>> understand
>>> >> the implementation details to understand that they are dynamic,
>>> >> computed specifically for each query for each matching document, so
>>> >> you cannot store them anywhere ahead of time.
>>> >>
>>> >> Otis
>>> >> --
>>> >> Solr & ElasticSearch Support
>>> >> http://sematext.com/
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI &lt;

> furkankamaci@

> &gt;> >
>>> >> wrote:
>>> >> > Hi Otis;
>>> >> >
>>> >> > It seems that I should read more about highlights. Is there any
>>> where
>>> >> that
>>> >> > explains in detail how highlights are generated at Solr?
>>> >> >
>>> >> > 2013/4/11 Otis Gospodnetic &lt;

> otis.gospodnetic@

> &gt;
>>> >> >
>>> >> >> Hi,
>>> >> >>
>>> >> >> You can't store highlights ahead of time because they are query
>>> >> >> dependent.  You could store documents in HBase and use Solr just
>>> for
>>> >> >> indexing.  Is that what you want to do?  If so, a custom
>>> >> >> SearchComponent executed after QueryComponent could fetch data
>>> from
>>> >> >> external store like HBase.  I'm not sure if I'd recommend that.
>>> >> >>
>>> >> >> Otis
>>> >> >> --
>>> >> >> Solr & ElasticSearch Support
>>> >> >> http://sematext.com/
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <
>>> 

> furkankamaci@

>>> >> >
>>> >> >> wrote:
>>> >> >> > Actually I don't think to store documents at Solr. I want to
>>> store
>>> >> just
>>> >> >> > highlights (snippets) at Hbase and I want to retrieve them from
>>> Hbase
>>> >> >> when
>>> >> >> > needed.
>>> >> >> > What do you think about separating just highlights from Solr and
>>> >> storing
>>> >> >> > them into Hbase at Solrclod. By the way if you explain at which
>>> >> process
>>> >> >> and
>>> >> >> > how highlights are genareted at Solr you are welcome.
>>> >> >> >
>>> >> >> >
>>> >> >> > 2013/4/9 Otis Gospodnetic &lt;

> otis.gospodnetic@

> &gt;
>>> >> >> >
>>> >> >> >> You may also be interested in looking at things like solrbase
>>> (on
>>> >> >> Github).
>>> >> >> >>
>>> >> >> >> Otis
>>> >> >> >> --
>>> >> >> >> Solr & ElasticSearch Support
>>> >> >> >> http://sematext.com/
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <
>>> >> 

> furkankamaci@

>>
>>> >> >> >> wrote:
>>> >> >> >> > Hi;
>>> >> >> >> >
>>> >> >> >> > First of all should mention that I am new to Solr and making
>>> a
>>> >> >> research
>>> >> >> >> > about it. What I am trying to do that I will crawl some
>>> websites
>>> >> with
>>> >> >> >> Nutch
>>> >> >> >> > and then I will index them with Solr. (Nutch 2.1,
>>> Solr-SolrCloud
>>> >> 4.2 )
>>> >> >> >> >
>>> >> >> >> > I wonder about something. I have a cloud of machines that
>>> crawls
>>> >> >> websites
>>> >> >> >> > and stores that documents. Then I send that documents into
>>> >> SolrCloud.
>>> >> >> >> Solr
>>> >> >> >> > indexes that documents and generates indexes and save them. I
>>> know
>>> >> >> that
>>> >> >> >> > from Information Retrieval theory: it *may* not be efficient
>>> to
>>> >> store
>>> >> >> >> > indexes at a NoSQL database (they are something like linked
>>> lists
>>> >> and
>>> >> >> if
>>> >> >> >> > you store them in such kind of database you *may* have a
>>> sparse
>>> >> >> >> > representation -by the way there may be some solutions for
>>> it.
>>> If
>>> >> you
>>> >> >> >> > explain them you are welcome.)
>>> >> >> >> >
>>> >> >> >> > However Solr stores some documents too (i.e. highlights) So
>>> some
>>> >> of my
>>> >> >> >> > documents will be doubled somehow. If I consider that I will
>>> have
>>> >> many
>>> >> >> >> > documents, that dobuled documents may cause a problem for me.
>>> So is
>>> >> >> there
>>> >> >> >> > any way not storing that documents at Solr and pointing to
>>> them
>>> at
>>> >> >> >> > Hbase(where I save my crawled documents) or instead of
>>> pointing
>>> >> >> directly
>>> >> >> >> > storing them at Hbase (is it efficient or not)?
>>> >> >> >>
>>> >> >>
>>> >>
>>>





--
View this message in context: http://lucene.472066.n3.nabble.com/Pointing-to-Hbase-for-Docuements-or-Directly-Saving-Documents-at-Hbase-tp4054277p4056599.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Otis Gospodnetic <ot...@gmail.com>.

Use Solr.  It's pretty clear you don't yet have any problems that
would make you think about alternatives.  Using Solr to store and not
just index will make your life simpler (and your app simpler and
likely faster).

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Tue, Apr 16, 2013 at 6:31 PM, Furkan KAMACI <fu...@gmail.com> wrote:
> Thanks again for your answer. If I find any document about such comparisons
> that I would like to read.
>
> By the way, is there any advantage for using Lucene instead of anything
> else as like that:
>
> Using Lucene is naturally supported at Solr and if I use anything else I
> may face with some compatibility problems or communicating issues?
>
>
> 2013/4/17 Otis Gospodnetic <ot...@gmail.com>
>
>> People do use other data stores to retrieve data sometimes. e.g. Mongo
>> is popular for that.  Like I hinted in another email, I wouldn't
>> necessarily recommend this for common cases.  Don't do it unless you
>> really know you need it.  Otherwise, just store in Solr.
>>
>> Otis
>> --
>> Solr & ElasticSearch Support
>> http://sematext.com/
>>
>>
>>
>>
>>
>> On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI <fu...@gmail.com>
>> wrote:
>> > Hi Otis and Jack;
>> >
>> > I have made a research about highlights and debugged code. I see that
>> > highlight are query dependent and not stored. Why Solr uses Lucene for
>> > storing text, I mean i.e. content of a web page. Is there any comparison
>> > about to store texts at Hbase or any other databases versus Lucene.
>> >
>> > Also I want to learn that is there anybody who has used anything else
>> from
>> > Lucene to store text of document at our solr user list?
>> >
>> > 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
>> >
>> >> Source code is your best bet.  Wiki has info about how to use it, but
>> >> not how highlighting is implemented.  But you don't need to understand
>> >> the implementation details to understand that they are dynamic,
>> >> computed specifically for each query for each matching document, so
>> >> you cannot store them anywhere ahead of time.
>> >>
>> >> Otis
>> >> --
>> >> Solr & ElasticSearch Support
>> >> http://sematext.com/
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI <furkankamaci@gmail.com
>> >
>> >> wrote:
>> >> > Hi Otis;
>> >> >
>> >> > It seems that I should read more about highlights. Is there any where
>> >> that
>> >> > explains in detail how highlights are generated at Solr?
>> >> >
>> >> > 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> You can't store highlights ahead of time because they are query
>> >> >> dependent.  You could store documents in HBase and use Solr just for
>> >> >> indexing.  Is that what you want to do?  If so, a custom
>> >> >> SearchComponent executed after QueryComponent could fetch data from
>> >> >> external store like HBase.  I'm not sure if I'd recommend that.
>> >> >>
>> >> >> Otis
>> >> >> --
>> >> >> Solr & ElasticSearch Support
>> >> >> http://sematext.com/
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <
>> furkankamaci@gmail.com
>> >> >
>> >> >> wrote:
>> >> >> > Actually I don't think to store documents at Solr. I want to store
>> >> just
>> >> >> > highlights (snippets) at Hbase and I want to retrieve them from
>> Hbase
>> >> >> when
>> >> >> > needed.
>> >> >> > What do you think about separating just highlights from Solr and
>> >> storing
>> >> >> > them into Hbase at Solrclod. By the way if you explain at which
>> >> process
>> >> >> and
>> >> >> > how highlights are genareted at Solr you are welcome.
>> >> >> >
>> >> >> >
>> >> >> > 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
>> >> >> >
>> >> >> >> You may also be interested in looking at things like solrbase (on
>> >> >> Github).
>> >> >> >>
>> >> >> >> Otis
>> >> >> >> --
>> >> >> >> Solr & ElasticSearch Support
>> >> >> >> http://sematext.com/
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <
>> >> furkankamaci@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi;
>> >> >> >> >
>> >> >> >> > First of all should mention that I am new to Solr and making a
>> >> >> research
>> >> >> >> > about it. What I am trying to do that I will crawl some websites
>> >> with
>> >> >> >> Nutch
>> >> >> >> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
>> >> 4.2 )
>> >> >> >> >
>> >> >> >> > I wonder about something. I have a cloud of machines that crawls
>> >> >> websites
>> >> >> >> > and stores that documents. Then I send that documents into
>> >> SolrCloud.
>> >> >> >> Solr
>> >> >> >> > indexes that documents and generates indexes and save them. I
>> know
>> >> >> that
>> >> >> >> > from Information Retrieval theory: it *may* not be efficient to
>> >> store
>> >> >> >> > indexes at a NoSQL database (they are something like linked
>> lists
>> >> and
>> >> >> if
>> >> >> >> > you store them in such kind of database you *may* have a sparse
>> >> >> >> > representation -by the way there may be some solutions for it.
>> If
>> >> you
>> >> >> >> > explain them you are welcome.)
>> >> >> >> >
>> >> >> >> > However Solr stores some documents too (i.e. highlights) So some
>> >> of my
>> >> >> >> > documents will be doubled somehow. If I consider that I will
>> have
>> >> many
>> >> >> >> > documents, that dobuled documents may cause a problem for me.
>> So is
>> >> >> there
>> >> >> >> > any way not storing that documents at Solr and pointing to them
>> at
>> >> >> >> > Hbase(where I save my crawled documents) or instead of pointing
>> >> >> directly
>> >> >> >> > storing them at Hbase (is it efficient or not)?
>> >> >> >>
>> >> >>
>> >>
>>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Furkan KAMACI <fu...@gmail.com>.

Thanks again for your answer. If I find any document about such comparisons
that I would like to read.

By the way, is there any advantage for using Lucene instead of anything
else as like that:

Using Lucene is naturally supported at Solr and if I use anything else I
may face with some compatibility problems or communicating issues?


2013/4/17 Otis Gospodnetic <ot...@gmail.com>

> People do use other data stores to retrieve data sometimes. e.g. Mongo
> is popular for that.  Like I hinted in another email, I wouldn't
> necessarily recommend this for common cases.  Don't do it unless you
> really know you need it.  Otherwise, just store in Solr.
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > Hi Otis and Jack;
> >
> > I have made a research about highlights and debugged code. I see that
> > highlight are query dependent and not stored. Why Solr uses Lucene for
> > storing text, I mean i.e. content of a web page. Is there any comparison
> > about to store texts at Hbase or any other databases versus Lucene.
> >
> > Also I want to learn that is there anybody who has used anything else
> from
> > Lucene to store text of document at our solr user list?
> >
> > 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
> >
> >> Source code is your best bet.  Wiki has info about how to use it, but
> >> not how highlighting is implemented.  But you don't need to understand
> >> the implementation details to understand that they are dynamic,
> >> computed specifically for each query for each matching document, so
> >> you cannot store them anywhere ahead of time.
> >>
> >> Otis
> >> --
> >> Solr & ElasticSearch Support
> >> http://sematext.com/
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI <furkankamaci@gmail.com
> >
> >> wrote:
> >> > Hi Otis;
> >> >
> >> > It seems that I should read more about highlights. Is there any where
> >> that
> >> > explains in detail how highlights are generated at Solr?
> >> >
> >> > 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
> >> >
> >> >> Hi,
> >> >>
> >> >> You can't store highlights ahead of time because they are query
> >> >> dependent.  You could store documents in HBase and use Solr just for
> >> >> indexing.  Is that what you want to do?  If so, a custom
> >> >> SearchComponent executed after QueryComponent could fetch data from
> >> >> external store like HBase.  I'm not sure if I'd recommend that.
> >> >>
> >> >> Otis
> >> >> --
> >> >> Solr & ElasticSearch Support
> >> >> http://sematext.com/
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <
> furkankamaci@gmail.com
> >> >
> >> >> wrote:
> >> >> > Actually I don't think to store documents at Solr. I want to store
> >> just
> >> >> > highlights (snippets) at Hbase and I want to retrieve them from
> Hbase
> >> >> when
> >> >> > needed.
> >> >> > What do you think about separating just highlights from Solr and
> >> storing
> >> >> > them into Hbase at Solrclod. By the way if you explain at which
> >> process
> >> >> and
> >> >> > how highlights are genareted at Solr you are welcome.
> >> >> >
> >> >> >
> >> >> > 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
> >> >> >
> >> >> >> You may also be interested in looking at things like solrbase (on
> >> >> Github).
> >> >> >>
> >> >> >> Otis
> >> >> >> --
> >> >> >> Solr & ElasticSearch Support
> >> >> >> http://sematext.com/
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <
> >> furkankamaci@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi;
> >> >> >> >
> >> >> >> > First of all should mention that I am new to Solr and making a
> >> >> research
> >> >> >> > about it. What I am trying to do that I will crawl some websites
> >> with
> >> >> >> Nutch
> >> >> >> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
> >> 4.2 )
> >> >> >> >
> >> >> >> > I wonder about something. I have a cloud of machines that crawls
> >> >> websites
> >> >> >> > and stores that documents. Then I send that documents into
> >> SolrCloud.
> >> >> >> Solr
> >> >> >> > indexes that documents and generates indexes and save them. I
> know
> >> >> that
> >> >> >> > from Information Retrieval theory: it *may* not be efficient to
> >> store
> >> >> >> > indexes at a NoSQL database (they are something like linked
> lists
> >> and
> >> >> if
> >> >> >> > you store them in such kind of database you *may* have a sparse
> >> >> >> > representation -by the way there may be some solutions for it.
> If
> >> you
> >> >> >> > explain them you are welcome.)
> >> >> >> >
> >> >> >> > However Solr stores some documents too (i.e. highlights) So some
> >> of my
> >> >> >> > documents will be doubled somehow. If I consider that I will
> have
> >> many
> >> >> >> > documents, that dobuled documents may cause a problem for me.
> So is
> >> >> there
> >> >> >> > any way not storing that documents at Solr and pointing to them
> at
> >> >> >> > Hbase(where I save my crawled documents) or instead of pointing
> >> >> directly
> >> >> >> > storing them at Hbase (is it efficient or not)?
> >> >> >>
> >> >>
> >>
>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Otis Gospodnetic <ot...@gmail.com>.

People do use other data stores to retrieve data sometimes. e.g. Mongo
is popular for that.  Like I hinted in another email, I wouldn't
necessarily recommend this for common cases.  Don't do it unless you
really know you need it.  Otherwise, just store in Solr.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI <fu...@gmail.com> wrote:
> Hi Otis and Jack;
>
> I have made a research about highlights and debugged code. I see that
> highlight are query dependent and not stored. Why Solr uses Lucene for
> storing text, I mean i.e. content of a web page. Is there any comparison
> about to store texts at Hbase or any other databases versus Lucene.
>
> Also I want to learn that is there anybody who has used anything else from
> Lucene to store text of document at our solr user list?
>
> 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
>
>> Source code is your best bet.  Wiki has info about how to use it, but
>> not how highlighting is implemented.  But you don't need to understand
>> the implementation details to understand that they are dynamic,
>> computed specifically for each query for each matching document, so
>> you cannot store them anywhere ahead of time.
>>
>> Otis
>> --
>> Solr & ElasticSearch Support
>> http://sematext.com/
>>
>>
>>
>>
>>
>> On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI <fu...@gmail.com>
>> wrote:
>> > Hi Otis;
>> >
>> > It seems that I should read more about highlights. Is there any where
>> that
>> > explains in detail how highlights are generated at Solr?
>> >
>> > 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
>> >
>> >> Hi,
>> >>
>> >> You can't store highlights ahead of time because they are query
>> >> dependent.  You could store documents in HBase and use Solr just for
>> >> indexing.  Is that what you want to do?  If so, a custom
>> >> SearchComponent executed after QueryComponent could fetch data from
>> >> external store like HBase.  I'm not sure if I'd recommend that.
>> >>
>> >> Otis
>> >> --
>> >> Solr & ElasticSearch Support
>> >> http://sematext.com/
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <furkankamaci@gmail.com
>> >
>> >> wrote:
>> >> > Actually I don't think to store documents at Solr. I want to store
>> just
>> >> > highlights (snippets) at Hbase and I want to retrieve them from Hbase
>> >> when
>> >> > needed.
>> >> > What do you think about separating just highlights from Solr and
>> storing
>> >> > them into Hbase at Solrclod. By the way if you explain at which
>> process
>> >> and
>> >> > how highlights are genareted at Solr you are welcome.
>> >> >
>> >> >
>> >> > 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
>> >> >
>> >> >> You may also be interested in looking at things like solrbase (on
>> >> Github).
>> >> >>
>> >> >> Otis
>> >> >> --
>> >> >> Solr & ElasticSearch Support
>> >> >> http://sematext.com/
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <
>> furkankamaci@gmail.com>
>> >> >> wrote:
>> >> >> > Hi;
>> >> >> >
>> >> >> > First of all should mention that I am new to Solr and making a
>> >> research
>> >> >> > about it. What I am trying to do that I will crawl some websites
>> with
>> >> >> Nutch
>> >> >> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
>> 4.2 )
>> >> >> >
>> >> >> > I wonder about something. I have a cloud of machines that crawls
>> >> websites
>> >> >> > and stores that documents. Then I send that documents into
>> SolrCloud.
>> >> >> Solr
>> >> >> > indexes that documents and generates indexes and save them. I know
>> >> that
>> >> >> > from Information Retrieval theory: it *may* not be efficient to
>> store
>> >> >> > indexes at a NoSQL database (they are something like linked lists
>> and
>> >> if
>> >> >> > you store them in such kind of database you *may* have a sparse
>> >> >> > representation -by the way there may be some solutions for it. If
>> you
>> >> >> > explain them you are welcome.)
>> >> >> >
>> >> >> > However Solr stores some documents too (i.e. highlights) So some
>> of my
>> >> >> > documents will be doubled somehow. If I consider that I will have
>> many
>> >> >> > documents, that dobuled documents may cause a problem for me. So is
>> >> there
>> >> >> > any way not storing that documents at Solr and pointing to them at
>> >> >> > Hbase(where I save my crawled documents) or instead of pointing
>> >> directly
>> >> >> > storing them at Hbase (is it efficient or not)?
>> >> >>
>> >>
>>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Otis and Jack;

I have made a research about highlights and debugged code. I see that
highlight are query dependent and not stored. Why Solr uses Lucene for
storing text, I mean i.e. content of a web page. Is there any comparison
about to store texts at Hbase or any other databases versus Lucene.

Also I want to learn that is there anybody who has used anything else from
Lucene to store text of document at our solr user list?

2013/4/11 Otis Gospodnetic <ot...@gmail.com>

> Source code is your best bet.  Wiki has info about how to use it, but
> not how highlighting is implemented.  But you don't need to understand
> the implementation details to understand that they are dynamic,
> computed specifically for each query for each matching document, so
> you cannot store them anywhere ahead of time.
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > Hi Otis;
> >
> > It seems that I should read more about highlights. Is there any where
> that
> > explains in detail how highlights are generated at Solr?
> >
> > 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
> >
> >> Hi,
> >>
> >> You can't store highlights ahead of time because they are query
> >> dependent.  You could store documents in HBase and use Solr just for
> >> indexing.  Is that what you want to do?  If so, a custom
> >> SearchComponent executed after QueryComponent could fetch data from
> >> external store like HBase.  I'm not sure if I'd recommend that.
> >>
> >> Otis
> >> --
> >> Solr & ElasticSearch Support
> >> http://sematext.com/
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <furkankamaci@gmail.com
> >
> >> wrote:
> >> > Actually I don't think to store documents at Solr. I want to store
> just
> >> > highlights (snippets) at Hbase and I want to retrieve them from Hbase
> >> when
> >> > needed.
> >> > What do you think about separating just highlights from Solr and
> storing
> >> > them into Hbase at Solrclod. By the way if you explain at which
> process
> >> and
> >> > how highlights are genareted at Solr you are welcome.
> >> >
> >> >
> >> > 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
> >> >
> >> >> You may also be interested in looking at things like solrbase (on
> >> Github).
> >> >>
> >> >> Otis
> >> >> --
> >> >> Solr & ElasticSearch Support
> >> >> http://sematext.com/
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <
> furkankamaci@gmail.com>
> >> >> wrote:
> >> >> > Hi;
> >> >> >
> >> >> > First of all should mention that I am new to Solr and making a
> >> research
> >> >> > about it. What I am trying to do that I will crawl some websites
> with
> >> >> Nutch
> >> >> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
> 4.2 )
> >> >> >
> >> >> > I wonder about something. I have a cloud of machines that crawls
> >> websites
> >> >> > and stores that documents. Then I send that documents into
> SolrCloud.
> >> >> Solr
> >> >> > indexes that documents and generates indexes and save them. I know
> >> that
> >> >> > from Information Retrieval theory: it *may* not be efficient to
> store
> >> >> > indexes at a NoSQL database (they are something like linked lists
> and
> >> if
> >> >> > you store them in such kind of database you *may* have a sparse
> >> >> > representation -by the way there may be some solutions for it. If
> you
> >> >> > explain them you are welcome.)
> >> >> >
> >> >> > However Solr stores some documents too (i.e. highlights) So some
> of my
> >> >> > documents will be doubled somehow. If I consider that I will have
> many
> >> >> > documents, that dobuled documents may cause a problem for me. So is
> >> there
> >> >> > any way not storing that documents at Solr and pointing to them at
> >> >> > Hbase(where I save my crawled documents) or instead of pointing
> >> directly
> >> >> > storing them at Hbase (is it efficient or not)?
> >> >>
> >>
>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Otis Gospodnetic <ot...@gmail.com>.

Source code is your best bet.  Wiki has info about how to use it, but
not how highlighting is implemented.  But you don't need to understand
the implementation details to understand that they are dynamic,
computed specifically for each query for each matching document, so
you cannot store them anywhere ahead of time.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> Hi Otis;
>
> It seems that I should read more about highlights. Is there any where that
> explains in detail how highlights are generated at Solr?
>
> 2013/4/11 Otis Gospodnetic <ot...@gmail.com>
>
>> Hi,
>>
>> You can't store highlights ahead of time because they are query
>> dependent.  You could store documents in HBase and use Solr just for
>> indexing.  Is that what you want to do?  If so, a custom
>> SearchComponent executed after QueryComponent could fetch data from
>> external store like HBase.  I'm not sure if I'd recommend that.
>>
>> Otis
>> --
>> Solr & ElasticSearch Support
>> http://sematext.com/
>>
>>
>>
>>
>>
>> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <fu...@gmail.com>
>> wrote:
>> > Actually I don't think to store documents at Solr. I want to store just
>> > highlights (snippets) at Hbase and I want to retrieve them from Hbase
>> when
>> > needed.
>> > What do you think about separating just highlights from Solr and storing
>> > them into Hbase at Solrclod. By the way if you explain at which process
>> and
>> > how highlights are genareted at Solr you are welcome.
>> >
>> >
>> > 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
>> >
>> >> You may also be interested in looking at things like solrbase (on
>> Github).
>> >>
>> >> Otis
>> >> --
>> >> Solr & ElasticSearch Support
>> >> http://sematext.com/
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <fu...@gmail.com>
>> >> wrote:
>> >> > Hi;
>> >> >
>> >> > First of all should mention that I am new to Solr and making a
>> research
>> >> > about it. What I am trying to do that I will crawl some websites with
>> >> Nutch
>> >> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
>> >> >
>> >> > I wonder about something. I have a cloud of machines that crawls
>> websites
>> >> > and stores that documents. Then I send that documents into SolrCloud.
>> >> Solr
>> >> > indexes that documents and generates indexes and save them. I know
>> that
>> >> > from Information Retrieval theory: it *may* not be efficient to store
>> >> > indexes at a NoSQL database (they are something like linked lists and
>> if
>> >> > you store them in such kind of database you *may* have a sparse
>> >> > representation -by the way there may be some solutions for it. If you
>> >> > explain them you are welcome.)
>> >> >
>> >> > However Solr stores some documents too (i.e. highlights) So some of my
>> >> > documents will be doubled somehow. If I consider that I will have many
>> >> > documents, that dobuled documents may cause a problem for me. So is
>> there
>> >> > any way not storing that documents at Solr and pointing to them at
>> >> > Hbase(where I save my crawled documents) or instead of pointing
>> directly
>> >> > storing them at Hbase (is it efficient or not)?
>> >>
>>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Otis;

It seems that I should read more about highlights. Is there any where that
explains in detail how highlights are generated at Solr?

2013/4/11 Otis Gospodnetic <ot...@gmail.com>

> Hi,
>
> You can't store highlights ahead of time because they are query
> dependent.  You could store documents in HBase and use Solr just for
> indexing.  Is that what you want to do?  If so, a custom
> SearchComponent executed after QueryComponent could fetch data from
> external store like HBase.  I'm not sure if I'd recommend that.
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > Actually I don't think to store documents at Solr. I want to store just
> > highlights (snippets) at Hbase and I want to retrieve them from Hbase
> when
> > needed.
> > What do you think about separating just highlights from Solr and storing
> > them into Hbase at Solrclod. By the way if you explain at which process
> and
> > how highlights are genareted at Solr you are welcome.
> >
> >
> > 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
> >
> >> You may also be interested in looking at things like solrbase (on
> Github).
> >>
> >> Otis
> >> --
> >> Solr & ElasticSearch Support
> >> http://sematext.com/
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <fu...@gmail.com>
> >> wrote:
> >> > Hi;
> >> >
> >> > First of all should mention that I am new to Solr and making a
> research
> >> > about it. What I am trying to do that I will crawl some websites with
> >> Nutch
> >> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
> >> >
> >> > I wonder about something. I have a cloud of machines that crawls
> websites
> >> > and stores that documents. Then I send that documents into SolrCloud.
> >> Solr
> >> > indexes that documents and generates indexes and save them. I know
> that
> >> > from Information Retrieval theory: it *may* not be efficient to store
> >> > indexes at a NoSQL database (they are something like linked lists and
> if
> >> > you store them in such kind of database you *may* have a sparse
> >> > representation -by the way there may be some solutions for it. If you
> >> > explain them you are welcome.)
> >> >
> >> > However Solr stores some documents too (i.e. highlights) So some of my
> >> > documents will be doubled somehow. If I consider that I will have many
> >> > documents, that dobuled documents may cause a problem for me. So is
> there
> >> > any way not storing that documents at Solr and pointing to them at
> >> > Hbase(where I save my crawled documents) or instead of pointing
> directly
> >> > storing them at Hbase (is it efficient or not)?
> >>
>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

You can't store highlights ahead of time because they are query
dependent.  You could store documents in HBase and use Solr just for
indexing.  Is that what you want to do?  If so, a custom
SearchComponent executed after QueryComponent could fetch data from
external store like HBase.  I'm not sure if I'd recommend that.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> Actually I don't think to store documents at Solr. I want to store just
> highlights (snippets) at Hbase and I want to retrieve them from Hbase when
> needed.
> What do you think about separating just highlights from Solr and storing
> them into Hbase at Solrclod. By the way if you explain at which process and
> how highlights are genareted at Solr you are welcome.
>
>
> 2013/4/9 Otis Gospodnetic <ot...@gmail.com>
>
>> You may also be interested in looking at things like solrbase (on Github).
>>
>> Otis
>> --
>> Solr & ElasticSearch Support
>> http://sematext.com/
>>
>>
>>
>>
>>
>> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <fu...@gmail.com>
>> wrote:
>> > Hi;
>> >
>> > First of all should mention that I am new to Solr and making a research
>> > about it. What I am trying to do that I will crawl some websites with
>> Nutch
>> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
>> >
>> > I wonder about something. I have a cloud of machines that crawls websites
>> > and stores that documents. Then I send that documents into SolrCloud.
>> Solr
>> > indexes that documents and generates indexes and save them. I know that
>> > from Information Retrieval theory: it *may* not be efficient to store
>> > indexes at a NoSQL database (they are something like linked lists and if
>> > you store them in such kind of database you *may* have a sparse
>> > representation -by the way there may be some solutions for it. If you
>> > explain them you are welcome.)
>> >
>> > However Solr stores some documents too (i.e. highlights) So some of my
>> > documents will be doubled somehow. If I consider that I will have many
>> > documents, that dobuled documents may cause a problem for me. So is there
>> > any way not storing that documents at Solr and pointing to them at
>> > Hbase(where I save my crawled documents) or instead of pointing directly
>> > storing them at Hbase (is it efficient or not)?
>>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Furkan KAMACI <fu...@gmail.com>.

Actually I don't think to store documents at Solr. I want to store just
highlights (snippets) at Hbase and I want to retrieve them from Hbase when
needed.
What do you think about separating just highlights from Solr and storing
them into Hbase at Solrclod. By the way if you explain at which process and
how highlights are genareted at Solr you are welcome.


2013/4/9 Otis Gospodnetic <ot...@gmail.com>

> You may also be interested in looking at things like solrbase (on Github).
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > Hi;
> >
> > First of all should mention that I am new to Solr and making a research
> > about it. What I am trying to do that I will crawl some websites with
> Nutch
> > and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
> >
> > I wonder about something. I have a cloud of machines that crawls websites
> > and stores that documents. Then I send that documents into SolrCloud.
> Solr
> > indexes that documents and generates indexes and save them. I know that
> > from Information Retrieval theory: it *may* not be efficient to store
> > indexes at a NoSQL database (they are something like linked lists and if
> > you store them in such kind of database you *may* have a sparse
> > representation -by the way there may be some solutions for it. If you
> > explain them you are welcome.)
> >
> > However Solr stores some documents too (i.e. highlights) So some of my
> > documents will be doubled somehow. If I consider that I will have many
> > documents, that dobuled documents may cause a problem for me. So is there
> > any way not storing that documents at Solr and pointing to them at
> > Hbase(where I save my crawled documents) or instead of pointing directly
> > storing them at Hbase (is it efficient or not)?
>

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Otis Gospodnetic <ot...@gmail.com>.

You may also be interested in looking at things like solrbase (on Github).

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI <fu...@gmail.com> wrote:
> Hi;
>
> First of all should mention that I am new to Solr and making a research
> about it. What I am trying to do that I will crawl some websites with Nutch
> and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
>
> I wonder about something. I have a cloud of machines that crawls websites
> and stores that documents. Then I send that documents into SolrCloud. Solr
> indexes that documents and generates indexes and save them. I know that
> from Information Retrieval theory: it *may* not be efficient to store
> indexes at a NoSQL database (they are something like linked lists and if
> you store them in such kind of database you *may* have a sparse
> representation -by the way there may be some solutions for it. If you
> explain them you are welcome.)
>
> However Solr stores some documents too (i.e. highlights) So some of my
> documents will be doubled somehow. If I consider that I will have many
> documents, that dobuled documents may cause a problem for me. So is there
> any way not storing that documents at Solr and pointing to them at
> Hbase(where I save my crawled documents) or instead of pointing directly
> storing them at Hbase (is it efficient or not)?

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

Posted by Jack Krupansky <ja...@basetechnology.com>.

Solr would not be storing the original source form of the documents in any 
case. Whether you use Tika or SolrCell, only the text stream of the content 
and the metadata would ever get indexed or stored in Solr.

Solr completely decouples "indexing" and "storing" of data values. If you 
don't want to "store" the text stream in Solr, then don't.

If you want to "store" the original blob of the source documents in some 
other data store, that's your choice. You can store the original URL or a 
document ID or URL for some alternate document store. That's your choice to 
make. Solr in no way forces you one way or the other. And whether that URL 
or document ID refers to HBase or a web site, doesn't matter to Solr either.

Whether or not you could more efficiently store the original document bytes 
in Lucene/Solr DocValues vs. HBase is a separate matter - I don't know one 
way or the other whether DocValues help or not. Or whether a Solr 
BinaryField might be suitable for store the original bytes of a document 
(but without indexing the bytes.)

In other words, maybe you could just use two separate Solr servers, one for 
text index and metadata store, and the other for raw store of the original 
document bytes.

-- Jack Krupansky

-----Original Message----- 
From: Furkan KAMACI
Sent: Saturday, April 06, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: Pointing to Hbase for Docuements or Directly Saving Documents at 
Hbase

Hi;

First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)

However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly
storing them at Hbase (is it efficient or not)?