You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by rashmi maheshwari <ma...@gmail.com> on 2014/01/28 17:36:19 UTC

Solr & Nutch

Hi,

Question1 --> When Solr could parse html, documents like doc, excel pdf
etc, why do we need nutch to parse html files? what is different?

Questions 2: When do we use multiple core in solar? any practical business
case when we need multiple cores?

Question 3: When do we go for cloud? What is meaning of implementing solr
cloud?


-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org

Re: Solr & Nutch

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.

Q1: Nutch doesn’t only handle the parse of HTML files, it also use hadoop to achieve large-scale crawling using multiple nodes, it fetch the content of the HTML file, and yes it also parse its content.

Q2: In our case we use sold to crawl some website, store the content in one “main” solr core. We also have a web app with the typical “search box” we use a separated core to store the queries made by our users.

Q3: Not currently using SolrCloud so I’m going to let this one pass to a more experienced fellow.

On Jan 28, 2014, at 11:36 AM, rashmi maheshwari <ma...@gmail.com> wrote:

> Hi,
> 
> Question1 --> When Solr could parse html, documents like doc, excel pdf
> etc, why do we need nutch to parse html files? what is different?
> 
> Questions 2: When do we use multiple core in solar? any practical business
> case when we need multiple cores?
> 
> Question 3: When do we go for cloud? What is meaning of implementing solr
> cloud?
> 
> 
> -- 
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org

________________________________________________________________________________________________
III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu

Re: Solr & Nutch

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

> 1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages.

In addition, I think Nutch has PageRank-like scoring function as opposed to
Lucene/Solr, those are based on vector space model scoring.

koji
-- 
http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html

Re: Solr & Nutch

Posted by rashmi maheshwari <ma...@gmail.com>.

Thanks Markus and Alexei.


On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko <
alexei@martchenko.com.br> wrote:

> Well, not even Google parse those. I'm not sure about Nutch but in some
> crawlers (jSoup i believe) there's an option to try to get full URLs from
> plain text, so you can capture some urls in the form of someClickFunction('
> http://www.someurl.com/whatever') or even if they are in the middle of
> some
> paragraph. Sometimes it works beautifully, sometimes it misleads you to
> parse urls shortened with ellipsis in the middle.
>
>
>
> alexei martchenko
> Facebook <http://www.facebook.com/alexeiramone> |
> Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> Steam <http://steamcommunity.com/id/alexeiramone/> |
> 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
>
>
> 2014-01-28 rashmi maheshwari <ma...@gmail.com>
>
> > Thanks All for quick response.
> >
> > Today I crawled a webpage using nutch. This page have many links. But all
> > anchor tags have "href=#" and javascript is written on onClick event of
> > each anchor tag to open a new page.
> >
> > So crawler didnt crawl any of those links which were opening using
> onClick
> > event and has # href value.
> >
> > How these links are crawled using nutch?
> >
> >
> >
> >
> > On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
> > alexei@martchenko.com.br> wrote:
> >
> > > 1) Plus, those files are binaries sometimes with metadata, specific
> > > crawlers need to understand them. html is a plain text
> > >
> > > 2) Yes, different data schemes. Sometimes I replicate the same core and
> > > make some A-B tests with different weights, filters etc etc and some
> > people
> > > like to creare CoreA and CoreB with the same schema and hammer CoreA
> with
> > > updates and commits and optmizes, they make it available for searches
> > while
> > > hammering CoreB. Then swap again. This produces faster searches.
> > >
> > >
> > > alexei martchenko
> > > Facebook <http://www.facebook.com/alexeiramone> |
> > > Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> > > Steam <http://steamcommunity.com/id/alexeiramone/> |
> > > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> > > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
> > >
> > >
> > > 2014-01-28 Jack Krupansky <ja...@basetechnology.com>
> > >
> > > > 1. Nutch follows the links within HTML web pages to crawl the full
> > graph
> > > > of a web of pages.
> > > >
> > > > 2. Think of a core as an SQL table - each table/core has a different
> > type
> > > > of data.
> > > >
> > > > 3. SolrCloud is all about scaling and availability - multiple shards
> > for
> > > > larger collections and multiple replicas for both scaling of query
> > > response
> > > > and availability if nodes go down.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > -----Original Message----- From: rashmi maheshwari
> > > > Sent: Tuesday, January 28, 2014 11:36 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Solr & Nutch
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Question1 --> When Solr could parse html, documents like doc, excel
> pdf
> > > > etc, why do we need nutch to parse html files? what is different?
> > > >
> > > > Questions 2: When do we use multiple core in solar? any practical
> > > business
> > > > case when we need multiple cores?
> > > >
> > > > Question 3: When do we go for cloud? What is meaning of implementing
> > solr
> > > > cloud?
> > > >
> > > >
> > > > --
> > > > Rashmi
> > > > Be the change that you want to see in this world!
> > > > www.minnal.zor.org
> > > > disha.resolve.at
> > > > www.artofliving.org
> > > >
> > >
> >
> >
> >
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> > www.minnal.zor.org
> > disha.resolve.at
> > www.artofliving.org
> >
>



-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org

Re: Solr & Nutch

Posted by Alexei Martchenko <al...@martchenko.com.br>.

Well, not even Google parse those. I'm not sure about Nutch but in some
crawlers (jSoup i believe) there's an option to try to get full URLs from
plain text, so you can capture some urls in the form of someClickFunction('
http://www.someurl.com/whatever') or even if they are in the middle of some
paragraph. Sometimes it works beautifully, sometimes it misleads you to
parse urls shortened with ellipsis in the middle.



alexei martchenko
Facebook <http://www.facebook.com/alexeiramone> |
Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
Steam <http://steamcommunity.com/id/alexeiramone/> |
4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |


2014-01-28 rashmi maheshwari <ma...@gmail.com>

> Thanks All for quick response.
>
> Today I crawled a webpage using nutch. This page have many links. But all
> anchor tags have "href=#" and javascript is written on onClick event of
> each anchor tag to open a new page.
>
> So crawler didnt crawl any of those links which were opening using onClick
> event and has # href value.
>
> How these links are crawled using nutch?
>
>
>
>
> On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
> alexei@martchenko.com.br> wrote:
>
> > 1) Plus, those files are binaries sometimes with metadata, specific
> > crawlers need to understand them. html is a plain text
> >
> > 2) Yes, different data schemes. Sometimes I replicate the same core and
> > make some A-B tests with different weights, filters etc etc and some
> people
> > like to creare CoreA and CoreB with the same schema and hammer CoreA with
> > updates and commits and optmizes, they make it available for searches
> while
> > hammering CoreB. Then swap again. This produces faster searches.
> >
> >
> > alexei martchenko
> > Facebook <http://www.facebook.com/alexeiramone> |
> > Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> > Steam <http://steamcommunity.com/id/alexeiramone/> |
> > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
> >
> >
> > 2014-01-28 Jack Krupansky <ja...@basetechnology.com>
> >
> > > 1. Nutch follows the links within HTML web pages to crawl the full
> graph
> > > of a web of pages.
> > >
> > > 2. Think of a core as an SQL table - each table/core has a different
> type
> > > of data.
> > >
> > > 3. SolrCloud is all about scaling and availability - multiple shards
> for
> > > larger collections and multiple replicas for both scaling of query
> > response
> > > and availability if nodes go down.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: rashmi maheshwari
> > > Sent: Tuesday, January 28, 2014 11:36 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Solr & Nutch
> > >
> > >
> > > Hi,
> > >
> > > Question1 --> When Solr could parse html, documents like doc, excel pdf
> > > etc, why do we need nutch to parse html files? what is different?
> > >
> > > Questions 2: When do we use multiple core in solar? any practical
> > business
> > > case when we need multiple cores?
> > >
> > > Question 3: When do we go for cloud? What is meaning of implementing
> solr
> > > cloud?
> > >
> > >
> > > --
> > > Rashmi
> > > Be the change that you want to see in this world!
> > > www.minnal.zor.org
> > > disha.resolve.at
> > > www.artofliving.org
> > >
> >
>
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>

Re: Solr & Nutch

Posted by rashmi maheshwari <ma...@gmail.com>.

Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have "href=#" and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
alexei@martchenko.com.br> wrote:

> 1) Plus, those files are binaries sometimes with metadata, specific
> crawlers need to understand them. html is a plain text
>
> 2) Yes, different data schemes. Sometimes I replicate the same core and
> make some A-B tests with different weights, filters etc etc and some people
> like to creare CoreA and CoreB with the same schema and hammer CoreA with
> updates and commits and optmizes, they make it available for searches while
> hammering CoreB. Then swap again. This produces faster searches.
>
>
> alexei martchenko
> Facebook <http://www.facebook.com/alexeiramone> |
> Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> Steam <http://steamcommunity.com/id/alexeiramone/> |
> 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
>
>
> 2014-01-28 Jack Krupansky <ja...@basetechnology.com>
>
> > 1. Nutch follows the links within HTML web pages to crawl the full graph
> > of a web of pages.
> >
> > 2. Think of a core as an SQL table - each table/core has a different type
> > of data.
> >
> > 3. SolrCloud is all about scaling and availability - multiple shards for
> > larger collections and multiple replicas for both scaling of query
> response
> > and availability if nodes go down.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: rashmi maheshwari
> > Sent: Tuesday, January 28, 2014 11:36 AM
> > To: solr-user@lucene.apache.org
> > Subject: Solr & Nutch
> >
> >
> > Hi,
> >
> > Question1 --> When Solr could parse html, documents like doc, excel pdf
> > etc, why do we need nutch to parse html files? what is different?
> >
> > Questions 2: When do we use multiple core in solar? any practical
> business
> > case when we need multiple cores?
> >
> > Question 3: When do we go for cloud? What is meaning of implementing solr
> > cloud?
> >
> >
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> > www.minnal.zor.org
> > disha.resolve.at
> > www.artofliving.org
> >
>



-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org

Re: Solr & Nutch

Posted by Alexei Martchenko <al...@martchenko.com.br>.

1) Plus, those files are binaries sometimes with metadata, specific
crawlers need to understand them. html is a plain text

2) Yes, different data schemes. Sometimes I replicate the same core and
make some A-B tests with different weights, filters etc etc and some people
like to creare CoreA and CoreB with the same schema and hammer CoreA with
updates and commits and optmizes, they make it available for searches while
hammering CoreB. Then swap again. This produces faster searches.


alexei martchenko
Facebook <http://www.facebook.com/alexeiramone> |
Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
Steam <http://steamcommunity.com/id/alexeiramone/> |
4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |


2014-01-28 Jack Krupansky <ja...@basetechnology.com>

> 1. Nutch follows the links within HTML web pages to crawl the full graph
> of a web of pages.
>
> 2. Think of a core as an SQL table - each table/core has a different type
> of data.
>
> 3. SolrCloud is all about scaling and availability - multiple shards for
> larger collections and multiple replicas for both scaling of query response
> and availability if nodes go down.
>
> -- Jack Krupansky
>
> -----Original Message----- From: rashmi maheshwari
> Sent: Tuesday, January 28, 2014 11:36 AM
> To: solr-user@lucene.apache.org
> Subject: Solr & Nutch
>
>
> Hi,
>
> Question1 --> When Solr could parse html, documents like doc, excel pdf
> etc, why do we need nutch to parse html files? what is different?
>
> Questions 2: When do we use multiple core in solar? any practical business
> case when we need multiple cores?
>
> Question 3: When do we go for cloud? What is meaning of implementing solr
> cloud?
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>

Re: Solr & Nutch

Posted by Jack Krupansky <ja...@basetechnology.com>.

1. Nutch follows the links within HTML web pages to crawl the full graph of 
a web of pages.

2. Think of a core as an SQL table - each table/core has a different type of 
data.

3. SolrCloud is all about scaling and availability - multiple shards for 
larger collections and multiple replicas for both scaling of query response 
and availability if nodes go down.

-- Jack Krupansky

-----Original Message----- 
From: rashmi maheshwari
Sent: Tuesday, January 28, 2014 11:36 AM
To: solr-user@lucene.apache.org
Subject: Solr & Nutch

Hi,

Question1 --> When Solr could parse html, documents like doc, excel pdf
etc, why do we need nutch to parse html files? what is different?

Questions 2: When do we use multiple core in solar? any practical business
case when we need multiple cores?

Question 3: When do we go for cloud? What is meaning of implementing solr
cloud?


-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org