You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2012/08/30 04:37:37 UTC

Re: Document Processing

I've seen the JSoup HTML parser library used for this. It worked
really well. The Boilerpipe library may be what you want. Its
schwerpunkt (*) is to separate boilerplate from wanted text in an HTML
page. I don't know what fine-grained control it has.

* raison d'être. There is no English word for this concept.

On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
<to...@gmail.com> wrote:
> Hello Michael,
>
> I can help you with using the UIMA UpdateRequestProcessor [1]; the current
> implementation uses in-memory execution of UIMA pipelines but since I was
> planning to add the support for higher scalability (with UIMA-AS [2]) that
> may help you as well.
>
> Tommaso
>
> [1] :
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
> [2] : http://uima.apache.org/doc-uimaas-what.html
>
> 2011/12/5 Michael Kelleher <mj...@gmail.com>
>
>> Hello Erik,
>>
>> I will take a look at both:
>>
>> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
>> dateProcessor
>>
>> and
>>
>> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
>> ocessor
>>
>>
>> and figure out what I need to extend to handle processing in the way I am
>> looking for.  I am assuming that "component" configuration is handled in a
>> standard way such that I can configure my new UpdateProcessor in the same
>> way I would configure any other UpdateProcessor "component"?
>>
>> Thanks for the suggestion.
>>
>>
>> 1 more question:  given that I am probably going to convert the HTML to
>> XML so I can use XPath expressions to "extract" my content, do you think
>> that this kind of processing will overload Solr?  This Solr instance will
>> be used solely for indexing, and will only ever have a single ManifoldCF
>> crawling job feeding it documents at one time.
>>
>> --mike
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Document Processing

Posted by Tanguy Moal <ta...@gmail.com>.

If your interest is focusing on the real textual content of a web page, you
could try this : JReadability (https://github.com/ifesdjeen/jReadability ,
Apache 2.0 license), which wraps JSoup (as Lance suggested) and applies a
set of predefined rules to scrap crap (nav, headers, footers, ...) off of
the content.

If you'd rather have the possibility to map portions of a webpage to
dedicated solr fields, using JSoup on its own could be a win. Read this :
https://norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser/

Hope this helps,

--
Tanguy

2012/9/6 Lance Norskog <go...@gmail.com>

> There is another way to do this: crawl the mobile site!
>
> The Fennec browser from Mozilla talks Android. I often use it to get
> pagecrap off my screen.
>
> ----- Original Message -----
> | From: "Lance Norskog" <go...@gmail.com>
> | To: solr-user@lucene.apache.org
> | Sent: Wednesday, August 29, 2012 7:37:37 PM
> | Subject: Re: Document Processing
> |
> | I've seen the JSoup HTML parser library used for this. It worked
> | really well. The Boilerpipe library may be what you want. Its
> | schwerpunkt (*) is to separate boilerplate from wanted text in an
> | HTML
> | page. I don't know what fine-grained control it has.
> |
> | * raison d'être. There is no English word for this concept.
> |
> | On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
> | <to...@gmail.com> wrote:
> | > Hello Michael,
> | >
> | > I can help you with using the UIMA UpdateRequestProcessor [1]; the
> | > current
> | > implementation uses in-memory execution of UIMA pipelines but since
> | > I was
> | > planning to add the support for higher scalability (with UIMA-AS
> | > [2]) that
> | > may help you as well.
> | >
> | > Tommaso
> | >
> | > [1] :
> | >
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
> | > [2] : http://uima.apache.org/doc-uimaas-what.html
> | >
> | > 2011/12/5 Michael Kelleher <mj...@gmail.com>
> | >
> | >> Hello Erik,
> | >>
> | >> I will take a look at both:
> | >>
> | >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
> | >> dateProcessor
> | >>
> | >> and
> | >>
> | >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
> | >> ocessor
> | >>
> | >>
> | >> and figure out what I need to extend to handle processing in the
> | >> way I am
> | >> looking for.  I am assuming that "component" configuration is
> | >> handled in a
> | >> standard way such that I can configure my new UpdateProcessor in
> | >> the same
> | >> way I would configure any other UpdateProcessor "component"?
> | >>
> | >> Thanks for the suggestion.
> | >>
> | >>
> | >> 1 more question:  given that I am probably going to convert the
> | >> HTML to
> | >> XML so I can use XPath expressions to "extract" my content, do you
> | >> think
> | >> that this kind of processing will overload Solr?  This Solr
> | >> instance will
> | >> be used solely for indexing, and will only ever have a single
> | >> ManifoldCF
> | >> crawling job feeding it documents at one time.
> | >>
> | >> --mike
> | >>
> |
> |
> |
> | --
> | Lance Norskog
> | goksron@gmail.com
> |
>

Re: Document Processing

Posted by Lance Norskog <go...@gmail.com>.

There is another way to do this: crawl the mobile site! 

The Fennec browser from Mozilla talks Android. I often use it to get pagecrap off my screen.

----- Original Message -----
| From: "Lance Norskog" <go...@gmail.com>
| To: solr-user@lucene.apache.org
| Sent: Wednesday, August 29, 2012 7:37:37 PM
| Subject: Re: Document Processing
| 
| I've seen the JSoup HTML parser library used for this. It worked
| really well. The Boilerpipe library may be what you want. Its
| schwerpunkt (*) is to separate boilerplate from wanted text in an
| HTML
| page. I don't know what fine-grained control it has.
| 
| * raison d'être. There is no English word for this concept.
| 
| On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
| <to...@gmail.com> wrote:
| > Hello Michael,
| >
| > I can help you with using the UIMA UpdateRequestProcessor [1]; the
| > current
| > implementation uses in-memory execution of UIMA pipelines but since
| > I was
| > planning to add the support for higher scalability (with UIMA-AS
| > [2]) that
| > may help you as well.
| >
| > Tommaso
| >
| > [1] :
| > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
| > [2] : http://uima.apache.org/doc-uimaas-what.html
| >
| > 2011/12/5 Michael Kelleher <mj...@gmail.com>
| >
| >> Hello Erik,
| >>
| >> I will take a look at both:
| >>
| >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
| >> dateProcessor
| >>
| >> and
| >>
| >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
| >> ocessor
| >>
| >>
| >> and figure out what I need to extend to handle processing in the
| >> way I am
| >> looking for.  I am assuming that "component" configuration is
| >> handled in a
| >> standard way such that I can configure my new UpdateProcessor in
| >> the same
| >> way I would configure any other UpdateProcessor "component"?
| >>
| >> Thanks for the suggestion.
| >>
| >>
| >> 1 more question:  given that I am probably going to convert the
| >> HTML to
| >> XML so I can use XPath expressions to "extract" my content, do you
| >> think
| >> that this kind of processing will overload Solr?  This Solr
| >> instance will
| >> be used solely for indexing, and will only ever have a single
| >> ManifoldCF
| >> crawling job feeding it documents at one time.
| >>
| >> --mike
| >>
| 
| 
| 
| --
| Lance Norskog
| goksron@gmail.com
|