You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cheng Li <ch...@usc.edu> on 2011/07/20 08:42:57 UTC

extract data from html, help

Hi ,

    I want to extract price data( here the price is $1110 ) from
http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good.

  But in the website source code , I cannot find any information about the
price of $1110. How should I extract  the price data from this page?

  Thanks,

-- 
Cheng Li

Re: extract data from html, help

Posted by Julien Nioche <li...@gmail.com>.
Simply implement a HTMLParseFilter which will receive a DOM representation
from the tika|html parser. Look in existing plkugins for examples or search
the mailing list

On 20 July 2011 08:53, Cheng Li <ch...@usc.edu> wrote:

> Thank you .
>
> What do you mean by Xpath?  Could you explain a little bit more ?
>
> Actually I was considering using Tika to deal with the extraction part. Any
> suggestions for that ?
>
> Thanks,
>
> On Wed, Jul 20, 2011 at 12:37 AM, Hannes Carl Meyer <
> hannescarl@googlemail.com> wrote:
>
> > As I can see the price is on the source code.
> > You could use for example XPath to extract that information via
> >
> > //li[@class='good-value selected']/span[@class='value']
> >
> > BR
> >
> > Hannes
> >
> > On Wed, Jul 20, 2011 at 9:13 AM, Gora Mohanty <go...@mimirtech.com>
> wrote:
> >
> > > On Wed, Jul 20, 2011 at 12:12 PM, Cheng Li <ch...@usc.edu> wrote:
> > > > Hi ,
> > > >
> > > >    I want to extract price data( here the price is $1110 ) from
> > > >
> > >
> >
> http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good
> > > .
> > > >
> > > >  But in the website source code , I cannot find any information about
> > the
> > > > price of $1110. How should I extract  the price data from this page?
> > >
> > > Haven't tried crawling the site with Nutch, but the price is in the
> > source
> > > code. Do a "View Source" in your browser, and search for 1,100 (there
> > > is a comma in there). I see
> > > <span class="value"><span class="icon"></span>$1,110</span>
> > >
> > > Regards,
> > > Gora
> > >
> >
>
>
>
> --
> Cheng Li
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: extract data from html, help

Posted by Cheng Li <ch...@usc.edu>.
Thank you .

What do you mean by Xpath?  Could you explain a little bit more ?

Actually I was considering using Tika to deal with the extraction part. Any
suggestions for that ?

Thanks,

On Wed, Jul 20, 2011 at 12:37 AM, Hannes Carl Meyer <
hannescarl@googlemail.com> wrote:

> As I can see the price is on the source code.
> You could use for example XPath to extract that information via
>
> //li[@class='good-value selected']/span[@class='value']
>
> BR
>
> Hannes
>
> On Wed, Jul 20, 2011 at 9:13 AM, Gora Mohanty <go...@mimirtech.com> wrote:
>
> > On Wed, Jul 20, 2011 at 12:12 PM, Cheng Li <ch...@usc.edu> wrote:
> > > Hi ,
> > >
> > >    I want to extract price data( here the price is $1110 ) from
> > >
> >
> http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good
> > .
> > >
> > >  But in the website source code , I cannot find any information about
> the
> > > price of $1110. How should I extract  the price data from this page?
> >
> > Haven't tried crawling the site with Nutch, but the price is in the
> source
> > code. Do a "View Source" in your browser, and search for 1,100 (there
> > is a comma in there). I see
> > <span class="value"><span class="icon"></span>$1,110</span>
> >
> > Regards,
> > Gora
> >
>



-- 
Cheng Li

Re: extract data from html, help

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
As I can see the price is on the source code.
You could use for example XPath to extract that information via

//li[@class='good-value selected']/span[@class='value']

BR

Hannes

On Wed, Jul 20, 2011 at 9:13 AM, Gora Mohanty <go...@mimirtech.com> wrote:

> On Wed, Jul 20, 2011 at 12:12 PM, Cheng Li <ch...@usc.edu> wrote:
> > Hi ,
> >
> >    I want to extract price data( here the price is $1110 ) from
> >
> http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good
> .
> >
> >  But in the website source code , I cannot find any information about the
> > price of $1110. How should I extract  the price data from this page?
>
> Haven't tried crawling the site with Nutch, but the price is in the source
> code. Do a "View Source" in your browser, and search for 1,100 (there
> is a comma in there). I see
> <span class="value"><span class="icon"></span>$1,110</span>
>
> Regards,
> Gora
>

Re: extract data from html, help

Posted by Cheng Li <ch...@usc.edu>.
OK , I got it .Thanks .

 I think I might use Tika to do the extraction.  The format is html , so I
need to use some token and regular expression to deal with it . Any
suggestion for that?

Thanks,

On Wed, Jul 20, 2011 at 12:13 AM, Gora Mohanty <go...@mimirtech.com> wrote:

> On Wed, Jul 20, 2011 at 12:12 PM, Cheng Li <ch...@usc.edu> wrote:
> > Hi ,
> >
> >    I want to extract price data( here the price is $1110 ) from
> >
> http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good
> .
> >
> >  But in the website source code , I cannot find any information about the
> > price of $1110. How should I extract  the price data from this page?
>
> Haven't tried crawling the site with Nutch, but the price is in the source
> code. Do a "View Source" in your browser, and search for 1,100 (there
> is a comma in there). I see
> <span class="value"><span class="icon"></span>$1,110</span>
>
> Regards,
> Gora
>



-- 
Cheng Li

Re: extract data from html, help

Posted by Gora Mohanty <go...@mimirtech.com>.
On Wed, Jul 20, 2011 at 12:12 PM, Cheng Li <ch...@usc.edu> wrote:
> Hi ,
>
>    I want to extract price data( here the price is $1110 ) from
> http://www.kbb.com/volkswagen/jetta/1991-volkswagen-jetta/gl-sedan-2d/?vehicleid=11638&intent=buy-used&pricetype=private-party&condition=good.
>
>  But in the website source code , I cannot find any information about the
> price of $1110. How should I extract  the price data from this page?

Haven't tried crawling the site with Nutch, but the price is in the source
code. Do a "View Source" in your browser, and search for 1,100 (there
is a comma in there). I see
<span class="value"><span class="icon"></span>$1,110</span>

Regards,
Gora