You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Henry Noerdlinger <hn...@infonow.com> on 2010/08/24 00:24:47 UTC

find segment for an url

I want to loop through URLs which have been crawled / indexed.

I have a (known) subset of URLs that I want to get the (raw) content for

if I know the segment, I can do something like this:
      String segName = "20100817162607";
      String url = "http://adomain.com/awebappOfInterest/someContent.do";

      HitDetails detail = new HitDetails(segName, url);
      Configuration conf = NutchConfiguration.create();

      NutchBean bean = new NutchBean(conf);

      byte[] contentBytes = bean.getContent(detail);
      for (byte b : contentBytes)
      {
         System.out.print((char)b);
      }

My question is, given, a known Url, how can I find what segment it is in? Is there something in the API for giving an URL and getting back the name of the segment it is found in?

regards,
-henry
hnoerdlinger@infonow.com

InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.

RE: find segment for an url

Posted by Henry Noerdlinger <hn...@infonow.com>.
Thank you,

That what I had already begun to look into. At some point in my process, a configuration will need to be set which will define search that will identify a specific set of urls to get content from.
I was planning on creating some code to sift through readlinkdb results.

My solution is now like this:

      //this is a pattern which identifies a set of urls
      String term = "partnerdetails";
      String field = "url";
      Configuration conf = NutchConfiguration.create();
      NutchBean bean = new NutchBean(conf);
      Query query = new Query(conf);
      query.addRequiredTerm(term,field);
      Hits hits = bean.search(query);
      for (int i = 0; i < hits.getLength(); i++)
      {

         Hit hit = hits.getHit(i);
         HitDetails detail = bean.getDetails(hit);
         byte[] contentBytes = bean.getContent(detail);
         StringBuilder content = new StringBuilder();
         for (byte b : contentBytes)
         {
            content.append((char)b);
         }
   }

________________________________________
From: CatOs Mandros [cat.os.mandros@gmail.com]
Sent: Tuesday, August 24, 2010 1:52 PM
To: user@nutch.apache.org
Subject: Re: find segment for an url

If I were you I would use Luke ( http://code.google.com/p/luke/ ) to
examine what data do you have on your indexes if you're using lucene
indexes :)

On Tue, Aug 24, 2010 at 6:21 PM, Henry Noerdlinger
<hn...@infonow.com> wrote:
> Thank you for response.
>
> I ran a simple test where I constructed a QueryParams object and have field / value of "url" and "http://blahblah.com/"
> and then added this to a Query object and passed this to my beloved NutchBean to search for like this:
>  String urlVal = "http://domain.com/webapp/content.do";
>      QueryParams qp = new QueryParams();
>      qp.put("url", urlVal);
>      Configuration conf = NutchConfiguration.create();
>      NutchBean bean = new NutchBean(conf);
>      Query query = new Query(conf);
>      query.setParams(qp);
>      Hits hits = bean.search(query);
>
> Didn't get anything.
>
>
> Is there someone who can give me a quick example of how this could be done?
>
>
>
> ________________________________________
> From: CatOs Mandros [cat.os.mandros@gmail.com]
> Sent: Tuesday, August 24, 2010 4:10 AM
> To: user@nutch.apache.org
> Subject: Re: find segment for an url
>
> Hi Henry,
>
> If i'm not mistaken, the correct way to handle this is to query your
> index . It should have the information about what segment is the URL
> located. Then you should only have to run your code on the segment
> returned to get the content.
>
>
> On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger
> <hn...@infonow.com> wrote:
>> I want to loop through URLs which have been crawled / indexed.
>>
>> I have a (known) subset of URLs that I want to get the (raw) content for
>>
>> if I know the segment, I can do something like this:
>>      String segName = "20100817162607";
>>      String url = "http://adomain.com/awebappOfInterest/someContent.do";
>>
>>      HitDetails detail = new HitDetails(segName, url);
>>      Configuration conf = NutchConfiguration.create();
>>
>>      NutchBean bean = new NutchBean(conf);
>>
>>      byte[] contentBytes = bean.getContent(detail);
>>      for (byte b : contentBytes)
>>      {
>>         System.out.print((char)b);
>>      }
>>
>> My question is, given, a known Url, how can I find what segment it is in? Is there something in the API for giving an URL and getting back the name of the segment it is found in?
>>
>> regards,
>> -henry
>> hnoerdlinger@infonow.com
>>
>> InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.
>>
>
>
> InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.
>


InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.

Re: find segment for an url

Posted by CatOs Mandros <ca...@gmail.com>.
If I were you I would use Luke ( http://code.google.com/p/luke/ ) to
examine what data do you have on your indexes if you're using lucene
indexes :)

On Tue, Aug 24, 2010 at 6:21 PM, Henry Noerdlinger
<hn...@infonow.com> wrote:
> Thank you for response.
>
> I ran a simple test where I constructed a QueryParams object and have field / value of "url" and "http://blahblah.com/"
> and then added this to a Query object and passed this to my beloved NutchBean to search for like this:
>  String urlVal = "http://domain.com/webapp/content.do";
>      QueryParams qp = new QueryParams();
>      qp.put("url", urlVal);
>      Configuration conf = NutchConfiguration.create();
>      NutchBean bean = new NutchBean(conf);
>      Query query = new Query(conf);
>      query.setParams(qp);
>      Hits hits = bean.search(query);
>
> Didn't get anything.
>
>
> Is there someone who can give me a quick example of how this could be done?
>
>
>
> ________________________________________
> From: CatOs Mandros [cat.os.mandros@gmail.com]
> Sent: Tuesday, August 24, 2010 4:10 AM
> To: user@nutch.apache.org
> Subject: Re: find segment for an url
>
> Hi Henry,
>
> If i'm not mistaken, the correct way to handle this is to query your
> index . It should have the information about what segment is the URL
> located. Then you should only have to run your code on the segment
> returned to get the content.
>
>
> On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger
> <hn...@infonow.com> wrote:
>> I want to loop through URLs which have been crawled / indexed.
>>
>> I have a (known) subset of URLs that I want to get the (raw) content for
>>
>> if I know the segment, I can do something like this:
>>      String segName = "20100817162607";
>>      String url = "http://adomain.com/awebappOfInterest/someContent.do";
>>
>>      HitDetails detail = new HitDetails(segName, url);
>>      Configuration conf = NutchConfiguration.create();
>>
>>      NutchBean bean = new NutchBean(conf);
>>
>>      byte[] contentBytes = bean.getContent(detail);
>>      for (byte b : contentBytes)
>>      {
>>         System.out.print((char)b);
>>      }
>>
>> My question is, given, a known Url, how can I find what segment it is in? Is there something in the API for giving an URL and getting back the name of the segment it is found in?
>>
>> regards,
>> -henry
>> hnoerdlinger@infonow.com
>>
>> InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.
>>
>
>
> InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.
>

RE: find segment for an url

Posted by Henry Noerdlinger <hn...@infonow.com>.
Thank you for response.

I ran a simple test where I constructed a QueryParams object and have field / value of "url" and "http://blahblah.com/"
and then added this to a Query object and passed this to my beloved NutchBean to search for like this:
 String urlVal = "http://domain.com/webapp/content.do";
      QueryParams qp = new QueryParams();
      qp.put("url", urlVal);
      Configuration conf = NutchConfiguration.create();
      NutchBean bean = new NutchBean(conf);
      Query query = new Query(conf);
      query.setParams(qp);
      Hits hits = bean.search(query);

Didn't get anything.


Is there someone who can give me a quick example of how this could be done?



________________________________________
From: CatOs Mandros [cat.os.mandros@gmail.com]
Sent: Tuesday, August 24, 2010 4:10 AM
To: user@nutch.apache.org
Subject: Re: find segment for an url

Hi Henry,

If i'm not mistaken, the correct way to handle this is to query your
index . It should have the information about what segment is the URL
located. Then you should only have to run your code on the segment
returned to get the content.


On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger
<hn...@infonow.com> wrote:
> I want to loop through URLs which have been crawled / indexed.
>
> I have a (known) subset of URLs that I want to get the (raw) content for
>
> if I know the segment, I can do something like this:
>      String segName = "20100817162607";
>      String url = "http://adomain.com/awebappOfInterest/someContent.do";
>
>      HitDetails detail = new HitDetails(segName, url);
>      Configuration conf = NutchConfiguration.create();
>
>      NutchBean bean = new NutchBean(conf);
>
>      byte[] contentBytes = bean.getContent(detail);
>      for (byte b : contentBytes)
>      {
>         System.out.print((char)b);
>      }
>
> My question is, given, a known Url, how can I find what segment it is in? Is there something in the API for giving an URL and getting back the name of the segment it is found in?
>
> regards,
> -henry
> hnoerdlinger@infonow.com
>
> InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.
>


InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.

Re: find segment for an url

Posted by CatOs Mandros <ca...@gmail.com>.
Hi Henry,

If i'm not mistaken, the correct way to handle this is to query your
index . It should have the information about what segment is the URL
located. Then you should only have to run your code on the segment
returned to get the content.


On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger
<hn...@infonow.com> wrote:
> I want to loop through URLs which have been crawled / indexed.
>
> I have a (known) subset of URLs that I want to get the (raw) content for
>
> if I know the segment, I can do something like this:
>      String segName = "20100817162607";
>      String url = "http://adomain.com/awebappOfInterest/someContent.do";
>
>      HitDetails detail = new HitDetails(segName, url);
>      Configuration conf = NutchConfiguration.create();
>
>      NutchBean bean = new NutchBean(conf);
>
>      byte[] contentBytes = bean.getContent(detail);
>      for (byte b : contentBytes)
>      {
>         System.out.print((char)b);
>      }
>
> My question is, given, a known Url, how can I find what segment it is in? Is there something in the API for giving an URL and getting back the name of the segment it is found in?
>
> regards,
> -henry
> hnoerdlinger@infonow.com
>
> InfoNow Corporation  |  This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential or privileged information.
>