You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by M Kelleher <mj...@gmail.com> on 2011/12/02 17:18:44 UTC

Export crawled URLs

Is it possible to export / download the list of URLs visited during a crawl job?

Sent from my iPad

Re: Export crawled URLs

Posted by Karl Wright <da...@gmail.com>.

All of it is welcome, although obviously we will not be able to commit
it without making further changes.

The way patches are submitted is by attaching them to JIRA tickets.
If you don't already have an account with Apache's JIRA
(https://issues.apache.org/jira), then just create one.  I'd suggest
that you create two ManifoldCF issues, one about MySQL support, and
one about localization, and attach your changes.  Be sure to click the
"grant license to ASF" button when you are attaching the files so we
can legally use them in the product.

Thanks!
Karl

On Mon, Dec 5, 2011 at 9:50 PM, Hitoshi Ozawa
<Oz...@ogis-ri.co.jp> wrote:
> I had to modify few other files besides DBInterfaceMySQL to make it work
> because of the
> differences in SQL functionalities. I'm, also, modifying the jsp files to
> make it I18N and adding
> Japanese messages as well. I may be adding paging and search features as
> well.
> Is it possible to obtain svn priviledges so I can upload the files? I'll
> probably be making some
> modifications to better support MySQL.
>
> Regards,
> H.Ozawa
>
>
> (2011/12/05 19:22), Karl Wright wrote:
>>
>> If you've updated the DBInterfaceMySQL driver, any chance you would be
>> willing to contribute it back to the project?
>>
>> Karl
>>
>>
>> On Sun, Dec 4, 2011 at 11:13 PM, Hitoshi Ozawa
>> <Oz...@ogis-ri.co.jp>  wrote:
>>
>>>
>>> "The interpretation of this field will differ from connector to
>>> connector".
>>>  From the above description, seems the content of entityid is dependent
>>> of
>>> which connector is
>>> being used to crawl the web pages.
>>> You're right about the second point on entityid column datatype. In
>>> MySQL,
>>> which I'm using
>>> with ManifoldCF, the datatype of entityid is LONGTEXT. I was just using
>>> it
>>> figurably even though
>>> I just found out that I can actually execute the sql statement. :-)
>>>
>>> Cheers,
>>> H.Ozawa
>>>
>>>
>>> (2011/12/05 10:29), Karl Wright wrote:
>>>
>>>>
>>>> Well, the history comes from the repohistory table, yes - but you may
>>>> not be able to construct a query with entityid=jobs.id, first of all
>>>> because that is incorrect (what the entity field contains is dependent
>>>>  on the activity type), and secondly because that column is
>>>> potentially long and only some kinds of queries can be done against
>>>> it.  Specifically it cannot be built into an index on PostgreSQL.
>>>>
>>>> Karl
>>>>
>>>> On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
>>>> <Oz...@ogis-ri.co.jp>    wrote:
>>>>
>>>>
>>>>>
>>>>> Is "history" just entries in the "repohistory" table with entitityid =
>>>>> jobs.id?
>>>>>
>>>>> H.Ozawa
>>>>>
>>>>> (2011/12/03 1:43), Karl Wright wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> The best place to get this from is the simple history.  A command-line
>>>>>> utility to dump this information to a text file should be possible
>>>>>> with the currently available interface primitives.  If that is how you
>>>>>> want to go, you will need to run ManifoldCF in multiprocess mode.
>>>>>> Alternatively you might want to request the info from the API, but
>>>>>> that's problematic because nobody has implemented report support in
>>>>>> the API as of now.
>>>>>>
>>>>>> A final alternative is to get this from the log.  There is an [INFO]
>>>>>> level line from the web connector for every fetch, I seem to recall,
>>>>>> and you might be able to use that.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj...@gmail.com>
>>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Is it possible to export / download the list of URLs visited during a
>>>>>>> crawl job?
>>>>>>>
>>>>>>> Sent from my iPad
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>

Re: Export crawled URLs

Posted by Hitoshi Ozawa <Oz...@ogis-ri.co.jp>.

I had to modify few other files besides DBInterfaceMySQL to make it work 
because of the
differences in SQL functionalities. I'm, also, modifying the jsp files 
to make it I18N and adding
Japanese messages as well. I may be adding paging and search features as 
well.
Is it possible to obtain svn priviledges so I can upload the files? I'll 
probably be making some
modifications to better support MySQL.

Regards,
H.Ozawa

(2011/12/05 19:22), Karl Wright wrote:
> If you've updated the DBInterfaceMySQL driver, any chance you would be
> willing to contribute it back to the project?
>
> Karl
>
>
> On Sun, Dec 4, 2011 at 11:13 PM, Hitoshi Ozawa
> <Oz...@ogis-ri.co.jp>  wrote:
>    
>> "The interpretation of this field will differ from connector to connector".
>>  From the above description, seems the content of entityid is dependent of
>> which connector is
>> being used to crawl the web pages.
>> You're right about the second point on entityid column datatype. In MySQL,
>> which I'm using
>> with ManifoldCF, the datatype of entityid is LONGTEXT. I was just using it
>> figurably even though
>> I just found out that I can actually execute the sql statement. :-)
>>
>> Cheers,
>> H.Ozawa
>>
>>
>> (2011/12/05 10:29), Karl Wright wrote:
>>      
>>> Well, the history comes from the repohistory table, yes - but you may
>>> not be able to construct a query with entityid=jobs.id, first of all
>>> because that is incorrect (what the entity field contains is dependent
>>>   on the activity type), and secondly because that column is
>>> potentially long and only some kinds of queries can be done against
>>> it.  Specifically it cannot be built into an index on PostgreSQL.
>>>
>>> Karl
>>>
>>> On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
>>> <Oz...@ogis-ri.co.jp>    wrote:
>>>
>>>        
>>>> Is "history" just entries in the "repohistory" table with entitityid =
>>>> jobs.id?
>>>>
>>>> H.Ozawa
>>>>
>>>> (2011/12/03 1:43), Karl Wright wrote:
>>>>
>>>>          
>>>>> The best place to get this from is the simple history.  A command-line
>>>>> utility to dump this information to a text file should be possible
>>>>> with the currently available interface primitives.  If that is how you
>>>>> want to go, you will need to run ManifoldCF in multiprocess mode.
>>>>> Alternatively you might want to request the info from the API, but
>>>>> that's problematic because nobody has implemented report support in
>>>>> the API as of now.
>>>>>
>>>>> A final alternative is to get this from the log.  There is an [INFO]
>>>>> level line from the web connector for every fetch, I seem to recall,
>>>>> and you might be able to use that.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj...@gmail.com>
>>>>>   wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> Is it possible to export / download the list of URLs visited during a
>>>>>> crawl job?
>>>>>>
>>>>>> Sent from my iPad
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>
>>>>>            
>>>>
>>>>
>>>>
>>>>          
>>>
>>>        
>>
>>
>>      
>

Re: Export crawled URLs

Posted by Karl Wright <da...@gmail.com>.

If you've updated the DBInterfaceMySQL driver, any chance you would be
willing to contribute it back to the project?

Karl


On Sun, Dec 4, 2011 at 11:13 PM, Hitoshi Ozawa
<Oz...@ogis-ri.co.jp> wrote:
> "The interpretation of this field will differ from connector to connector".
> From the above description, seems the content of entityid is dependent of
> which connector is
> being used to crawl the web pages.
> You're right about the second point on entityid column datatype. In MySQL,
> which I'm using
> with ManifoldCF, the datatype of entityid is LONGTEXT. I was just using it
> figurably even though
> I just found out that I can actually execute the sql statement. :-)
>
> Cheers,
> H.Ozawa
>
>
> (2011/12/05 10:29), Karl Wright wrote:
>>
>> Well, the history comes from the repohistory table, yes - but you may
>> not be able to construct a query with entityid=jobs.id, first of all
>> because that is incorrect (what the entity field contains is dependent
>>  on the activity type), and secondly because that column is
>> potentially long and only some kinds of queries can be done against
>> it.  Specifically it cannot be built into an index on PostgreSQL.
>>
>> Karl
>>
>> On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
>> <Oz...@ogis-ri.co.jp>  wrote:
>>
>>>
>>> Is "history" just entries in the "repohistory" table with entitityid =
>>> jobs.id?
>>>
>>> H.Ozawa
>>>
>>> (2011/12/03 1:43), Karl Wright wrote:
>>>
>>>>
>>>> The best place to get this from is the simple history.  A command-line
>>>> utility to dump this information to a text file should be possible
>>>> with the currently available interface primitives.  If that is how you
>>>> want to go, you will need to run ManifoldCF in multiprocess mode.
>>>> Alternatively you might want to request the info from the API, but
>>>> that's problematic because nobody has implemented report support in
>>>> the API as of now.
>>>>
>>>> A final alternative is to get this from the log.  There is an [INFO]
>>>> level line from the web connector for every fetch, I seem to recall,
>>>> and you might be able to use that.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj...@gmail.com>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> Is it possible to export / download the list of URLs visited during a
>>>>> crawl job?
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>

Re: Export crawled URLs

Posted by Hitoshi Ozawa <Oz...@ogis-ri.co.jp>.

"The interpretation of this field will differ from connector to connector".
 From the above description, seems the content of entityid is dependent 
of which connector is
being used to crawl the web pages.
You're right about the second point on entityid column datatype. In 
MySQL, which I'm using
with ManifoldCF, the datatype of entityid is LONGTEXT. I was just using 
it figurably even though
I just found out that I can actually execute the sql statement. :-)

Cheers,
H.Ozawa

(2011/12/05 10:29), Karl Wright wrote:
> Well, the history comes from the repohistory table, yes - but you may
> not be able to construct a query with entityid=jobs.id, first of all
> because that is incorrect (what the entity field contains is dependent
>   on the activity type), and secondly because that column is
> potentially long and only some kinds of queries can be done against
> it.  Specifically it cannot be built into an index on PostgreSQL.
>
> Karl
>
> On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
> <Oz...@ogis-ri.co.jp>  wrote:
>    
>> Is "history" just entries in the "repohistory" table with entitityid =
>> jobs.id?
>>
>> H.Ozawa
>>
>> (2011/12/03 1:43), Karl Wright wrote:
>>      
>>> The best place to get this from is the simple history.  A command-line
>>> utility to dump this information to a text file should be possible
>>> with the currently available interface primitives.  If that is how you
>>> want to go, you will need to run ManifoldCF in multiprocess mode.
>>> Alternatively you might want to request the info from the API, but
>>> that's problematic because nobody has implemented report support in
>>> the API as of now.
>>>
>>> A final alternative is to get this from the log.  There is an [INFO]
>>> level line from the web connector for every fetch, I seem to recall,
>>> and you might be able to use that.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj...@gmail.com>    wrote:
>>>
>>>        
>>>> Is it possible to export / download the list of URLs visited during a
>>>> crawl job?
>>>>
>>>> Sent from my iPad
>>>>
>>>>          
>>>
>>>        
>>
>>
>>      
>

Re: Export crawled URLs

Posted by Karl Wright <da...@gmail.com>.

Well, the history comes from the repohistory table, yes - but you may
not be able to construct a query with entityid=jobs.id, first of all
because that is incorrect (what the entity field contains is dependent
 on the activity type), and secondly because that column is
potentially long and only some kinds of queries can be done against
it.  Specifically it cannot be built into an index on PostgreSQL.

Karl

On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
<Oz...@ogis-ri.co.jp> wrote:
> Is "history" just entries in the "repohistory" table with entitityid =
> jobs.id?
>
> H.Ozawa
>
> (2011/12/03 1:43), Karl Wright wrote:
>>
>> The best place to get this from is the simple history.  A command-line
>> utility to dump this information to a text file should be possible
>> with the currently available interface primitives.  If that is how you
>> want to go, you will need to run ManifoldCF in multiprocess mode.
>> Alternatively you might want to request the info from the API, but
>> that's problematic because nobody has implemented report support in
>> the API as of now.
>>
>> A final alternative is to get this from the log.  There is an [INFO]
>> level line from the web connector for every fetch, I seem to recall,
>> and you might be able to use that.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj...@gmail.com>  wrote:
>>
>>>
>>> Is it possible to export / download the list of URLs visited during a
>>> crawl job?
>>>
>>> Sent from my iPad
>>>
>>
>>
>
>
>

Re: Export crawled URLs

Posted by Hitoshi Ozawa <Oz...@ogis-ri.co.jp>.

Is "history" just entries in the "repohistory" table with entitityid = 
jobs.id?

H.Ozawa

(2011/12/03 1:43), Karl Wright wrote:
> The best place to get this from is the simple history.  A command-line
> utility to dump this information to a text file should be possible
> with the currently available interface primitives.  If that is how you
> want to go, you will need to run ManifoldCF in multiprocess mode.
> Alternatively you might want to request the info from the API, but
> that's problematic because nobody has implemented report support in
> the API as of now.
>
> A final alternative is to get this from the log.  There is an [INFO]
> level line from the web connector for every fetch, I seem to recall,
> and you might be able to use that.
>
> Thanks,
> Karl
>
>
> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher<mj...@gmail.com>  wrote:
>    
>> Is it possible to export / download the list of URLs visited during a crawl job?
>>
>> Sent from my iPad
>>      
>

Re: Export crawled URLs

Posted by Karl Wright <da...@gmail.com>.

The best place to get this from is the simple history.  A command-line
utility to dump this information to a text file should be possible
with the currently available interface primitives.  If that is how you
want to go, you will need to run ManifoldCF in multiprocess mode.
Alternatively you might want to request the info from the API, but
that's problematic because nobody has implemented report support in
the API as of now.

A final alternative is to get this from the log.  There is an [INFO]
level line from the web connector for every fetch, I seem to recall,
and you might be able to use that.

Thanks,
Karl

On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher <mj...@gmail.com> wrote:
> Is it possible to export / download the list of URLs visited during a crawl job?
>
> Sent from my iPad