You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2012/02/01 14:12:40 UTC

Bad Request in nutch when i use parsechecker?

parsechecker

--
View this message in context: http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsechecker-tp3706524p3706524.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Bad Request in nutch when i use parsechecker?

Posted by mina <ta...@gmail.com>.
thanks for your answer, i see this page but is a patch:
patch-with-utf8-encoding.diff
<https://issues.apache.org/jira/secure/attachment/12502393/patch-with-utf8-encoding.diff>how
i use this patch in nutch? i should use nutch source? please help me.

On Wed, Feb 1, 2012 at 8:02 AM, Markus Jelsma-2 [via Lucene] <
ml-node+s472066n3707012h91@n3.nabble.com> wrote:

> Nutch cannot do this right now. However, there's a patch that does the
> encoding.
>
> https://issues.apache.org/jira/browse/NUTCH-1098
>
>
>
> On Wednesday 01 February 2012 16:26:06 mina wrote:
>
> > how i can force nutch to encoding this url? i want give this url and
> > then nutch encode it, i want set this task to nutch. i want nutch do:
> >  1.get url
> > then
> >  2.encoding it
> > what command encode an url?
> >
> > On 2/1/12, Markus Jelsma-2 [via Lucene]
> >
> > <[hidden email] <http://user/SendEmail.jtp?type=node&node=3707012&i=0>>
> wrote:
> > > bin/nutch parsechecker
> > >
> http://www.irna.ir/News/30786427/%D8%B3%D9%88%D8%A1-%D8%A7%D8%B3%D8%AA%D9
> > >
> %81%D8%A7%D8%AF%D9%87-%D8%A7%D8%B2-%D9%86%D8%A7%D9%85-%D9%83%D9%85%DB%8C%
> > >
> D8%AA%D9%87-%D8%A7%D9%85%D8%AF%D8%A7%D8%AF-%D8%A8%D8%B1%D8%A7%DB%8C-%D8%A
> > >
> C%D9%85%D8%B9-%D8%A2%D9%88%D8%B1%DB%8C-%D8%B1%D8%A7%DB%8C-%D8%AF%D8%B1-%D
> > >
> 9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%
> > > D8%A7%D8%B3%D9%8A/
> > >
> > > encoding, encoding, encoding
> > >
> > > On Wednesday 01 February 2012 14:14:55 mina wrote:
> > >> hi, i use this command:
> > >>
> > >> bin/nutch parsechecker -dumpText
> > >>
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> > >> -آو ری-رای-در-مناطق-محروم/سياسي/
> > >>
> > >> and see log:
> > >>
> > >> fetching:
> > >>
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> > >> -آو ری-رای-در-مناطق-محروم/سياسي/ parsing:
> > >>
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> > >> -آو ری-رای-در-مناطق-محروم/سياسي/ contentType: text/html
> > >> ---------
> > >> Url
> > >> ---------------
> > >>
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> > >> -آو ری-رای-در-مناطق-محروم/سياسي/--------- ParseData
> > >> ---------
> > >> Version: 5
> > >> Status: success(1,0)
> > >> Title: Bad Request
> > >> Outlinks: 0
> > >> Content Metadata: Date=Wed, 01 Feb 2012 10:04:04 GMT
> Content-Length=324
> > >> Connection=close Content-Type=text/html; charset=us-ascii
> > >> Server=Microsoft-HTTPAPI/2.0
> > >> Parse Metadata: CharEncodingForConversion=us-ascii
> > >> OriginalCharEncoding=us-ascii
> > >> ---------
> > >> ParseText
> > >> ---------
> > >> Bad Request Bad Request - Invalid URL HTTP Error 400. The request URL
> is
> > >> invalid.
> > >>
> > >>
> > >>
> > >> i have Bad Request. why? how i fix this error?
> > >>
> > >> --
> > >> View this message in context:
> > >>
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parse
> > >> ch ecker-tp3706524p3706530.html Sent from the Nutch - User mailing
> list
> > >> archive at Nabble.com.
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > >
> > >
> > > _______________________________________________
> > > If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsec
> > > hecker-tp3706524p3706875.html
> > >
> > > To unsubscribe from Bad Request in nutch when i use parsechecker?,
> visit
> > >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubsc
> > >
> ribe_by_code&node=3706524&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM3MDY1Mj
> > > R8NTgyODE5NjA3
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsech
> > ecker-tp3706524p3706919.html Sent from the Nutch - User mailing list
> > archive at Nabble.com.
>
> --
> Markus Jelsma - CTO - Openindex
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsechecker-tp3706524p3707012.html
>  To unsubscribe from Bad Request in nutch when i use parsechecker?, click
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3706524&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM3MDY1MjR8NTgyODE5NjA3>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsechecker-tp3706524p3709105.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Bad Request in nutch when i use parsechecker?

Posted by Markus Jelsma <ma...@openindex.io>.
Nutch cannot do this right now. However, there's a patch that does the 
encoding.

https://issues.apache.org/jira/browse/NUTCH-1098


On Wednesday 01 February 2012 16:26:06 mina wrote:
> how i can force nutch to encoding this url? i want give this url and
> then nutch encode it, i want set this task to nutch. i want nutch do:
>  1.get url
> then
>  2.encoding it
> what command encode an url?
> 
> On 2/1/12, Markus Jelsma-2 [via Lucene]
> 
> <ml...@n3.nabble.com> wrote:
> > bin/nutch parsechecker
> > http://www.irna.ir/News/30786427/%D8%B3%D9%88%D8%A1-%D8%A7%D8%B3%D8%AA%D9
> > %81%D8%A7%D8%AF%D9%87-%D8%A7%D8%B2-%D9%86%D8%A7%D9%85-%D9%83%D9%85%DB%8C%
> > D8%AA%D9%87-%D8%A7%D9%85%D8%AF%D8%A7%D8%AF-%D8%A8%D8%B1%D8%A7%DB%8C-%D8%A
> > C%D9%85%D8%B9-%D8%A2%D9%88%D8%B1%DB%8C-%D8%B1%D8%A7%DB%8C-%D8%AF%D8%B1-%D
> > 9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%
> > D8%A7%D8%B3%D9%8A/
> > 
> > encoding, encoding, encoding
> > 
> > On Wednesday 01 February 2012 14:14:55 mina wrote:
> >> hi, i use this command:
> >> 
> >> bin/nutch parsechecker -dumpText
> >> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> >> -آو ری-رای-در-مناطق-محروم/سياسي/
> >> 
> >> and see log:
> >> 
> >> fetching:
> >> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> >> -آو ری-رای-در-مناطق-محروم/سياسي/ parsing:
> >> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> >> -آو ری-رای-در-مناطق-محروم/سياسي/ contentType: text/html
> >> ---------
> >> Url
> >> ---------------
> >> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع
> >> -آو ری-رای-در-مناطق-محروم/سياسي/--------- ParseData
> >> ---------
> >> Version: 5
> >> Status: success(1,0)
> >> Title: Bad Request
> >> Outlinks: 0
> >> Content Metadata: Date=Wed, 01 Feb 2012 10:04:04 GMT Content-Length=324
> >> Connection=close Content-Type=text/html; charset=us-ascii
> >> Server=Microsoft-HTTPAPI/2.0
> >> Parse Metadata: CharEncodingForConversion=us-ascii
> >> OriginalCharEncoding=us-ascii
> >> ---------
> >> ParseText
> >> ---------
> >> Bad Request Bad Request - Invalid URL HTTP Error 400. The request URL is
> >> invalid.
> >> 
> >> 
> >> 
> >> i have Bad Request. why? how i fix this error?
> >> 
> >> --
> >> View this message in context:
> >> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parse
> >> ch ecker-tp3706524p3706530.html Sent from the Nutch - User mailing list
> >> archive at Nabble.com.
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > 
> > 
> > _______________________________________________
> > If you reply to this email, your message will be added to the discussion
> > below:
> > http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsec
> > hecker-tp3706524p3706875.html
> > 
> > To unsubscribe from Bad Request in nutch when i use parsechecker?, visit
> > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubsc
> > ribe_by_code&node=3706524&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM3MDY1Mj
> > R8NTgyODE5NjA3
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsech
> ecker-tp3706524p3706919.html Sent from the Nutch - User mailing list
> archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Bad Request in nutch when i use parsechecker?

Posted by mina <ta...@gmail.com>.
how i can force nutch to encoding this url? i want give this url and
then nutch encode it, i want set this task to nutch. i want nutch do:
 1.get url
then
 2.encoding it
what command encode an url?

On 2/1/12, Markus Jelsma-2 [via Lucene]
<ml...@n3.nabble.com> wrote:
>
>
> bin/nutch parsechecker
> http://www.irna.ir/News/30786427/%D8%B3%D9%88%D8%A1-%D8%A7%D8%B3%D8%AA%D9%81%D8%A7%D8%AF%D9%87-%D8%A7%D8%B2-%D9%86%D8%A7%D9%85-%D9%83%D9%85%DB%8C%D8%AA%D9%87-%D8%A7%D9%85%D8%AF%D8%A7%D8%AF-%D8%A8%D8%B1%D8%A7%DB%8C-%D8%AC%D9%85%D8%B9-%D8%A2%D9%88%D8%B1%DB%8C-%D8%B1%D8%A7%DB%8C-%D8%AF%D8%B1-%D9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A/
>
> encoding, encoding, encoding
>
>
>
> On Wednesday 01 February 2012 14:14:55 mina wrote:
>> hi, i use this command:
>>
>> bin/nutch parsechecker -dumpText
>> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
>> ری-رای-در-مناطق-محروم/سياسي/
>>
>> and see log:
>>
>> fetching:
>> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
>> ری-رای-در-مناطق-محروم/سياسي/ parsing:
>> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
>> ری-رای-در-مناطق-محروم/سياسي/ contentType: text/html
>> ---------
>> Url
>> ---------------
>> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
>> ری-رای-در-مناطق-محروم/سياسي/--------- ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title: Bad Request
>> Outlinks: 0
>> Content Metadata: Date=Wed, 01 Feb 2012 10:04:04 GMT Content-Length=324
>> Connection=close Content-Type=text/html; charset=us-ascii
>> Server=Microsoft-HTTPAPI/2.0
>> Parse Metadata: CharEncodingForConversion=us-ascii
>> OriginalCharEncoding=us-ascii
>> ---------
>> ParseText
>> ---------
>> Bad Request Bad Request - Invalid URL HTTP Error 400. The request URL is
>> invalid.
>>
>>
>>
>> i have Bad Request. why? how i fix this error?
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsech
>> ecker-tp3706524p3706530.html Sent from the Nutch - User mailing list
>> archive at Nabble.com.
>
> --
> Markus Jelsma - CTO - Openindex
>
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsechecker-tp3706524p3706875.html
>
> To unsubscribe from Bad Request in nutch when i use parsechecker?, visit
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3706524&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM3MDY1MjR8NTgyODE5NjA3


--
View this message in context: http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsechecker-tp3706524p3706919.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Bad Request in nutch when i use parsechecker?

Posted by Markus Jelsma <ma...@openindex.io>.
bin/nutch parsechecker 
http://www.irna.ir/News/30786427/%D8%B3%D9%88%D8%A1-%D8%A7%D8%B3%D8%AA%D9%81%D8%A7%D8%AF%D9%87-%D8%A7%D8%B2-%D9%86%D8%A7%D9%85-%D9%83%D9%85%DB%8C%D8%AA%D9%87-%D8%A7%D9%85%D8%AF%D8%A7%D8%AF-%D8%A8%D8%B1%D8%A7%DB%8C-%D8%AC%D9%85%D8%B9-%D8%A2%D9%88%D8%B1%DB%8C-%D8%B1%D8%A7%DB%8C-%D8%AF%D8%B1-%D9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A/

encoding, encoding, encoding



On Wednesday 01 February 2012 14:14:55 mina wrote:
> hi, i use this command:
> 
> bin/nutch parsechecker -dumpText
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
> ری-رای-در-مناطق-محروم/سياسي/
> 
> and see log:
> 
> fetching:
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
> ری-رای-در-مناطق-محروم/سياسي/ parsing:
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
> ری-رای-در-مناطق-محروم/سياسي/ contentType: text/html
> ---------
> Url
> ---------------
> http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آو
> ری-رای-در-مناطق-محروم/سياسي/--------- ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Bad Request
> Outlinks: 0
> Content Metadata: Date=Wed, 01 Feb 2012 10:04:04 GMT Content-Length=324
> Connection=close Content-Type=text/html; charset=us-ascii
> Server=Microsoft-HTTPAPI/2.0
> Parse Metadata: CharEncodingForConversion=us-ascii
> OriginalCharEncoding=us-ascii
> ---------
> ParseText
> ---------
> Bad Request Bad Request - Invalid URL HTTP Error 400. The request URL is
> invalid.
> 
> 
> 
> i have Bad Request. why? how i fix this error?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsech
> ecker-tp3706524p3706530.html Sent from the Nutch - User mailing list
> archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Bad Request in nutch when i use parsechecker?

Posted by mina <ta...@gmail.com>.
hi, i use this command:

bin/nutch parsechecker -dumpText
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/ 
 
and see log:

fetching:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
parsing:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
contentType: text/html
---------
Url
---------------
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Bad Request
Outlinks: 0
Content Metadata: Date=Wed, 01 Feb 2012 10:04:04 GMT Content-Length=324
Connection=close Content-Type=text/html; charset=us-ascii
Server=Microsoft-HTTPAPI/2.0
Parse Metadata: CharEncodingForConversion=us-ascii
OriginalCharEncoding=us-ascii
---------
ParseText
---------
Bad Request Bad Request - Invalid URL HTTP Error 400. The request URL is
invalid.



i have Bad Request. why? how i fix this error? 

--
View this message in context: http://lucene.472066.n3.nabble.com/Bad-Request-in-nutch-when-i-use-parsechecker-tp3706524p3706530.html
Sent from the Nutch - User mailing list archive at Nabble.com.