You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by Gav <gm...@apache.org> on 2016/07/13 02:53:20 UTC

Error importing mbox files

Hi,

When running :-

python3 import-mbox.py --source https://mail-archives.apache.org/mod_mbox/
--mod-mbox --project httpd

I get what seems to be all lists being slurped k, the output ends with:-

...
Parsed httpd-announce/201511.mbox: 9 records from
310ede37d0739990f3e1338778ae6f2e5b31117916caa2f1469651a8
2015 elements left to slurp
Slurping httpd-announce/201510.mbox
Found attachment: Notice_to_Appear_00000681680.zip
Found attachment: Court_Notification_00000155647.zip
Found attachment: 00174586.zip
Date seems totally wrong, setting to _now_ instead.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "import-mbox.py", line 263, in run
    json, contents = foo.compute_updates(list_override, private, message)
  File "/var/www/incubator-ponymail/tools/archiver.py", line 274, in
compute_updates
    mdatestring = time.strftime("%Y/%m/%d %H:%M:%S",
time.gmtime(email.utils.mktime_tz(mdate)))
  File "/usr/lib/python3.5/email/_parseaddr.py", line 185, in mktime_tz
    if data[9] is None:
IndexError: tuple index out of range

All done! 54 records inserted/updated after 67 seconds. 0 records were bad
and ignored


Any ideas?

FYI I set the VM TZ to UTC with no effect.

Gav...

...

Re: Error importing mbox files

Posted by Gav <gm...@apache.org>.
Next error :(

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/email/_encoded_words.py", line 109, in decode_b
    return base64.b64decode(padded_encoded, validate=True), defects
  File "/usr/lib/python3.5/base64.py", line 87, in b64decode
    raise binascii.Error('Non-base64 digit found')
binascii.Error: Non-base64 digit found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "import-mbox.py", line 263, in run
    json, contents = foo.compute_updates(list_override, private, message)
  File "/var/www/incubator-ponymail/tools/archiver.py", line 276, in
compute_updates
    body = self.msgbody(msg)
  File "/var/www/incubator-ponymail/tools/archiver.py", line 211, in msgbody
    body = msg.get_payload(decode=True)
  File "/usr/lib/python3.5/email/message.py", line 287, in get_payload
    value, defects = decode_b(b''.join(bpayload.splitlines()))
  File "/usr/lib/python3.5/email/_encoded_words.py", line 124, in decode_b
    raise AssertionError("unexpected binascii.Error")
AssertionError: unexpected binascii.Error

All done! 67411 records inserted/updated after 4439 seconds. 183 records
were bad and ignored


Gav...


On Wed, Jul 13, 2016 at 8:23 PM, Gav <gm...@apache.org> wrote:

>
>
> On Wed, Jul 13, 2016 at 7:36 PM, Daniel Gruno <hu...@apache.org>
> wrote:
>
>>
>> On 2016-07-13 04:53 (+0200), Gav <gm...@apache.org> wrote:
>> > Hi,
>> >
>> > When running :-
>> >
>> > python3 import-mbox.py --source
>> https://mail-archives.apache.org/mod_mbox/
>> > --mod-mbox --project httpd
>> >
>> > I get what seems to be all lists being slurped k, the output ends with:-
>> >
>> > ...
>> > Parsed httpd-announce/201511.mbox: 9 records from
>> > 310ede37d0739990f3e1338778ae6f2e5b31117916caa2f1469651a8
>> > 2015 elements left to slurp
>> > Slurping httpd-announce/201510.mbox
>> > Found attachment: Notice_to_Appear_00000681680.zip
>> > Found attachment: Court_Notification_00000155647.zip
>> > Found attachment: 00174586.zip
>> > Date seems totally wrong, setting to _now_ instead.
>> > Exception in thread Thread-1:
>> > Traceback (most recent call last):
>> >   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
>> >     self.run()
>> >   File "import-mbox.py", line 263, in run
>> >     json, contents = foo.compute_updates(list_override, private,
>> message)
>> >   File "/var/www/incubator-ponymail/tools/archiver.py", line 274, in
>> > compute_updates
>> >     mdatestring = time.strftime("%Y/%m/%d %H:%M:%S",
>> > time.gmtime(email.utils.mktime_tz(mdate)))
>> >   File "/usr/lib/python3.5/email/_parseaddr.py", line 185, in mktime_tz
>> >     if data[9] is None:
>> > IndexError: tuple index out of range
>>
>> This appears to be something that happens when emails violate the RFC
>> (missing Date header). Pony Mail then tries to fake it, but ends up with a
>> 9-tuple instead of a 10-tuple, which mktime_tz requires.
>>
>> I've pushed a change to master which appends a 10th element to the
>> 9-tuple, making the date formatter work again. Please try using archiver.py
>> from master and report back if that fixes your issue :)
>>
>
> thanks, running it now, will report back.
>
> Gav...
>
>
>>
>> Ideally, emails should follow protocol, but these specific ones (which
>> appear to be scam/malware) certainly don't.
>>
>> >
>> > All done! 54 records inserted/updated after 67 seconds. 0 records were
>> bad
>> > and ignored
>> >
>> >
>> > Any ideas?
>> >
>> > FYI I set the VM TZ to UTC with no effect.
>> >
>> > Gav...
>> >
>> > ...
>> >
>>
>
>

Re: Error importing mbox files

Posted by Gav <gm...@apache.org>.
On Wed, Jul 13, 2016 at 7:36 PM, Daniel Gruno <hu...@apache.org> wrote:

>
> On 2016-07-13 04:53 (+0200), Gav <gm...@apache.org> wrote:
> > Hi,
> >
> > When running :-
> >
> > python3 import-mbox.py --source
> https://mail-archives.apache.org/mod_mbox/
> > --mod-mbox --project httpd
> >
> > I get what seems to be all lists being slurped k, the output ends with:-
> >
> > ...
> > Parsed httpd-announce/201511.mbox: 9 records from
> > 310ede37d0739990f3e1338778ae6f2e5b31117916caa2f1469651a8
> > 2015 elements left to slurp
> > Slurping httpd-announce/201510.mbox
> > Found attachment: Notice_to_Appear_00000681680.zip
> > Found attachment: Court_Notification_00000155647.zip
> > Found attachment: 00174586.zip
> > Date seems totally wrong, setting to _now_ instead.
> > Exception in thread Thread-1:
> > Traceback (most recent call last):
> >   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
> >     self.run()
> >   File "import-mbox.py", line 263, in run
> >     json, contents = foo.compute_updates(list_override, private, message)
> >   File "/var/www/incubator-ponymail/tools/archiver.py", line 274, in
> > compute_updates
> >     mdatestring = time.strftime("%Y/%m/%d %H:%M:%S",
> > time.gmtime(email.utils.mktime_tz(mdate)))
> >   File "/usr/lib/python3.5/email/_parseaddr.py", line 185, in mktime_tz
> >     if data[9] is None:
> > IndexError: tuple index out of range
>
> This appears to be something that happens when emails violate the RFC
> (missing Date header). Pony Mail then tries to fake it, but ends up with a
> 9-tuple instead of a 10-tuple, which mktime_tz requires.
>
> I've pushed a change to master which appends a 10th element to the
> 9-tuple, making the date formatter work again. Please try using archiver.py
> from master and report back if that fixes your issue :)
>

thanks, running it now, will report back.

Gav...


>
> Ideally, emails should follow protocol, but these specific ones (which
> appear to be scam/malware) certainly don't.
>
> >
> > All done! 54 records inserted/updated after 67 seconds. 0 records were
> bad
> > and ignored
> >
> >
> > Any ideas?
> >
> > FYI I set the VM TZ to UTC with no effect.
> >
> > Gav...
> >
> > ...
> >
>

Re: Error importing mbox files

Posted by Daniel Gruno <hu...@apache.org>.
On 2016-07-13 04:53 (+0200), Gav <gm...@apache.org> wrote: 
> Hi,
> 
> When running :-
> 
> python3 import-mbox.py --source https://mail-archives.apache.org/mod_mbox/
> --mod-mbox --project httpd
> 
> I get what seems to be all lists being slurped k, the output ends with:-
> 
> ...
> Parsed httpd-announce/201511.mbox: 9 records from
> 310ede37d0739990f3e1338778ae6f2e5b31117916caa2f1469651a8
> 2015 elements left to slurp
> Slurping httpd-announce/201510.mbox
> Found attachment: Notice_to_Appear_00000681680.zip
> Found attachment: Court_Notification_00000155647.zip
> Found attachment: 00174586.zip
> Date seems totally wrong, setting to _now_ instead.
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
>     self.run()
>   File "import-mbox.py", line 263, in run
>     json, contents = foo.compute_updates(list_override, private, message)
>   File "/var/www/incubator-ponymail/tools/archiver.py", line 274, in
> compute_updates
>     mdatestring = time.strftime("%Y/%m/%d %H:%M:%S",
> time.gmtime(email.utils.mktime_tz(mdate)))
>   File "/usr/lib/python3.5/email/_parseaddr.py", line 185, in mktime_tz
>     if data[9] is None:
> IndexError: tuple index out of range

This appears to be something that happens when emails violate the RFC (missing Date header). Pony Mail then tries to fake it, but ends up with a 9-tuple instead of a 10-tuple, which mktime_tz requires.

I've pushed a change to master which appends a 10th element to the 9-tuple, making the date formatter work again. Please try using archiver.py from master and report back if that fixes your issue :)

Ideally, emails should follow protocol, but these specific ones (which appear to be scam/malware) certainly don't.

> 
> All done! 54 records inserted/updated after 67 seconds. 0 records were bad
> and ignored
> 
> 
> Any ideas?
> 
> FYI I set the VM TZ to UTC with no effect.
> 
> Gav...
> 
> ...
>