Encoding issues in email messages

I have a little python script that pulls emails from a POP mail address and dumps them to a file (one email with one file).

The PHP script then looks at the files and displays them.

I have a problem with ISO-8859-1 (Latin-1) encoded letter

Here's an example of the text I got: =? iso-8859-1? Q? G = EDsli_Karlsson? = and Sj = E1um hva = F0 = F3li er kl = E1r J

The way I pull out emails is code.

pop = poplib.POP3(server)

mail_list = pop.list()[1]

for m in mail_list:
    mno, size = m.split()
    lines = pop.retr(mno)[1]

    file = StringIO.StringIO("\r\n".join(lines))
    msg = rfc822.Message(file)

    body = file.readlines()

    f = open(str(random.randint(1,100)) + ".email", "w")
    f.write(msg["From"] + "\n")
    f.write(msg["Subject"] + "\n")
    f.write(msg["Date"] + "\n")

    for b in body:
        f.write(b)

      

I've tried probably all encoding / decoding combinations inside python and php.

+1


source to share


5 answers


You can use the python electronic library (python 2.5+) to avoid these problems:

import email
import poplib
import random
from cStringIO import StringIO
from email.generator import Generator

pop = poplib.POP3(server)

mail_count = len(pop.list()[1])

for message_num in xrange(mail_count):
    message = "\r\n".join(pop.retr(message_num)[1])
    message = email.message_from_string(message)

    out_file = StringIO()
    message_gen = Generator(out_file, mangle_from_=False, maxheaderlen=60)
    message_gen.flatten(message)
    message_text = out_file.getvalue()

    filename = "%s.email" % random.randint(1,100)
    email_file = open(filename, "w")
    email_file.write(message_text)
    email_file.close()

      



This code will get all messages from your server and turn them into Python message objects and then line them up again for writing to a file. Using an email package from the Python standard library, you need to handle MIME encoding and decoding issues.

DISCLAIMER: I have not tested this code, but it should work fine.

+3


source


What MIME headers are, RFC 2047 . Here's how to decode it in Python:

import email.Header
import sys

header_and_encoding = email.Header.decode_header(sys.stdin.readline())
for part in header_and_encoding:
    if part[1] is None:
        print part[0],
    else:
        upart = (part[0]).decode(part[1])
        print upart.encode('latin-1'),
print

      



More detailed explanations (in French) at http://www.bortzmeyer.org/decoder-en-tetes-courrier.html

+2


source


There is a better way to do this, but this is what I ran into. Thanks for the help guys.

import poplib, quopri
import random, md5
import sys, rfc822, StringIO
import email
from email.Generator import Generator

user = "email@example.com"
password = "password"
server = "mail.example.com"

# connects
try:
    pop = poplib.POP3(server)
except:
    print "Error connecting to server"
    sys.exit(-1)

# user auth
try:
    print pop.user(user)
    print pop.pass_(password)
except:
    print "Authentication error"
    sys.exit(-2)

# gets the mail list
mail_list = pop.list()[1]

for m in mail_list:
    mno, size = m.split()
    message = "\r\n".join(pop.retr(mno)[1])
    message = email.message_from_string(message)

    # uses the email flatten
    out_file = StringIO.StringIO()
    message_gen = Generator(out_file, mangle_from_=False, maxheaderlen=60)
    message_gen.flatten(message)
    message_text = out_file.getvalue()

    # fixes mime encoding issues (for display within html)
    clean_text = quopri.decodestring(message_text)

    msg = email.message_from_string(clean_text)

    # finds the last body (when in mime multipart, html is the last one)
    for part in msg.walk():
        if part.get_content_type():
            body = part.get_payload(decode=True)

    filename = "%s.email" % random.randint(1,100)

    email_file = open(filename, "w")

    email_file.write(msg["From"] + "\n")
    email_file.write(msg["Return-Path"] + "\n")
    email_file.write(msg["Subject"] + "\n")
    email_file.write(msg["Date"] + "\n")
    email_file.write(body)

    email_file.close()

pop.quit()
sys.exit()

      

+2


source


Until recently, simple latin letters N or utf-N were not allowed in headers, which means that they will be encoded using the method first described in RFC-1522 , but it was replaced later. Accents are encoded either in quotes or in Base64, and this is indicated by the symbol? Q? (or? B? for Base64). You will have to decipher them. Oh and space are encoded as "_". See Wikipedia .

+1


source


This is MIME content and what the email actually looks like, not a bug. You have to use the MIME decoding library (or manually decode it yourself) on the PHP side (which, if I understood correctly, is the one acting as the email renderer).

In Python, you will be using mimetools . In PHP I'm not sure. There seems to be a MIME parser in the Zend framework somewhere, and there are probably two million fragments floating around.

http://en.wikipedia.org/wiki/MIME#Encoded-Word

0


source







All Articles