en fr

Extraire le HTML d'un email au format mbox

Posted on 2017-08-07 in Trucs et astuces

Here is a small script to convert mail in the mbox format to HTML. For the script to work, the body of the message (recommended) or its first attachment must contain the HTML code.

By default, the script will convert all mbox files of the current folder. You can also give it a path to another folder as first argument. The HTML is saved in the same folder than the original file in a file with the same name and the .html extension.

#!/usr/bin/env python3

import sys
from glob import glob
from mailbox import mboxMessage
from os.path import join, splitext


path = '.'
if len(sys.argv) > 1:
    path = sys.argv[1]

for path in glob(join(path, '*.mbox')):
    with open(path, 'r') as mail_file:
        message = mboxMessage(mail_file.read())

    content = message.get_payload()
    if isinstance(content, str):
        html = message.get_payload(decode=True).decode('utf-8')
    else:
        html = content[0].get_payload(decode=True).decode('utf-8')
        html = html.replace('charset=iso-8859-1', 'charset=utf-8')

    file_name, _ = splitext(path)
    with open(f'{file_name}.html', 'w') as html_file:
        html_file.write(html)

You can also find it on github: https://github.com/Jenselme/dot-files-shell/blob/master/bin/extract-html-email.py