Cleaning email chain for text analysis python

I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on. I've tried parsing it out by turning everything lower, and then string splitting.

text = text.lower() text = text.translate(string.punctuation) text_list = text.split('+') text_list = [x for x in text_list if len(x) != 0] 
is there a better way to do this? 56.7k 10 10 gold badges 70 70 silver badges 98 98 bronze badges asked Aug 3, 2018 at 15:39 3,712 4 4 gold badges 26 26 silver badges 46 46 bronze badges What is the purpose of text.split('+') ? Your email does not have any + signs. Commented Aug 3, 2018 at 15:42 After I use text.translate(string.punctuation) it seems to turn my \n s into + s Commented Aug 3, 2018 at 15:44

3 Answers 3

You can use re to split messages (explanation of this regexp on external site). The result is list of dicts with keys 'from' , 'to' , 'subject' and 'message' :

text = """From: 'Mark Twain' [email protected]> To: 'Edgar Allen Poe' [email protected]> Subject: RE:Hello! Ed, I just read the Tell Tale Heart. You\'ve got problems man. Sincerely, Marky Mark From: 'Edgar Allen Poe' [email protected]> To: 'Mark Twain' [email protected]> Subject: RE: Hello! Mark, The world is crushing my soul, and so are you. Regards, Edgar""" import re from pprint import pprint groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M) emails = [] for g in groups: d = <> d['from'] = g[0].strip() d['to'] = g[1].strip() d['subject'] = g[2].strip() d['message'] = g[3].strip() emails.append(d) pprint(emails) 
[[email protected]>", 'message': 'Ed,\n' '\n' "I just read the Tell Tale Heart. You've got problems man.\n" '\n' 'Sincerely,\n' 'Marky Mark', 'subject': 'RE:Hello!', 'to': "'Edgar Allen Poe' [email protected]>">, [email protected]>", 'message': 'Mark,\n' '\n' 'The world is crushing my soul, and so are you.\n' '\n' 'Regards,\n' 'Edgar', 'subject': 'RE: Hello!', 'to': "'Mark Twain' [email protected]>">] 
answered Aug 3, 2018 at 15:48 Andrej Kesely Andrej Kesely 194k 15 15 gold badges 56 56 silver badges 100 100 bronze badges

That's not how str.translate works. Your text.translate(string.punctuation) uses the punctuation chars as a translation table, so it maps '\n', which is codepoint 10 to the 10th char in string.punctuation , which is '+'. The usual way to use str.translate is to first create a translation table using str.maketrans , which lets you specify chars to map from, the corresponding chars to map to, and (optionally) chars to delete. If you just want to use it for deletion you can create the table using dict.fromkeys , eg

table = dict.fromkeys([ord(c) for c in string.punctuation]) 

which makes a dict associating the codepoint of each char in string.punctuation to None .

Here's a repaired version of your code that uses str.translate to perform the case conversion and the punctuation deletion in a single step.

# Map upper case to lower case & remove punctuation table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase, string.punctuation) text = text.translate(table) text_list = text.split('\n') for row in text_list: print(repr(row)) 

output

'from mark twain marktwaingmailcom' 'to edgar allen poe eapgmailcom' 'subject rehello' '' 'ed' '' 'i just read the tell tale heart youve got problems man' '' 'sincerely' 'marky mark' '' 'from edgar allen poe eapgmailcom' 'to mark twain marktwaingmailcom' 'subject re hello' '' 'mark' '' 'the world is crushing my soul and so are you' '' 'regards' 'edgar' 

However, simply deleting all the punctuation is a bit messy, since it joins some words that you may not want joined. Instead, we can translate each punctuation char to a space, and then split on whitespace:

# Map all punctuation to space table = dict.fromkeys([ord(c) for c in string.punctuation], ' ') text = text.translate(table).lower() text_list = text.split() print(text_list) 

output

['from', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'to', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'subject', 're', 'hello', 'ed', 'i', 'just', 'read', 'the', 'tell', 'tale', 'heart', 'you', 've', 'got', 'problems', 'man', 'sincerely', 'marky', 'mark', 'from', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'to', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'subject', 're', 'hello', 'mark', 'the', 'world', 'is', 'crushing', 'my', 'soul', 'and', 'so', 'are', 'you', 'regards', 'edgar']