Parsing PlainText Emails from HTML Content (ASP.NET)

Question

Parsing PlainText Emails from HTML Content (ASP.NET)

In short, we basically have a system where HTML content for email is generated. It's not perfect, but it works.

From this we should be able to get a plain text alternative for email. I was thinking about going instantly and creating a RegEx to strip the tags <*>

from the post, but then I realized it wouldn't be good because we needed formatting information (paragraphs, line breaks, images, etc.).

NOTE. ... I am fine with sending mail and setting up alternate views, etc., this is only for getting plaintext from HTML.

So I'm thinking about some ideas. Will post one answer to see what you guys think but thought I would open it to the floor. :)

If you need more clarification then please shout.

Many thanks,

Rob

+1

html email parsing asp.net plaintext

Rob cooper 10 nov. '08 at 9:55

source to share

3 answers

My idea

Create a page based on HTML content and navigate the tree. You can then select text from the controls and process the various controls as needed (for example, use ALT text for images, "_____" for HR, etc.).

0

Rob cooper 10 nov. '08 at 9:56

source to share

You can make sure the HTML mail is in XHTML format, so you can parse it easily with standard XML tools and then create your own DOM serializer that outputs plain text. It would be a lot of work to cover generic XHTML, but for the limited set you plan on using in email, it might work.

Alternatively, if you don't mind using another program, you can simply use the -dump switch in the lynx web browser.

0

bobince 10 nov. '08 at 10:49

source to share

Rob cooper · Accepted Answer · 2008-11-10T13:24:18+0000

My decision

Okay, so that's it! I came up with a solution to my problem and it works like a charm!

Now, here are some of the goals I would like to outline:

All email content must remain in ASPX pages (currently HTML content).
I didn't want the client code to do anything other than " SendMail("PageX.aspx")

".
I didn't want to write too much code.
I wanted to keep the code semantically correct as soon as possible (no REALLY crazy ass!).

Process

So this is what I ended up with:

Go to the main email messages page. Create ASP.NET MultiView Control . This control will have two views - HTML and PlainText.
In each view, I've added content placeholders for the actual content.
Then I grabbed all the existing ASPX code (like header and footer) and got stuck in HTML view. All this, DocType and all that. This makes VS a little inferior. Ignore.
Then, of course, I added new content to the PlainText view to best reproduce the HTML view in the PlainText environment.
Then I added the Master code by Page_Load

checking the QueryString "type" parameter, which could be "html" or "text". It falls on "text" if not. Depending on the value, it switches the view.
Then I go to content pages and add new placeholders for PlainText equivalents and add text as needed.
To make my life easier, I have overloaded my method SendMail

to get the response to the required page by passing " type=html

" and " type=text

" and creating an AlternateView if needed.

In summary

So, in short:

Views separate the actual "views" of the content (HTML and text).
The master page automatically switches the view based on the QueryString.
Content pages are responsible for how their views appear.

Mission completed!

If it's not clear then please shout. I would like to create a blog post about this at some point in more detail.

Parsing PlainText Emails from HTML Content (ASP.NET)

My decision

Process

In summary

My idea

More articles: