Remove special and invalid characters in a string
I am working on creating a product feed for a third party company. The data I'm working with has all kinds of invalid, special characters, double spacing, etc. They also requested that the data be HTML encoded using special characters.
An example of some data to be transferred = "Buy Kitchen
Aid Artisan ™ Stand Mixer 4.8L "
try
{
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = HttpUtility.HtmlEncode(removeDoubleSpace).Trim();
var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
var finalStringOutput = Regex.Replace(encodedAndLineBreaksRemoved, @"(™)|(’)|(")|(–)", "");
return finalStringOutput;
}
catch (Exception)
{
return stringInput;
}
I was trying to come up with one method that could be called to do all of the above, in a cleaner way, rather than multiple expressions Regex
. Or maybe there is only one regex that covers everything?
source to share
Use a whitelist, not a blacklist, because you can more easily find out which letters are acceptable, which letters might be inappropriate. This is a whitelist. This is a list of valid characters. Create your whitelist and remove anything that is missing from this list. In your case, a potential whitelist might include all ASCII characters.
Below is a whitelist that captures all alphanumeric characters and punctuation marks.
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Program
{
private static string input = @"Buy Kitchen
Aid Artisan™ Stand Mixer 4.8L ";
public static void Main()
{
var match = Regex
.Match(input, @"[a-zA-Z0-9\p{P}]+");
StringBuilder builder = new StringBuilder();
while(match.Success)
{
// add a space between matches
builder.Append(match + " ");
match = match.NextMatch();
}
Console.WriteLine(builder.ToString());
}
}
Output
Buy Kitchen Aid Artisan Stand Mixer 4.8L
source to share
Here's a slightly expanded code:
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = System.Web.HttpUtility.HtmlEncode(removeDoubleSpace).Trim().Replace("™", string.Empty).Replace("’", string.Empty).Replace(""", string.Empty).Replace("–", string.Empty);
You don't need to use var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
as newlines have already been removed with the \s+
regex ( \s
matches any white space, tabs, formatting, etc. Equivalent to [\ f \ n \ r \ t \ v].).
Also, there is no need to use a second regex unless you plan on removing a specific character range or class (for example, all characters inside a \p{S}
shorthand class), so I just bind a few methods string.Replace
directly to the trimmed and encoded string.
Output:
Buy Kitchen Aid Artisan Stand Mixer 4.8L
source to share