Remove special and invalid characters in a string

I am working on creating a product feed for a third party company. The data I'm working with has all kinds of invalid, special characters, double spacing, etc. They also requested that the data be HTML encoded using special characters.

An example of some data to be transferred = "Buy Kitchen

Aid Artisan ™ Stand Mixer 4.8L "

        try
        {
            var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
            var encodedString = HttpUtility.HtmlEncode(removeDoubleSpace).Trim();
            var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
            var finalStringOutput = Regex.Replace(encodedAndLineBreaksRemoved, @"(™)|(’)|(")|(–)", "");

            return finalStringOutput;
        }
        catch (Exception)
        {
            return stringInput;
        }

      

I was trying to come up with one method that could be called to do all of the above, in a cleaner way, rather than multiple expressions Regex

. Or maybe there is only one regex that covers everything?

+3


source to share


3 answers


Use a whitelist, not a blacklist, because you can more easily find out which letters are acceptable, which letters might be inappropriate. This is a whitelist. This is a list of valid characters. Create your whitelist and remove anything that is missing from this list. In your case, a potential whitelist might include all ASCII characters.

Below is a whitelist that captures all alphanumeric characters and punctuation marks.

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Program
{       
    private static string input = @"Buy Kitchen

Aid Artisan™ Stand Mixer 4.8L ";

    public static void Main()
    {
        var match = Regex
            .Match(input, @"[a-zA-Z0-9\p{P}]+");

        StringBuilder builder = new StringBuilder();
        while(match.Success)
        {
            // add a space between matches
            builder.Append(match + " ");
            match = match.NextMatch();
        }
        Console.WriteLine(builder.ToString());
    }
}

      



Output

Buy Kitchen Aid Artisan Stand Mixer 4.8L

      

+2


source


Here's a slightly expanded code:

var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = System.Web.HttpUtility.HtmlEncode(removeDoubleSpace).Trim().Replace("™", string.Empty).Replace("’", string.Empty).Replace(""", string.Empty).Replace("", string.Empty);

      

You don't need to use var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");

as newlines have already been removed with the \s+

regex ( \s

matches any white space, tabs, formatting, etc. Equivalent to [\ f \ n \ r \ t \ v].).



Also, there is no need to use a second regex unless you plan on removing a specific character range or class (for example, all characters inside a \p{S}

shorthand class), so I just bind a few methods string.Replace

directly to the trimmed and encoded string.

Output:

Buy Kitchen Aid Artisan Stand Mixer 4.8L

      

0


source


You don't need regex, linq will also do:

var str = "Buy Kitchen Aid Artisan™ Stand Mixer 4.8L";
var newStr = new string(str.Where(c => !Char.IsSymbol(c)).ToArray());

Console.WriteLine(newStr); // Buy Kitchen Aid Artisan Stand Mixer 4.8L

      

0


source







All Articles