How to write vba code to remove and replace UTF8 characters

I have this code, and I still cannot replace non-english characters like Vietnamese or Thai from my data with a simple "placeholder".

Sub NonLatin()
Dim cell As Range
    For Each cell In Range("A1", Cells(Rows.Count, "A").End(xlUp))
        s = cell.Value
            For i = 1 To Len(s)
                If Mid(s, i, 1) Like "[!A-Za-z0-9@#$%^&* * ]" Then cell.Value = "placeholder"
            Next
    Next
End Sub

      

Appreciate your help

+3


source to share


2 answers


For more information on using regular expressions in VBA code, see this question .


Then use regular expressions in a function like this to process the strings. Here I am assuming you want to replace every invalid characterto the placeholder, not the entire line. If it's a whole string, you don't need to do individual character checks, you can just use qualifiers +

or *

on multiple characters in the regex pattern and test the whole string together.

Function LatinString(str As String) As String
    ' After including a reference to "Microsoft VBScript Regular Expressions 5.5"
    ' Set up the regular expressions object
    Dim regEx As New RegExp
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        ' This is the pattern of ALLOWED characters. 
        ' Note that special characters should be escaped using a slash e.g. \$ not $
        .Pattern = "[A-Za-z0-9]"
    End With

    ' Loop through characters in string. Replace disallowed characters with "?"
    Dim i As Long
    For i = 1 To Len(str)
        If Not regEx.Test(Mid(str, i, 1)) Then
            str = Left(str, i - 1) & "?" & Mid(str, i + 1)
        End If
    Next i
    ' Return output
    LatinString = str
End Function

      



You can use this in your code

Dim cell As Range
For Each cell In Range("A1", Cells(Rows.Count, "A").End(xlUp))
    cell.Value = LatinString(cell.Value)
Next

      


For a byte level method that converts a Unicode string to UTF8 string without using regular expressions, check out this article

0


source


You can replace any characters that are not in e. d. ASCII (first 128 characters) with placeholder using code below:



Option Explicit

Sub Test()

    Dim oCell As Range

    With CreateObject("VBScript.RegExp")
        .Global = True
        .Pattern = "[^u0000-u00F7]"
        For Each oCell In [A1:C4]
            oCell.Value = .Replace(oCell.Value, "*")
        Next
    End With

End Sub

      

0


source







All Articles