How to write vba code to remove and replace UTF8 characters
I have this code, and I still cannot replace non-english characters like Vietnamese or Thai from my data with a simple "placeholder".
Sub NonLatin()
Dim cell As Range
For Each cell In Range("A1", Cells(Rows.Count, "A").End(xlUp))
s = cell.Value
For i = 1 To Len(s)
If Mid(s, i, 1) Like "[!A-Za-z0-9@#$%^&* * ]" Then cell.Value = "placeholder"
Next
Next
End Sub
Appreciate your help
source to share
For more information on using regular expressions in VBA code, see this question .
Then use regular expressions in a function like this to process the strings. Here I am assuming you want to replace every invalid characterto the placeholder, not the entire line. If it's a whole string, you don't need to do individual character checks, you can just use qualifiers +
or *
on multiple characters in the regex pattern and test the whole string together.
Function LatinString(str As String) As String
' After including a reference to "Microsoft VBScript Regular Expressions 5.5"
' Set up the regular expressions object
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
' This is the pattern of ALLOWED characters.
' Note that special characters should be escaped using a slash e.g. \$ not $
.Pattern = "[A-Za-z0-9]"
End With
' Loop through characters in string. Replace disallowed characters with "?"
Dim i As Long
For i = 1 To Len(str)
If Not regEx.Test(Mid(str, i, 1)) Then
str = Left(str, i - 1) & "?" & Mid(str, i + 1)
End If
Next i
' Return output
LatinString = str
End Function
You can use this in your code
Dim cell As Range
For Each cell In Range("A1", Cells(Rows.Count, "A").End(xlUp))
cell.Value = LatinString(cell.Value)
Next
For a byte level method that converts a Unicode string to UTF8 string without using regular expressions, check out this article
source to share
You can replace any characters that are not in e. d. ASCII (first 128 characters) with placeholder using code below:
Option Explicit Sub Test() Dim oCell As Range With CreateObject("VBScript.RegExp") .Global = True .Pattern = "[^u0000-u00F7]" For Each oCell In [A1:C4] oCell.Value = .Replace(oCell.Value, "*") Next End With End Sub
source to share