How can I remove unknown characters using T-SQL?

I am trying to make a script to remove Chinese characters in a string and between Chinese characters that I want to remove. See example below. Thank.

select LTRIM(SUBSTRING('Tower 6A 第6A座', PATINDEX('%[a-zA-Z0-9]%', 'Tower 6A 第6A座'), LEN('Tower 6A 第6A座')))
select LTRIM(REPLACE(SUBSTRING('Tower 6A 第6A座', CHARINDEX(N'樓', 'Tower 6A 第6A座') + 1, LEN('Tower 6A 第6A座')), ' ', ''))

      

Example line:

Tower 6A 第6A座
Tower 3 第3座

      

Bad result:

Tower 6A ?6A?
Tower6A?6A?
Tower 3 ?3?
Tower3?3?

      

Good result, I want to achieve:

Tower 6A
Tower 6A
Tower 3
Tower 3

      

+3


source to share


5 answers


try it



   SELECT Replace(Replace('Tower 6A 第6A座','[^a-zA-Z0-9]+', ''),'?','')

      

+3


source


Looks weird, but does the job. However, I will not count on excellent performance:

;WITH string AS (
    SELECT N'Tower 6A 第6A座' s
),
split AS (
    select LEFT(s, 1) s_item,
       STUFF(s, 1, 1, N'') s
    from string
    union all
    select LEFT(s, 1),
       STUFF(s, 1, 1, N'')
    from split
    where s > ''
)
,
remove_non_ascii AS ( 
    SELECT s_item, UNICODE(s_item) s_unicode
    FROM split WHERE UNICODE(s_item)<256
)
SELECT STUFF((SELECT s_item FROM remove_non_ascii
FOR XML PATH, TYPE).value('.[1]', 'NVARCHAR(MAX)'), 1,0, '');

      

What it does:



  • Separates lines into lines

  • Eliminates UNICODE characters greater than 256 (you can play with the condition)

  • Concatenates a string together

It uses a recursive query, so in the case of strings longer than 100 characters you need to increase the number of recursive loops by adding: OPTION (MAXRECURSION n)

(where n is the new number of recursive loops)

+2


source


try it

;With cte(Data)
AS
(
SELECT 'Tower 6A 第6A座' UNION ALL
SELECT 'Tower 3 第3座'
)
SELECT
CASE WHEN CHARINDEX('Tower', DATA)= 0 THEN 'Tower ' + DATA ELSE DATA END AS DATA
FROM
(
SELECT
Split.a.value('.', 'VARCHAR(1000)') AS Data
FROM
(
SELECT  CAST('<S>' + REPLACE(REPLACE(Data, '?', '</S><S>'),'','</S><S>') + '</S>' AS XML) AS Data
FROM cte
) AS 
    A CROSS APPLY Data.nodes('/S') AS Split(a)
) DT
Where
DT.Data <> ''

      

Result

DATA
----------
Tower 6A 
Tower 6A
Tower 3 
Tower 3

      

0


source


Using delimitedSplit8k you can do this:

-- Sample data
DECLARE @table TABLE (someid int identity, string nvarchar(100));
INSERT @table VALUES (N'Tower 6A 第6A座'), (N'Tower 3 第3座');

-- Solution
WITH base AS
(
  SELECT someid, ItemNumber, itemClean = replace(item,'?','')
  FROM @table t
  CROSS APPLY dbo.delimitedSplit8K(t.string, ' ')
)
SELECT someid, newstring = b1.itemClean + ' ' +  b2.itemClean
FROM base b1
CROSS APPLY 
(
  SELECT itemClean 
  FROM base b2 WHERE ItemNumber = 1 AND b1.someid = b2.someid
) b2
WHERE b1.ItemNumber > 1;

      

Results:

someid      newstring
----------- ---------
1           6A Tower
1           6A Tower
2           3 Tower
2           3 Tower

      

0


source


One approach to this problem is to leverage the SQL Server 2016 capability to execute R Script. For this method to work, you need to ensure that R services have been installed on your SQL Server, you also need to include the sp_execute_external_script stored procedure. Include external scripts as follows:

EXECUTE sp_configure;
GO
-- enable execution of external script add an argument:
EXECUTE sp_configure 'external scripts enabled', 1;
GO

RECONFIGURE;
GO

      

Also make sure the SQL Server Launchpad service is running. You may need to restart SQL Server as well.

If we convert our input containing Chinese characters from Unicode to ASCII (VARCHAR), then SQL Server will automatically convert the Chinese characters to question marks "?". We can then use the pattern matching capabilities of the R languages ​​to remove all text between the question marks (formally our Chinese characters) in the following example:

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ExampleData]') AND type in (N'U'))
BEGIN

    CREATE TABLE [dbo].[ExampleData](
        [RawData] [nvarchar](100) NULL
    ) ON [PRIMARY]

    INSERT INTO ExampleData
    (
        [RawData]
    )
    VALUES ( N'Tower 6A 第6A座');

    INSERT INTO ExampleData
    (
        [RawData]
    )
    VALUES ( N'Tower 3 第3座');

END

DECLARE @inputQuery NVARCHAR(MAX) = N'SELECT CAST([RawData] AS VARCHAR(100)) as a FROM ExampleData'

DECLARE @RScript NVARCHAR(MAX) = N'
pattern <-"\\?.*\\?";
inData$a <- sub(pattern, "\\1", inData$a, perl = T );
outData <- inData;';


EXEC sp_execute_external_script @language = N'R'
, @script = @RScript
, @input_data_1 = @inputQuery
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( (a varchar(300)));

      

0


source







All Articles