Unicode Javascript - need to display invalid characters to user

I am looking for a solution that solves the following problem but has limited experience with Unicode.

Basically the user can enter text in the textbox, however, when submitting, I want to display a list of characters that match the WEREN "T GSM. IE everything that does not have a char code of 0-127.

However, it breaks a lot when you add emojis to the mix, because if I char array some emoji characters will be broken and this will show the wrong reason why the validation failed.

IE "πŸ˜€" .length = 2, it will be split into 2 characters, and so when I tell the user why it failed, they will get the wrong reason.

Any ideas on how I can solve this would be greatly appreciated.

EDIT: Can't use ES6 and need an array of invalid characters

+3


source to share


3 answers


Suppose you are using a regular expression like this to find characters that arent in the valid range:

/[^\0-\x7f]/

      

you can change it to select UTF-16 surrogate pairs :

/[\ud800-\udbff][\udc00-\udfff]|[^\0-\x7f]/

      

In modern browsers, you can also just use a flag u

to work with Unicode code points directly:



/[^\0-\x7f]/u

      

This will still only receive codepages and not grapheme clusters (important for character mix, modern combined emotions, skin tone, and general correctness across all languages). They are more difficult to deal with. When (if?) Browser support comes in , they will be less stringent; until then, a dedicated package is your best bet.

var NON_GSM_CODEPOINT = /[\ud800-\udbff][\udc00-\udfff]|[^\0-\x7f]/;
var input = document.getElementById('input');

input.addEventListener('input', function () {
  var match = this.value.match(NON_GSM_CODEPOINT);
  this.setCustomValidity(match ? 'Invalid character: "' + match[0] + '"' : '');
  this.form.reportValidity();
});
      

<form>
  <textarea id="input"></textarea>
</form>
      

Run codeHide result


+3


source


You can use the spread operator ( ...

) to split the characters into an array, and then charCodeAt

to get the value:



let str = `πŸ˜€abcπŸ˜€defπŸ˜€ghi`;
let chars = [...str];

console.log(`All Chars: ${chars}`);

console.log('Bad Chars:',
  chars.filter(v=>v.charCodeAt(0)>127)
);
      

Run codeHide result


+1


source


Interesting! This is just a trial and error, but it looks like converting the string to an array of chars strings with Array.from

will allow you to index the characters correctly:

Array.from('πŸ˜€').length
1

Array.from('πŸ˜€abc').length
4

Array.from('πŸ˜€abc')[0]
"πŸ˜€"

      

0


source







All Articles