Order of Calls to String Replacement Functions when Used with RTLs

When calling String.replace with the replace function, we can get the offsets of the subscripts.

var a = [];
"hello world".replace(/l/g, function (m, i) { a.push(i); });
// a = [2, 3, 9]

      

In the above example, we get a list of offsets for the matched characters l

.

Can I count on implementations that will always call the match function in ascending order , even when used with right-to-left languages ?

That is: can I be sure that the result above will always be [2,3,9]

, and not [3,9,2]

or any other permutation of these offsets?

This is a follow-up to this question , which Tomalak answered:

Absolutely yes. Matches are processed from left to right in the original string, because left to right is how regex engines work in a string.

However, regarding the case of RTL languages, he also said:

Good question [...] RTL text definitely affects JavaScript regex behavior.

I tested with the following RTL snippet in Chrome:

var a = [];
"بلوچی مکرانی".replace(/ی/g, function (m, i) { a.push(i); });
// a = [4, 11]

      

I don't speak this language, but looking at the string I see the character ی

as the first character of the string and as the first character after the space. However, since the text is written from right to left, these positions are actually the last character before the white space and the last character in the line, which means[4,11]

So this seems to work in Chrome. Question: Can I trust that the result will be the same across all compatible javascript implementations?

+3


source to share


1 answer


I searched for ECMA-262 5.1 Edition / June 2011 with the keyword "format control", "right to left" and "RTL" and they are not mentioned except when they say format control characters are allowed in string literal and literal regular expressions.

From section 7.1

It is useful to allow formatting of control characters in the source text for easier editing and display. All format control characters can be used in comments, as well as in string and regular expression literals.



Appendix E

7.1: Unicode format control characters are no longer stripped from ECMAScript source prior to processing. In revision 5, if such a character appears in StringLiteral

or RegularExpressionLiteral

, the character will be included in the literal, where in revision 3 the character will not be included in the literal

With that, I concluded that JavaScript doesn't work differently with characters from right to left. It only knows about the UTF-16 code units stored in a string and operates on a logical order basis .

+2


source







All Articles