What are the definitions of valid and invalid pp tokens?

I want to make extensive use of ## - operator and enum magic to handle a huge amount of similar access operations, error handling and data flow.

If using preprocessor operators ##

and #

results in an invalid pp token, the behavior is undefined in C.

The order of the preprocessor is generally undefined (*) in C90 (see Marker Insertion Operator ). In some cases it happens (in various sources, including the MISRA Committee, and on the reference page) that the order of multiple ## / # - operators affects the appearance of undefined behavior. But I find it very difficult to understand examples of these sources and establish a general rule.

So my questions are:

  • What are the rules for valid pp tokens?

  • Are there differences between the various C and C ++ standards?

  • My current problem: Is the following legal with all two operator operations? (**)

    #define test(A) test_## A ## _THING
    int test(0001) = 2;
    
          

Comments:

(*) I don't use "is undefined" because it has nothing to do with undefined behavior, but IMHO, but rather undefined behavior. More than one ## or # operator is used, in general to make the program erroneous. Obviously the order is - we just can't predict what - so the order is not specified.

(**) This is not an actual numbering app. But the pattern is equivalent.

+3


source to share


2 answers


What are the rules for valid pp tokens?

They are specified in the relevant standards; C11 section 6.4 and C ++ 11 and section 2.4. In both cases, they correspond to the product pretreatment marker. Apart from the pp number, they shouldn't be too awesome. The rest of the possibilities are identifiers (including keywords), punctuators (in C ++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character that doesn't match any other production.

With some exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing tokens. (One exception is mismatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing token, so text containing an inexhaustible string or character literal cannot be marked.)

In the context of an operator ##

, however, the result of the concatenation must be a single preprocessing token. Thus, an invalid concatenation is a concatenation that results in a sequence of characters that contain multiple preprocessing tokens.

Are there any differences between C and C ++?

Yes, there are some minor differences:

  • C ++ has custom string and character literals, and allows "raw" string literals. These literals will differ differently in C, so they could be multiple preprocessing tokens, or (in the case of original string literals) even invalid preprocessing tokens.

  • C ++ contains characters ::

    , .*

    and ->*

    , all of which will be denoted as two token-tokens in C. Also, in C ++ some things that look like keywords (for example new

    , delete

    ) are part of preprocessing-op-or-punc (although these characters are valid preprocssing tokens in both languages.)

  • C allows for hexadecimal floating point literals (for example 1.1p-3

    ), which are not valid preprocessor tokens in C ++.

  • C ++ allows you to use apostrophes in integer literals as delimiters ( 1'000'000'000

    ). In C, this will likely result in unmatched apostrophes.

  • There are minor differences in the handling of universal symbol names (eg \u0234

    ).

  • In C ++, <::

    it will be referred to as <

    , ::

    if you do not follow him :

    , or >

    . ( <:::

    and <::>

    are usually marked using the longest match rule.) There is no exception to the longest match rule in C; <::

    is always tokenized using the long match rule, so the first token will always be <:

    .

Is it allowed to concatenate test_

, 0001

and _THING

although no concatenation order is specified?

Yes, it is legal in both languages.

test_ ## 0001 => test_0001             (identifier)
test_0001 ## _THING => test_0001_THING (identifier)

0001 ## _THING => 0001_THING           (pp-number)
test_ ## 0001_THING => test_0001_THING (identifier)

      

What are examples of invalid marker concatenation?

Let's pretend that

#define concat3(a, b, c) a ## b ## c

      



The following values ​​are now invalid regardless of the order of concatenation:

concat3(., ., .)

      

..

is not a token, although ...

is. But the concatenation must be done in some order, and ..

will be a necessary intermediate value; since it is not a single token, the concatenation will be invalid.

concat3(27,e,-7)

      

There -7

are two tokens here, so it cannot be concatenated.

And here's a case where the order of the concatenation matters:

concat3(27e, -, 7)

      

If this is concatenated from left to right, it will result in 27e- ## 7

, which is the concatenation of two pp numbers. But -

it cannot be combined with 7

, because (as stated above) is -7

not a single token.

What is a pp number?

In general terms, a pp-number is a superset of tokens, which can be converted to (single) numeric literals or can be invalid. The definition is deliberately broad, partly to allow (some) marker connections and partly to isolate the preprocessor from periodic changes in number formats. The exact definition can be found in the relevant standards, but unofficially, a token is a pp-number if:

  • It starts with a decimal digit or period ( .

    ) followed by a decimal digit.
  • The rest of the token is letters, numbers, and periods, possibly including sign signs ( +

    , -

    ) if preceded by an exponent character. The exponent symbol can be E

    either E

    in both languages; as well as P

    , and P

    in C with C99.
  • In C ++, pp can also include (but not start with) an apostrophe followed by a letter or number.
  • Note. Above, letter

    includes the underscore character. Also, generic symbol names can be used (except for the following apostrophe in C ++).

Once preprocessing is complete, all pp numbers will be converted to numeric literals, if possible. If the conversion is not possible (because the token does not match the syntax for any numeric literal), the program is invalid.

+5


source


#define test(A) test_## A ## _THING
int test(0001) = 2;

      

It is legal both LTR assessment, and with the RTL, as both test_0001

and 0001_THING

are valid preprocessor tokens. The first is the identifier and the last is the pp number; pp numbers are not checked for correct suffix until late in compilation; consider, for example, 0001u

an unsigned octal literal.

A few examples to show that the order of evaluation matters:

#define paste2(a,b) a##b
#define paste(a,b) paste2(a,b)
#if defined(LTR)
#define paste3(a,b,c) paste(paste(a,b),c)
#elif defined(RTL)
#define paste3(a,b,c) paste(a,paste(b,c))
#else
#define paste3(a,b,c)  a##b##c
#endif
double a = paste3(1,.,e3), b = paste3(1e,+,3);  // OK LTR, invalid RTL

#define stringify2(x) #x
#define stringify(x) stringify2(x)
#define stringify_paste3(a,b,c) stringify(paste3(a,b,c))
char s[] = stringify_paste3(%:,%,:);            // invalid LTR, OK RTL

      



If your compiler uses sequential evaluation order (LTR or RTL) and presents an error when generating an invalid pp token, then exactly one of these lines will generate an error. Naturally, the lax compiler could well allow both, while the strict compiler did not allow either.

The second example is rather contrived; due to the way the grammar is built, it is very difficult to find a pp token that is valid when building RTL but not when building LTR.

There are no significant differences between C and C ++ in this regard; the two standards have the same language (down to the section headings) describing the macro replacement process. The only way the language could influence the process would be in valid preprocessing tokens: C ++ (especially recently) has more forms of valid preprocessing tokens, such as custom string literals.

0


source







All Articles