Why doesn't the Python grammar spec include docstrings and comments?

Question

Why doesn't the Python grammar spec include docstrings and comments?

I am consulting the official Python grammar specification with Python 3.6 .

I cannot find any syntax for comments (they appear with appended #

) and docstrings (they should appear with '''

). A quick look at the lexical analysis page did not help either - the docstrings are defined there as longstrings

, but are not displayed in the grammar specifications.The type with the name is STRING

displayed further, but the link to its definition has no place.

With that in mind, I'm wondering how the CPython compiler knows what comments and docs are. How is this feat achieved?

Initially I assumed that comments and docstrings are removed in the first pass by the CPython compiler, but then the question arises as to how the help()

corresponding docstrings can be displayed.

+3

python python-internals grammar

Akshat mahajan 23 june 17 at 9:15 am

source to share

2 answers

The document line is not a separate grammar object. It's just normal simple_stmt

(following this rule up to atom

and STRING+

^* . If it's the first statement in the body of a function, class, or module, then it's used by the compiler as the docstring.

This is documented in the reference documentation as footnotes and compound statements : class

def

[3] A string literal that appears as the first statement in the body of a function is converted to the functions attribute __doc__

and hence the docstring function.

[4] A string literal that appears as the first statement in the body of the class is converted to a namespace element __doc__

and therefore to the classs docstring.

There is currently no reference documentation that indicates the same for modules, I consider this to be a documentation bug.

Comments are removed by the tokenizer and should never be parsed as a grammar. Their whole point is not to make sense at the grammar level. See the Comments section of the Lexical Analysis documentation:

A comment begins with a hash character (#) that is not part of the string literal and ends at the end of the physical string. A comment denotes the end of a logical line if no implicit string concatenation rules have been called. Comments are ignored by syntax; they are not tokens .

The bold accent is mine. Thus, the tokenizer skips comments entirely :

/* Skip comment */
if (c == '#') {
    while (c != EOF && c != '\n') {
        c = tok_nextc(tok);
    }
}

Note that the Python source code goes through 3 stages:

Tokenizing
analysis
compilation

Grammar refers only to the parsing stage; comments are removed in the tokenizer and docstrings are for the compiler only.

To illustrate how the parser does not treat docstrings as anything other than a string literal expression, you can access any Python parsing results as an abstract parser through a moduleast

. This creates Python objects that directly reflect the parse tree that creates a Python parser from which the Python bytecode is then compiled:

>>> import ast
>>> function = 'def foo():\n    "docstring"\n'
>>> parse_tree = ast.parse(function)
>>> ast.dump(parse_tree)
"Module(body=[FunctionDef(name='foo', args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Expr(value=Str(s='docstring'))], decorator_list=[], returns=None)])"
>>> parse_tree.body[0]
<_ast.FunctionDef object at 0x107b96ba8>
>>> parse_tree.body[0].body[0]
<_ast.Expr object at 0x107b16a20>
>>> parse_tree.body[0].body[0].value
<_ast.Str object at 0x107bb3ef0>
>>> parse_tree.body[0].body[0].value.s
'docstring'

Thus, you have an object FunctionDef

that has an expression representing Str

with a value as the first element in the body 'docstring'

. It is the compiler that generates the code object by storing this docstring in a separate attribute.

You can compile AST to bytecode using a function ; again, this uses the actual code paths that the Python interpreter uses. We will use a module to decompile the bytecode for us: compile()

dis

>>> codeobj = compile(parse_tree, '', 'exec')
>>> import dis
>>> dis.dis(codeobj)
  1           0 LOAD_CONST               0 (<code object foo at 0x107ac9d20, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

Thus, the compiled code produces top-level statements for the module. Code operationMAKE_FUNCTION

uses stored kodobekt (part top-level object code constants) for building functions. So we're looking at this nested code object, at index 0:

>>> dis.dis(codeobj.co_consts[0])
  1           0 LOAD_CONST               1 (None)
              2 RETURN_VALUE

There seems to be no docstring here. The function does nothing but return None

. The document line is instead stored as a constant:

>>> codeobj.co_consts[0].co_consts
('docstring', None)

When performing the MAKE_FUNCTION

operation, MAKE_FUNCTION

it is this first constant, provided that it is a string, that is converted to an attribute __doc__

for the function object.

After compilation, we can execute a code object with a functionexec()

in a given namespace, which adds a function object with a docstring:

>>> namespace = {}
>>> exec(codeobj, namespace)
>>> namespace['foo']
<function foo at 0x107c23e18>
>>> namespace['foo'].__doc__
'docstring'

So the compiler's job is to determine when something is a docstring. This is done in C code in a function : compiler_isdocstring()

static int
compiler_isdocstring(stmt_ty s)
{
    if (s->kind != Expr_kind)
        return 0;
    if (s->v.Expr.value->kind == Str_kind)
        return 1;
    if (s->v.Expr.value->kind == Constant_kind)
        return PyUnicode_CheckExact(s->v.Expr.value->v.Constant.value);
    return 0;
}

This is called from places where a document line makes sense; for modules and classes in and for functions in . compiler_body()

compiler_function()

TL; DR : Comments are not part of the grammar because the parser doesn't even see the comments. They are passed by the tokenizer. Document strings are not part of the grammar because they are just string literals to the parser. It is a compilation step (using the parser's output tree) that interprets these string expressions as docstrings.

^* Full path of the grammar rule: simple_stmt

→ small_stmt

→ expr_stmt

→ testlist_star_expr

→ star_expr

→ expr

→ xor_expr

→ and_expr

→ shift_expr

→ arith_expr

→ term

→ factor

→ power

→ atom_expr

→ atom

→STRING+

+7

Martijn pieters 23 june 17 at 9:17 am

source to share

coldspeed · Accepted Answer · 2017-06-23T09:16:55+0000

Section 1

What happens to the comments?

Comments (anything preceding #

) are ignored during tokenization / lexical analysis, so there is no need to write rules to parse them. They do not provide semantic information to the interpreter / compiler, as they only serve to improve the verbosity of your program for the reader and are therefore ignored.

Here's the lex specification for the ANSI C programming language: http://www.quut.com/c/ANSI-C-grammar-l-1998.html . I would like to draw your attention to the way comments are handled here:

"/*"            { comment(); }
"//"[^\n]*      { /* consume //-comment */ }

Now let's take a look at the rule for int

.

"int"           { count(); return(INT); }

Here's a lex function to handle int

other tokens as well:

void count(void)
{
    int i;

    for (i = 0; yytext[i] != '\0'; i++)
        if (yytext[i] == '\n')
            column = 0;
        else if (yytext[i] == '\t')
            column += 8 - (column % 8);
        else
            column++;

    ECHO;
}

You see here that it ends with a statement ECHO

, which means that it is a valid token and needs to be parsed.

Now, here's the lex function handling comments:

void comment(void)
{
    char c, prev = 0;

    while ((c = input()) != 0)      /* (EOF maps to 0) */
    {
        if (c == '/' && prev == '*')
            return;
        prev = c;
    }
    error("unterminated comment");
}

Not here ECHO

. So nothing is returned.

This is a typical example, but python does the same.

Section 2

What's going on with docstrings?

Note. This section of my answer is intended to complement @MartijnPieters' answer. It is not intended to replicate any of the information he provided in his post. Now, having said this, ...

I originally assumed that comments and docstrings are removed in the first pass to the CPython compiler [...]

Docstrings (string literals that are not assigned to any variable name, nothing in the '...'

, "..."

, '''...'''

and """..."""

) actually processed. They are parsed as simple string literals (token STRING+

), as Martijn Pieters mentions in his answer . As far as the current docs are concerned, it is only mentioned in passing that docstrings are assigned to the function / class / module attribute __doc__

. How this is done is not actually mentioned anywhere.

What actually happens is that they are marked and parsed as string literals and the resulting parse tree will contain them. A bytecode is generated from the parse tree, with the docstrings in their rightful place in the attribute __doc__

(they are clearly not part of the bytecode as shown below). I won't go into details, as the answer I linked above describes the same thing in very nice detail.

Of course, they can be completely ignored. If you use python -OO

(the flag -OO

means "optimize intensively" as opposed to -O

which means "optimize"), with the resulting bytecode stored in files .pyo

, which excludes dock sequences,

An illustration can be seen below:

Create a file test.py

with the following code:

def foo():
    """ docstring """
    pass

We will now compile this code with the normal flags set.

>>> code = compile(open('test.py').read(), '', 'single')
>>> import dis
>>> dis.dis(code)
  1           0 LOAD_CONST               0 (<code object foo at 0x102b20ed0, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

As you can see, there is no mention of our docstring in the bytecode. However, they are. To get the docstring you can do ...

>>> code.co_consts[0].co_consts
(' docstring ', None)

So, as you can see, the docstring remains, just not as part of the main bytecode. Now let's recompile this code, but with an optimization level of 2 (switch equivalent -OO

):

>>> code = compile(open('test.py').read(), '', 'single', optimize=2)
>>> dis.dis(code)
  1           0 LOAD_CONST               0 (<code object foo at 0x102a95810, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

No, it makes a difference, but ...

>>> code.co_consts[0].co_consts
(None,)

The docks are now gone.

Flags -O

and -OO

only remove things ( bytecode optimization is done by default ... -O

removes asserts and if __debug__:

bundles from the generated bytecode, while -OO

ignoring docstrings in appendix). The resulting compilation time will decrease slightly. In addition, the execution speed remains the same unless you have a large number of statements assert

and if __debug__:

, otherwise, it does not affect performance.

Also remember that docstrings are only saved if they are the first in the function / class / module definition. All extra lines are simply removed at compile time. If you change test.py

to the following:

def foo():
    """ docstring """

    """test"""
    pass

And then repeat the same process with optimization=0

, it gets stored in a variable co_consts

when compiled:

>>> code.co_consts[0].co_consts
(' docstring ', None)

The meaning """ test """

was ignored. You will be interested to know that this removal is done as part of the basic optimization for bytecode.

Section 3

Additional reading

(You may find these links as interesting as I am.)

What does Python optimization (-O or PYTHONOPTIMIZE) do?
What do the python file extensions, .pyc.pyd.pyo mean?
Are the Docstrings and Python comments kept in memory when the module is loaded?
Working with compilation ()
dis

module
peephole.c

(courtesy Martijn) - Source code for all compiler optimizations. It's especially fun if you can figure it out!

Why doesn't the Python grammar spec include docstrings and comments?

Section 1

What happens to the comments?

Section 2

What's going on with docstrings?

Section 3

Additional reading

More articles: