What are the basic steps to compile?

What are the basic steps for making a C program? By compiling, I mean (probably wrongly) getting binary from plain text containing C code using gcc.

I would like to understand some key points of the process:

  • By the end of the day, I need to convert my C code to a language that my processor needs to understand. So who cares about finding out my CPU instructions ? Operating system?

  • Is gcc converting any C language to assembly ?

  • I know (actually guess) that for each type of processor I need an assembler that will interpret (?) The assembler code and translate into my specific processor instructions. Where is this assembler (who is sending it)? Does it come with the OS?

  • Why exactly can't I see 0s and 1s if I open the binary with a text editor?

+3


source to share


4 answers


By the end of the day, I need to convert my C code to a language that my processor needs to understand. So who cares to find out my instructions for using the processor? Operating system?

You are not very clear here. If you're asking which tool knows your specific processor instructions, it's an assembler, a disassembler, a debugger, and maybe a few others. They can generate machine code or convert it back to disassembly.

If you are asking who cares about which instructions are used, it is the processor that must execute them, since each instruction set even represents a general instruction such as "add two integers" in a completely different way.

Is gcc converting any C language to assembler?

Yes, C (or a program in any other supported language) is converted to assembly using GCC. There are many steps, and at least two additional internal representations are used in the process. The details are explained in the internal GCC docs . Finally, the "backend" compiler generates an assembly of simple "templates" generated by previous compiler passes. You can ask GCC to dump this assembly using the -S flag. Unless you specifically ask for it, the next step (build) will be automatically executed and you will only see the last executable file.



I know (actually guess) that for each type of processor I would need an assembler that will interpret (?) The assembly code and translate into my specific processor instructions. Where is this assembler (who is sending it)? Does it come with the OS?

First, note that assembly languages ​​are different for each CPU as they must represent a 1: 1 machine processor language. The Assembler then translated the assembly code into machine code. Who is sending it? Anyone who builds it. With the GNU toolchain, it is part of the binutils package and is usually installed by default for most Linux distributions. It is not only an accessible assembler. Also note that although the GNU "suite" (GCC / binutils / gdb) supports many architectures, you need to use the appropriate port for your architecture. For example, your default PC assembler cannot compile / assemble to ARM machine code.

Why exactly can't I see 0s and 1s if I open the binary with a text editor?

Since a text editor must show a textual representation of these 0s and 1s. Assuming that each character in the file is 8 bits long, they interpret every eighth bit as a single character instead of showing individual bits. If you know that the standard 8-bit ASCII letter "A" represents 65, you can also convert that back to binary: 01000001. It is slightly easier to convert the hexadecimal representation back to binary. You can use the hexdump tool (or similar) for this.

+2


source


Lots happen :)

Here are some of the key steps (BTW, this is how I think about compiling, the next steps only have similarities to the steps defined in the standard).



  • the preprocessor runs on the source file.

    The preprocessor does all sorts of things for us, including:

    • It performs triglyph (special three character sequences that represent some special characters that early keyboards did not have) replacement .
    • It performs macro substitution (i.e. #define

      ) with simple text replacement
    • It grabs any header files and copies their entire contents to where the line was #include

      .

    On Linux, the program that does this m4

    , and using gcc

    , you can stop after this step using the flag -E

    .

  • Once launched in front of the processor, we have a file containing all the information the parser needs to run and check our syntax, and emit assembly . On Linux, the program that most likely does this, cc1

    and using gcc

    , you can stop after this step using a flag -s

    .

  • The assembly is converted to object code , most likely by a program gas

    (GNU Assembler), and using gcc

    , you can stop at this step using -c

    .

  • Finally, one or more object files, along with the libraries, are converted into an executable linker . A Linux compiler is common ld

    , and use gcc

    without any special flags is done entirely through this.

+8


source


Since you specifically mentioned, "By the end of the day, I need to convert my C code to a language that my processor needs to understand," I'll explain a little about how compilers work.

Typical compilers do several things.

First, they do something like lexing. This step takes individual symbols and combines them into "tokens" that are understood in the next step. This step distinguishes between language words (for example, "for" and "if" in C), operators (for example, "+"), constants (for example, integer and string literals), and others. What it differentiates depends on the language itself.

The next step is a parser that takes the stream of tokens generated by the lexer and (usually) converts it into something called an "abstract syntax tree" or AST. AST is a computation performed by a program with data structures that can be moved around by the compiler. Usually AST is language independent, and compilers like GCC can parse different languages ​​in a common AST format that the next step (code generator) can understand.

Finally, the code generator goes through the AST and outputs code that represents the semantics of the AST, that is, the code that actually performs the computations that the AST represents.

In the case of GCC and possibly other compilers, the compiler does not actually generate machine code. Instead, it outputs the assembly code, which it passes to the assembler. The assembler goes through a similar process of lexing, parsing, and code generation to generate machine code. After all, an assembler is simply a compiler that compiles assembly code.

In the case of C (and many others), the Assembler is usually not the last one. The assembler creates objects, called object files, that contain unresolved references to functions in other object files or libraries (for example, printf in the standard C library, or functions from other C files in your project). These object files are uploaded to something called a "linker" whose job it is to combine all object files into one binary file and resolve all unresolved references in the object files.

Finally, after all these steps, you have a complete executable binary.

Note that this is how GCC and many other compilers work, but this is not required. Any program you could write that exactly takes a stream of C code and outputs a stream of other code (assembly, machine code, even javascript) that is equivalent is a compiler.

In addition, the stages are not always completely separate. Instead of lexing and the whole file, then parsing the whole result, then generating code for the whole AST, the compiler can do a little lexing, then start parsing when it has some tokens, and then return to lexing when the parser needs more tokens. the parser feels like it knows enough it could generate code before the lexer produces a few more tokens for it.

+6


source


" By the end of the day, I need to convert my C code into a language my processor needs to understand. So who cares about finding out my processor instructions?"

CPU.

But note that on a modern computer, there seems to be a single processor - it's just an illusion.

This is a good enough conceptual model for simple C programming.


" Is gcc converting any C language to assembler?"

If you ask for it. The variant -S

will generate a list of assemblies. For PCs, you can choose between AT&T syntax, which is ugly as sin, overflowing with percent signs, and normal Intel syntax. Unfortunately AT&T (selected via -masm=att

IIRC) is the default, but you can use -masm=intel

to get a regular build.

Unless you ask it to create an assembly, then gcc presumably generates object code directly from its internal abstract syntax tree (AST).

Making assembly language as an intermediate form just adds complexity and inefficiency, so I highly doubt it does that.


" I know (actually guess) that for each type of processor I need an assembler that will interpret (?) The assembly code and translate into my specific processor instructions. Where is that assembler (who sends it)? Does it come with the OS?"

You don't need such an assembler. But it comes with gcc assembler as

. Unix-like OS-es are generally gcc

and as

in the set, while Windows does not include developer tools. However, the Microsoft dev tools are free to download, now (last week or so) including the full Visual Studio environment. Microsoft assembler ml.exe

and known as MASM, Macro assembler (as if there were no other macro assemblers).


" Why exactly can't I see 0s and 1s if I open a binary file with a text editor?"

It depends on the text editor, although I don't know what 0 and 1 might represent; text editors are designed to interpret bytes as text.

You can just write a text editor like this if you like.

Fair warning: it has no practical use that I can think of.


Finally, regarding the question in the title

" What are the basic steps for compiling?

In practice, there are two main steps: compiling and linking . The compilation step is further subdivided into preprocessing and compiling the host language , i.e.

    compilation -> link

& hellip; really

    (preprocessing and compiling the main language) -> link

During preprocessing, the source codes of the files are concatenated using directives #include

. This creates a complete block of translation of the source code. Main language compilation translates this into an object code file which contains machine code with some unresolved references.

Then, finally, the linking step combines the object code files (including the content of the object code content in the libraries) to create a single complete executable file.

+1


source







All Articles