8

I am currently learning about how the compilation and linking works in C++. I think I kinda get how the compiler works, and that for a file to fully compile you don't need to have function implementations, but only declarations. It is the linker's job to link the function declaration to its implementation.

But now I have this weird question, for example if I have a .cpp file in which I am using 1000 different functions, and each of those functions has its own separate .cpp and .h file, how does the linker know which of the cpp files to scan to find that specific function? I mean does the linker know where the function is located or does the linker every time for every function scan the whole project to find that specific function?

tripleee
  • 125
  • 9
artas2357
  • 183
  • 1
  • 6

4 Answers4

13

The linker is given explicitly the list of files to use, in the command line of the linker. They can be object files (.obj / .o) - compiled code - or libraries (.lib / .a) - object files structured in a single one.

Part of the job of the linker is to establish a list of the available functions and assign them an address in memory. This is enough to generate the calls where needed.

Note that the linker does not see the source code itself.

11

The compiler compiles from a .cpp file to an object file (.o) with the binary code.

The linker combines all of the object files together into a single binary.

So, the linker doesn't need to know which cpp to look at, because the linker doesn't look at cpp files. Instead, the linker looks at all of the .o files, figures out where all the functions are, and combines them together.

D.W.
  • 167,959
  • 22
  • 232
  • 500
9

It is the linker's job to link the function declaration to its implementation

This is not true. This is the compiler's job. Essentially the declaration (I'm assuming you mean the prototype, as in return_type function_name (arguments...)) is part of the language to allow you to tell the compiler what the function looks like before the compiler finds its implementation (the code body between { and }).

A lot of functions in fact are declared at the same place they are implemented. However it is also common to declare function prototypes without an implementation in headers or in virtual classes. This basically tells the compiler "hey, this is what the function looks like, but you will find the implementation later or somewhere else".

By the time you run the linker it is no longer processing things like declarations. Instead, linkers deal only with compiled code and some of that code consists of functions.

The final binary code does not actually contain any function definitions. Instead, code is simply a long list of machine instructions. A function is simply an address to somewhere in the code. For example a function like add() may be located at the 12566th byte of the binary code (or 0x00003116 in hex which is a more common way to look at addresses). The code that calls that function will simply be the instruction call 0x00003116. At this stage there is no information that that location contains a function.

However linkers work with object (or library) files that contain metadata, not just pure binary code. The actual format of these files depend on a lot of things - the language you use, the operating system you are on, the type of file etc. (for example, obj files in C/C++, lib files on Linux, dll files on Windows, etc.). However, what these files must contain is a list of what is sometimes called "symbols" and what addresses those symbols point to. These symbols are basically the function signature which tells the linker things like what the function's name is in the source code, how many arguments the function accepts, etc. This list of symbols is usually called the symbol table.

Actually, during compilation the compiler keeps a data structure called the symbol table in RAM in order to remember what it has compiled. Once the code is compiled, the compiler will format this symbol table appropriately and insert it in the object or library file.

When the linker sees that some code is calling add() it scans the list of files it is working on (or it looks up the database/array that it stored all the scanned files) and checks their symbol table to find add(). Once found, it will replace the caller's symbol with the address of the add() function (eg, 0x00003116). There is another step called the fixup which recalculates the address in RAM depending on how the executable or library file is loaded into RAM but I'll leave the details of that process as further research. It's enough to know that the linker's job is to load all the files you pass to it, remember all their symbol tables, and then replace symbols in code with actual addresses in memory.

tripleee
  • 125
  • 9
slebetman
  • 699
  • 3
  • 7
3

For C and C++, the linker's job is the same, ignoring template classes/functions.

The .o / .obj files are split into Sections, each of which has a Symbol Table.

The Symbol Table has the offset into this Section of each exported Symbol (function, global variable), and a set of relations. These are the function calls, and references to global variables. The linker will determine which Object file contains _main, and add it to the .exe it is building. It will then scan the relocations for the Section, and determine that _add is being used. It then finds the .obj file that provides _add, and adds it to the .exe. It then patches the Section that has _main at the offset given in the Relocation Table, and writes the Absolute or Relative address of _add here. It then locates all the Symbols that the Section that _add requires, and processes them in the same way.

A .a / .lib file is a collection of Object files, optionally with an Index that lists all the Symbols the Library contains, for MS Visual Studio, this Index file is processed once, and the object files are added to the .exe as needed. UNIX traditionally only scans Library files when needed, and if a Library needs another Library, then this must be listed before the requiring Library.

A worked example may be helpful. Assuming that the linker is looking for _main, and it finds it in foo.obj, at offset 100 in the .DATA Section. This Section is added to the .DATA Section of the .exe, and the caller is updated to have the address of _main. A C or C++ program does not start at _main - there is a function the initialises the C run time, any global variables, and then calls _main. In MSVC command-line programs this is _mainCRTStartup, but will have other names for other compilers.

The linker then finds _add in bar.c in Section .DATA, and adds it to the .DATA section. Assume that foo.c .DATA Section was 500 bytes longs, this means that bar.c .DATA section is now at 500 bytes in. _add is 200 bytes into bar.obj .DATA setion, so _add is 700 bytes into the .exe .DATA section. Thus 700 is written into the .exe .DATA section at location 100 - which is where _add is called. The process repeats until all Relocations have had their Sections added to the .exec.

CSM
  • 179
  • 2