Section Alignment and Library Modules - Data and Code Alignment in an Object File

In order to write HLL code that produces efficient machine code, you really need

5.8 Data and Code Alignment in an Object File

5.8.4 Section Alignment and Library Modules

Section alignment can have a very big impact on the size of your executable files if you use a lot of short library routines. Suppose, for example, that you’ve specified an alignment size of 16 bytes for the sections associated with the object files appearing in a library. Each library function that the linker proc-esses will be placed on a 16-byte boundary. If the functions are small (fewer than 16 bytes in length), the space between the functions will be unused when the linker creates the final executable. This is another form of internal fragmentation.

To understand why you would want to align the code (or data) in a section on a given boundary, just remember how cache lines work (see Write Great Code, Volume 1). By aligning the start of a function on a cache line, you may be able to slightly increase the execution speed of that function as it may generate fewer cache misses during execution. For this reason, many programmers like to align all their functions at the start of a cache line.

Although the size of a cache line varies from CPU to CPU, a typical cache line is 16 to 64 bytes long, so many compilers, assemblers, and linkers will attempt to align code and data to one of these boundaries. On the 80x86 processor, there are some other benefits to 16-byte alignment, so many 80x86-based tools default to a 16-byte section alignment for object files.

Consider, for example, the following short HLA (High-Level Assembler) program, processed by Microsoft tools, that calls two relative small library routines:

program t;

#include( "bits.hhf" ) begin t;

bits.cnt( 5 );

bits.reverse32( 10 );

end t;

Here is the source code to the bits.cnt library module:

unit bitsUnit;

#includeonce( "bits.hhf" );

// //

// Counts the number of "1" bits in a dword value.

// This function returns the dword count value in EAX.

procedure bits.cnt( BitsToCnt:dword ); @nodisplay;

const

EveryOtherBit := $5555_5555;

EveryAlternatePair := $3333_3333;

EvenNibbles := $0f0f_0f0f;

begin cnt;

push( edx );

mov( BitsToCnt, eax );

mov( eax, edx );

// Compute sum of each pair of bits // in EAX. The algorithm treats // each pair of bits in EAX as a two // bit number and calculates the

// number of bits as follows (description // is for bits zero and one, but it generalizes // to each pair):

// EDX = BIT1 BIT0 // EAX = 0 BIT1 //

// EDX-EAX = 00 if both bits were zero.

// 01 if Bit0=1 and Bit1=0.

// 01 if Bit0=0 and Bit1=1.

// 10 if Bit0=1 and Bit1=1.

// Note that the result is left in EDX.

shr( 1, eax );

and( EveryOtherBit, eax );

sub( eax, edx );

// Now sum up the groups of two bits to // produces sums of four bits. This works // as follows:

// in EAX's original value. The SHR instruction // moves this value into bits 0..7 and zeroes // out the HO bits of EAX.

intmul( $0101_0101, eax );

shr( 24, eax );

pop( edx );

end cnt;

end bitsUnit;

Here is the source code for the bits.reverse32 library function. Note that this source file also includes the bits.reverse16 and bits.reverse8 functions (to conserve space, the bodies of these functions do not appear below).

Although the operation of these functions is not pertinent to our discussion, note that these functions swap the values in the HO and LO bit positions.

Because these three functions appear in a single source file, any program that includes one of these functions will automatically include all three (because of the way compilers, assemblers, and linkers work).

unit bitsUnit;

#include( "bits.hhf" );

procedure bits.reverse32( BitsToReverse:dword ); @nodisplay; @noframe;

begin reverse32;

push( ebx );

mov( [esp+8], eax );

// Swap the bytes in the numbers:

bswap( eax );

// Swap the nibbles in the numbers mov( $f0f0_f0f0, ebx );

and( eax, ebx );

and( $0f0f_0f0f, eax );

shr( 4, ebx );

shl( 4, eax );

or( ebx, eax );

// Swap each pair of two bits in the numbers:

mov( eax, ebx );

shr( 2, eax );

shl( 2, ebx );

and( $3333_3333, eax );

and( $cccc_cccc, ebx );

or( ebx, eax );

// Swap every other bit in the number:

lea( ebx, [eax + eax] );

shr( 1, eax );

and( $5555_5555, eax );

and( $aaaa_aaaa, ebx );

or( ebx, eax );

pop( ebx );

ret( 4 );

end reverse32;

procedure bits.reverse16( BitsToReverse:word );

@nodisplay; @noframe;

begin reverse16;

// Uninteresting code that is very similar to // that appearing in reverse32 has been snipped...

end reverse16;

procedure bits.reverse8( BitsToReverse:byte );

@nodisplay; @noframe;

begin reverse8;

// Uninteresting code snipped...

end reverse8;

end bitsUnit;

The Microsoft dumpbin.exe tool allows you to examine the various fields of an OBJ or EXE file. Running dumpbin with the /headers command-line option on the bitcnt.obj and reverse.obj files (produced for the HLA standard library) tells us that each of the sections are aligned to a 16-byte boundary. Therefore, when the linker combines the bitcnt.obj and reverse.obj data with the sample program given earlier, it will align the bits.cnt function in the bitcnt.obj file on a 16-bit boundary, and it will align the three functions in the reverse.obj file on a 16-byte boundary (note that it will not align each function in the file on a 16-byte boundary.

That task is the responsibility of the tool that created the object file, if such alignment is desired). By using the dumpbin.exe program with the /disasm command-line option on the executable file, you can see that the

linker has honored these alignment requests (note that an address that is aligned on a 16-byte boundary will have a 0 in the LO hexadecimal digit):

Address opcodes Assembly Instructions

--- --- 04001000: E9 EB 00 00 00 jmp 040010F0

04001005: E9 57 01 00 00 jmp 04001161 0400100A: E8 F1 00 00 00 call 04001100

; Here's where the main program starts.

0400100F: 6A 00 push 0 04001011: 8B EC mov ebp,esp 04001013: 55 push ebp 04001014: 6A 05 push 5 04001016: E8 65 01 00 00 call 04001180 0400101B: 6A 0A push 0Ah 0400101D: E8 0E 00 00 00 call 04001030 04001022: 6A 00 push 0

04001024: FF 15 00 20 00 04 call dword ptr ds:[04002000h]

;The following INT3 instructions are used as padding in order

;to align the bits.reverse32 function (which immediately follows)

;to a 16-byte boundary:

0400102A: CC int 3 0400102B: CC int 3 0400102C: CC int 3 0400102D: CC int 3 0400102E: CC int 3 0400102F: CC int 3

; Here's where bits.reverse32 starts. Note that this address

; is rounded up to a 16-byte boundary.

04001030: 53 push ebx

04001031: 8B 44 24 08 mov eax,dword ptr [esp+8]

04001035: 0F C8 bswap eax

04001037: BB F0 F0 F0 F0 mov ebx,0F0F0F0F0h 0400103C: 23 D8 and ebx,eax 0400103E: 25 0F 0F 0F 0F and eax,0F0F0F0Fh 04001043: C1 EB 04 shr ebx,4

04001046: C1 E0 04 shl eax,4 04001049: 0B C3 or eax,ebx 0400104B: 8B D8 mov ebx,eax 0400104D: C1 E8 02 shr eax,2 04001050: C1 E3 02 shl ebx,2

04001053: 25 33 33 33 33 and eax,33333333h 04001058: 81 E3 CC CC CC CC and ebx,0CCCCCCCCh 0400105E: 0B C3 or eax,ebx 04001060: 8D 1C 00 lea ebx,[eax+eax]

04001063: D1 E8 shr eax,1

04001065: 25 55 55 55 55 and eax,55555555h 0400106A: 81 E3 AA AA AA AA and ebx,0AAAAAAAAh 04001070: 0B C3 or eax,ebx 04001072: 5B pop ebx 04001073: C2 04 00 ret 4

; Here's where bits.reverse16 begins. As this function appeared

; in the same file as bits.reverse32, and no alignment option

; was specified in the source file, HLA and the linker won't

; bother aligning this to any particular boundary. Instead, the

; code immediately follows the bits.reverse32 function

; in memory.

04001076: 53 push ebx 04001077: 50 push eax

04001078: 8B 44 24 0C mov eax,dword ptr [esp+0Ch]

. ;uninteresting code for bits.reverse16 and . ; bits.reverse8 was snipped

; end of bits.reverse8 code

040010E6: 88 04 24 mov byte ptr [esp],al 040010E9: 58 pop eax

040010EA: C2 04 00 ret 4

; More padding bytes to align the following function (used by

; HLA exception handling) to a 16-byte boundary:

040010ED: CC int 3 040010EE: CC int 3 040010EF: CC int 3

; Default exception return function (automatically generated

; by HLA):

040010F0: B8 01 00 00 00 mov eax,1 040010F5: C3 ret

; More padding bytes to align the internal HLA BuildExcepts

; function to a 16-byte boundary:

040010F6: CC int 3 040010F7: CC int 3 040010F8: CC int 3 040010F9: CC int 3 040010FA: CC int 3 040010FB: CC int 3 040010FC: CC int 3 040010FD: CC int 3 040010FE: CC int 3 040010FF: CC int 3

; HLA BuildExcepts code (automatically generated by the

; compiler):

04001100: 58 pop eax 04001101: 68 05 10 00 04 push 4001005h 04001106: 55 push ebp

. ; Remainder of BuildExcepts code goes here . ; along with some other code and data .

; Padding bytes to ensure that bits.cnt is aligned

; on a 16-byte boundary:

0400117D: CC int 3 0400117E: CC int 3 0400117F: CC int 3

; Here's the low-level machine code for the bits.cnt function:

04001180: 55 push ebp 04001181: 8B EC mov ebp,esp 04001183: 83 E4 FC and esp,0FFFFFFFCh 04001186: 52 push edx

04001187: 8B 45 08 mov eax,dword ptr [ebp+8]

0400118A: 8B D0 mov edx,eax 0400118C: D1 E8 shr eax,1

0400118E: 25 55 55 55 55 and eax,55555555h 04001193: 2B D0 sub edx,eax 04001195: 8B C2 mov eax,edx 04001197: C1 EA 02 shr edx,2

0400119A: 25 33 33 33 33 and eax,33333333h 0400119F: 81 E2 33 33 33 33 and edx,33333333h 040011A5: 03 C2 add eax,edx 040011A7: 8B D0 mov edx,eax 040011A9: C1 E8 04 shr eax,4 040011AC: 03 C2 add eax,edx 040011AE: 25 0F 0F 0F 0F and eax,0F0F0F0Fh 040011B3: 69 C0 01 01 01 01 imul eax,eax,1010101h 040011B9: C1 E8 18 shr eax,18h

040011BC: 5A pop edx 040011BD: 8B E5 mov esp,ebp 040011BF: 5D pop ebp 040011C0: C2 04 00 ret 4

The exact operation of this program really isn’t important (after all, it doesn’t actually do anything useful). What is important to note is how the linker inserts extra bytes ($cc, the int 3 instruction) before a group of one or more functions appearing in a source file to ensure that they are aligned on the specified boundary.

In this particular example, the bits.cnt function is actually 64 bytes long, and the linker inserted only 3 bytes in order to align it to a 16-byte boundary.

This percentage of waste—the number of padding bytes compared to the size of the function—is quite low. However, if you have a large number of small functions, the wasted space can become significant (as with the default exception handler in this example that has only two instructions). When creating your own library modules, you will need to weigh the inefficiencies of extra space for padding against the small performance gains you’ll obtain by using aligned code.

Object code dump utilities (like dumpbin.exe) are quite useful for analyzing object code and executable files in order to determine attributes such as section size and alignment. Linux (and most Unix-like systems) provide the objdump utility that is comparable. I’ll discuss using these tools in the next chapter, as they are great tools for analyzing compiler output.

Dans le document CODE WRITE GREAT (Page 124-132)