Assembly Language Style Guidelines
- Style Guidelines for Assembly
Language Programmers
-
- 1.0 - Introduction
-
- 1.1 - ADDHEX.ASM
-
- 1.2 - Graphics Example
-
- 1.3 - S.COM Example
-
- 1.4 - Intended Audience
-
- 1.5 - Readability
Metrics
-
- 1.6 - How to Achieve
Readability
-
- 1.7 - How This Document is
Organized
-
- 1.8 - Guidelines, Rules,
Enforced Rules, and Exceptions
-
- 1.9 - Source Language
Concerns
-
- 2.0 - Program
Organization
-
- 2.1 - Library Functions
-
- 2.2 - Common Object
Modules
-
- 2.3 - Local Modules
-
- 2.4 - Program Make
Files
-
- 3.0 - Module
Organization
-
- 3.1 - Module Attributes
-
- 3.1.1 - Module Cohesion
-
- 3.1.2 - Module Coupling
-
- 3.1.3 - Physical
Organization of Modules
-
- 3.1.4 - Module
Interface
-
- 4.0 - Program Unit
Organization
-
- 4.1 - Routine Cohesion
-
- 4.1.1 - Routine
Coupling
-
- 4.1.2 - Routine Size
-
- 4.2 - Placement of the
Main Procedure and Data
-
- 5.0 - Statement
Organization
-
- 6.0 - Comments
-
- 6.1 - What is a Bad
Comment?
-
- 6.2 - What is a Good
Comment?
-
- 6.3 - Endline vs.
Standalone Comments
-
- 6.4 - Unfinished Code
-
- 6.5 - Cross References in
Code to Other Documents
-
- 7.0 - Names, Instructions,
Operators, and Operands
-
- 7.1 - Names
-
- 7.1.1 - Naming
Conventions
-
- 7.1.2 - Alphabetic Case
Considerations
-
- 7.1.3 - Abbreviations
-
- 7.1.4 - The Position of
Components Within an Identifier
-
- 7.1.5 - Names to Avoid
-
- 7.2 - Instructions,
Directives, and Pseudo-Opcodes
-
- 7.2.1 - Choosing the Best
Instruction Sequence
-
- 7.2.2 - Control
Structures
-
- 7.2.3 - Instruction
Synonyms
-
- 8.0 - Data Types
-
- 8.1 - Defining New Data
Types with TYPEDEF
-
- 8.2 - Creating Array
Types
-
- 8.3 - Declaring
Structures in Assembly Language
-
- 8.4 - Data Types and the
UCR Standard Library
-
Style Guidelines for Assembly Language Programmers
1.0 Introduction
Most people consider assembly language programs difficult to read. While
there are a multitude of reasons why people feel this way, the primary
reason is that assembly language does not make it easy for programmers to
write readable programs. This doesn't mean it's impossible to write
readable programs, only that it takes an extra effort on the part of an
assembly language programmer to produce readable code.
To demonstrate some common problems with assembly language programs,
consider the following programs or program segments. These are actual
programs written in assembly language taken from the internet. Each
example demonstrates a separate problem. (By the way, the choice of these
examples is not intended to embarass the original authors. These programs
are typical of assembly language source code found on the Internet.)
1.1 ADDHEX.ASM
%TITLE "Sums TWO hex values"
IDEAL
DOSSEG
MODEL small
STACK 256
DATASEG
exitCode db 0
prompt1 db 'Enter value 1: ', 0
prompt2 db 'Enter value 2: ', 0
string db 20 DUP (?)
CODESEG
EXTRN StrLength:proc
EXTRN StrWrite:proc, StrRead:proc, NewLine:proc
EXTRN AscToBin:proc, BinToAscHex:proc
Start:
mov ax,@data
mov ds,ax
mov es,ax
mov di, offset prompt1
call GetValue
push ax
mov di, offset prompt2
call GetValue
pop bx
add ax,bx
mov cx,4
mov di, offset string
call BinToAscHex
call StrWrite
Exit:
mov ah,04Ch
mov al,[exitCode]
int 21h
PROC GetValue
call StrWrite
mov di, offset string
mov cl,4
call StrRead
call NewLine
call StrLength
mov bx,cx
mov [word bx + di], 'h'
call AscToBin
ret
ENDP GetValue
END Start
Well, the biggest problem with this program should be fairly obvious - it
has absolutely no comments other than the title of the program. Another
problem is the fact that strings that prompt the user appear in one part
of the program and the calls that print those strings appear in another.
While this is typical assembly language programming, it still makes the
program harder to read. Another, relatively minor, problem is that it
uses TASM's "less-than" IDEAL syntax[1].
This program also uses the MASM/TASM "simplified" segment
directives. How typically Microsoft to name a feature that adds
complexity to a product "simplified." It turns out that
programs that use the standard segmentation directives will be easier to
read[2].
Before moving one, it is worthwhile to point out two good features about
this program (with respect to readability). First, the programmer chose a
reasonable set of names for the procedures and variables this program uses
(I'll assume the author of this code segment is also the author of the
library routines it calls). Another positive aspect to this program is
that the mnemonic and operand fields are nicely aligned.
Okay, after complaining about how hard this code is to read, how about a
more readable version? The following program is, arguably, more readable
than the version above. Arguably, because this version uses the UCR
Standard Library v2.0 and it assumes that the reader is familiar with
features of that particular library.
;**************************************************
;
; AddHex-
;
; This simple program reads two integer values from
; the user, computes their sum, and prints the
; result to the display.
;
; This example uses the "UCR Standard Library for
; 80x86 Assembly Language Programmers v2.0"
;
; Randall Hyde
; 12/13/96
title AddHex
.xlist
include ucrlib.a
includelib ucrlib.lib
.list
cseg segment para public 'code'
assume cs:cseg
; GetInt-
;
; This function reads an integer value from the keyboard and
; returns that value in the AX register.
;
; This routine traps illegal values (either too large or
; incorrect digits) and makes the user re-enter the value.
GetInt textequ <call GetInt_p>
GetInt_p proc
push dx ;DX hold error code.
GetIntLoop: mov dx, false ;Assume no error.
try ;Trap any errors.
FlushGetc ;Force input from a new line.
geti ;Read the integer.
except $Conversion ;Trap if bad characters.
print "Illegal numeric conversion, please
re-enter", nl
mov dx, true
except $Overflow ;Trap if # too large.
print "Value out of range, please re-enter.",nl
mov dx, true
endtry
cmp dx, true
je GetIntLoop
pop dx
ret
GetInt_p endp
Main proc
InitExcept
print 'Enter value 1: '
GetInt
mov bx, ax
print 'Enter value 2: '
GetInt
print cr, lf, 'The sum of the two values is '
add ax, bx
puti
putcr
Quit: CleanUpEx
ExitPgm ;DOS macro to quit program.
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 256 dup (?)
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
It is well worth pointing out that this code does quite a bit more than
the original AddHex program. In particular, it validates the user's
input; something the original program did not do. If one were to exactly
simulate the original program, the program could be simplified to the
following:
print nl, 'Enter value 1: '
Geti
mov bx, ax
print nl, 'Enter value 2: '
Geti
add ax, bx
putcr
puti
putcr
In this example, the two sample solutions improved the readability of the
program by adding comments, formatting the program a little bit better,
and by using the high-level features of the UCR Standard Library to
simplify the coding and keep output string literals with the statements
that print them.
1.2 Graphics Example
The following program segment comes from a much larger program named
"MODEX.ASM" on the net. It deals with setting up the color
graphics display.
;===================================
;SET_POINT (Xpos%, Ypos%, ColorNum%)
;===================================
;
; Plots a single Pixel on the active display page
;
; ENTRY: Xpos = X position to plot pixel at
; Ypos = Y position to plot pixel at
; ColorNum = Color to plot pixel with
;
; EXIT: No meaningful values returned
;
SP_STACK STRUC
DW ?,? ; BP, DI
DD ? ; Caller
SETP_Color DB ?,? ; Color of Point to Plot
SETP_Ypos DW ? ; Y pos of Point to Plot
SETP_Xpos DW ? ; X pos of Point to Plot
SP_STACK ENDS
PUBLIC SET_POINT
SET_POINT PROC FAR
PUSHx BP, DI ; Preserve Registers
MOV BP, SP ; Set up Stack Frame
LES DI, d CURRENT_PAGE ; Point to Active VGA Page
MOV AX, [BP].SETP_Ypos ; Get Line # of Pixel
MUL SCREEN_WIDTH ; Get Offset to Start of Line
MOV BX, [BP].SETP_Xpos ; Get Xpos
MOV CX, BX ; Copy to extract Plane # from
SHR BX, 2 ; X offset (Bytes) = Xpos/4
ADD BX, AX ; Offset = Width*Ypos + Xpos/4
MOV AX, MAP_MASK_PLANE1 ; Map Mask & Plane Select Register
AND CL, PLANE_BITS ; Get Plane Bits
SHL AH, CL ; Get Plane Select Value
OUT_16 SC_Index, AX ; Select Plane
MOV AL,[BP].SETP_Color ; Get Pixel Color
MOV ES:[DI+BX], AL ; Draw Pixel
POPx DI, BP ; Restore Saved Registers
RET 6 ; Exit and Clean up Stack
SET_POINT ENDP
Unlike the previous example, this one has lots of comments. Indeed, the
comments are not bad. However, this particular routine suffers from its
own set of problems. First, most of the instructions, register names, and
identifiers appear in upper case. Upper case characters are much harder
to read than lower case letters. Considering the extra work involved in
entering upper case letters into the computer, it's a real shame to see
this type of mistake in a program[3]. Another
big problem with this particular code segment is that the author didn't
align the label field, the mnemonic field, and the operand field very well
(it's not horrible, but it's bad enough to affect the readability of the
program.
Here is an improved version of the program:
;===================================
;
;SetPoint (Xpos%, Ypos%, ColorNum%)
;
;
; Plots a single Pixel on the active display page
;
; ENTRY: Xpos = X position to plot pixel at
; Ypos = Y position to plot pixel at
; ColorNum = Color to plot pixel with
;
; ES:DI = Screen base address (??? I added this without really
; knowing what is going on here
[RLH]).
;
; EXIT: No meaningful values returned
;
dp textequ <dword ptr>
Color textequ <[bp+6]>
YPos textequ <[bp+8]>
XPos textequ <[bp+10]>
public SetPoint
SetPoint proc far
push bp
mov bp, sp
push di
les di, dp CurrentPage ;Point at active VGA Page
mov ax, YPos ;Get line # of Pixel
mul ScreenWidth ;Get offset to start of
line
mov bx, XPos ;Get offset into line
mov cx, bx ;Save for plane
computations
shr bx, 2 ;X offset (bytes)= XPos/4
add bx, ax ;Offset=Width*YPos + XPos/4
mov ax, MapMaskPlane1 ;Map mask & plane
select reg
and cl, PlaneBits ;Get plane bits
shl ah, cl ;Get plane select value
out_16 SCIndex, ax ;Select plane
mov al, Color ;Get pixel color
mov es:[di+bx], al ;Draw pixel
pop di
pop bp
ret 6
SetPoint endp
Most of the changes here were purely mechanical: reducing the number of
upper case letters in the program, spacing the program out better,
adjusting some comments, etc. Nevertheless, these small, subtle, changes
have a big impact on how easy the code is to read (at least, to an
experienced assembly langage programmer).
1.3 S.COM Example
The following code sequence came from a program labelled
"S.COM" that was also found in an archive on the internet.
;Get all file names matching filespec and set up tables
GetFileRecords:
mov dx, OFFSET DTA ;Set up DTA
mov ah, 1Ah
int 21h
mov dx, FILESPEC ;Get first file name
mov cl, 37h
mov ah, 4Eh
int 21h
jnc FileFound ;No files. Try a different filespec.
mov si, OFFSET NoFilesMsg
call Error
jmp NewFilespec
FileFound:
mov di, OFFSET fileRecords ;DI -> storage for file names
mov bx, OFFSET files ;BX -> array of files
sub bx, 2
StoreFileName:
add bx, 2 ;For all files that will fit,
cmp bx, (OFFSET files) + NFILES*2
jb @@L1
sub bx, 2
mov [last], bx
mov si, OFFSET tooManyMsg
jmp DoError
@@L1:
mov [bx], di ;Store pointer to status/filename in
files[]
mov al, [DTA_ATTRIB] ;Store status byte
and al, 3Fh ;Top bit is used to indicate file is marked
stosb
mov si, OFFSET DTA_NAME ;Copy file name from DTA to filename
storage
call CopyString
inc di
mov si, OFFSET DTA_TIME ;Copy time, date and size
mov cx, 4
rep movsw
mov ah, 4Fh ;Next filename
int 21h
jnc StoreFileName
mov [last], bx ;Save pointer to last file entry
mov al, [keepSorted] ;If returning from EXEC, need to resort
files?
or al, al
jz DisplayFiles
jmp Sort0
The primary problem with this program is the formatting. The label
fields overlap the mnemonic fields (in almost every instance), the operand
fields of the various instructions are not aligned, there are very few
blank lines to organize the code, the programmer makes excessive use of
"local" label names, and, although not prevalent, there are a
few items that are all uppercase (remember, upper case characters are
harder to read). This program also makes considerable use of "magic
numbers," especially with respect to opcodes passed on to DOS.
Another subtle problem with this program is the way it organizes control
flow. At a couple of points in the code it checks to see if an error
condition exists (file not found and too many files processed). If an
error exists, the code above branches around some error handling code that
the author places in the middle of the routine. Unfortunately, this
interrupts the flow of the program. Most readers will want to see a
straight-line version of the program's typical operation without having to
worry about details concerning error conditions. Unfortunately, the
organization of this code is such that the user must skip over
seldomly-executed code in order to follow what is happening with the
common case[4].
Here is a slightly improved version of the above program:
;Get all file names matching filespec and set up tables
GetFileRecords mov dx, offset DTA ;Set up DTA
DOS SetDTA
; Get the first file that matches the specified filename (that may
; contain wildcard characters). If no such file exists, then
; we've got an error.
mov dx, FileSpec
mov cl, 37h
DOS FindFirstFile
jc FileNotFound
; As long as there are no more files matching our file spec (that contains
; wildcard characters), get the file information and place it in the
; "files" array. Each time through the
"StoreFileName" loop we've got
; a new file name via a call to DOS' FindNextFile function (FindFirstFile
; for the first iteration). Store the info concerning the file away and
; move on to the next file.
mov di, offset fileRecords ;DI -> storage for file
names
mov bx, offset files ;BX -> array of
files
sub bx, 2 ;Special case for 1st
iteration
StoreFileName: add bx, 2
cmp bx, (offset files) + NFILES*2
jae TooManyFiles
; Store away the pointer to the status/filename in files[] array.
; Note that the H.O. bit of the status byte indicates that the file is
; is marked.
mov [bx], di ;Store pointer in files[]
mov al, [DTAattrib] ;Store status byte
and al, 3Fh ;Clear file is marked bit
stosb
; Copy the filename from the DTA storage area to the space we've set aside.
mov si, offset DTAname
call CopyString
inc di ;Skip zero byte (???).
mov si, offset DTAtime ;Copy time, date and size
mov cx, 4
rep movsw
; Move on to the next file and try again.
DOS FindNextFile
jnc StoreFileName
; After processing the last file entry, do some clean up.
; (1) Save pointer to last file entry.
; (2) If returning from EXEC, we may need to resort and display the files.
mov [last], bx
mov al, [keepSorted]
or al, al
jz DisplayFiles
jmp Sort0
; Jump down here if there were no files to process.
FileNotFound: mov si, offset NoFilesMsg
call Error
jmp NewFilespec
; Jump down here if there were too many files to process.
TooManyFiles: sub bx, 2
mov [last], bx
mov si, offset tooManyMsg
jmp DoError
This improved version dispenses with the local labels, formats the code
better by aligning all the statement fields and inserting blank lines into
the code. It also eliminates much of the uppercase characters appearing
in the previous version. Another improvment is that this code moves the
error handling code out of the main stream of this code segment, allowing
the reader to follow the typical execution in a more linear fashion.
1.4 Intended Audience
Of course, an assembly language program is going to be nearly unreadable
to someone who doesn't know assembly language. This is true for almost
any programming language. In the examples above, it's doubtful that the
"improved" versions are really any more readable than the
original version if you don't know 80x86 assembly language. Perhaps the
improved versions are more aesthetic in a generic sense, but if you don't
know 80x86 assembly language it's doubtful you'd make any more sense of
the second version than the first. Other than burying a tutorial on 80x86
assembly language in a program's comments, there is no way to address this
problem[5].
In view of the above, it makes sense to define an "intended
audience" that we intend to have read our assembly language
programs. Such a person should:
- Be a reasonably competent 80x86 assembly language programmer.
- Be reasonably familiar with the problem the assembly language program
is attempting to solve.
- Fluently read English[6].
- Have a good grasp of high level language concepts.
- Possess appropriate knowledge for someone working in the field of
Computer Science (e.g., understands standard algorithms and data
structures, understands basic machine architecture, and understands basic
discrete mathmatics).
1.5 Readability Metrics
One has to ask "What is it that makes one program more readable than
another?" In other words, how do we measure the
"readability" of a program? The usual metric, "I know a
well-written program when I see one" is inappropriate; for most
people, this translates to "If your programs look like my better
programs then they are readable, otherwise they are not." Obviously,
such a metric is of little value since it changes with every person.
To develop a metric for measuring the readability of an assembly language
program, the first thing we must ask is "Why is readability
important?" This question has a simple (though somewhat flippant)
answer: Readability is important because programs are read (furthermore,
a line of code is typically read ten times more often than it is
written). To expand on this, consider the fact that most programs are
read and maintained by other programmers (Steve McConnell claims that up
to ten generations of maintenance programmers work on a typically real
world program before it is rewritten; furthermore, they spend up to 60%
of their effort on that code simply figuring out how it works). The more
readable your programs are, the less time these other people will have to
spend figuring out what your program does. Instead, they can concentrate
on adding features or correcting defects in the code.
For the purposes of this document, we will define a "readable"
program as one that has the following trait:
- A "readable" program is one that a competent programmer (one
who is familiar with the problem the program is attempting to solve) can
pick up, without ever having seen the program before, and fully comprehend
the entire program in a minimal amount of time.
That's a tall order! This definition doesn't sound very difficult to
achieve, but few non-trivial programs ever really achieve this status.
This definition suggests that an appropriate programmer (i.e., one who is
familiar with the problem the program is trying to solve) can pick up a
program, read it at their normal reading pace (just once), and fully
comprehend the program. Anything less is not a "readable"
program.
Of course, in practice, this definition is unusable since very few
programs reach this goal. Part of the problem is that programs tend to be
quite long and few human beings are capable of managing a large number of
details in their head at one time. Furthermore, no matter how
well-written a program may be, "a competent programmer" does not
suggest that the programmer's IQ is so high they can read a statement a
fully comprehend its meaning without expending much thought. Therefore,
we must define readabilty, not as a boolean entity, but as a scale.
Although truly unreadable programs exist, there are many
"readable" programs that are less readable than other programs.
Therefore, perhaps the following definition is more realistic:
- A readable program is one that consists of one or more modules. A
competent program should be able to pick a given module in that program
and achieve an 80% comprehension level by expending no more than an
average of one minute for each statement in the program.
An 80% comprehension level means that the programmer can correct bugs in
the program and add new features to the program without making mistakes
due to a misunderstanding of the code at hand.
1.6 How to Achieve Readability
The "I'll know one when I see one" metric for readable
programs provides a big hint concerning how one should write programs
that are readable. As pointed out early, the "I'll know it when I
see it" metric suggests that an individual will consider a program to
be readable if it is very similar to (good) programs that this particular
person has written. This suggests an important trait that readable
programs must possess: consistency. If all programmers were to write
programs using a consistent style, they'd find programs written by others
to be similar to their own, and, therefore, easier to read. This single
goal is the primary purpose of this paper - to suggest a consistent
standard that everyone will follow.
Of course, consistency by itself is not good enough. Consistently bad
programs are not particularly easy to read. Therefore, one must carefully
consider the guidelines to use when defining an all-encompassing standard.
The purpose of this paper is to create such a standard. However, don't
get the impression that the material appearing in this document appears
simply because it sounded good at the time or because of some personal
preferences. The material in this paper comes from several software
engineering texts on the subject (including Elements of Programming Style,
Code Complete, and Writing Solid Code), nearly 20 years of personal
assembly language programming experience, and a set of generic programming
guidelines developed for Information Management Associates, Inc.
This document assumes consistent usage by its readers. Therefore, it
concentrates on a lot of mechanical and psychological issues that affect
the readability of a program. For example, uppercase letters are harder
to read than lower case letters (this is a well-known result from
psychology research). It takes longer for a human being to recognize
uppercase characters, therefore, an average human being will take more
time to read text written all in upper case. Hence, this document
suggests that one should avoid the use of uppercase sequences in a
program. Many of the other issues appearing in this document are in a
similar vein; they suggest minor changes to the way you might write your
programs that make it easier for someone to recognize some pattern in your
code, thus aiding in comprehension.
1.7 How This Document is Organized
This document follows a top-down discussion of readability. It starts
with the concept of a program. Then it discusses modules. From there it
works its way down to procedures. Then it talks about individual
statements. Beyond that, it talks about components that make up
statements (e.g., instructions, names, and operators). Finally, this
paper concludes by discussing some orthogonal issues.
Section Two discusses programs in general. It primarily discusses
documentation that must accompany a program and the organization of source
files. It also discusses, briefly, configuration management and source
code control issues. Keep in mind that figuring out how to build a
program (make, assemble, link, test, debug, etc.) is important. If your
reader fully understands the "heapsort" algorithm you are using,
but cannot build an executable module to run, they still do not fully
understand your program.
Section Three discusses how to organize modules in your program in a
logical fashion. This makes it easier for others to locate sections of
code and organizes related sections of code together so someone can easily
find important code and ignore unimportant or unrelated code while
attempting to understand what your program does.
Section Four discusses the use of procedures within a program. This is a
continuation of the theme in Section Three, although at a lower, more
detailed, level.
Section Five discusses the program at the level of the statement. This
(large) section provides the meat of this proposal. Most of the rules
this paper presents appear in this section.
Section Six discusses those items that make up a statement (labels,
names, instructions, operands, operators, etc.) This is another large
section that presents a large number of rules one should follow when
writing readable programs. This section discusses naming conventions,
appropriateness of operators, and so on.
Section Seven discusses data types and other related topics.
Section Eight covers miscellaneous topics that the previous sections did
not cover.
1.8 Guidelines, Rules, Enforced Rules, and Exceptions
Not all rules are equally important. For example, a rule that you check
the spelling of all the words in your comments is probably less important
than suggesting that the comments all be in English[7]. Therefore, this paper uses three designations
to keep things straight: Guidelines, Rules, and Enforced Rules.
A Guideline is a suggestion. It is a rule you should follow unless you
can verbally defend why you should break the rule. As long as there is a
good, defensible, reason, you should feel no apprehension violated a
guideline. Guidelines exist in order to encourage consistency in areas
where there are no good reasons for choosing one methodology over
another. You shouldn't violate a Guideline just because you don't like it
-- doing so will make your programs inconsistent with respect to other
programs that do follow the Guidline (and, therefore, harder to read --
however, you shouldn't lose any sleep because you violated a Guideline.
Rules are much stronger than Guidelines. You should never break a rule
unless there is some external reason for doing so (e.g., making a call to
a library routine forces you to use a bad naming convention). Whenever
you feel you must violate a rule, you should verify that it is reasonable
to do so in a peer review with at least two peers. Furthermore, you
should explain in the program's comments why it was necessary to violate
the rule. Rules are just that -- rules to be followed. However, there
are certain situations where it may be necessary to violate the rule in
order to satisfy external requirements or even make the program more
readable.
Enforced Rules are the toughest of the lot. You should never violate an
enforced rule. If there is ever a true need to do this, then you should
consider demoting the Enforced Rule to a simple Rule rather than treating
the violation as a reasonable alternative.
An Exception is exactly that, a known example where one would commonly
violate a Guideline, Rule, or (very rarely) Enforced Rule. Although
exceptions are rare, the old adage "Every rule has its
exceptions..." certainly applies to this document. The Exceptions
point out some of the common violations one might expect.
Of course, the categorization of Guidelines, Rules, Enforced Rules, and
Exceptions herein is one man's opinion. At some organizations, this
categorization may require reworking depending on the needs of that
organization.
1.9 Source Language Concerns
This document will assume that the entire program is written in 80x86
assembly language. Although this organization is rare in commercial
applications, this assumption will, in no way, invalidate these
guidelines. Other guidelines exist for various high level languages
(including a set written by this paper's author). You should adopt a
reasonable set of guidelines for the other languages you use and apply
these guidelines to the 80x86 assembly language modules in the program.
2.0 Program Organization
A source program generally consists of one or more source, object, and
library files. As a project gets larger and the number of files
increases, it becomes difficult to keep track of the files in a project.
This is especially true if a number of different projects share a common
set of source modules. This section will address these concerns.
2.1 Library Functions
A library, by its very nature, suggests stability. Ignoring the
possibility of software defects, one would rarely expect the number or
function of routines in a library to vary from project to project. A good
example is the "UCR Standard Library for 80x86 Assembly Language
Programmers." One would expect "printf" to behave
identically in two different programs that use the Standard Library.
Contrast this against two programs, each of which implement their own
version of printf. One could not reasonably assume both programs have
identical implementations[8]. This leads to the
following rule:
- Rule:
- Library functions are those routines intended for
common reuse in many different assembly language programs. All assembly
language (callable) libraries on a system should exist as ".lib"
files and should appear in the "/lib" or "/asmlib"
subdirectory.
- Guideline:
- "/asmlib" is probably a better choice
if you're using multiple languages since those other languages may need to
put files in a "/lib" directory.
- Exception:
- It's probably reasonable to leave the UCR
Standard Library's "stdlib.lib" file in the
"/stdlib/lib" directory since most people expect it there.
The rule above ensures that the library files are all in one location so
they are easy to find, modify, and review. By putting all your library
modules into a single directory, you avoid configuration management
problems such as having outdated versions of a library linking with one
program and up-to-date versions linking with other programs.
2.2 Common Object Modules
This document defines a library as a collection of object modules that
have wide application in many different programs. The UCR Standard
Library is a typical example of a library. Some object modules are not so
general purpose, but still find application in two or more different
programs. Two major configuration management problems exist in this
situation: (1) making sure the ".obj" file is up-to-date when
linking it with a program; (2) Knowing which modules use the module so
one can verify that changes to the module won't break existing code.
The following rules takes care of case one:
- Rule:
- If two different program share an object module, then
the associated source, object, and makefiles for that module should appear
in a subdirectory that is specific to that module (i.e., no other files in
the subdirectory). The subdirectory name should be the same as the module
name. If possible, you should create a set of link/alias/shortcuts to
this subdirectory and place these links in the main directory of each of
the projects that utilize the module. If links are not possible, you
should place the module's subdirectory in the "/common"
subdirectory.
- Enforced Rule:
- Every subdirectory containing one or more
modules should have a make file that will automatically generate the
appropriate, up-to-date, ".obj" files. An individual, a batch
file, or another make file should be able to automatically generate new
object modules (if necessary) by simply executing the make program.
- Guideline:
- Use Microsoft's nmake program. At the very
least, use nmake acceptable syntax in your makefiles.
The other problem, noting which projects use a given module is much more
difficult. The obvious solution, commenting the source code associated
with the module to tell the reader which programs use the module, is
impractical. Maintaining these comments is too error-prone and the
comments will quickly get out of phase and be worse than useless -- they
would be incorrect. A better solution is to create a dummy file using the
module's name with a ".elw" (elsewhere) suffix and placing this
file in the main subdirectory of each program that links the module. Now,
using one of the venerable "whereis" programs, you can easily
locate all projects that use the module.
- Guideline:
- If a project uses a module that is not local to
the project's subdirectory, create a dummy file (using "TOUCH"
or a comparable program) that uses the module's main name with a
".elw" suffix. This will allow someone to easily search for all
the projects that use a common object module by using a
"whereis" program.
2.3 Local Modules
Local modules are those that a single program/project uses. Typically,
the source and object code for each module appears in the same directory
as the other files associated with the project. This is a reasonable
arrangement until the number of files increases to the point that it is
difficult to find a file in a directory listing. At that point, most
programmers begin reorganizing their directory by creating subdirectories
to hold many of these source modules. However, the placement, name, and
contents of these new subdirectories can have a big impact on the overall
readability of the program. This section will address these issues.
The first issue to consider is the contents of these new subdirectories.
Since programmers rummaging through this project in the future will need
to easily locate source files in a project, it is important that you
organize these new subdirectories so that it is easy to find the source
files you are moving into them. The best organization is to put each
source module (or a small group of strongly related modules) into its own
subdirectory. The subdirectory should bear the name of the source module
minus its suffix (or the main module if there is more than one present in
the subdirectory). If you place two or more source files in the same
directory, ensure this set of source files forms a cohesive set (meaning
the source files contain code that solve a single problem). A discussion
of cohesiveness appears later in this document.
- Rule:
- If a project directory contains too many files, try
to move some of the modules to subdirectories within the project
directory; give the subdirectory the same name as the source file without
the suffix. This will nearly reduce the number of files in half. If this
reduction is insufficient, try categorizing the source modules (e.g.,
FileIO, Graphics, Rendering, and Sound) and move these modules to a
subdirectory bearing the name of the category.
- Enforced Rule:
- Each new subdirectory you create should have
its own make file that will automatically assemble all source modules
within that subdirectory, as appropriate.
- Enforced Rule:
- Any new subdirectories you create for these
source modules should appear within the directory containing the project.
The only excepts are those modules that are, or you anticipate, sharing
with other projects. See "Common Object Modules" on
page 13 for more details.
Stand-alone assembly language programs generally contain a
"main" procedure - the first program unit that executes when the
operating system loads the program into memory. For any programmer new to
a project, this procedure is the anchor where one first begins reading
the code and the point where the reader will continually refer.
Therefore, the reader should be able to easily locate this source file.
The following rule helps ensure this is the case:
- Rule:
- The source module containing the main program should
have the same name as the executable (obviously the suffix will be
different). For example, if the "Simulate 886" program's
executable name is "Sim886.exe" then you should find the main
program in the "Sim886.asm" source file.
Finding the souce file that contains the main program is one thing.
Finding the main program itself can be almost as hard. Assembly language
lets you give the main program any name you want. However, to make the
main procedure easy to find (both in the source code and at the O/S
level), you should actually name this program "main". See "Module Organization" on page 15 for more details
about the placement of the main program.
- Rule:
- The name of the main procedure in an assembly
language program should be "main".
2.4 Program Make Files
Every project, even if it contains only a single source module, should
have an associated make file. If someone want to assemble your program,
they should have to worry about what program (e.g., MASM) to use, what
command line options to use, what library modules to use, etc. They
should be able to type "nmake"[9] and
wind up with an executable program. Even if assembling the program
consists of nothing more than typing the name of the assembler and the
source file, you should still have a make file. Someone else may not
realize that's all that is necessary.
- Enforced Rule:
- The main project directory should contain a
make file that will automatically generate an executable (or other
expected object module) in response to a simple make/nmake command.
- Rule:
- If your project uses object modules that are not in
the same subdirectory as the main program's module, you should test the
".obj" files for those modules and execute the corresponding
make files in their directories if the object code is out of date. You
can assume that library files are up to date.
- Guideline:
- Avoid using fancy "make" features.
Most programmers only learn the basics about make and will not be able to
understand what your make file is doing if you fully exploit the make
language. Especially avoid the use of default rules since this can create
havoc if someone arbitrarily adds or removes files from the directory
containing the make file.
3.0 Module Organization
A module is a collection of objects that are logically related. Those
objects may include constants, data types, variables, and program units
(e.g., functions, procedures, etc.). Note that objects in a module need
not be physically related. For example, it is quite possible to construct
a module using several different source files. Likewise, it is quite
possible to have several different modules in the same source file.
However, the best modules are physically related as well as logically
related; that is, all the objects associated with a module exist in a
single source file (or directory if the source file would be too large)
and nothing else is present.
Modules contain several different objects including constants, types,
variables, and program units (routines). Modules shares many of the
attributes with routines (program units); this is not surprising since
routines are the major component of a typical module. However, modules
have some additional attributes of their own. The following sections
describe the attributes of a well-written module.
- Note:
- Unit and package are both synonyms for the term
module.
3.1 Module Attributes
A module is a generic term that describes a set of program related
objects (program units as well as data and type objects) that are somehow
coupled. Good modules share many of the same attributes as good program
units as well as the ability to hide certain details from code outside the
module.
3.1.1 Module Cohesion
Modules exhibit the following different kinds of cohesion (listed from
good to bad):
- Functional or logical cohesion exists if the module accomplishes
exactly one (simple) task.
- Sequential or pipelined cohesion exists when a module does several
sequential operations that must be performed in a certain order with the
data from one operation being fed to the next in a "filter-like"
fashion.
- Global or communicational cohesion exists when a module performs a set
of operations that make use of a common set of data, but are otherwise
unrelated.
- Temporal cohesion exists when a module performs a set of operations
that need to be done at the same time (though not necessarily in the same
order). A typical initialization module is an example of such code.
- Procedural cohesion exists when a module performs a sequence of
operations in a specific order, but the only thing that binds them
together is the order in which they must be done. Unlike sequential
cohesion, the operations do not share data.
- State cohesion occurs when several different (unrelated) operations
appear in the same module and a state variable (e.g., a parameter) selects
the operation to execute. Typically such modules contain a case (switch)
or if..elseif..elseif... statement.
- No cohesion exists if the operations in a module have no apparent
relationship with one another.
The first three forms of cohesion above are generally acceptable in a
program. The fourth (temporal) is probably okay, but you should rarely
use it. The last three forms should almost never appear in a program.
For some reasonable examples of module cohesion, you should consult "Code
Complete".
- Guideline:
- Design good modules! Good modules exhibit strong
cohesion. That is, a module should offer a (small) group of services that
are logically related. For example, a "printer" module might provide all
the services one would expect from a printer. The individual routines
within the module would provide the individual services.
3.1.2 Module Coupling
Coupling refers to the way that two modules communicate with one
another. There are several criteria that define the level of coupling
between two modules:
- Cardinality- the number of objects communicated between two modules.
The fewer objects the better (i.e., fewer parameters).
- Intimacy- how "private" is the communication? Parameter lists are the
most private form; private data fields in a class or object are next
level; public data fields in a class or object are next, global variables
are even less intimate, and passing data in a file or database is the
least intimate connection. Well-written modules exhibit a high degree of
intimacy.
- Visibility- this is somewhat related to intimacy above. This refers
to how visible the data is to the entire system that you pass between two
modules. For example, passing data in a parameter list is direct and very
visible (you always see the data the caller is passing in the call to the
routine); passing data in global variables makes the transfer less
visible (you could have set up the global variable long before the call to
the routine). Another example is passing simple (scalar) variables rather
than loading up a bunch of values into a structure/record and passing that
structure/record to the callee.
- Flexibility- This refers to how easy it is to make the connection
between two routines that may not have been originally intended to call
one another. For example, suppose you pass a structure containing three
fields into a function. If you want to call that function but you only
have three data objects, not the structure, you would have to create a
dummy structure, copy the three values into the field of that structure,
and then call the function. On the other hand, had you simply passed the
three values as separate parameters, you could still pass in structures
(by specifying each field) as well as call the function with separate
values. The module containing this later function is more flexible.
A module is loosely coupled if its functions exhibit low cardinality,
high intimacy, high visibility, and high flexibility. Often, these
features are in conflict with one another (e.g., increasing the
flexibility by breaking out the fields from a structures [a good thing]
will also increase the cardinality [a bad thing]). It is the traditional
goal of any engineer to choose the appropriate compromises for each
individual circumstance; therefore, you will need to carefully balance
each of the four attributes above.
A module that uses loose coupling generally contains fewer errors per
KLOC (thousands of lines of code). Furthermore, modules that exhibit
loose coupling are easier to reuse (both in the current and future
projects). For more information on coupling, see the appropriate chapter
in "Code Complete".
- Guideline:
- Design good modules! Good modules exhibit loose
coupling. That is, there are only a few, well-defined (visible)
interfaces between the module and the outside world. Most data is
private, accessible only through accessor functions (see information
hiding below). Furthermore, the interface should be flexible.
- Guideline:
- Design good modules! Good modules exhibit
information hiding. Code outside the module should only have access to
the module through a small set of public routines. All data should be
private to that module. A module should implement an abstract data type.
All interface to the module should be through a well-defined set of
operations.
3.1.3 Physical Organization of Modules
Many languages provide direct support for modules (e.g., packages in Ada,
modules in Modula-2, and units in Delphi/Pascal). Some languages provide
only indirect support for modules (e.g., a source file in C/C++). Others,
like BASIC, don't really support modules, so you would have to simulate
them by physically grouping objects together and exercising some
discipline. Assembly language falls into the middle ground. The primary
mechanism for hiding names from other modules is to implement a module as
an individual source file and publish only those names that are part of
the module's interface to the outside world.
- Rule:
- Each module should completely reside in a single
source file. If size considerations prevent this, then all the source
files for a given module should reside in a subdirectory specifically
designated for that module.
Some people have the crazy idea that modularization means putting each
function in a separate source file. Such physical modularization
generally impairs the readability of a program more than it helps. Strive
instead for logical modularization, that is, defining a module by its
actions rather than by source code syntax (e.g., separating out
functions).
This document does not address the decomposition of a problem into its
modular components. Presumably, you can already handle that part of the
task. There are a wide variety of texts on this subject if you feel weak
in this area.
3.1.4 Module Interface
In any language system that supports modules, there are two primary
components of a module: the interface component that publicizes the module
visible names and the implementation component that contains the actual
code, data, and private objects. MASM (and most assemblers) uses a scheme
that is very similar to the one C/C++ uses. There are directives that let
you import and export names. Like C/C++, you could place these directives
directly in the related source modules. However, such code is difficult
to maintain (since you need to change the directives in every file
whenever you modify a public name). The solution, as adopted in the
C/C++ programming languages, is to use header files. Header files contain
all the public definitions and exports (as well as common data type
definitions and constant definitions). The header file provides the
interface to the other modules that want to use the code present in the
implementation module.
The MASM 6.x externdef directive is perfect for creating interface
files. When you use externdef within a source module that defines a
symbol, externdef behaves like the public directive, exporting the name
to other modules. When you use externdef within a source modules that
refers to an external name, externdef behaves like the extern (or extrn
) directive. This lets you place an externdef directive in a single file
and include this file into both the modules that import and export the
public names.
If you are using an assembler that does not support externdef, you should
probably consider switching to MASM 6.x. If switching to a better
assembler (that supports externdef) is not feasible, the last thing you
want to do is have to maintain the interface information in several
separate files. Instead, use the assembler's ifdef conditional assembly
directives to assemble a set of public statements in the header file if a
symbol with the module's name is defined prior to including the header
file. It should assemble a set of extrn statements otherwise. Although
you still have to maintain the public and external information in two
places (in the ifdef true and false sections), they are in the same file
and located near one another.
- Rule:
- Keep all module interface directives (public, extrn,
extern, and externdef) in a single header file for a given module. Place
any other common data type definitions and constant definitions in this
header file as well.
- Guideline:
- There should only be a single header file
associated with any one module (even if the module has multiple source
files associated with it). If, for some reason, you feel it is necessary
to have multiple header files associated with a module, you should create
a single file that includes all of the other interface files. That way a
program that wants to use all the header files need only include the
single file.
When designing header files, make sure you can include a file more than
once without ill effects (e.g., duplicate symbol errors). The traditional
way to do this is to put an IFDEF statement like the following around all
the statements in a header file:
; Module: MyHeader.a
ifndef MyHeader_A
MyHeader_A = 0
.
. ;Statements in this header file.
.
endif
The first time a source file includes "MyHeader.a" the symbol
"MyHeader_A" is undefined. Therefore, the assembler will
process all the statements in the header file. In successive include
operations (during the same assembly) the symbol "MyHeader_A" is
already defined, so the assembler ignores the body of the include file.
My would you ever include a file twice? Easy. Some header files may
include other header files. By including the file
"YourHeader.a" a module might also be including
"MyHeader.a" (assuming "YourHeader.a" contains the
appropriate include directive). Your main program, that includes
"YourHeader.a" might also need "MyHeader.a" so it
explicitly includes this file not realizing "YourHeader.a" has
already processed "MyHeader.a" thereby causing symbol
redefinitions.
- Rule:
- Always put an appropriate IFNDEF statement around all
the definitions in a header file to allow multiple inclusion of the header
file without ill effect.
- Guideline:
- Use the ".a" suffix for assembly
language header/interface files.
- Rule:
- Include files for library functions on a system
should exist as ".a" files and should appear in the
"/include" or "/asminc" subdirectory.
- Guideline:
- "/asminc" is probably a better choice
if you're using multiple languages since those other languages may need to
put files in a "/include" directory.
- Exception:
- It's probably reasonable to leave the UCR
Standard Library's "stdlib.a" file in the
"/stdlib/include" directory since most people expect it there.
4.0 Program Unit Organization
A program unit is any procedure, function, coroutine, iterator,
subroutine, subprogram, routine, or other term that describes a section
of code that abstracts a set of common operations on the computer. This
text will simply use the term procedure or routine to describe these
concepts.
Routines are closely related to modules, since they tend to be the major
component of a module (along with data, constants, and types). Hence,
many of the attributes that apply to a module also apply to routines. The
following paragraphs, at the expense of being redundant, repeat the
earlier definitions so you don't have to flip back to the previous
sections.
4.1 Routine Cohesion
Routines exhibit the following kinds of cohesion (listed from good to
bad):
- Functional or logical cohesion exists if the routine accomplishes
exactly one (simple) task.
- Sequential or pipelined cohesion exists when a routine does several
sequential operations that must be performed in a certain order with the
data from one operation being fed to the next in a "filter-like"
fashion.
- Global or communicational cohesion exists when a routine performs a
set of operations that make use of a common set of data, but are otherwise
unrelated.
- Temporal cohesion exists when a routine performs a set of operations
that need to be done at the same time (though not necessarily in the same
order). A typical initialization routine is an example of such code.
- Procedural cohesion exists when a routine performs a sequence of
operations in a specific order, but the only thing that binds them
together is the order in which they must be done. Unlike sequential
cohesion, the operations do not share data.
- State cohesion occurs when several different (unrelated) operations
appear in the same module and a state variable (e.g., a parameter) selects
the operation to execute. Typically such routines contain a case (switch)
or if..elseif..elseif... statement.
- No cohesion exists if the operations in a routine have no apparent
relationship with one another.
The first three forms of cohesion above are generally acceptable in a
program. The fourth (temporal) is probably okay, but you should rarely
use it. The last three forms should almost never appear in a program.
For some reasonable examples of routine cohesion, you should consult "Code
Complete".
- Guideline:
- All routines should exhibit good
cohesiveness. Functional cohesiveness is best, followed by sequential and
global cohesiveness. Temporal cohesiveness is okay on occasion. You
should avoid the other forms.
4.1.1 Routine Coupling
Coupling refers to the way that two routines communicate with one
another. There are several criteria that define the level of coupling
between two routines:
- Cardinality- the number of objects communicated between two routines.
The fewer objects the better (i.e., fewer parameters).
- Intimacy- how "private" is the communication? Parameter lists are the
most private form; private data fields in a class or object are next
level; public data fields in a class or object are next, global variables
are even less intimate, and passing data in a file or database is the
least intimate connection. Well-written routines exhibit a high degree of
intimacy.
- Visibility- this is somewhat related to intimacy above. This refers
to how visible the data is to the entire system that you pass between two
routines. For example, passing data in a parameter list is direct and
very visible (you always see the data the caller is passing in the call to
the routine); passing data in global variables makes the transfer less
visible (you could have set up the global variable long before the call to
the routine). Another example is passing simple (scalar) variables rather
than loading up a bunch of values into a structure/record and passing that
structure/record to the callee.
- Flexibility- This refers to how easy it is to make the connection
between two routines that may not have been originally intended to call
one another. For example, suppose you pass a structure containing three
fields into a function. If you want to call that function but you only
have three data objects, not the structure, you would have to create a
dummy structure, copy the three values into the field of that structure,
and then call the routine. On the other hand, had you simply passed the
three values as separate parameters, you could still pass in structures
(by specifying each field) as well as call the routine with separate
values.
A function is loosely coupled if it exhibits low cardinality, high
intimacy, high visibility, and high flexibility. Often, these features
are in conflict with one another (e.g., increasing the flexibility by
breaking out the fields from a structures [a good thing] will also
increase the cardinality [a bad thing]). It is the traditional goal of
any engineer to choose the appropriate compromises for each individual
circumstance; therefore, you will need to carefully balance each of the
four attributes above.
A program that uses loose coupling generally contains fewer errors per
KLOC (thousands of lines of code). Furthermore, routines that exhibit
loose coupling are easier to reuse (both in the current and future
projects). For more information on coupling, see the appropriate chapter
in "Code Complete".
- Guideline:
- Coupling between routines in source code should
be loose.
4.1.2 Routine Size
Sometime in the 1960's, someone decided that programmers could only look
at one page in a listing at a time, therefore routines should be a maximum
of one page long (66 lines, at the time). In the 1970's, when interactive
computing became popular, this was adjusted to 24 lines -- the size of a
terminal screen. In fact, there is very little empirical evidence to
suggest that small routine size is a good attribute. In fact, several
studies on code containing artificial constraints on routine size indicate
just the opposite -- shorter routines often contain more bugs per KLOC[10].
A routine that exhibits functional cohesiveness is the right size, almost
regardless of the number of lines of code it contains. You shouldn't
artificially break up a routine into two or more subroutines (e.g.,
sub_partI and sub_partII) just because you feel a routine is getting to be
too long. First, verify that your routine exhibits strong cohesion and
loose coupling. If this is the case, the routine is not too long. Do
keep in mind, however, that a long routine is probably a good indication
that it is performing several actions and, therefore, does not exhibit
strong cohesion.
Of course, you can take this too far. Most studies on the subject
indicate that routines in excess of 150-200 lines of code tend to contain
more bugs and are more costly to fix than shorter routines. Note, by the
way, that you do not count blank lines or lines containing only comments
when counting the lines of code in a program.
Also note that most studies involving routine size deal with HLLs. A
comparable assembly language routine will contain more lines of code than
the corresponding HLL routine. Therefore, you can expect your routines in
assembly language to be a little longer.
- Guideline:
- Do not let artificial constraints affect the
size of your routines. If a routine exceeds about 200-250 lines of code,
make sure the routine exhibits functional or sequential cohesion. Also
look to see if there aren't some generic subsequences in your code that
you can turn into stand alone routines.
- Rule:
- Never shorten a routine by dividing it into n parts
that you would always call in the appropriate sequence as a way of
shortening the original routine.
4.2 Placement of the Main Procedure and Data
As noted earlier, you should name the main procedure main and place it
in the source file bearing the same name as the executable file. If this
module is rather long, it can still be difficult to locate the main
program. A good solution is to always place the main procedure at the
same point in the source file. By convention (meaning everyone expects it
this way), most programmers make their main program the first or last
procedure in an assembly language program. Either position is fine.
Putting the main program anywhere else makes it hard to find.
- Rule:
- Always make the main procedure the first or last
procedure in a source file.
MASM, because it is a multiphase assembler, does not require that you
define a symbol before you use it. This is necessary because many
instructions (like JMP) need to refer to symbols found later in the
program. In a similar manner, MASM doesn't really care where you define
your data - before or after its use[11].
However, most programmers "grew up" with high level languages
that require the definition of a symbol before its first use. As a
result, they expect to be able to find a variable declaration by looking
backwards in the source file. Since everyone expects this, it is a good
tradition to continue in an assembly language program.
- Rule:
- You should declare all variables, constants, and
macros prior to their use in an assembly language program.
- Rule:
- You should define all static variables (those you
declare in a segment) at the beginning of the source module.
5.0 Statement Organization
In an assembly language program, the author must work extra hard to make
a program readable. By following a large number of rules, you can produce
a program that is readable. However, by breaking a single rule no matter
how many other rules you've followed, you can render a program
unreadable. Nowhere is this more true than how you organize the
statements within your program. Consider the following example taken from
"The Art of Assembly Language Programming":
______________________________________________________
mov ax, 0
mov bx, ax
add ax, dx
mov cx, ax
______________________________________________________
mov ax, 0
mov bx, ax
add ax, dx
mov cx, ax
______________________________________________________
While this is an extreme example, do note that it only takes a few
mistakes to have a large impact on the readability of a program. Consider
(a short section from) an example presented earlier:
GetFileRecords:
mov dx, OFFSET DTA ;Set up DTA
mov ah, 1Ah
int 21h
mov dx, FILESPEC ;Get first file name
mov cl, 37h
mov ah, 4Eh
int 21h
jnc FileFound ;No files. Try a different filespec.
mov si, OFFSET NoFilesMsg
call Error
jmp NewFilespec
FileFound:
mov di, OFFSET fileRecords ;DI -> storage for file names
mov bx, OFFSET files ;BX -> array of files
sub bx, 2
Improved version:
GetFileRecords: mov dx, offset DTA ;Set up DTA
DOS SetDTA
mov dx, FileSpec
mov cl, 37h
DOS FindFirstFile
jc FileNotFound
mov di, offset fileRecords ;DI -> storage for file
names
mov bx, offset files ;BX -> array of files
sub bx, 2 ;Special case for 1st
iteration
An assembly language statement consists of four possible fields: a label
field, a mnemonic field, an operand field, and a comment file. The
mnemonic and comment fields are always optional. The label field is
generally optional although certain instructions (mnemonics) do not allow
labels while others require labels. The operand field's presence is tied
to the mnemonic field. For most instructions the actual mnemonic
determines whether an operand field must be present.
MASM is a free-form assembler insofar as it does not require these fields
to appear in any particular column[12].
However, the freedom to arrange these columns in any manner is one of the
primary contributors to hard to read assembly language programs. Although
MASM lets you enter your programs in free-form, there is absolutely no
reason you cannot adopt a fixed field format, always starting each field
in the same column. Doing so generally helps make an assembly language
program much easier to read. Here are the rules you should use:
- Rule:
- If an identifier is present in the label field,
always start that identifier in column one of the source line.
- Rule:
- All mnemonics should start in the same column.
Generally, this should be column 17 (the second tab stop) or some other
convenient position.
- Rule:
- All operands should start in the same column.
Generally, this should be column 25 (the third tab stop) or some other
convenient position.
- Exception:
- If a mnemonic (typically a macro) is longer than
seven characters and requires an operand, you have no choice but to start
the operand field beyond column 25 (this is an exception assuming you've
chosen columns 17 and 25 for your mnemonic and operand fields,
respectively).
- Guideline:
- Try to always start the comment fields on
adjacent source lines in the same column (note that it is impractical to
always start the comment field in the same column throughout a program).
Most people learn a high level language prior to learning assembly
language. They have been firmly taught that readable (HLL) programs have
their control structures properly indented to show the structure of the
program. Indentation works great when you have a block structured
language. Assembly language, however, is the original unstructured
language and indentation rules for a structured programming language
simply do not apply. While it is important to be able to denote that a
certain sequence of instructions is special (e.g., belong to the
"then" portion of an if..then..else..endif statement),
indentation is not the proper way to achieve this in an assembly language
program.
If you need to set off a sequence of statements from surrounding code,
the best thing you can do is use blank lines in your source code. For a
small amount of detachment, to separate one computation from another for
example, a single blank line is sufficient. To really show that one
section of code is special, use two, three, or even four blank lines to
separate one block of statements from the surrounding code. To separate
two totally unrelated sections of code, you might use several blank lines
and a row of dashes or asterisks to separate the statements. E.g.,
mov dx, FileSpec
mov cl, 37h
DOS FindFirstFile
jc FileNotFound
; *********************************************
mov di, offset fileRecords ;DI -> storage for file
names
mov bx, offset files ;BX -> array of files
sub bx, 2 ;Special case for 1st
iteration
- Guideline:
- Use blank lines to separate special blocks of
code from the surrounding code. Use an aesthetic looking row of asterisks
or dashes if you need a stronger separation between two blocks of code (do
not overdo this, however).
If two sequences of assembly language statements correspond to roughly
two HLL statements, it's generally a good idea to put a blank line between
the two sequences. This helps clarify the two segments of code in the
reader's mind. Of course, it is easy to get carried away and insert too
much white space in a program, so use some common sense here.
- Guideline:
- If two sequences of code in assembly language
correspond to two adjacent statements in a HLL, then use a blank line to
separate those two assembly sequences (assuming the sequences are real
short).
A common problem in any language (not just assembly language) is a line
containing a comment that is adjacent to one or two lines containing
code. Such a program is very difficult read because it is hard to
determine where the code ends and the comment begins (or vice-versa).
This is especially true when the comments contain sample code. It is
often quite difficult to determine if what you're looking at is code or
comments; hence the following enforced rule:
- Enforced Rule:
- Always put at least one blank line between
code and comments (assuming, of course, the comment is sitting only a line
by itself; that is, it is not an endline comment[13]).
6.0 Comments
Comments in an assembly language program generally come in two forms:
endline comments and standalone comments[14]. As their names suggest, endline lines
comments always occur at the end of a source statement and standalone
comments sit on a line by themselves[15].
These two types of comments have distinct purposes, this section will
explore their use and describe the attributes of a well-commented
program.
6.1 What is a Bad Comment?
It is amazing how many programmers claim their code is well-commented.
Were you to count characters between (or after) the comment delimiters,
they might have a point. Consider, however, the following comment:
mov ax, 0 ;Set AX to zero.
Quite frankly, this comment is worse than no comment at all. It doesn't
tell the reader anything the instruction itself doesn't tell and it
requires the reader to take some of his or her precious time to figure out
that the comment is worthless. If someone cannot tell that this
instruction is setting AX to zero, they have no business reading an
assembly language program. This brings up the first guideline of this
section:
- Guideline:
- Choose an intended audience for your source code
and write the comments to that audience. For assembly language source
code, you can usually assume the target audience are those who know a
reasonable amount of assembly language.
Don't explain the actions of an assembly language instruction in your
code unless that instruction is doing something that isn't obvious (and
most of the time you should consider changing the code sequence if it
isn't obvious what is going on). Instead, explain how that instruction is
helping to solve the problem at hand. The following is a much better
comment for the instruction above:
mov ax, 0 ;AX is the resulting sum. Initialize it.
Note that the comment does not say "Initialize it to zero."
Although there would be nothing intrinsically wrong with saying this, the
phrase "Initialize it" remains true no matter what value you
assign to AX. This makes maintaining the code (and comment) much easier
since you don't have to change the comment whenever you change the
constant associated with the instruction.
- Guideline:
- Write your comments in such a way that minor
changes to the instruction do not require that you change the
corresponding comment.
Note: Although a trivial comment is bad (indeed, worse than no comment at
all), the worst comment a program can have is one that is wrong. Consider
the following statement:
mov ax, 1 ;Set AX to zero.
It is amazing how long a typical person will look at this code trying to
figure out how on earth the program sets AX to zero when it's obvious it
does not do this. People will always believe comments over code. If
there is some ambiguity between the comments and the code, they will
assume that the code is tricky and that the comments are correct. Only
after exhausting all possible options is the average person likely to
concede that the comment must be incorrect.
- Enforced Rule:
- Never allow incorrect comments in your
program.
This is another reason not to put trivial comments like "Set AX to
zero" in your code. As you modify the program, these are the
comments most likely to become incorrect as you change the code and fail
to keep the comments in sync. However, even some non-trivial comments can
become incorrect via changes to the code. Therefore, always follow this
rule:
- Enforced Rule:
- Always update all comments affected by a
code change immediately after making the code change.
Undoubtedly you've heard the phrase "make sure you comment your code
as though someone else wrote it for you; otherwise in six months you'll
wish you had." This statement encompasses two concepts. First,
don't ever think that your understanding of the current code will last.
While working on a given section of a program you're probably investing
considerable thought and study to figure out what's going on. Six months
down the road, however, you will have forgotten much of what you figured
out and the comments can go a long way to getting you back up to speed
quickly. The second point this code makes is the implication that others
read and write code too. You will have to read someone else's code, they
will have to read yours. If you write the comments the way you would
expect others to write it for you, chances are pretty good that your
comments will work for them as well.
- Rule:
- Never use racist, sexist, obscene, or other
exceptionally politically incorrect language in your comments.
Undoubtedly such language in your comments will come back to embarass you
in the future. Furthermore, it's doubtful that such language would help
someone better understand the program.
It's much easier to give examples of bad comments than it is to discuss
good comments. The following list describes some of the worst possible
comments you can put in a program (from worst up to barely tolerable):
- The absolute worst comment you can put into a program is an incorrect
comment. Consider the following assembly statement:
mov ax, 10; { Set AX to 11 }
- It is amazing how many programmers will automatically assume the
comment is correct and try to figure out how this code manages to set the
variable "A" to the value 11 when the code so obviously sets it to 10.
- The second worst comment you can place in a program is a comment that
explains what a statement is doing. The typical example is something like
"mov ax, 10; { Set 'A' to 10 }". Unlike the previous example, this
comment is correct. But it is still worse than no comment at all because
it is redundant and forces the reader to spend additional time reading the
code (reading time is directly proportional to reading difficulty). This
also makes it harder to maintain since slight changes to the code (e.g.,
"mov ax, 9") requires modifications to the comment that would
not otherwise be required.
- The third worst comment in a program is an irrelevant one. Telling a
joke, for example, may seem cute, but it does little to improve the
readability of a program; indeed, it offers a distraction that breaks
concentration.
- The fourth worst comment is no comment at all.
- The fifth worst comment is a comment that is obsolete or out of date
(though not incorrect). For example, comments at the beginning of the
file may describe the current version of a module and who last worked on
it. If the last programmer to modify the file did not update the
comments, the comments are now out of date.
6.2 What is a Good Comment?
Steve McConnell provides a long list of suggestions for high-quality
code. These suggestions include:
- Use commenting styles that don't break down or discourage
modification. Essentially, he's saying pick a commenting style that isn't
so much work people refuse to use it. He gives an example of a block of
comments surrounded by asterisks as being hard to maintain. This is a
poor example since modern text editors will automatically "outline" the
comments for you. Nevertheless, the basic idea is sound.
- Comment as you go along. If you put commenting off until the last
moment, then it seems like another task in the software development
process always comes along and management is likely to discourage the
completion of the commenting task in hopes of meeting new deadlines.
- Avoid self-indulgent comments. Also, you should avoid sexist,
profane, or other insulting remarks in your comments. Always remember,
someone else will eventually read your code.
- Avoid putting comments on the same physical line as the statement they
describe. Such comments are very hard to maintain since there is very
little room. McConnell suggests that endline comments are okay for
variable declarations. For some this might be true but many variable
declarations may require considerable explanation that simply won't fit at
the end of a line. One exception to this rule is "maintenance notes."
Comments that refer to a defect tracking entry in the defect database are
okay (note that the CodeWright text editor provides a much better solution
for this -- buttons that can bring up an external file). Of course,
endline comments are marginally more useful in assembly language than in
the HLLs that McConnell addresses, but the basic idea is sound.
- Write comments that describe blocks of statements rather than
individual statements. Comments covering single statements tend to
discuss the mechanics of that statement rather than discussing what the
program is doing.
- Focus paragraph comments on the why rather than the how. Code should
explain what the program is doing and why the programmer chose to do it
that way rather than explain what each individual statement is doing.
- Use comments to prepare the reader for what is to follow. Someone
reading the comments should be able to have a good idea of what the
following code does without actually looking at the code. Note that this
rule also suggests that comments should always precede the code to which
they apply.
- Make every comment count. If the reader wastes time reading a comment
of little value, the program is harder to read; period.
- Document surprises and tricky code. Of course, the best solution is
not to have any tricky code. In practice, you can't always achieve this
goal. When you do need to restore to some tricky code, make sure you
fully document what you've done.
- Avoid abbreviations. While there may be an argument for abbreviating
identifiers that appear in a program, no way does this apply to
comments.
- Keep comments close to the code they describe. The prologue to a
program unit should give its name, describe the parameters, and provide a
short description of the program. It should not go into details about the
operation of the module itself. Internal comments should to that.
- Comments should explain the parameters to a function, assertions about
these parameters, whether they are input, output, or in/out parameters.
- Comments should describe a routine's limitations, assumptions, and any
side effects.
- Rule:
- All comments will be
high-quality comments that describe the actions of the surrounding code in
a concise manner
6.3 Endline vs. Standalone Comments
- Guideline:
- Whenever a comment appears on a line by itself,
always put the semicolon in column one. You may indent the text if this
is appropriate or aesthetic.
- Guideline:
- Adjacent lines of comments should not have any
interspersed blank lines. A blank comment line should, at least, contain
a semicolon in column one.
The guidline above suggests that your code should look like this:
; This is a comment with a blank line between it and the next comment.
;
; This is another line with a comment on it.
Rather than like this:
; This is a comment with a blank line between it and the next comment.
; This is another line with a comment on it.
The semicolon appearing between the two statements suggest continuity
that is not present when you remove the semicolon. If two blocks of
comments are truly separate and whitespace between them is appropriate,
you should consider separating them by a large number of blank lines to
completely eliminate any possible association between the two.
Standalone comments are great for describing the actions of the code that
immediately follows. So what are endline comments useful for? Endline
comments can explain how a sequence of instructions are implimenting the
algorithm described in a previous set of standalone comments. Consider
the following code:
; Compute the transpose of a matrix using the algorithm:
;
; for i := 0 to 3 do
; for j := 0 to 3 do
; swap( a[i][j], b[j][i] );
forlp i, 0, 3
forlp j, 0, 3
mov bx, i ;Compute address of a[i][j] using
shl bx, 2 ; row major ordering (i*4 + j)*2.
add bx, j
add bx, bx
lea bx, a[bx]
push bx ;Push address of a[i][j] onto
stack.
mov bx, j ;Compute address of b[j][i] using
shl bx, 2 ;row major ordering (j*4 + i)*2.
add bx, i
add bx, bx
lea bx, b[bx]
push bx ;Push address of b[j][i] onto
stack.
call swap ;Swap a[i][j] with b[j][i].
next
next
Note that the block comments before this sequence explain, in high level
terms, what the code is doing. The endline comments explain how the
statement sequence implements the general algorithm. Note, however, that
the endline comments do not explain what each statement is doing (at least
at the machine level). Rather than claiming "add bx, bx" is
multiplying the quantity in BX by two, this code assumes the reader can
figure that out for themselves (any reasonable assembly programmer would
know this). Once again, keep in mind your audience and write your
comments for them.
6.4 Unfinished Code
Often it is the case that a programmer will write a section of code that
(partially) accomplishes some task but needs further work to complete a
feature set, make it more robust, or remove some known defect in the
code. It is common for such programmers to place comments into the code
like "This needs more work," "Kludge ahead," etc.
The problem with these comments is that they are often forgotten. It
isn't until the code fails in the field that the section of code
associated with these comments is found and their problems corrected.
Ideally, one should never have to put such code into a program. Of
course, ideally, programs never have any defects in them, either. Since
such code inevitably finds its way into a program, it's best to have a
policy in place to deal with it, hence this section.
Unfinished code comes in five general categories: non-functional code,
partially functioning code, suspect code, code in need of enhancement, and
code documentation. Non-functional code might be a stub or driver that
needs to be replaced in the future with actual code or some code that has
severe enough defects that it is useless except for some small special
cases. This code is really bad, fortunately its severity prevents you
from ignoring it. It is unlikely anyone would miss such a poorly
constructed piece of code in early testing prior to release.
Partially functioning code is, perhaps, the biggest problem. This code
works well enough to pass some simple tests yet contains serious defects
that should be corrected. Moreover, these defects are known. Software
often contains a large number of unknown defects; it's a shame to let some
(prior) known defects ship with the product simply because a programmer
forgot about a defect or couldn't find the defect later.
Suspect code is exactly that- code that is suspicious. The programmer
may not be aware of a quantifiable problem but may suspect that a problem
exists. Such code will need a later review in order to verify whether it
is correct.
The fourth category, code in need of enhancement, is the least serious.
For example, to expedite a release, a programmer might choose to use a
simple algorithm rather than a complex, faster algorithm. S/he could make
a comment in the code like "This linear search should be replaced by
a hash table lookup in a future version of the software." Although
it might not be absolutely necessary to correct such a problem, it would
be nice to know about such problems so they can be dealt with in the
future.
The fifth category, documentation, refers to changes made to software
that will affect the corresponding documentation (user guide, design
document, etc.). The documentation department can search for these
defects to bring existing documentation in line with the current code.
This standard defines a mechanism for dealing with these five classes of
problems. Any occurrence of unfinished code will be preceded by a comment
that takes one of the following forms (where "_" denotes a
single space):
;_#defect#severe_;
;_#defect#functional_;
;_#defect#suspect_;
;_#defect#enhancement_;
;_#defect#documentation_;
It is important to use all lower case and verify the correct spelling so
it is easy to find these comments using a text editor search or a tool
like grep. Obviously, a separate comment explaining the situation must
follow these comments in the source code.
Examples:
; #defect#suspect ;
; #defect#enhancement ;
; #defect#documentation ;
Notice the use of comment delimiters (the semicolon) on both sides even
though assembly language, doesn't require them.
- Enforced Rule:
- If a module
contains some defects that cannot be immediately removed because of time
or other constraints, the program will insert a standardized comment
before the code so that it is easy to locate such problems in the future.
The five standardized comments are ";_#defect#severe_;,
";_#defect#functional_;", ";_#defect#suspect_;",
";_#defect#enhancement_;", and
";_#defect#documentation_;" where "_" denotes a
single space. The spelling and spacing should be exact so it is easy to
search for these strings in the source tree.
6.5 Cross References in Code to Other Documents
In many instances a section of code might be intrinsically tied to some
other document. For example, you might refer the reader to the user
document or the design document within your comments in a program. This
document proposes a standard way to do this so that it is relatively easy
to locate cross references appearing in source code. The technique is
similar to that for defect reporting, except the comments take the form:
; text #link#location text ;
"Text" is optional and represents arbitrary text (although it
is really intended for embedding html commands to provide hyperlinks to
the specified document). "Location" describes the document and
section where the associated information can be found.
Examples:
; #link#User's Guide Section 3.1 ;
; #link#Program Design Document, Page 5 ;
; #link#Funcs.pas module, "xyz" function ;
; <A HREF="DesignDoc.html#xyzfunc"> #link#xyzfunc
</a> ;
- Guideline:
- If a module contains
some cross references to other documents, there should be a comment that
takes the form "; text #link#location text ;" that provides the
reference to that other document. In this comment "text"
represents some optional text (typically reserved for html tags) and
"location" is some descriptive text that describes the document
(and a position in that document) related to the current section of code
in the program.
7.0 Names, Instructions, Operators, and Operands
Although program features like good comments, proper spacing of
statements, and good modularization can help yield programs that are more
readable; ultimately, a programmer must read the instructions in a
program to understand what it does. Therefore, do not underestimate the
importance of making your statements as readable as possible. This
section deals with this issue.
7.1 Names
According to studies done at IBM, the use of high-quality identifiers in
a program contributes more to the readability of that program than any
other single factor, including high-quality comments. The quality of
your identifiers can make or break your program; program with
high-quality identifiers can be very easy to read, programs with poor
quality identifiers will be very difficult to read. There are very few
"tricks" to developing high-quality names; most of the rules are nothing
more than plain old-fashion common sense. Unfortunately, programmers
(especially C/C++ programmers) have developed many arcane naming
conventions that ignore common sense. The biggest obstacle most
programmers have to learning how to create good names is an unwillingness
to abandon existing conventions. Yet their only defense when quizzed on
why they adhere to (existing) bad conventions seems to be "because that's
the way I've always done it and that's the way everybody else does it."
The aforementioned researchers at IBM developed several programs with
the following set of attributes:
- Bad comments, bad names
- Bad comments, good names
- Good comments, bad names
- Good comments, good names
As should be obvious, the programs that had bad comments and names were
the hardest to read; likewise, those programs with good comments and
names were the easiest to read. The surprising results concerned the
other two cases. Most people assume good comments are more important than
good names in a program. Not only did IBM find this to be false, they
found it to be really false.
As it turns out, good names are even more important that good comments in
a program. This is not to say that comments are unimportant, they are
extremely important; however, it is worth pointing out that if you spend
the time to write good comments and then choose poor names for your
program's identifiers, you've damaged the readability of your program
despite the work you've put into your comments. Quickly read over the
following code:
mov ax, SignedValue
cwd
add ax, -1
rcl dx, 1
mov AbsoluteValue, dx
Question: What does this code compute and store in the AbsoluteValue
variable?
- The sign extension of SignedValue.
- The negation of SignedValue.
- The absolute value of SignedValue.
- A boolean value indicating that the result is positive or negative.
- Signum(SignedValue) (-1, 0, +1 if neg, zero, pos).
- Ceil(SignedValue)
- Floor(SignedValue)
The obvious answer is the absolute value of SignedValue. This is also
incorrect. The correct answer is signum:
mov ax, SignedValue ;Get value to check.
cwd ;DX = FFFF if neg, 0000 otherwise.
add ax, 0ffffh ;Carry=0 if ax is zero, one
otherwise.
rcl dx, 1 ;DX = FFFF if AX is neg, 0 if
ax=0,
mov Signum, dx ; 1 if ax>0.
Granted, this is a tricky piece of code[16].
Nonetheless, even without the comments you can probably figure out what
the code sequence does even if you can't figure out how it does it:
mov ax, SignedValue
cwd
add ax, 0ffffh
rcl dx, 1
mov Signum, dx
Based on the names alone you can probably figure out that this code
computes the signum function. This is the "understanding 80% of the
code" referred to earlier. Note that you don't need misleading names
to make this code unphathomable. Consider the following code that doesn't
trick you by using misleading names:
mov ax, x
cwd
add ax, 0ffffh
rcl dx, 1
mov y, dx
This is a very simple example. Now imagine a large program that has many
names. As the number of names increase in a program, it becomes harder to
keep track of them all. If the names themselves do not provide a good
clue to the meaning of the name, understanding the program becomes very
difficult.
- Enforced Rule:
- All identifiers appearing in an assembly
language program must be descriptive names whose meaning and use are
clear.
Since labels (i.e., identifiers) are the target of jump and call
instructions, a typical assembly language program will have a large number
of identifiers. Therefore, it is tempting to begin using names like
"label1, label2, label3, ..." Avoid this temptation! There is
always a reason you are jumping to some spot in your code. Try to
describe that reason and use that description for your label name.
- Rule:
- Never use names like "Lbl0, Lbl1, Lbl2,
..." in your program.
7.1.1 Naming Conventions
Naming conventions represent one area in Computer Science where there are
far too many divergent views (program layout is the other principle
area). The primary purpose of an object's name in a programming language
is to describe the use and/or contents of that object. A secondary
consideration may be to describe the type of the object. Programmers use
different mechanisms to handle these objectives. Unfortunately, there are
far too many "conventions" in place, it would be asking too much to expect
any one programmer to follow several different standards. Therefore, this
standard will apply across all languages as much as possible.
The vast majority of programmers know only one language - English. Some
programmers know English as a second language and may not be familiar with
a common non-English phrase that is not in their own language (e.g.,
rendezvous). Since English is the common language of most programmers,
all identifiers should use easily recognizable English words and
phrases.
- Rule:
- All identifiers that
represent words or phrases must be English words or phrases.
7.1.2 Alphabetic Case Considerations
A case-neutral identifier will work properly whether you compile it with
a compiler that has case sensitive identifiers or case insensitive
identifiers. In practice, this means that all uses of the identifiers
must be spelled exactly the same way (including case) and that no other
identifier exists whose only difference is the case of the letters in the
identifier. For example, if you declare an identifier "ProfitsThisYear"
in Pascal (a case-insensitive language), you could legally refer to this
variable as "profitsThisYear" and "PROFITSTHISYEAR". However, this is
not a case-neutral usage since a case sensitive language would treat these
three identifiers as different names. Conversely, in case-sensitive
languages like C/C++, it is possible to create two different identifiers
with names like "PROFITS" and "profits" in the program. This is not
case-neutral since attempting to use these two identifiers in a case
insensitive language (like Pascal) would produce an error since the
case-insensitive language would think they were the same name.
- Enforced Rule:
- All identifiers
must be "case-neutral."
Different programmers (especially in different languages) use alphabetic
case to denote different objects. For example, a common C/C++ coding
convention is to use all upper case to denote a constant, macro, or type
definition and to use all lower case to denote variable names or reserved
words. Prolog programmers use an initial lower case alphabetic to denote
a variable. Other comparable coding conventions exist. Unfortunately,
there are so many different conventions that make use of alphabetic case,
they are nearly worthless, hence the following rule:
- Rule:
- You should never use
alphabetic case to denote the type, classification, or any other
program-related attribute of an identifier (unless the language's syntax
specifically requires this).
There are going to be some obvious exceptions to the above rule, this
document will cover those exceptions a little later. Alphabetic case does
have one very useful purpose in identifiers - it is useful for separating
words in a multi-word identifier; more on that subject in a moment.
To produce readable identifiers often requires a multi-word phrase.
Natural languages typically use spaces to separate words; we can not,
however, use this technique in identifiers.
Unfortunatelywritingmultiwordidentifiers
makesthemalmostimpossibletoreadifyoudonotdosomethingtodistiguishtheindividualwords
(Unfortunately writing multiword identifiers makes them almost impossible
to read if you do not do something to distinguish the individual words).
There are a couple of good conventions in place to solve this problem.
This standard's convention is to capitalize the first alphabetic character
of each word in the middle of an identifier.
- Rule:
- Capitalize the first letter
of interior words in all multi-word identifiers.
Note that the rule above does not specify whether the first letter of an
identifier is upper or lower case. Subject to the other rules governing
case, you can elect to use upper or lower case for the first symbol,
although you should be consistent throughout your program.
Lower case characters are easier to read than upper case. Identifiers
written completely in upper case take almost twice as long to recognize
and, therefore, impair the readability of a program. Yes, all upper case
does make an identifier stand out. Such emphasis is rarely necessary in
real programs. Yes, common C/C++ coding conventions dictate the use of
all upper case identifiers. Forget them. They not only make your
programs harder to read, they also violate the first rule above.
- Rule:
- Avoid using all upper case
characters in an identifier.
7.1.3 Abbreviations
The primary purpose of an identifier is to describe the use of, or value
associated with, that identifier. The best way to create an identifier
for an object is to describe that object in English and then create a
variable name from that description. Variable names should be meaningful,
concise, and non-ambiguous to an average programmer fluent in the English
language. Avoid short names. Some research has shown that programs using
identifiers whose average length is 10-20 characters are generally easier
to debug than programs with substantially shorter or longer identifiers.
Avoid abbreviations as much as possible. What may seem like a perfectly
reasonable abbreviation to you may totally confound someone else.
Consider the following variable names that have actually appeared in
commercial software:
NoEmployees, NoAccounts, pend
The "NoEmployees" and "NoAccounts" variables seem to be boolean variables
indicating the presence or absence of employees and accounts. In fact,
this particular programmer was using the (perfectly reasonable in the real
world) abbreviation of "number" to indicate the number of employees and
the number of accounts. The "pend" name referred to a procedure's end
rather than any pending operation.
Programmers often use abbreviations in two situations: they're poor
typists and they want to reduce the typing effort, or a good descriptive
name for an object is simply too long. The former case is an unacceptable
reason for using abbreviations. The second case, especially if care is
taken, may warrant the occasional use of an abbreviation.
- Guideline:
- Avoid all identifier
abbreviations in your programs. When necessary, use standardized
abbreviations or ask someone to review your abbreviations. Whenever you
use abbreviations in your programs, create a "data dictionary" in the
comments near the names' definition that provides a full name and
description for your abbreviation.
The variable names you create should be pronounceable. "NumFiles" is a
much better identifier than "NmFls". The first can be spoken, the second
you must generally spell out. Avoid homonyms and long names that are
identical except for a few syllables. If you choose good names for your
identifiers, you should be able to read a program listing over the
telephone to a peer without overly confusing that person.
- Rule:
- All identifiers should be
pronounceable (in English) without having to spell out more than one
letter.
7.1.4 The Position of Components Within an Identifier
When scanning through a listing, most programmers only read the first few
characters of an identifier. It is important, therefore, to place the
most important information (that defines and makes this identifier unique)
in the first few characters of the identifier. So, you should avoid
creating several identifiers that all begin with the same phrase or
sequence of characters since this will force the programmer to mentally
process additional characters in the identifier while reading the
listing. Since this slows the reader down, it makes the program harder to
read.
- Guideline:
- Try to make most
identifiers unique in the first few character positions of the
identifier. This makes the program easier to read.
- Corollary:
- Never use a numeric
suffix to differentiate two names.
Many C/C++ Programmers, especially Microsoft Windows programmers, have
adopted a formal naming convention known as "Hungarian Notation." To
quote Steve McConnell from Code Complete: "The term 'Hungarian' refers
both to the fact that names that follow the convention look like words in
a foreign language and to the fact that the creator of the convention,
Charles Simonyi, is originally from Hungary." One of the first rules
given concerning identifiers stated that all identifiers are to be English
names. Do we really want to create "artificially foreign" identifiers?
Hungarian notation actually violates another rule as well: names using the
Hungarian notation generally have very common prefixes, thus making them
harder to read.
Hungarian notation does have a few minor advantages, but the
disadvantages far outweigh the advantages. The following list from Code
Complete and other sources describes what's wrong with Hungarian
notation:
- Hungarian notation generally defines objects in terms of basic machine
types rather than in terms of abstract data types.
- Hungarian notation combines meaning with representation. One of the
primary purposes of high level language is to abstract representation
away. For example, if you declare a variable to be of type integer, you
shouldn't have to change the variable's name just because you changed its
type to real.
- Hungarian notation encourages lazy, uninformative variable names.
Indeed, it is common to find variable names in Windows programs that
contain only type prefix characters, without an descriptive name
attached.
- Hungarian notation prefixes the descriptive name with some type
information, thus making it harder for the programming to find the
descriptive portion of the name.
- Guideline:
- Avoid using Hungarian
notation and any other formal naming convention that attaches low-level
type information to the identifier.
Although attaching machine type information to an identifier is generally
a bad idea, a well thought-out name can successfully associate some
high-level type information with the identifier, especially if the name
implies the type or the type information appears as a suffix. For
example, names like "PencilCount" and "BytesAvailable" suggest integer
values. Likewise, names like "IsReady" and "Busy" indicate boolean
values. "KeyCode" and "MiddleInitial" suggest character variables. A
name like "StopWatchTime" probably indicates a real value. Likewise,
"CustomerName" is probably a string variable. Unfortunately, it isn't
always possible to choose a great name that describes both the content and
type of an object; this is particularly true when the object is an
instance (or definition of) some abstract data type. In such instances,
some additional text can improve the identifier. Hungarian notation is a
raw attempt at this that, unfortunately, fails for a variety of reasons.
A better solution is to use a suffix phrase to denote the type or class
of an identifier. A common UNIX/C convention, for example, is to apply a
"_t" suffix to denote a type name (e.g., size_t, key_t, etc.). This
convention succeeds over Hungarian notation for several reasons including
(1) the "type phrase" is a suffix and doesn't interfere with reading the
name, (2) this particular convention specifies the class of the object
(const, var, type, function, etc.) rather than a low level type, and (3)
It certainly makes sense to change the identifier if it's classification
changes.
- Guideline:
- If you want to
differentiate identifiers that are constants, type definitions, and
variable names, use the suffixes "_c", "_t", and "_v", respectively.
- Rule:
- The classification suffix
should not be the only component that differentiates two identifiers.
Can we apply this suffix idea to variables and avoid the pitfalls?
Sometimes. Consider a high level data type "button" corresponding to a
button on a Visual BASIC or Delphi form. A variable name like
"CancelButton" makes perfect sense. Likewise, labels appearing on a form
could use names like "ETWWLabel" and "EditPageLabel". Note that these
suffixes still suffer from the fact that a change in type will require
that you change the variable's name. However, changes in high level types
are far less common than changes in low-level types, so this shouldn't
present a big problem.
7.1.5 Names to Avoid
Avoid using symbols in an identifier that are easily mistaken for other
symbols. This includes the sets {"1" (one), "I" (upper case "I"), and
"l" (lower case "L")}, {"0" (zero) and "O" (upper case "O")}, {"2" (two)
and "Z" (upper case "Z")}, {"5" (five) and "S" (upper case "S")}, and ("6"
(six) and "G" (upper case "G")}.
- Guideline:
- Avoid using symbols in
identifiers that are easily mistaken for other symbols (see the list
above).
Avoid misleading abbreviations and names. For example, FALSE shouldn't
be an identifier that stands for "Failed As a Legitimate Software
Engineer." Likewise, you shouldn't compute the amount of free memory
available to a program and stuff it into the variable "Profits".
- Rule:
- Avoid misleading
abbreviations and names.
You should avoid names with similar meanings. For example, if you have
two variables "InputLine" and "InputLn" that you use for two separate
purposes, you will undoubtedly confuse the two when writing or reading the
code. If you can swap the names of the two objects and the program still
makes sense, you should rename those identifiers. Note that the names do
not have to be similar, only their meanings. "InputLine" and "LineBuffer"
are obviously different but you can still easily confuse them in a
program.
- Rule:
- Do not use names with
similar meanings for different objects in your programs.
In a similar vein, you should avoid using two or more variables that have
different meanings but similar names. For example, if you are writing a
teacher's grading program you probably wouldn't want to use the name
"NumStudents" to indicate the number of students in the class along with
the variable "StudentNum" to hold an individual student's ID number.
"NumStudents" and "StudentNum" are too similar.
- Rule:
- Do not use similar names
that have different meanings.
Avoid names that sound similar when read aloud, especially out of
context. This would include names like "hard" and "heart", "Knew" and
"new", etc. Remember the discussion in the section above on
abbreviations, you should be able to discuss your problem listing over the
telephone with a peer. Names that sound alike make such discussions
difficult.
- Guideline:
- Avoid homonyms in
identifiers.
Avoid misspelled words in names and avoid names that are commonly
misspelled. Most programmers are notoriously bad spellers (look at some
of the comments in our own code!). Spelling words correctly is hard
enough, remembering how to spell an identifier incorrectly is even more
difficult. Likewise, if a word is often spelled incorrectly, requiring a
programer to spell it correctly on each use is probably asking too much.
- Guideline:
- Avoid misspelled words
and names that are often misspelled in identifiers.
If you redefine the name of some library routine in your code, another
program will surely confuse your name with the library's version. This is
especially true when dealing with standard library routines and APIs.
- Enforced Rule:
- Do not reuse
existing standard library routine names in your program unless you are
specifically replacing that routine with one that has similar semantics
(i.e., don't reuse the name for a different purpose).
7.2 Instructions, Directives, and Pseudo-Opcodes
Your choice of assembly language sequences, the instructions themselves,
and your choice of directives and pseudo-opcodes can have a big impact on
the readability of your programs. The following subsections discuss these
problems.
7.2.1 Choosing the Best Instruction Sequence
Like any language, you can solve a given problem using a wide variety of
solutions involving different instruction sequences. As a continuing
example, consider (again) the following code sequence:
mov ax, SignedValue ;Get value to check.
cwd ;DX = FFFF if neg, 0000 otherwise.
add ax, 0ffffh ;Carry=0 if ax is zero.
rcl dx, 1 ;DX = FFFF if AX is neg, 0 if AX=0,
mov Signum, dx ; 1 if AX>0.
Now consider the following code sequence that also computes the signum
function:
mov ax, SignedValue ;Get value to check.
cmp ax, 0 ;Check the sign.
je GotSignum ;We're done if it's zero.
mov ax, 1 ;Assume it was positive.
jns GotSignum ;Branch if it was positive.
neg ax ;Else return -1 for negative
values.
GotSignum: mov Signum, ax
Yes, the second version is longer and slower. However, an average person
can read the instruction sequence and figure out what it's doing; hence
the second version is much easier to read than the first. Which sequence
is best? Unless speed or space is an extremely critical factor and you
can show that this routine is in the critical execution path, then the
second version is obviously better. There is a time and a place for
tricky assembly code; however, it's rare that you would need to pull
tricks like this throughout your code.
So how does one choose appropriate instruction sequences when there are
many possible ways to accomplish the same task? The best way is to ensure
that you have a choice. Although there are many different ways to
accomplish an operation, few people bother to consider any instruction
sequence other than the first one that comes to mind. Unfortunatley, the
"best" instruction sequence is rarely the first instruction
sequence that comes to most people's minds[17]. In order to make a choice, you have to have
a choice to make. That means you should create at least two different
code sequences for a given operation if there is ever a question
concerning the readability of your code. Once you have at least two
versions, you can choose between them based on your needs at hand. While
it is impractical to "write your program twice" so that you'll
have a choice for every sequence of instructions in the program, you
should apply this technique to particularly bothersome code sequences.
- Guideline:
- For particularly difficult to understand
sections of code, try solving the problem several different ways. Then
choose the most easily understood solution for actual incorporation into
your program.
One problem with the above suggestion is that you're often too close to
your own work to make decisions like "this code isn't too hard to
understand, I don't have to worry about it." It is often a good idea
to have someone else review your code and point out those sections they
find hard to understand[18].
- Guideline:
- Take advantage of reviews to determine those
sections of code in your program that may need to be rewritten to make
them easier to understand.
7.2.2 Control Structures
Ralph Griswold[19] once said (roughly) the
following about C, Pascal, and Icon: "C makes it easy to write hard
to read programs[20], Pascal makes it hard to
write hard to read programs, and Icon makes it easy to write easy to read
programs." Assembly language can be summed up like this:
"Assembly language makes it hard to write easy to read programs and
easy to write hard to read programs." It takes considerable
discipline to write readable assembly language programs; but it can be
done. Sadly, most assembly code you find today is extremely poorly
written. Indeed, that state of affairs is the whole reason for this
document. Once you get past issues like comments and naming conventions,
issues like program control flow and data structure design have among the
largest impacts on program readability. Since most assembly languages
lack structured control flow constructs, this is one area where
undisciplined programmers can really show how poorly they can write their
code. One need look no farther than the public domain code on the
Internet, or at Microsoft's sample code for that matter[21], to see abundant examples of poorly written
assembly language code.
Fortunately, with a little discipline it is possible to write readable
assembly language programs. How you design your control structures can
have a big impact on the readability of your programs. The best way to do
this can be summed up in two words: avoid spaghetti.
Spaghetti code is the name given to a program that has a large number of
intertwined branches and branch targets within a code sequence. Consider
the following example:
jmp L1
L1: mov ax, 0
jmp L2
L3: mov ax, 1
jmp L2
L4: mov ax, -1
jmp L2
L0: mov ax, x
cmp ax, 0
je L1
jns L3
jmp L4
L2: mov y, ax
This code sequence, by the way, is our good friend the Signum function.
It takes a few moments to figure this out because as you manually trace
through the code you find yourself spending more time following jumps
around than you do looking at code that computes useful results. Now this
is a rather extreme example, but it is also fairly short. A longer code
sequence code become just as obfuscated with even fewer branches all over
the place.
Spaghetti code is given this name because it resembles a bowl of
spaghetti. That is, if we consider a control path in the program a
spaghetti noodle, spaghetti code contains lots of intertwined branches
into and out of different sections of the program. Needless to say, most
spaghetti programs are difficult to understand, generally contain lots of
bugs, and are often inefficient (don't forget that branches are among the
slowest executing instructions on most modern processors).
So how to we resolve this? Easy by physically adopting structured
programming techniques in assembly language code. Of course, 80x86
assembly language doesn't provide if..then..else..endif, while..endwhile,
repeat..until, and other such statements[22],
but we can certainly simulate them. Consider the following high level
language code sequence:
if(expression) then
<< statements to execute if expression is true
>>
else
<< statements to execute if expression is false
>>
endif
Almost any high level language program can figure out what this type of
statement will do. Assembly languge programmers should leverage this
knowledge by attempting to organize their code so it takes this same
form. Specifically, the assembly language version should look something
like the following:
<< Assembly code to compute value of expression
>>
JNxx ElsePart ;xx is the opposite condition we want to
check.
<< Assembly code corresponding to the then portion
>>
jmp AroundElsePart
ElsePart:
<< Assembly code corresponding to the else portion
>>
AroundElsePart:
For an concrete example, consider the following:
if ( x=y ) then
write( 'x = y' );
else
write( 'x <> y' );
endif;
; Corresponding Assembly Code:
mov ax, x
cmp ax, y
jne ElsePart
print "x=y",nl
jmp IfDone
ElsePart: print "x<>y",nl
IfDone:
While this may seem like the obvious way to organize an
if..then.else..endif statement, it is suprising how many people would
naturally assume they've got to place the else part somewhere else in the
program as follows:
mov ax, x
cmp ax, y
jne ElsePart
print "x=y",nl
IfDone:
.
.
.
ElsePart: print "x<>y",nl
jmp IfDone
This code organization makes the program more difficult to follow. Most
programmers have a HLL background and despite a current assignment, they
still work mostly in HLLs. Assembly language programs will be more
readable if they mimic the HLL control constructs[23].
For similar reasons, you should attempt to organize your assembly code
that simulates while loops, repeat..until loops, for loops, etc., so that
the code resembles the HLL code (for example, a while loop should
physically test the condition at the beginning of the loop with a jump at
the bottom of the loop).
- Rule:
- Attempt to design your programs using HLL control
structures. The organization of the assembly code that you write should
physically resemble the organization of some corresponding HLL program.
Assembly language offers you the flexibility to design arbitrary control
structures. This flexibility is one of the reasons good assembly language
programmers can write better code than that produced by a compiler (that
can only work with high level control structures). However, keep in mind
that a fast program doesn't have to contain the tightest possible code in
every sequence. Execution speed is nearly irrelevant in most parts of the
program. Sacrificing readability for speed isn't a big win in most of the
program.
- Guideline:
- Avoid control structures that don't easily map
to well-known high level language control structures in your assembly
language programs. Deviant control structures should only appear in small
sections of code when efficiency demands their use.
7.2.3 Instruction Synonyms
MASM defines several synonyms for common instructions. This is
especially true for the conditional jump and "set on condition
code" instructions. For example, JA and JNBE are synonyms for one
another. Logically, one could use either instruction in the same
context. However, the choice of synonym can have an impact on the
readability of a code sequence. To see why, consider the following:
if( x <= y ) then
<< true statements>>
else
<< false statements>>
endif
; Assembly code:
mov ax, x
cmp ax, y
ja ElsePart
<< true code >>
jmp IfDone
ElsePart: << false code >>
IfDone:
When someone reads this program, the "JA" statement skips over
the true portion. Unfortunately, the "JA" instruction gives the
illusion we're checking to see if something is greater than something
else; in actuality, we're testing to see if some condition is less than
or equal, not greater than. As such, this code sequence hides some of the
original intent of high level algorithm. One solution is to swap the
false and true portions of the code:
mov ax, x
cmp ax, y
jbe ThenPart
<< false code >>
jmp IfDone
ThenPart: << true code >>
IfDone:
This code sequence uses the conditional jump that matches the high level
algorithm's test (less than or equal). However, this code is now
organized in a non-standard fashion (it's an if..else..then..endif
statement). This hurts the readability more than using the proper jump
helped it. Now consider the following solution:
mov ax, x
cmp ax, y
jnbe ElsePart
<< true code >>
jmp IfDone
ElsePart: << false code >>
IfDone:
This code is organized in the traditional if..then..else..endif fashion.
Instead of using JA to skip over the then portion, it uses JNBE to do so.
This helps indicate, in a more readable fashion, that the code falls
through on below or equal and branches if it is not below or equal. Since
the instruction (JNBE) is easier to relate to the original test (<=)
than JA, this makes this section of code a little more readable.
- Rule:
- When skipping over some code because some condition
has failed (e.g., you fall into the code because the condition is
successful), always use a conditional jump of the form "JNxx"
to skip over the code section. For example, to fall through to a section
of code if one value is less than another, use the JNL or JNB instruction
to skip over the code. Of course, if you are testing a negative condition
(e.g., testing for equality) then use an instruction of the form Jx to
skip over the code.
8.0 Data Types
Prior to the arrival of MASM, most assemblers provided very little
capability for declaring and allocated complex data types. Generally, you
could allocate bytes, words, and other primitive machine structures. You
could also set aside a block of bytes. As high level languages improved
their ability to declare and use abstract data types, assembly language
fell farther and farther behind. Then MASM came along and changed all
that[24]. Unfortunately, many long time
assembly language programmers haven't bothered to learn the new MASM
syntax for things like arrays, structures, and other data types.
Likewise, many new assembly language programmers don't bother learning and
using these data typing facilities because they're already overwhelmed by
assembly language and want to minimize the number of things they've got to
learn. This is really a shame because MASM data typing is one of the
biggest improvements to assembly language since using mnemonics rather
than binary opcodes for machine level programming.
Note that MASM is a "high-level" assembler. It does things
assemblers for other chips won't do like checking the types of operands
and reporting errors if there are mismatches. Some people, who are used
to assemblers on other machines find this annoying. However, it's a great
idea in assembly language for the same reason it's a great idea in HLLs[25]. These features have one other beneficial
side-effect: they help other understand what you're trying to do in your
programs. It should come as no surprise, then, that this style guide will
encourage the use of these features in your assembly language programs.
8.1 Defining New Data Types with TYPEDEF
MASM provides a small number of primitive data types. For typical
applications, bytes, sbytes, words, swords, dwords, sdwords, and various
floating point formats are the most commonly used scalar data types
available. You may construct more abstract data types by using these
built-in types. For example, if you want a character, you'd normally
declare a byte variable. If you wanted a 16-bit integer, you'd typically
use the sword (or word) declaration. Of course, when you encounter a
variable declaration like "answer byte ?" it's a little
difficult to figure out what the real type is. Do we have a character, a
boolean, a small integer, or something else here? Ultimately it doesn't
matter to the machine; a byte is a byte is a byte. It's interpretation
as a character, boolean, or integer value is defined by the machine
instructions that operate on it, not by the way you define it.
Nevertheless, this distinction is important to someone who is reading your
program (perhaps they are verifying that you've supplied the correct
instruction sequence for a given data object). MASM's typedef directive
can help make this distinction clear.
In its simplest form, the typedef directive behaves like a textequ. It
let's you replace one string in your program with another. For example,
you can create the following definitions with MASM:
char typedef byte
integer typedef sword
boolean typedef byte
float typedef real4
IntPtr typedef far ptr integer
Once you have declared these names, you can define char, integer,
boolean, and float variables as follows:
MyChar char ?
I integer ?
Ptr2I IntPtr I
IsPresent boolean ?
ProfitsThisYear float ?
- Rule:
- Use the existing MASM data types as type building
blocks. For most data types you create in your program, you should
declare explicit type names using the typedef directive. There is really
no excuse for using the built-in primitive types[26].
8.2 Creating Array Types
MASM provides an interesting facility for reserving blocks of storage -
the DUP operator. This operator is unusual (among assembly languages)
because its definition is recursive. The basic definition is (using
HyGram notation):
DupOperator = expression ws* 'DUP' ws* '(' ws* operand ws* ') %%
Note that "expression" expands to a valid numeric value (or
numeric expression), "ws*" means "zero or more whitespace
characters" and "operand" expands to anything that is legal
in the operand field of a MASM word/dw, byte/db, etc., directive[27]. One would typically use this operator to
reserve a block of memory locations as follows:
ArrayName integer 16 dup (?) ;Declare array of 16 words.
This declaration would set aside 16 contiguous words in memory.
The interesting thing about the DUP operator is that any legal operand
field for a directive like byte or word may appear inside the parentheses,
including additional DUP expressions. The DUP operator simply says
"duplicate this object the specified number of times." For
example, "16 dup (1,2)" says "give me 16 copies of the
value pair one and two. If this operand appeared in the operand field of
a byte directive, it would reserve 32 bytes, containing the alternating
values one and two.
So what happens if we apply this technique recursively? Well, "4
dup ( 3 dup (0))" when read recursively says "give me four
copies of whatever is inside the (outermost) parentheses. This turns out
to be the expression "3 dup (0)" that says "give me three
zeros." Since the original operand says to give four copies of three
copies of a zero, the end result is that this expression produces 12
zeros. Now consider the following two declarations:
Array1 integer 4 dup ( 3 dup (0))
Array2 integer 12 dup (0)
Both definitions set aside 12 integers in memory (initializing each to
zero). To the assembler these are nearly identical; to the 80x86 they are
absolutely identical. To the reader, however, they are obviously
different. Were you to declare two identical one-dimensional arrays of
integers, using two different declarations makes your program inconsistent
and, therefore, harder to read.
However, we can exploit this difference to declare multidimensional
arrays. The first example above suggests that we have four copies of an
array containing three integers each. This corresponds to the popular
row-major array access function. The second example above suggests that
we have a single dimensional array containing 12 integers.
- Guideline:
- Take advantage of the recursive nature of the
DUP operator to declare multidimensional arrays in your programs.
8.3 Declaring Structures in Assembly Language
MASM provides an excellent facility for declaring and using structures,
unions, and records[28]; for some reason, many
assembly language programmers ignore them and manually compute offsets to
fields within structures in their code. Not only does this produce hard
to read code, the result is nearly unmaintainable as well.
- Rule:
- When a structure data type is appropriate in an
assembly language program, declare the corresponding structure in the
program and use it. Do not compute the offsets to fields in the structure
manually, use the standard structure "dot-notation" to access
fields of the structure.
One problem with using structures occurs when you access structure fields
indirectly (i.e., through a pointer). Indirect access always occurs
through a register (for near pointers) or a segment/register pair (for far
pointers). Once you load a pointer value into a register or register
pair, the program doesn't readily indicate what pointer you are using.
This is especially true if you use the indirect access several times in a
section of code without reloading the register(s). One solution is to use
a textequ to create a special symbol that expands as appropriate.
Consider the following code:
s struct
a Integer ?
b integer ?
s ends
.
.
.
r s {}
ptr2r dword r
.
.
.
les di, ptr2r
mov ax, es:[di].s.a ;No indication this is
ptr2r!
.
.
.
mov es:[di].b, bx ;Really no indication!
Now consider the following:
s struct
a Integer ?
b integer ?
s ends
sPtr typedef far ptr s
.
.
.
q s {}
r sPtr q
r@ textequ <es:[di].s>
.
.
.
les di, ptr2r
mov ax, r@.a ;Now it's clear this is using r
.
.
.
mov r@.b, bx ;Ditto.
Note that the "@" symbol is a legal identifier character to
MASM, hence "r@" is just another symbol. As a general rule you
should avoid using symbols like "@" in identifiers, but it
serves a good purpose here - it indicates we've got an indirect pointer.
Of course, you must always make sure to load the pointer into ES:DI when
using the textequ above. If you use several different segment/register
pairs to access the data that "r" points at, this trick may not
make the code anymore readable since you will need several text equates
that all mean the same thing.
8.4 Data Types and the UCR Standard Library
The UCR Standard Library for 80x86 Assembly Language Programmers (version
2.0 and later) provide a set of macros that let you declare arrays and
pointers using a C-like syntax. The following example demonstrates this
capability:
var
integer i, j, array[10], array2[10][3], *ptr2Int
char *FirstName, LastName[32]
endvar
These declarations emit the following assembly code:
i integer ?
j integer 25
array integer 10 dup (?)
array2 integer 10 dup ( 3 dup (?))
ptr2Int dword ?
LastName char 32 dup (?)
Name dword LastName
For those comfortable with C/C++ (and other HLLs) the UCR Standard
Library declarations should look very familiar. For that reason, their
use is a good idea when writing assembly code that uses the UCR Standard
Library.
[1] Someone who uses TASM all the time may think this
is fine, but consider those individuals who don't. They're not familiar
with TASM's funny syntax so they may find several statements in this
program to be confusing.
[2] Simplified segment directives do make it easier
to write assembly language programs that interface with HLLs. However,
they only complicate matters in stand-alone assembly language programs.
[3] A lot of old-time programmers believe that
assembly instructions should appear in upper case. A lot of this has to
do with the fact that old IBM mainframes and certain personal computers
like the original Apple II only supported upper case characters.
[4] Note, by the way, that I am not suggesting that
this error checking/handling code should be absent from the program. I am
only suggesting that it not interrupt the normal flow of the program while
reading the code.
[5] Doing so (inserting an 80x86 tutorial into your
comments) would wind up making the program less readable to those who
already know assembly language since, at the very least, they'd have to
skip over this material; at the worst they'd have to read it (wasting
their time).
[6] Or whatever other natural language is in use at
the site(s) where you develop, maintain, and use the software.
[7] You may substitute the local language in your
area if it is not English.
[8] In fact, just the opposite is true. One should
get concerned if both implementations are identical. This would suggest
poor planning on the part of the program's author(s) since the same
routine must now be maintained in two different programs.
[9] Or whatever make program you normally use.
[10] This happens because shorter function
invariable have stronger coupling, leading to integration errors.
[11] Technically, this is incorrect. In some very
special cases MASM will generate better machine code if you define your
variables before you use them in a program.
[12] Older assemblers on other machines have
required the labels to begin in column one, the mnemonic to appear in a
specific column, the operand to appear in a specific column, etc. These
were examples of fixed-formant source line translators.
[13] See the next section concerning comments for
more information.
[14] This document will simply use the term
comments when refering to standalone comments.
[15] Since the label, mnemonic, and operand fields
are all optional, it is legal to have a comment on a line by itself.
[16] It could be worse, you should see what the
"superoptimizer" outputs for the signum function. It's even
shorter and harder to understand than this code.
[17] This is true regardless of what metric you use
to determine the "best" code sequence.
[18] Of course, if the program is a class
assignment, you may want to check your instructor's cheating policy before
showing your work to your classmates!
[19] The designer of the SNOBOL4 and Icon
programming languages.
[20] Note that this does not infer that it is hard
to write easy to read C programs. Only that if one is sloppy, one can
easily write something that is near impossible to understand.
[21] Okay, this is a cheap shot. In fact, most of
the assembly code on this planet is poorly written.
[22] Actually, MASM 6.x does, but we'll ignore that
fact here.
[23] Sometimes, for performance reasons, the code
sequence above is justified since straight-line code executes faster than
code with jumps. If the program rarely executes the ELSE portion of an if
statement, always having to jump over it could be a waste of time. But if
you're optimizing for speed, you will often need to sacrafice readability.
[24] Okay, MASM wasn't the first, but such
techniques were not popularized until MASM appeared.
[25] Of course, MASM gives you the ability to
override this behavoir when necessary. Therefore, the complaints from
"old-hand" assembly language programmers that this is insane are
groundless.
[26] Okay, using some assembler that doesn't support
typedef would probably be a good excuse!
[27] For brevity, the productions for these objects
do not appear here.
[28] MASM records are equivalent to bit fields in
C/C++. They are not equivalent to records in Pascal.
- 1.0 - Introduction
-
- 1.1 - ADDHEX.ASM
-
- 1.2 - Graphics Example
-
- 1.3 - S.COM Example
-
- 1.4 - Intended Audience
-
- 1.5 - Readability
Metrics
-
- 1.6 - How to Achieve
Readability
-
- 1.7 - How This Document is
Organized
-
- 1.8 - Guidelines, Rules,
Enforced Rules, and Exceptions
-
- 1.9 - Source Language
Concerns
-
- 2.0 - Program
Organization
-
- 2.1 - Library Functions
-
- 2.2 - Common Object
Modules
-
- 2.3 - Local Modules
-
- 2.4 - Program Make
Files
-
- 3.0 - Module
Organization
-
- 3.1 - Module Attributes
-
- 3.1.1 - Module Cohesion
-
- 3.1.2 - Module Coupling
-
- 3.1.3 - Physical
Organization of Modules
-
- 3.1.4 - Module
Interface
-
- 4.0 - Program Unit
Organization
-
- 4.1 - Routine Cohesion
-
- 4.1.1 - Routine
Coupling
-
- 4.1.2 - Routine Size
-
- 4.2 - Placement of the
Main Procedure and Data
-
- 5.0 - Statement
Organization
-
- 6.0 - Comments
-
- 6.1 - What is a Bad
Comment?
-
- 6.2 - What is a Good
Comment?
-
- 6.3 - Endline vs.
Standalone Comments
-
- 6.4 - Unfinished Code
-
- 6.5 - Cross References in
Code to Other Documents
-
- 7.0 - Names, Instructions,
Operators, and Operands
-
- 7.1 - Names
-
- 7.1.1 - Naming
Conventions
-
- 7.1.2 - Alphabetic Case
Considerations
-
- 7.1.3 - Abbreviations
-
- 7.1.4 - The Position of
Components Within an Identifier
-
- 7.1.5 - Names to Avoid
-
- 7.2 - Instructions,
Directives, and Pseudo-Opcodes
-
- 7.2.1 - Choosing the Best
Instruction Sequence
-
- 7.2.2 - Control
Structures
-
- 7.2.3 - Instruction
Synonyms
-
- 8.0 - Data Types
-
- 8.1 - Defining New Data
Types with TYPEDEF
-
- 8.2 - Creating Array
Types
-
- 8.3 - Declaring
Structures in Assembly Language
-
- 8.4 - Data Types and the
UCR Standard Library
-
Assembly Language Style Guidelines - 13 FEB 1997