Skip to main content

Full text of "dec :: vax :: vms :: training :: EY-00014-DP VMS Internals and Data Structures 1984"

See other formats


VAX/VMS 
Internals and 
Data Structures 




LAWRENCE J. KENAH & SIMON E BATE 



VAX/VMS Internals and Data Structures 



VAX/VMS Internals 
and Data Structures 



LAWRENCE J. KENAH 
SIMON F. BATE 



Digital Press 



Copyright 1984 by Digital Equipment Corporation. 

All rights reserved. Reproduction of this book, in whole or in part, is prohibited. For 
information, write Digital Press, Educational Services, Digital Equipment Corporation, 
Bedford, Massachusetts. 

The painting reproduced on the front cover is "From Red to Violet" (1970, oil on can- 
vas) by Hannes Beckmann, courtesy of the DeCordova Museum Collection: Gift of Mr. 
Michael F. Lynch. 

DEC, DECnet, UNIBUS, VAX, and VMS are trademarks of Digital Equipment Corpora- 
tion. 

Designed by David Ford. 

Automatically typeset utilizing a VAX- 11/780 by York Graphic Services, Incorporated. 

Printed in U.S.A. by Halliday Lithograph. 

Order number EY-00014-DP. 

Library of Congress Cataloging in Publication Data 

Kenah, Lawrence J., 1946- 
VAX/VMS internals and data structures. 

Includes index. 

1. VAX/VMS (Computer operating system) 2. VAX-11 (Computer)— 
Programming. 3. Data structures (Computer science] I. Bate, 
Simon. II. Title. III. Title: VA.X./VM.S. internals and data structures. 
QA76.6.K454 1984 001.64'2 83-26187 

ISBN 0-932376-52-5 



Preface 



This book explains how the VAX/VMS executive works. It describes the data 
structures maintained and manipulated by the VMS operating system, dis- 
cusses the mechanisms that transfer control between user processes and the 
VMS operating system (and among the components of the operating system 
itself), and describes some of the features of the VAX hardware as they are 
used by the VMS operating system. It also describes the VMS executive, in- 
cluding all the major components of the executive, as well as system initiali- 
zation and the operation of all system services. It does not include a general 
discussion of the I/O subsystem, because that subject is already described in 
the VAX/VMS Guide to Writing a Device Driver (Digital Equipment Corpora- 
tion, 1982). However, the details of some VAX/VMS device drivers, as well as 
the operations of I/O-related system services are included in this book. 

This book is intended for system programmers and other users of the VAX/ 
VMS operating system who wish to understand the internal workings of the 
executive. The detailed description of data structures should help system 
managers make better informed decisions when they configure systems for 
space-or time-critical applications. It will also help application designers to 
appreciate the effects (in speed or in memory consumption) of different de- 
sign and implementation decisions. This book assumes that the reader is 
familiar with the VAX architecture and the VMS operating system, particu- 
larly with its use of system services and its techniques of memory manage- 
ment. 

In explaining the operation of a subsystem of the executive, this book em- 
phasizes the data structures manipulated by that component, rather than de- 
tailed flow diagrams of major routines. 

This book differs from the reference manuals that make up the VAX/VMS 
documentation set in that it describes internal operations and data struc- 
tures. While it is unlikely that any component described in this book will be 
drastically changed with any major release of the VAX/VMS operating sys- 
tem, there is no guarantee that a particular data structure or subroutine de- 
scribed here will remain the same from release to release. With each new 
version of the operating system, privileged application programs that rely on 
details contained in this book must be tested before they are used for produc- 
tion work with a standard load of users. 

This book is divided into nine parts, each of which describes a different 
aspect of the operating system. 

• Part 1 presents an overview of the VAX/VMS operating system and reviews 
those concepts that are crucial to understanding the workings of that sys- 
tem. 



Preface 



• Part 2 describes the mechanisms used to pass control between user pro- 
grams and the operating system and within the VMS system itself. 

• Part 3 describes scheduling and timer support, concluding with a discus- 
sion of the internals of the VAX/VMS lock manager. 

• Part 4 discusses memory management. 

• Part 5 describes the I/O subsystem. 

• Part 6 describes the creation and deletion of a process and the activation 
and termination of an image in the context of a process. 

• Part 7 deals with system initialization and also includes a discussion on 
the VAX- 11/782. 

• Part 8 discusses miscellaneous topics that are not conveniently classified 
in any conventional catalog of operating systems: 

— The implementation of logical names 

— The functions of miscellaneous system services 

— The use of listing and map files 

— The conventions used in naming symbols 

• Part 9 provides information on VMS data structures. 

Most of the operations of the VMS executive can be easily understood once 
the contents of the various data structures are known. Although selected 
structures are described throughout the book, Appendix B describes (or pro- 
vides pointers to) all the structures used by the operating system. The struc- 
tures related to device drivers and the file system are not described. The data 
structures related to device drivers are described in the VAX/VMS Guide to 
Writing a Device Driver. Data structures specific to the file system have yet 
to be documented. 

Several documents in the VAX/VMS document set supply important back- 
ground information for the topics discussed in this book. The following pro- 
vide an especially valuable foundation: VAX/VMS System Services Reference 
Manual, the VAX- 11 software installation guides, and the chapter in the 
VAX-11 Run-Time Library Reference Manual that describes condition han- 
dling. 

The concepts underlying the operating system are discussed in the VAX/ 
VMS Summary Description and Glossary, and the VAX Software Handbook. 
The following documents are also helpful references: the VAX/VMS Guide to 
Writing a Device Driver, the VAX-11 Architecture Reference Manual, and 
the VAX Hardware Handbook. 

An excellent description of the VAX architecture, as well as a discussion of 
some of the design decisions made for its first implementation, the VAX-11/ 
780, can be found in Computer Programming and Architecture: The VAX-1 1 
by Henry M. Levy and Richard H. Eckhouse, Jr. (Digital Press, 1980). This 



VI 



Preface 

book also contains a bibliography of some of the literature dealing with oper- 
ating system design. 

The reader should be aware of several conventions used throughout this 
book. In all diagrams of memory, the lowest virtual address appears at the top 
of the page and addresses increase toward the bottom of the page. This con- 
vention means that the direction of stack growth is toward the top of the 
page. In diagrams that display more detail, such as bytes within longwords, 
addresses also increase from right to left. That is, the lowest addressed byte 
(or bit) in a longword is on the righthand side of a figure and the most signifi- 
cant byte (or bit) is on the lefthand side. 

The words "system" or "VMS system" are used to describe the entire soft- 
ware package that is a part of a VAX- 11 system, including privileged proc- 
esses, utilities, and other support software as well as the executive itself. 

The word "executive" refers to those parts of the VMS operating system 
that reside in system virtual address space. The executive includes the con- 
tents of the file SYS.EXE, device drivers, and other code and data structures 
loaded at initialization time, including RMS and the system message file. 

When either "process control block" or "PCB" is used without a modifier, 
it refers to the software structure used by the scheduler. The data structure 
that contains copies of the general registers (that the hardware locates 
through the PR$_PCBB register) is always called the "hardware PCB." 

When referring to access modes, the term "inner access modes" means 
those access modes with more privilege. The term "outer access modes" 
means those access modes with less privilege. Thus, the innermost access 
mode is kernel and the outermost access mode is user. 

The term "SYSBOOT parameter" is used to describe any of the adjustable 
parameters that are used by the secondary bootstrap program SYSBOOT to 
configure the system. The adjustable parameters include both the dynamic 
parameters that can be changed on the running system and the static parame- 
ters that require a reboot in order for their values to change. These parame- 
ters are referred to by their parameter names rather than by the global loca- 
tions where their values are stored. Appendix A relates the SYSBOOT 
parameter names to their corresponding global locations. 

The terms "byte index," "word index," "longword index," and so on, refer 
to a method of access that uses the VAX-1 1 context indexing addressing capa- 
bility. That is, the index value will be multiplied by one, two, four, or eight 
(depending on whether a byte, word, longword, or quadword is being refer- 
enced) as part of operand evaluation in order to calculate the effective address 
of the operand. 

In general, the component called INIT refers to a module of that name in 
the executive and not the volume initialization utility. When that utility 
program is being referenced, it will be clearly specified. 



vn 



Preface 



Three conventions are observed for lists. 

In lists such as this one, where there is no order or hierarchy, list elements 
are indicated by leading bullets ( • ). Sublists without hierarchy are indi- 
cated by dashes ( — ). 

Lists that indicate an ordered set of operations are numbered. Sublists that 
indicate an ordered set of operations are lettered. 

Numbered lists with the numbers enclosed in circles indicate a corre- 
spondence between individual list elements and numbered items in a 
figure. 



ACKNOWLEDGMENTS 

Our first thanks must go to Joe Carchidi, for suggesting that this book be 
written, and to Dick Hustvedt, for his help and enlightening conversations. 

We would like to thank John Lucas for putting together the initial versions 
of Chapters 7, 10, 11, and 30 and Vik Muiznieks for writing the initial ver- 
sions of Chapters 5, 18, and 19. 

Appreciation goes to all those who reviewed the drafts for both editions of 
the book (VAX/VMS Version 2.2 and 3.3). We would particularity like to 
thank Kathy Morse for reviewing the first edition in its entirety and Wayne 
Cardoza for reviewing the entire second edition. Our special thanks go to 
Ruth Goldenberg for reviewing both editions in their entirety, and for her 
many corrections, comments, and suggestions. 

We owe a lot of thanks to our editing staff, especially to Jonathan 
Ostrowsky for his labors in preparing the first edition, and Betty Steinfeld for 
her help and suggestions. Many thanks go to Jonathan Parsons for reviewing 
and editing the second edition, and for all his help, patience, and suggestions. 

We would like to thank the Graphic Services department at Spitbrook, 
particularily Pat Walker for her help in paging and production of the first 
edition, and Paul King for his help in transforming innumerable slides and 
rough sketches into figures. Thanks go to Kathy Greenleaf and Jackie 
Markow for converting the files to our generic markup language. 

Thanks go to Larry Bohn, Sue Gault, Bill Heffner, Kathleen Jensen, and 
Judy Jurgens for their support and interest in this project. 

Finally, we would like to thank all those who originally designed and im- 
plemented the VAX/VMS operating system, and all those who have contrib- 
uted to later releases. 

Lawrence J. Kenah 

Simon F. Bate 

August 1983 



Vlll 



Contents 



PART I/Introduction 

1 System Overview 3 

1 . 1 Process, Job, and Image 3 

1.1.1 Process 3 

1.1.2 Image 5 

1.1.3 Job 6 

1 . 2 Functionality Provided By 

VAX/VMS 6 
1.2.1 Operating System Kernel 6 

1.2.3 User Interface 9 

1 .2.4 Interface among Kernel 

Subsystems 1 1 

1.3 Hardware Implementation 

of the Operating System 
Kernel 13 

1.3.1 VAX Architecture Features 

Exploited by VMS 13 

1.3.2 VAX- 11 Instruction Set 14 

1 .3.3 Implementation of VMS Kernel 

Routines 15 

1 .3.4 Memory Management and 

Access Modes 19 

1.3.5 Exceptions, Interrupts, and 

REI 20 

1.3.6 Process Structure 21 

1.4 Other System Concepts 22 

1.4.1 Resource Control 22 

1.4.2 Other System Primitives 23 

1 .5 Layout of Virtual Address 

Space 24 

1.5.1 System Virtual Address 

Space 24 

1 .5.2 The Control Region (PI 

Space) 26 

1.5.3 The Program Region ( P0 

Space) 26 

2 Synchronization 

Techniques 30 

2.1 Elevated IPL 30 

2.1.1 Use of IPL$_SYNCH 31 



2.1.2 

2.1.3 

2.1.4 

2.2 

2.2.1 

2.2.2 

2.3 

2.3.1 

2.3.2 

2.3.3 
2.3.4 
2.3.5 

2.4 



Other IPL Levels Used for 

Synchronization 32 
IPL$_QUEUEAST 33 
IPL 2 34 

Serialized Access 35 
Fork Processing 35 
I/O Postprocessing 36 
Mutual Exclusion Semaphores 

(Mutexes) 36 
Locking a Mutex for Read 

Access 37 
Locking a Mutex for Write 

Access 38 
Mutex Wait State 39 
Unlocking a Mutex 39 
Resource Wait State 40 
VAX/VMS Lock Management 

System Services 40 



Dynamic Memory 
Allocation 42 



3.1 


Allocation Strategy and 




Implementation 42 


3.1.1 


Allocation of Dynamic 




Memory 43 


3.1.2 


Example of Allocation of 




Dynamic Memory 44 


3.1.3 


Deallocation of Dynamic 




Memory 45 


3.1.4 


Example of Deallocation of 




Dynamic Memory 45 


3.1.5 


Synchronization 47 


3.1.6 


Granularity of Allocation 49 


3.2 


Preallocated Request 




Packets 50 


3.2.1 


Allocation from One of the 




Lookaside Lists 50 - 


3.2.2 


Deallocation to the Lookaside 




List 51 


3.3 


Use of Dynamic Memory 53 


3.3.1 


Process Allocation Region 53 



IX 



Contents 



3.3.2 Paged Dynamic Memory 53 

3.3.3 Nonpaged Dynamic 

Memory 56 



PART II/Control Mechanisms 

4 Condition Handling 61 

4. 1 Overview of the Condition 

Handling Facility 61 

4.1.1 Goals of the VAX-11 Condition 

Handling Facility 61 

4. 1 .2 Features of the VAX- 1 1 

Condition Handling 
Facility 62 

4.2 Generation of Exceptions 63 

4.2. 1 Exceptions That Originate in 

the Hardware 63 

4.2.2 Exceptions Detected by 

Software 74 

4.3 Uniform Exception 

Dispatching 75 

4.3. 1 Establishing a Condition 

Handler 77 

4.3.2 The Search for a Condition 

Handler 78 

4.3.3 Multiply Active Signals 81 

4.4 Condition Handler Action 83 

4.4.1 Continue or Resignal 84 

4.4.2 Unwinding the Call Stack 84 

4.4.3 Example of Unwinding the Call 

Stack 85 

4.4.4 Potential Infinite Loop 88 

4.4.5 Unwinding Multiply Active 

Signals 88 

4.4.6 Correct Use of Default Depth 

inSYS$UNWIND 89 

4.4.7 Unwinding AST's 92 

4.5 Default (VMS-Supplied) 

Condition Handlers 95 

4.5.1 Traceback Handler Established 

by Image Startup 95 

4.5.2 Catch-Ail Condition 

Handler 95 

4.5.3 Handlers Used by Other Access 

Modes 96 



5 Hardware Interrupts 98 

5 . 1 Hardware Interrupt 

Dispatching 98 

5.1.1 Interrupt Dispatching 99 

5.1.2 System Control Block 100 

5.2 VAX/VMS Interrupt Service 

Routines 104 

5.2. 1 Restrictions Imposed on 

Interrupt Service 
Routines 104 

5.2.2 Servicing UNIBUS 

Interrupts 105 

5.2.3 MASSBUS Interrupt Service 

Routines 109 

5.2.4 DR32 Interrupt Service 

Routine 112 

5.2.5 MA780 Interrupt 

Dispatching 112 

5.2.6 MA780 Interrupts on the 

VAX-1 1/782 114 

5.3 Connect- to-Interrupt 

Mechanism 115 

6 Software Interrupts 117 

6. 1 The Software Interrupt 1 1 7 

6.1.1 Hardware Mechanism of 

Software Interrupts 117 

6.1.2 Software Interrupt Service 

Routines 119 

6.2 Software Interrupt Levels in 

VAX/VMS 119 

6.2. 1 Mount Verification 

Cancellation 120 

6.2.2 Fork Processing 121 

6.2.3 Software Timer 123 

6.2.4 I/O Postprocessing 123 

6.2.5 Rescheduling Interrupt 124 

6.2.6 AST Delivery Interrupt 125 

7 AST Delivery 126 

7. 1 Hardware Assistance to AST 

Delivery 126 

7.1.1 REI Instruction 126 

7. 1 .2 ASTLVL Processor Register 

(PR$_ASTLVL) 127 



Contents 



7.2 Queuing an AST to a 

Process 127 

7.2.1 AST Control Block 127 

7.2.2 Access Mode and AST 

Queuing 130 

7.2.3 Special Kernel Mode 

ASTs 130 

7.2.4 Piggyback Special Kernel Mode 

ASTs 130 

7.2.5 Computation of a New Value 

for ASTLVL 132 

7.3 Delivering an AST to a 

Process 133 

7.3.1 AST Delivery Interrupt 133 

7.3.2 Argument List 135 

7.3.3 AST Exit Path 136 

7.4 Special Kernel Mode 

ASTs 137 

7.4.1 I/O Postprocessing in Process 

Context 137 

7.4.2 Process Suspension 138 

7.4.3 Process Deletion 138 

7.4.4 $GETJPI System Service 139 

7.4.5 Power Recovery ASTs 140 

7.4.6 Other System Use of 

ASTs 140 

7.5 Attention and Out-of-Band 

ASTs 140 

7.5.1 Set Attention Mechanism 140 

7.5.2 Delivery of Attention 

ASTs 141 

7.5.3 Flushing an Attention AST 

List 142 

7.5.4 Examples in VAX/VMS 142 

7.5.5 Out-of-Band ASTs 143 

8 Error Handling 147 

8.1 Error Logging 147 

8.1.1 Overview of the Error Logging 

Subsystem 147 

8.1.2 Device Driver Errors 147 

8.1.3 Other Error Log Messages 1 48 

8.1.4 Operation of the Error Logger 

Routines 148 

8.1.5 Cursory Overview of the 

ERRFMT Process 149 



8.1.6 

8.2 

8.2.1 
8.2.2 

8.2.3 

8.3 

8.3.1 
8.3.2 
8.3.3 
8.3.4 



9.1 

9.2 

9.2.1 

9.2.2 

9.3 

9.3.1 

9.3.2 

9.3.3 

9.3.4 
9.3.5 

9.3.6 

9.4 

9.4.1 

9.4.2 



Error Log Mailbox 150 
System Crashes 

(BUGCHECKS) 150 
Bugcheck Mechanism 150 
Operation of Bugcheck 

Routine 151 
System Dump File 154 
Machine Check 

Mechanism 156 
VAX-1 1/730 Machine 

Check 157 
VAX-1 1/750 Machine 

Check 157 
VAX-1 1/780 Machine 

Check 159 
Machine Check Recovery 

Blocks 160 



System Service 
Dispatching 



162 



162 



System Service Vectors 
Change Mode 

Instructions 165 
The CHMK and CHME 

Instructions 165 
The CHMS and CHMU 

Instructions 165 
Change Mode Dispatching in 

VMS 166 
Operation of the Change Mode 

Dispatcher 167 
Change-Mode-to-Kernel 

Dispatcher 171 
Change-Mode-to-Executive 

Dispatcher 171 
RMS Dispatching 171 
Return Path for System 

Services 172 
Return Path for RMS 

Services 173 
User- Written System Service 

Dispatching 174 
Per-Process User- Written 

Dispatcher 174 
Privileged Shareable 

Images 1 75 



XI 



Contents 



9.4.3 


System- Wide User- Written 


11.2.2 




Dispatcher 178 


11.3 


9.5 


Related System Services 1 78 




9.5.1 


Set System Service Failure 


11.3.1 




Exceptions System 


11.3.2 




Service 179 




9.5.2 


Change Mode System 


11.3.3 




Services 179 


11.3.4 


9.5.3 


System Service Filtering 1 79 


11.3.5 

11.4 
11.4.1 


PARI 


r Ill/Scheduling and Timer 


11.4.2 



Support 

10 Scheduling 183 

10.1 Process States 183 

10.1.1 Process Control Block 1 83 

10.1.2 Software Priority 184 

10.1.3 State Queues 191 

10.2 System Events 197 

10.2.1 Process State Changes 198 

10.2.2 Wait States and AST 

Delivery 198 

10.2.3 Event Reporting 200 

10.2.4 System Events and Associated 

Priority Boosts 201 

10.3 Rescheduling Interrupt 202 

10.3.1 Hardware Context 203 

10.3.2 Removal of Current Process 

from Execution 204 

10.3.3 Selection of Next Process for 

Execution 205 

10.3.4 Summary Longword and 

Computable State 
Queues 206 

10.3.5 Hardware Assistance in 

Context Switching 207 

11 Timer Support 212 

11.1 Timekeepingin VAX/VMS 212 

11.1.1 Hardware Clocks 212 

11.1.2 Software Time 215 

1 1 . 1 .3 Set Time System Service 215 

1 1 .2 Hardware Clock Interrupt 

Service Routine 217 
1 1 .2. 1 System Time Updating 2 1 7 



12 

12.1 

12.1.1 

12.1.2 

12.1.3 

12.1.4 

12.2 

12.2.1 
12.2.2 
12.2.3 

12.3 

12.3.1 
12.3.2 

12.3.3 
12.3.4 
12.3.5 
12.3.6 



13 

13.1 

13.1.1 
13.1.2 



Timer Queue Testing 217 
Software Timer Interrupt 

Service Routine 218 
Quantum Expiration 218 
Timer Queue and Timer Queue 

Elements 218 
Timer Request Servicing 220 
Scheduled Wakeup 220 
Periodic System 

Procedures 221 
Timer System Services 222 
$SETIMR Requests 222 
Scheduled Wakeup 

Operations 223 

Process Control and 
Communication 225 

Event Flag Services 225 
Local Event Flags 225 
Common Event Flags 226 
Event Flag Wait States 228 
Setting and Clearing Event 

Flags 229 
Affecting the Computability of 

Another Process 231 
Common Event Flags 23 1 
Process Control Services 231 
Miscellaneous Process 

Attribute Changes 234 
Interprocess 

Communication 235 
Event Flags 238 
VAX/VMS Lock Management 

system services 238 
Mailboxes 238 
Logical Names 239 
Global Sections 239 
Interprocessor Communication 

with the MA780 239 

VAX/VMS Lock 
Manager 244 

Lock Manager Data 

Structures 244 
Lock Blocks 245 
Resource Blocks 246 



xn 



Contents 



13.1.3 


Accessing the Lock and 
Resource Blocks 247 


14.3.4 


13.1.4 


Relationships in the Lock 
Database 250 


14.3.5 


13.2 


Queuing and Dequeuing 


14.4 




Locks 250 


14.4.1 


13.2.1 


The $ENQ System 


14.4.2 




Service 250 


14.4.3 


13.2.2 


Lock Conversions 254 




13.2.3 


The $DEQ System 
Service 255 


14.5 


13.3 


Handling Deadlocks 255 


14.5.1 


13.3.1 


Initiating a Deadlock 






Search 256 


14.5.2 


13.3.2 


Deadlock Detection 256 


14.5.3 


13.3.3 


Victim Selection 262 


14.5.4 



PART IV/Memory Management 



14 



Memory Management Data 



14.6 



14.6.1 





Structures 267 


14.6.2 
14.6.3 


14.1 


Process Data Structures 






(Process Header) 267 


14.6.4 


14.1.1 


Process Page Tables 269 




14.1.2 


Working Set List 273 


14.7 


14.1.3 


Process Section Table 276 




14.1.4 


Process Header Page 

Arrays 279 


14.7.1 


14.2 


PFN Database 279 


14.7.2 


14.2.1 


PTE Array 279 




14.2.2 


BAK Array 280 


14.7.3 


14.2.3 


STATE Array 282 




14.2.4 


TYPE Array 283 


14.7.4 


14.2.5 


Forward and Backward 
Links 284 




14.2.6 


REFCNT Array 284 




14.2.7 


SHRCNT Array 285 


15 


14.2.8 


WSLX Array 286 


15.1 


14.2.9 


SWPVBN Array 286 




14.3 


Data Structures for Global 


15.1.1 




Pages 286 


15.1.2 


14.3.1 


Global Section Descriptor 286 


15.2 


14.3.2 


The System Header and Global 






Section Table Entries 287 


15.2.1 


14.3.3 


Global Page Table Entries 288 





Global Page Table and System 

Page Table 289 
Process PTEs for Global 

Pages 292 
Swapping Data Structures 292 
Balance Slots 292 
Balance Slot Arrays 293 
Comment on Equal Size 

Balance Slots 294 
Data Structures That Describe 

the Page and Swap Files 295 
Structure of Page and Swap 

Files 295 
The SHELL process 297 
Structure of Swap Files 297 
Alternate Page and Swap 

Files 299 
Swapper and Modified Page 

Writer Page Table 

Arrays 299 
Direct I/O and Scatter/ 

Gather 299 
Swapper I/O 300 
Modified Page Writer PTE 

Array 300 
Nonreentrancy of Swapper and 

Modified Page Writer 30 1 
Data Structures Used with 

Shared Memory 302 
Shared Memory Control 

Structures 302 
Global Sections in Shared 

Memory 304 
Mailboxes in Shared 

Memory 307 
Common Event Flag Clusters 

in Shared Memory 307 



Paging Dynamics 308 

Overview of Pager 

Operation 308 
Hardware Action 308 
Initial Pager Action 309 
Page Faults for Process Private 

Pages 310 
Page Located in an Image 

File 311 



Xlll 



Contents 



15.2.2 Demand Zero Pages 317 

15.2.3 Global Copy-on-Reference and 

Page-File Pages 317 

15.2.4 Page Located in the Page 

File 319 

15.3 Page Faults for Global 

Pages 319 

15.3.1 Page Fault for Global 

Read-Only Page 319 

15.3.2 Global Read/Write Pages 322 

15.3.3 Global Copy-on-Reference 

Pages 323 

1 5 .3 .4 Global Page-File Backing Store 

Pages 324 

15.4 Working Set Replacement 326 

15.4.1 Scan of Working Set List 326 

1 5.4.2 Reusing Working Set List 

Entries 326 

15.4.3 Using an Available Entry in the 

Working Set List 327 

15.4.4 Skipping Working Set List 

Entries 328 

15.5 Input and Output That Support 

Paging 328 

15.5.1 Page Reads and 

Clustering 329 

15.5.2 Modified Page Writing 333 

15.5.3 Update Section System 

Service 338 

15.6 Paging and Scheduling 339 

15.6.1 Page Fault Wait State 339 

15.6.2 Free Page Wait State 339 

15.6.3 Collided Page Wait State 340 

16 Memory Management 

System Services 341 

16.1 Dispatch Method for Memory 

Management System 
Services 341 

16.2 Virtual Address Creation and 

Deletion 342 

16.2.1 Address Space Creation 342 

16.2.2 Address Space Deletion 344 

16.2.3 Controlled Allocation of 

Virtual Memory 346 

16.3 Private and Global 

Sections 346 



16.3.1 Create and Map Section System 

Service 346 

16.3.2 Map Global Section System 

Service 349 

16.3.3 Delete Global Section System 

Service 349 

16.3.4 Update Section System 

Service 350 
16.4 Related System Services 351 

16.4.1 Working Set Size 

Adjustment 351 

16.4.2 Locking and Unlocking 

Pages 357 

16.4.3 Process Swap Mode 359 

16.4.4 Altering Page Protection 359 

17 Swapping 



17.1 


Swapping Overview 360 


17.1.1 


Swapper Responsibilities 360 


17.1.2 


Swapper Implementation 36 1 


17.1.3 


Comparison of Paging and 




Swapping 362 


17.2 


Swap Scheduling 362 


17.2.1 


Selection of Inswap 




Candidate 362 


17.2.2 


Selection of Shrink or Outswap 




Candidates 366 


17.2.3 


System Events That Trigger 




Swapper Activity 369 


17.3 


Swapper's Use of Memory 




Management Data 




Structures 370 


17.3.1 


Process Header 370 


17.3.2 


Swapper I/O Data 




Structures 372 


17.4 


Outswap Operation 373 


17.4.1 


Selection of Outswap 




Candidate 374 


17.4.2 


Outswap of the Process 




Body 374 


17.4.3 


Outswap of Process 




Header 379 


17.5 


Inswap Operation 381 


17.5.1 


Selection of an Inswap 




Candidate 382 


17.5.2 


Inswap of the Process 




Header 382 



XIV 



Contents 



17.5.3 



18 

18.1 

18.1.1 
18.1.2 
18.2 

18.2.1 
18.2.2 
18.3 
18.3.1 

18.3.2 
18.3.3 
18.4 
18.5 

18.5.1 
18.5.2 

18.5.3 
18.6 
18.7 
18.7.1 

18.7.2 



19 

19.1 
19.1.1 
19.1.2 
19.1.3 

19.1.4 

19.2 
19.3 
19.3.1 



Rebuilding the Process 


19.3.2 


Body 383 


19.4 




19.4.1 


Input/Output 


19.4.2 


I/O System Services 393 


19.4.3 
19.5 


Assigning and Deassigning 


19.5.1 


Channels 393 


19.5.2 


Channel Assignment 393 


19.5.3 


Channel Deassignment 395 


19.5.4 


Device Allocation and 


19.6 


Deallocation 396 


19.6.1 



Device Allocation 396 
Device Deallocation 397 
$QIO System Service 398 
Device-Independent 

Preprocessing 398 
FDT Routines 399 
I/O Postprocessing 400 
I/O Cancellation 402 
Mailbox Creation and 

Deletion 402 
Mailbox Creation 403 
Mailbox Creation in Shared 

Memory 405 
Mailbox Deletion 407 
Broadcast System Service 408 
Informational Services 411 
Device-Independent 

Information 411 
Device-Dependent 

Information 412 

VAX/VMS Device 
Drivers 414 

Disk Drivers 414 
ECC Error Recovery 414 
Offset Recovery 416 
Dynamic Bad Block 

Handling 416 
Multiple-Block Noncontiguous 

Virtual I/O 417 
Magnetic Tape Drivers 419 
Class and Port Drivers 420 
Implementation of SCA on 

VAX/VMS 420 



19.6.2 
19.6.3 
19.6.4 

19.6.5 



I/O Processing 422 
Terminal Driver 422 
Full Duplex Operation 426 
Channels and Terminal 

Controllers 428 
Type- Ahead Buffer 428 
Pseudo Device Drivers 428 
Null Device Driver 429 
Network Device Driver 429 
Remote Terminals 430 
Mailbox Driver 430 
Console Interface 435 
VAX-1 1/730 Console 

Interface 435 
VAX-1 1/750 Console 

Interface 436 
VAX-1 1/780 Console 

Interface 436 
Data Transfer Between the 

VAX-1 1 CPU and Console 

Devices 437 
Console Interrupt 

Dispatching 437 



PART Vl/Process Creation and 
Deletion 



20 


Process Creation 443 


20.1 


Create Process System 




Service 443 


20.1.1 


Control Flow of Create 




Process 444 


20.1.2 


Establishing Quotas for the 




New Process 450 


20.1.3 


The PCB Vector 452 


20.1.4 


Fabrication of Process IDs 452 


20.2 


The Shell Process 454 


20.2.1 


Moving SHELL Into Process 




Context 454 


20.2.2 


Configuration of the Process 




Header 455 


20.3 


Process Creation in the 




Context of the New 




Process 458 


20.3.1 


Operation of PROCSTRT 45 8 


20.3.2 


Catch- All Condition 




Handler 462 



XV 



Contents 



21 Image Activation and 

Termination 463 

21.1 Image Initiation 463 

21.1.1 Image Activation 464 

21.1.2 The Address Relocation Fixup 

System Service 476 

2 1 . 1 .3 Image Startup 480 

21.2 Image Exit 482 

21.2.1 Control Flow of the Exit 

System Service 483 

21.2.2 Example of Termination 

Handler List Processing 484 

2 1 .3 Image and Process 

Rundown 485 



23.2.3 
23.3 



23.3.1 
23.3.2 

23.3.3 
23.3.4 
23.3.5 

23.4 



The Logout Operation 506 
Command Language 

Interpreters and Image 

Execution 508 
CLI Initialization 509 
Command Processing 

Loop 509 
Image Initiation by DCL 511 
Image Termination 513 
Abnormal Image 

Termination 514 
The LOGOUT Operation 516 



21.3.1 


Control Flow of Rundown 485 


par-: 


21.4 


Process Privileges 488 


24 


21.4.1 


Process Privilege Masks 488 


21.4.2 


Set Privilege System 
Service 490 


24.1 
24.1.1 


22 


Process Deletion 492 




22.1 


Process Deletion in Context of 
Caller 492 


24.1.2 


22.1.1 


Delete Process System 
Service 492 


24.1.3 


22.2 


Process Deletion in Context of 


24.2 




Process Being Deleted 493 


24.2.1 


22.2.1 


Special Kernel AST for Process 




Deletion 493 




22.2.2 


Deletion of a Process That 


24.2.2 




Owns Subprocesses 496 


24.2.3 


22.2.3 


Example of Process Deletion 






with Subprocesses 497 


24.2.4 
24.3 


23 


Interactive and Batch 






Jobs 499 


24.3.1 


23.1 


The Job Controller and 
Unsolicited Input 499 




23.1.1 


Unsolicited Terminal 
Input 499 


25 


23.1.2 


The SUBMIT Command 502 


25.1 


23.1.3 


Unsolicited Card Reader 






Input 502 


25.1.1 


23.2 


The LOGINOUT Image 503 




23.2.1 


Interactive Jobs 503 


25.1.2 


23.2.2 


LOGINOUT Operation for 






Batch Jobs 505 


25.1.3 



PART VH/System Initialization 

Bootstrap Procedures 521 

Processor-Specific 

Initialization 521 
VAX- 1 1/730 Initial Bootstrap 

Operation 521 
VAX- 1 1/750 Initial Bootstrap 

Operation 524 
VAX- 1 1/780 Initial Bootstrap 

Operation 528 
Primary Bootstrap 

Program 530 
Motivation for Two Bootstrap 

Programs 534 
Operation of VMB 535 
Bootstrap Driver and I/O 

Subroutines 542 
File Operations 542 
Secondary Bootstrap Program 

(SYSBOOT) 542 
Detailed Operation of 

SYSBOOT 543 



Operating System 
Initialization 548 

Initial Execution of the 
Executive (INIT) 548 

Turning on Memory 
Management 548 

Initialization of the 
Executive 550 

I/O Adapter Initialization 557 



XVI 



Contents 



25.1.4 


CPU-Dependent 




Routines 558 


25.2 


Initialization in Process 




Context 559 


25.2.1 


SYSINIT Process 561 


25.2.2 


The STARTUP Process 564 


25.3 


The System Generation Utility 




(SYSGEN) 565 


25.3.1 


Contents of Parameter 




Block 566 


25.3.2 


Use of Parameter Files by 




SYSBOOT 566 


25.3.3 


Use of Parameter Files by 




SYSGEN 570 



26 Size of System Virtual 

Address Space 572 

26.1 Size of Process Header 572 

26. 1 . 1 Process Page Tables 573 

26. 1 .2 Working Set List and Process 

Section Table 573 

26 . 1 .3 Process Header Page 

Arrays 575 

26.2 System Virtual Address 

Space 576 

26.2. 1 System Virtual Address Space 

and SYSBOOT 
Parameters 576 

26.2.2 System Page Table and the PFN 

Database 585 

26.2.3 Approximation Used by 

SYSBOOT 586 

26.2.4 Renormalization of 

SPTREQ 587 

26.3 Physical Memory 

Requirements of the 
Executive 587 

26.3. 1 Physical Memory Used by the 

Executive 587 

26.3.2 System Processes 589 

26.4 Sizes of Pieces of P 1 Space 590 

27 Powerfail Recovery 596 

27.1 Powerfail Sequence 596 

27.2 Power Recovery 597 
27.2.1 Initial Step in Power 

Recovery 598 



27.2.2 

27.2.3 
27.2.4 
27.3 
27.3.1 

27.3.2 

27.3.3 
27.4 

27.4.1 



27.4.2 



Operation of the Restart 

Routine 601 
Device Notification 603 
Process Notification 604 
Multiple Power Failures 605 
Nested Power Fail 

Interrupts 605 
Prevention of Nested 

Restarts 606 
Device Driver Action 606 
Power Failure on the 

UNIBUS 607 
UNIBUS Power Failure on the 

VAX-1 1/730 and 

VAX-1 1/750 607 
UNIBUS Power Failure on the 

VAX-1 1/780 607 



28 


The VAX-1 1/782 




Multiprocessing 




System 609 


28.1 


How the VMS System Supports 




Multiprocessing 610 


28.1.1 


Hooks in the Executive 611 


28.1.2 


Hardware Support for 




Multiprocessing 612 


28.2 


System Initialization on the 




VAX-1 1/782 613 


28.2.1 


System Initialization on the 




Primary Processor 613 


28.2.2 


System Initialization on the 




Attached Processor 613 


28.2.3 


Turning Multiprocessing 




On 614 


28.2.4 


Turning Multiprocessing 




Off 615 


28.3 


Scheduling and Interrupts on 




the VAX-1 1/782 616 


28.3.1 


Scheduling Processes on the 




VAX-1 1/782 617 


28.3.2 


Preventing Scheduling on the 




Attached Processor 618 


28.3.3 


Executing Jobs on the Attached 




Processor 618 


28.3.4 


Detecting Access Mode 




Transitions 620 


28.3.5 


Interrupt 




Communication 621 



XVll 



Contents 



PART Yin/Miscellaneous Topics 



29 

29.1 
29.1.1 

29.1.2 
29.1.3 

29.1.4 

29.1.5 

29.2 

29.2.1 

29.2.2 

29.2.3 
29.2.4 
29.2.5 

30 

30.1 

30.1.1 

30.1.2 

30.1.3 

30.1.4 
30.2 

30.2.1 

30.2.2 

30.2.3 

30.3 

30.3.1 
30.3.2 



Logical Names 625 

Logical Name Tables 625 
Logical Name Data 

Structures 625 
Logical Name Block 627 
Searching for a Logical 

Name 628 
Hashing the Logical 

Names 628 
Changes to Speed Logical 

Name Translation 629 
Logical Name System 

Services 629 
Privilege and Protection 

Checks 630 
Logical Name Table 

Mutexes 630 
Logical Name Creation 630 
Logical Name Deletion 63 1 
Logical Name Translation 63 1 

Miscellaneous System 
Services 

Communication with System 

Processes 632 
Accounting Manager (Job 

Controller) 632 
Symbiont Manager (fob 

Controller) 633 
Operator 

Communications 634 
Error Logger 634 
System Message File 

Services 635 
Get Message System 

Service 635 
Put Message System 

Service 637 
Procedure EXE$EXCMSG 638 
Process Information 

(SGETJPI) 639 
Operation of the SGETJPI 

System Service 639 
$GETIPI Special Kernel Mode 

ASTs 641 



30.3.3 


Wildcard Support in 




SGETJPI 641 


30.4 


System Information 




(SGETSYI) 642 


30.5 


Formatting Support 642 


30.5.1 


Time Conversion 




Services 643 


30.5.2 


Formatted ASCII Output 


31 


Use of Listing and Map 




Files 645 



643 



31.1 Hints in Reading the Executive 

Listings 645 

31.1.1 Structure of a MACRO Listing 

File 645 

31.1.2 The VAX-1 1 Instruction Set and 

Addressing Modes 649 

31.1.3 Use of the REI Instruction 653 

31.1.4 Register Conventions 654 

31.1.5 Elimination of Seldom-Used 

Code 655 

31.1.6 Dynamically Locking Code or 

Data into Memory 656 

31.2 Use of Map Files 658 

31.2.1 The Executive Map 

SYS.MAP 658 

31.2.2 RMS.MAP, DCL.MAP, and 

MP.MAP 659 

3 1 .2.3 Device Driver Map Files 660 

31.2.4 CPU-Dependent 

Routines 660 

31.2.5 Other Map Files 661 

3 1 .3 The System Dump Analyzer 

(SDA) 661 

31.3.1 Global Locations 661 

3 1 .3.2 Layout of System Virtual 

Address Space 662 

3 1 .3 .3 Layout of P 1 Space 662 

31.4 Interpreting MDL Files 662 

3 1 .4. 1 Sample Structure 

Definitions 662 

31 .4.2 Commonly Used MDL 

Commands 663 

3 1 .4.3 Bit Field Definitions— The V 

Directive 670 



XVlll 



Contents 



32 Naming Conventions 671 

32.1 Public Symbol Patterns 671 

32.2 Object Data Types 676 

32.3 Facility Prefix Table 677 



APPENDIXES 

A Executive Data Areas 683 

A. 1 Statically Allocated Executive 

Data 683 
A.2 Dynamically Allocated 

Executive Data 725 



B 


Data Structure 




Definitions 733 


B.l 


Executive Data 




Structures 736 


B.2 


Constants 764 


B.3 


Data Structures Used by the 




I/O System 771 


B.4 


Data Structures Used by Files- 




11 773 


B.5 


Miscellaneous Data Structures 




and Constants 774 



XIX 



PART I/Introduction 



System Overview 



For the fashion of Minas Tirith was such that it was built on 
seven levels, each delved into a hill, and about each was set a 
wall, and in each wall was a gate. 
— J.R.R. Tolkien, The Return of the King 

This chapter introduces the basic concepts that are used to describe the 
VAX/VMS operating system. Special attention is paid to the features of the 
VAX architecture that are either exploited by the operating system or exist 
solely to support an operating system. In addition, some of the design goals 
that guided the implementation of the VMS operating system are discussed. 

1.1 PROCESS, JOB, AND IMAGE 

The fundamental unit in the VAX/VMS operating system, the entity that is 
selected for execution by the scheduler, is the process. If a process creates 
subprocesses, the collection of the creator process, all the subprocesses cre- 
ated by it, and all subprocesses created by its descendants, is called a job. The 
programs that a process executes in order to accomplish meaningful work are 
called images. 

1.1.1 Process 

A process is fully described by hardware and software context and a virtual 
address space description. This information is stored in several data struc- 
tures located in different places in the process address space. The data struc- 
tures that contain the various pieces of process context are pictured in Figure 
1-1. 

1.1.1.1 Hardware Context. The hardware context consists of copies of the general 
purpose registers, the four per-process stack pointers, the program counter 
(PC), the processor status longword (PSL), and the process-specific processor 
registers, including the memory management registers and the AST level 
register. The hardware context resides in a data structure called the hardware 
process control block that is used primarily when a process is removed from 
or selected for execution. 

Another part of process context that is related to hardware is the existence 
of four per-process stacks, one for each of the four access modes. When any 
code executes in the context of a process, the code uses the stack associated 
with the code's current access mode. 



1 . Hardware context is stored 
in hardware PCB. 

2. Software context is spread 
around in PCB, PHD, JIB, 
and P1 space. 

3. Virtual address space 
description is stored in 
PO and P1 page tables. 



s( 



(Control Region) 
P1 Space 



System Space 



• Per-Process 
Stacks 

• RMS Data 

• Image Data 



£ 

Co 

I 
O 



to 
3 



80000000 



Job Information 
Block (JIB) 



Software 

Process Control 

Block (PCB) 



This JIB is pointed 
to by all other 
processes (if any) 
in the same job. 



• Pooled Quotas 

• Master Process 
ID 

• Count of 
Processes in Job 



• Process Name 

• Scheduling 
Information 

• Process ID 

• Pointers to 
other structures 



Process Header 
(PHD) 



• Working Set List 

• Process Section 
Table 

• Accounting 

Information 



P0 Page Table 

(Virtual 

Address Space 

Description) 

P1 Page Table 



Hardware Process 
Control Block 



• General Registers 

• PC, PSL 

• Per-Process 
Stack Pointers 

• Memory 
Management Registers 

• ASTLVL 



(Hardware Context) 



y( 



Figure 1-1 

Data Structures That Describe Process Context 



1.1 Process, Job, and Image 

1.1.1.2 Software Context. Software context consists of all the data required by vari- 
ous parts of the operating system to make scheduling and other decisions 
about a process. This data includes the process software priority, its current 
scheduling state, process privileges, quotas and limits, and miscellaneous in- 
formation such as process name and process identification. 

The information about a process that must be in memory at all times is 
stored in a data structure called the software process control block (PCB). 
This data includes the software priority of the process, its unique process 
identification (PID), and the particular scheduling state that the process is in 
at a given point in time. Some process quotas and limits are stored in the 
software PCB. The quotas and limits shared among all processes in the same 
job are stored in a shared data structure called the job information block. 

The information about a process that does not have to be permanently 
resident (swappable process context) is contained in a data structure called 
the process header. This information is only needed when the process is resi- 
dent and consists mainly of information used by memory management when 
page faults occur. The data in the process header is also used by the swapper 
when the process is removed from memory (outswapped) or brought back 
into memory (inswapped). The hardware PCB, which contains the hardware 
context of a process, is a part of the process header. Some information in the 
process header is available to suitably privileged code whenever the process is 
resident (is in the balance set), and some information is only accessible from 
that process's context. 

Other process-specific information is stored in the PI portion of the process 
virtual address space (the control region). This data includes exception dis- 
patching information, RMS data tables, and information about the image that 
is currently executing. Information that is stored in PI space is only accessi- 
ble when the process is executing (is the current process) because PI space is 
process specific. 

1.1.1.3 Virtual Address Space Description. The virtual address space of a process is 
described by the process PO and PI page tables, stored in the high address end 
of the process header. The process virtual address space is altered when an 
image is initially activated, during image execution through selected system 
services, and when an image terminates. The process page tables reside in 
system virtual address space and are in turn described by entries in the sys- 
tem page table. Unlike the other portions of the process header, the process 
page tables are themselves pageable, and they are faulted into the process 
working set only when they are needed. 



1.1.2 Image 

The programs that execute in the context of a process are called images. 
Images usually reside in files that are produced by the VAX/VMS linker. 



System Overview 



When the user initiates image execution (as part of process creation or 
through a DCL or MCR command in an interactive or batch job), a compo- 
nent of the executive called the image activator sets up the process page 
tables to point to the appropriate sections of the image file. The VMS operat- 
ing system uses the same paging mechanism that implements its virtual 
memory support to read image pages into memory as they are needed. 



1.1.3 Job 

The collection of subprocesses that have a common root process is called a 
job. The concept of a job exists solely for the purpose of sharing resources. 
Some quotas and limits, so-called pooled quotas, are shared among all proc- 
esses in the same job. The current values of these quotas are contained in a 
data structure called a job information block (Figure 1-1) that is shared by all 
processes in the same job. 



1.2 FUNCTIONALITY PROVIDED BY THE VAX/VMS SYSTEM 

The VAX/VMS operating system provides services at many levels so that user 
applications may execute easily and effectively. The layered structure of the 
VAX/VMS operating system is pictured in Figure 1-2. In general, components 
in a given layer can make use of the facilities in all inner layers. 



1 .2. 1 Operating System Kernel 

The main topic of this book is the operating system kernel: the I/O subsys- 
tem, memory management, the scheduler, and the VAX/VMS system serv- 
ices that support and complement these components. The discussion of these 
three components and other miscellaneous parts of the operating system ker- 
nel focuses on the data structures that are manipulated by a given compo- 
nent. By discussing what each major data structure represents, and how that 
structure is altered by different sequences of events in the system, we will 
describe the detailed operations of each major piece of the executive. 

1.2.1.1 I/O Subsystem. The I/O subsystem consists of device drivers and their associ- 
ated data structures, device-independent routines within the executive, and 
several system services, the most important of which is the $QIO request, 
the eventual I/O request that is issued by all outer layers of the system. The 
I/O subsystem is described in great detail from the point of view of adding a 
device driver to a VMS operating system in the VAX/VMS Guide to Writing a 
Device Driver. Chapters 18 and 19 of this book describe features of the I/O 
subsystem that are not described in that manual. 



Privileged Images 

Images Installed with Privilege 
Other Privileged Images 
Images Linked with the 
System Symbol Table 

• File System 

• Informational 
Utilities 



Run-Time 

Library 

(Specific) 

• FORTRAN 

• PASCAL 
•PL/I 




Program Development Tools 

• Text Editors 

• Linker 

• MACRO Assembler 

• System Message 

• Compiler 



Run-Time 

Library 

(General) 

• Math Library 

• String 
Manipulation 

• Screen 
Formatting 



Layered Products 

• Language Compilers 

• DATATRIEVE 

• Forms Utilities 



Figure 1-2 

Layered Design of the VAX/VMS Operating System 



Assorted Utilities 

•SORT 

• File Manipulation 

• HELP 

• DIRECTORY 



-a 






5 
o 

S. 
a, 
to 
a. 

to 



(>3 

1 



System Overview 

1.2.1.2 Memory Management. The main components of the memory management 
subsystem are the page fault handler, which implements the virtual memory 
support of the VAX/VMS operating system, and the swapper, which allows 
the system to more fully utilize the amount of physical memory that is avail- 
able. The data structures used and manipulated by the pager and swapper 
include the PFN database and the page tables of each process. The PFN data- 
base describes each page of physical memory that is available for paging and 
swapping. Virtual address space descriptions of each currently resident proc- 
ess are contained in their respective page tables. 

System services are available to allow a user (or the system on behalf of the 
user) to create or delete specific portions of virtual address space or map a file 
into a specified virtual address range. 

1.2.1.3 Scheduling and Process Control. The third major component of the kernel is 
the scheduler, which selects processes for execution and removes processes 
from execution that can no longer execute. The scheduler also handles clock 
servicing and includes timer-related system services. System services are 
available to allow a process (or programmer) to create or delete other proc- 
esses. Other services provide one process the ability to control the execution 
of another. 

1.2.1.4 Miscellaneous Services. One area of the operating system kernel that is not 
pictured in Figure 1-2 involves the many miscellaneous services that are 
available in the operating system kernel. Some of these services, for such 
tasks as logical name creation or string formatting, are available to the user in 
the form of system services. Others of these miscellaneous services, such as 
pool manipulation routines and synchronization techniques, are only used by 
the kernel and privileged utilities. 



1.2.2 Data Management 

The VAX/VMS operating system provides data management facilities at two 
levels. The record structure that exists within a file is interpreted by the 
VAX-11 Record Management Services (RMS), which exists in a layer just 
outside the kernel. RMS exists as a series of procedures located in system 
space, so it is in some ways just like the rest of the operating system kernel. 
Most of the procedures in RMS execute in executive access mode, providing a 
thin wall of protection between RMS and the kernel itself. 

The placement of files on mass storage volumes is controlled by one of the 
disk or tape ACPs (Ancillary Control Process). ACPs are implemented as 
separate processes because many of their operations must be serialized to 
avoid synchronous access conflicts. These processes interact with the kernel 



1.2 Functionality Provided by the VAX/VMS System 

both through the system service interface and by using some of the utility 
routines that are not accessible to the general user. 



1.2.3 User Interface 

The interface that is presented to the user (as distinct from the application 
programmer who is using system services and Run-Time Library procedures) 
is one of the command language interpreters (CLI). Some of the services per- 
formed by a CLI call RMS or the system services directly. Others result in the 
execution of an external image. These images are generally no different from 
user-written applications because their only interface to the executive is 
through the system services and RMS calls. 

1.2.3.1 Images Installed with Privilege. Some of the informational utilities and disk 
and tape volume manipulation utilities require that selected portions of pro- 
tected data structures be read or written in a controlled fashion. Images that 
require privilege to perform their function can be installed (made known to 
the operating system) by the system manager so that they can perform their 
function in an ordinarily nonprivileged process environment. Images that fit 
this description are MAIL, MONITOR, VMOUNT (the volume mount util- 
ity), SET, and SHOW. Table 1-1 lists all those images that are installed with 
privilege in a typical VMS system. 

1.2.3.2 Other Privileged Images. Other images that perform privileged functions are 
not installed with privilege because their functions are less controlled and 
could destroy the system if executed by naive or malicious users. These im- 
ages can only be executed by privileged users. Examples of these images in- 
clude SYSGEN (for loading device drivers), INSTALL (which makes images 
privileged or shareable), or the images invoked by a CLI to manipulate print 
or batch queues. Images that require privilege to execute but are not installed 
with privilege in a typical VAX/VMS system are also listed in Table 1-1. 

1.2.3.3 Images That Link with SYS$SYSTEM:SYS.STB. Table 1-1 also lists those 
components that are linked with the system symbol table (SYS$SYSTEM: 
SYS.STB). These images access known locations in the system image 
(SYS.EXE) through global symbols and must be relinked each time the sys- 
tem itself is relinked. User applications or special components such as device 
drivers that include SYS.STB when they are linked must be relinked when- 
ever a new version of the symbol table is released, usually at each major 
release of the VAX/VMS operating system. 



System Overview 



Table 1-1: System 


Processes and Privileged Images 




Linked with 




Image Name 


SYS.STB 


Description 


F11AACP.EXE . 


Yes 


Files- 11 Structure Level 1 ACP 


F11BACP.EXE 


Yes 


Files- 11 Structure Level 2 ACP 


MTAAACP.EXE 


Yes 


Magnetic Tape ACP 


REMACP.EXE 


Yes 


Remote Terminal ACP 


NETACP 


Yes 


Network ACP 


ERRFMT.EXE 


Yes 


Error Log Buffer Format Process 


INPSMB.EXE 


Yes 


Card Reader Input Symbiont 


JOBCTL.EXE 


Yes 


Job Controller/Symbiont Manager 


OPCOM.EXE 


Yes 


Operator Communication Facility 


PRTSMB.EXE 


Yes 


Print Symbiont 


Images Installed with Privilege (in a typical VMS system] 




Linked with 




Image Name 


SYS.STB 


Description 


DISMOUNT.EXE 


Yes 


Volume Dismount Utility 


INIT.EXE 


Yes 


Volume Initialization Utility 


LOGINOUT.EXE 


Yes 


Login/Logout Image 


MAIL.EXE 


No 


Mail Utility 


MONITOR.EXE 


Yes 


System Statistics Utility 


PHONE.EXE 


No 


Phone Utility 


REQUEST.EXE 


Yes 


Operator Request Facility 


SET.EXE 


Yes 


SET Command Processor 


SETPO.EXE 


Yes 


SET Command Processor 


SHOW.EXE 


Yes 


SHOW Command Processor 


SUBMIT.EXE 


No 


Batch and Print Job Submission 
Facility 


VMOUNT.EXE 


Yes 


Volume Mount Utility 


Images That Require Privilege That Are 


Typically Not Installed 




Linked with 




Image Name 


SYS.STB 


Description 


AUTHORIZE.EXE 


Yes 


Authorize Utility 


INSTALL.EXE 


Yes 


Known Image Installation Utility 


NCP.EXE 


Yes 


Network Control Program 


OPCCRASH.EXE 


Yes 


System Shutdown Facility 


QUEMAN.EXE 


No 


Queue Manipulation Command Processor 


REPLY.EXE 


No 


Message Broadcasting Facility 


RMSSHARE.EXE 


Yes 


File Sharing Utility 


RUNDET.EXE 


No 


RUN Process Command Processor 


SDA.EXE 


Yes 


System Dump Analyzer 


SYSGEN.EXE 


Yes 


System Generation and Configuration Utility 



10 



1.2 Functionality Provided by the VAX/ VMS System 



Table 1-1: System Processes and Privileged Images (continued) 

Images Whose Operations Are Protected by System UIC or Volume Ownership 







Linked w. 


ith 




Image Name 




SYS.STB 


Description 




BAD.EXE 




No 


Bad Block Locator 




BACKUP.EXE 




No 


Backup Utility 




DSC 1. EXE 




No 


Disk Save and Compress Utility 
for Structure Level 1 




DSC2.EXE 




No 


Disk Save and Compress Utility 
for Structure Level 2 




DISKQUOTA.EXE 




Yes 


Disk Quota Utility 




VERIFY.EXE 




No 


File Structure Verification Utility 




Miscellaneous Images Linked with SYS$SYSTEM:SYS.STB 






Linked with 






Image Name 


SYS.STB 


Description 




DCL.EXE 


Yes 




DCL Command Interpreter 




MCR.EXE 


Yes 




MCR Command Interpreter 




MP.EXE 


Yes 




Multiprocessing Loadable Code 




RMS.EXE 


Yes 




Record Management Services Image 





1.2.4 Interface among Kernel Subsystems 

The coupling among the three major subsystems pictured in Figure 1-2 is 
somewhat misleading because there is actually little interaction between the 
three components. In addition, each of the three components has its own 
section of executive data structures that it is responsible for. When one of the 
other pieces of the system wishes to access such data structures, it does so 
through some controlled interface. Figure 1-3 shows the small amount of 
interaction that occurs between the three major subsystems in the operating 
system kernel. 

1.2.4.1 I/O Subsystem Requests. The I/O subsystem makes a request to memory 
management to lock down specified pages for a direct I/O request. The pager 
or swapper is notified directly when the I/O request that just completed was 
initiated by either one of them. 

I/O requests can result in the requesting process being placed in a wait 
state, until the request completes. This change of state requires that the 
scheduler be notified. In addition, I/O completion can also cause a process to 
change its scheduling state. Again, the scheduler would be called. 

1.2.4.2 Memory Management Requests. Both the pager and swapper require input 
and output operations in order to fulfill their functions. Neither calls $QIO 



11 



System Overview 



Lock/Unlock Physical 
Pages for Direct I/O 




Page Fault Wait 

Page Fault Read Complete 

Free Page Wait 
Physical Page Available 

Inswap Complete 

Outswap Complete 



Figure 1-3 

Interaction between Components of VMS Kernel 



directly because many of the protection checks that $QIO makes are unnec- 
essary and would slow down page I/O and swap I/O. Instead, the pager and 
swapper use special entry points into the I/O subsystem, and these points 
allow prebuilt I/O requests to be queued directly to a driver. 

If a process incurs a page fault that results in a read from disk, or if a process 
requires physical memory and none is available, the process is put into one of 
the memory management wait states by the scheduler. When the page read 
completes or physical memory becomes available, the process is made com- 
putable again. 

1.2.4.3 Scheduler Requests. The scheduler interacts very little with the rest of the 
system. It serves a more passive role when cooperation with memory man- 
agement or the I/O subsystem is required. One exception to this passive role 
is that the scheduler awakens the swapper when a process that is not cur- 
rently memory resident becomes computable. 



12 



1.3 Hardware Implementation of the Operating System Kernel 

1.3 HARDWARE IMPLEMENTATION OF THE OPERATING 

SYSTEM KERNEL 

The method of implementing the many services provided by the VAX/VMS 
operating system illustrates the close connection between the hardware de- 
sign and the operating system. Many of the general features of the VAX archi- 
tecture are used to advantage by the VAX/VMS operating system. Other fea- 
tures of the architecture exist entirely to support an operating system. 



1 .3. 1 VAX Architecture Features Exploited by VMS 

Several features of the VAX architecture that are available to all users are 
used for specific purposes by the operating system. 

• The general purpose calling mechanism is the primary path into the oper- 
ating system from all outer layers of the system. Because all system serv- 
ices are procedures, they are available to all native mode languages. 

• The memory management protection scheme is used to protect code 
and data used by more privileged access modes from modification by less 
privileged modes. Read-only portions of the executive are protected in the 
same manner. 

• There is implicit protection built into special instructions that may only 
be executed from kernel mode. Because only the executive (and suitably 
privileged process-based code) executes in kernel mode, such instructions 
as MTPR, LDPCTX, and HALT are protected from execution by non- 
privileged users. 

• The operating system uses interrupt priority level (IPL) for several pur- 
poses. At its most elementary level, IPL is elevated so that certain inter- 
rupts are blocked. For example, clock interrupts must be blocked while the 
system time (stored in a quadword) is checked because this checking takes 
more than one instruction. Clock interrupts are blocked to prevent the 
system time from being updated while it is being checked. 

• IPL is also used as a synchronization tool. For example, any routine that 
accesses a system-wide data structure must raise IPL to 7 (called 
IPL$_SYNCH). The assignment of various hardware and software inter- 
rupts to specific IPL values establishes an order of importance to the hard- 
ware and software interrupt services that the VMS operating system per- 
forms. 

• Several other features of the VAX architecture are used by specific compo- 
nents of the operating system and are described in later chapters. They 
include the following: 

—The change mode instructions (CHME and CHMK), which are used to 
decrease access mode (to greater privilege) (see Figure 1-4). Note that 
most exceptions and all interrupts result in changing mode to kernel (a 



13 



System Overview 



Access mode fields in the PSL are not directly accessible to the programmer or 

to the operating system. 
A process can reach a 
MORE privileged access 
mode through the CHMx 
instructions. In addition, 
most exceptions (except 
CHME, CHMS, and CHMU) 
and all interrupts cause 
access mode change to 
kernel. 



The only way to reach a 
LESS privileged access 
mode is through the REI 
Instruction. 




The boundaries between the access modes are nearly identical to the layer 
boundaries pictured in Figure 1-2. 

• Nearly all of the system services execute in kernel mode. 

• RMS and some system services execute in executive mode. 

• Command Language Interpreters normally execute in supervisor mode. 

• Utilities, application programs, Run-Time Library procedures, and so on 
normally execute in user mode. Privileged utilities sometimes execute in 
kernel or executive mode. 



Figure 1-4 

Methods for Altering Access Mode 



brief introduction to exceptions and interrupts is presented in Section 

1.3.5). 
— The inclusion of many protection checks and pending interrupt checks 

in the single instruction that is the common interrupt exit path, REI. 
— Software interrupts. 
—Hardware context and the single instructions (SVPCTX and LDPCTX) 

that save and restore it. 
— The use of ASTs to obtain and pass information. 

1.3.2 VAX-11 Instruction Set 

While the VAX-11 instruction set, data types, and addressing modes were 
designed to be somewhat compatible with the PDP-11, several features that 



14 



1 



,3 Hardware Implementation of the Operating System Kernel 



were missing in the PDP-11 were added to the VAX architecture. True con- 
text indexing allows array elements to be addressed by element number, with 
the hardware accounting for the size (byte, word, longword, or quadword) of 
each element. Short literal addressing was added in recognition of the fact 
that the majority of literals that appear in a program are small numbers. 
Variable length bit fields and character data types were added to serve the 
needs of several classes of users, including operating system designers. 

The instruction set includes many instructions that are useful to any de- 
signer and occur often in the VMS executive. The queue instructions allow 
the construction of doubly linked lists as a common dynamic data structure. 
Character string instructions are useful when dealing with any data structure 
that can be treated as an array of bytes. Bit field instructions allow efficient 
operations on flags and masks. 

One of the most important features of the VAX architecture is the calling 
standard. Any procedure that adheres to this standard can be called from any 
native language, an advantage for any large application that wishes to make 
use of the features of a wide range of languages. The VMS operating system 
adheres to this standard in its interfaces to the outside world through the 
system service interface, RMS entry points, and the Run-Time Library proce- 
dures. All system services and RMS routines are written as procedures that 
can be accessed by issuing a CALLx to absolute location SYS$service in the 
process PI virtual address space. Run-Time Library procedures are included 
in a user's image instead of being located in system space. 



1.3.3 Implementation of VMS Kernel Routines 

In Section 1.2.1, the VMS kernel was divided into three functional pieces plus 
the system service interface to the rest of the world. Alternatively, the oper- 
ating system kernel can be partitioned according to the method used to gain 
access to each part. Three classes of routines within the kernel are proce- 
dure-based code, exception service routines, and interrupt service routines. 
Other system-wide functions, the swapping and modified page writing per- 
formed by the swapper, are implemented as a separate process that resides in 
system space. Figure 1-5 shows the various entry paths into the operating 
system kernel. 

1.3.3.1 Process Context and System State. The first section of this chapter discussed 
the pieces of the system that are used to describe a process. Process context 
includes a complete address space description, quotas, privileges, scheduling 
data, and so on. Any portion of the system that executes in the context of a 
process can count on all of these process attributes being available. 

There is a portion of the kernel, however, that operates outside the context 
of a specific process. The largest class of routines that fall into this category is 
that of interrupt service routines, invoked in response to external events with 



15 



System Overview 



External Device 
Hardware Interrupts 
(IPL=20...23) 



Translation - not - Valid 
Fault (Page Fault) 
(Exception, not Interrupt)' 




Device Driver 
Fork Processing 
(IPL=8...11) 

I/O Postprocessing 
Software Interrupt 
(IPL=4) 



X. K AST Delivery 



Software Interrupt 
(IPL=2) 



Rescheduling 
Software Interrupt 
(IPL=3) 

Hardware Clock 

Interrupt 

(IPL=24) 

Figure 1-5 

Paths into Components of VMS Kernel 



Software Timer 

Interrupt 

(IPL=7) 



no regard for the currently executing process. Portions of the initialization 
sequence also fall into this category. In any case, there are no process features 
such as a kernel stack or a page fault handler available when these routines 
are executing. 

Because of the lack of a process, this system state or interrupt state can be 
characterized by the following limited context. 

• All stack operations take place on the system-wide interrupt stack. 

• The primary description of system or interrupt state is contained in the 
processor status longword (PSL). The PSL will indicate that the interrupt 
stack is being used, that the current access mode is kernel mode, and that 
the IPL is higher than IPL 2. 

• The system control block, the data structure that controls the dispatching 
of interrupts and exceptions, can be thought of as the secondary structure 
that describes system state. 



16 



1.3 Hardware Implementation of the Operating System Kernel 

• Code that executes in this so-called system context can only refer to sys- 
tem virtual addresses. In particular, there is no PI space available, so the 
system-wide interrupt stack must be located in system space. 

• No page faults are allowed. The page fault handler generates a fatal bug- 
check if a page fault occurs and the IPL is above IPL 2. 

• No exceptions are allowed. Exceptions, like page faults, are associated 
with a process. The exception dispatcher generates a fatal bugcheck if an 
exception occurs above IPL 2 or while the processor is executing on the 
interrupt stack. 

• ASTs, asynchronous events that allow a process to receive notification 
when external events have occurred, are not allowed. (The AST delivery 
interrupt is delivered when IPL" drops below IPL 2, an indication that the 
processor is leaving the interrupt state.) 

• No system services are allowed in the system state. (In fact, most system 
services can only be called from process context at IPL 0; only the memory 
management system services can be called at IPL 2. Process deletion re- 
quires that these system services be callable at IPL 2 ; doing so requires a 
great deal of care and is not recommended.) 

1.3.3.2 Process-Based Routines. Procedure-based code (RMS services and the system 
services) and exception service routines usually execute in the context of the 
current process (on the kernel stack when in kernel mode). 

The system services are implemented as procedures and are available to all 
native mode languages. In addition, the fact that they are procedures means 
that there is a call frame on the stack. Thus, errors detected by a utility 
subroutine used by a system service can return an error simply by putting the 
error status into R0 and issuing a RET instruction. All superfluous informa- 
tion is cleaned off the stack by the RET instruction. The system service dis- 
patchers, actually the dispatchers for the CHMK and CHME exceptions, are 
exception service routines. 

System services must be called from process context. They are not availa- 
ble from interrupt service routines or other code (such as portions of the 
initialization sequence) that executes outside the context of a process. One 
reason for requiring process context is that the various services assume that 
there is a process whose privileges can be checked and whose quotas can be 
charged as part of the normal operation of the service. Some system services 
reference locations in PI space, a portion of address space only available 
while executing in process context. System services also make assumptions 
about IPL and synchronization that would be violated if they were called 
from other than process-based code executing at IPL 0. 

The pager (the page fault exception handler) is an exception service routine 
that is invoked in response to a translation-not-valid fault. The pager thus 
satisfies page faults in the context of the process that incurred the fault. Be- 



17 



System Overview 

cause page faults are associated with a process, the system cannot tolerate 
page faults that occur in interrupt service routines or other routines that 
execute outside the context of a process. The actual restriction imposed by 
the pager is even more stringent. Page faults are not allowed above IPL 2. This 
restriction applies to process-based code executing at elevated IPL as well as 
to interrupt service code. 

1.3.3.3 Interrupt Service Routines. By their asynchronous nature, interrupts execute 
without the support of process context (on the system-wide interrupt stack). 

• I/O requests are initiated through the $QIO system service, which can be 
issued directly by the user or by some intermediary, such as RMS, on the 
user's behalf. Once an I/O request has been placed into a device queue, it 
remains there until the driver is triggered, usually by an interrupt gener- 
ated in the external device. 

Two classes of software interrupt service routines exist solely to support 
the I/O subsystem. The fork level interrupts allow device drivers to lower 
IPL in a controlled fashion. Final processing of I/O requests is also done in 
a software interrupt service routine. 

• The timer functions in the operating system include support in both the 
hardware clock interrupt service routine and a software interrupt service 
routine that actually services individual timer requests. 

• Another software interrupt performs the rescheduling function, where one 
process is removed from execution and another selected and placed into 
execution. 

1.3.3.4 Special Processes — Swapper and Null. The swapper and the null process are 
different from any other processes that exist in a VAX/VMS system. The 
differences lie not in their operations, which are completely normal, but in 
their limited context. 

The limited context of either of these processes is due, in part, to the fact 
that these two processes exist as part of the system image SYS. EXE. They do 
not have to be created with the Create Process system service. Specifically, 
their PCBs and process headers are assembled (in module PDAT) and linked 
into the system image. Other characteristics of these two processes are listed 
here. 

• Their process headers are static. There is no working set list and no process 
section table. Neither process supports page faults. All code executed by 
either process must be locked into memory in some way. In fact, the code 
of both of these processes is part of the nonpaged executive. 

• Both processes execute entirely in kernel mode, thereby eliminating the 
need for stacks for the other three access modes. 



1.3 Hardware Implementation of the Operating System Kernel 

• Neither process has a PI space. The kernel stack for either process is lo- 
cated in system space. 

• The null process does not have a PO space either. The swapper uses an 
array allocated from nonpaged pool as its PO page table for a special portion 
of process creation, the part that takes place in the context of the swapper 
process. 

Despite their limited contexts, both of these processes behave in a normal 
fashion in every other way. The swapper and the null process are selected for 
execution by the scheduler just like any other process in the system. The 
swapper spends its idle time in the hibernate state until some component in 
the system recognizes a need for one of the swapper functions, at which time 
it is awakened. The null process is always computable, but set to the lowest 
priority in the system (priority 0). All CPU time not used by any other proc- 
ess in the system will be used by the null process. 

1.3.3.5 Special Subroutines. There are several utility subroutines within the operat- 
ing system related to scheduling and resource allocation that are called from 
both process-based code such as system services and from software interrupt 
service routines. These subroutines are constrained to execute with the lim- 
ited context of interrupt or system state. 

1.3.4 Memory Management and Access Modes 

The address translation mechanism is described in the VAX Hardware Hand- 
book. Two side effects of this operation are of special interest to the VMS 
operating system. When a page is not valid, a translation-not-valid exception 
is generated that transfers control to an exception service routine that can 
take whatever steps are required to make the page valid. This exception 
transfers control from a hardware mechanism, address translation, to a soft- 
ware exception service routine, the page fault handler, and allows the operat- 
ing system to gain control on address translation failures in order to imple- 
ment its dynamic mapping of pages while a program is executing. 

Before the address translation mechanism checks the valid bit, a protection 
check is made to determine whether the requested access will be granted. 
The check uses both the current access mode in the PSL (PSL<25:24>), a 
protection code that is defined for each virtual page, and the type of access 
(read, modify, or write) to make its decision. This protection check allows the 
operating system to make read-only portions of the executive inaccessible to 
anyone (all access modes) for writing, preventing corruption of operating sys- 
tem code. In addition, privileged data structures can be protected from even 
read access by nonprivileged users, preserving the integrity of the operating 
system. 



19 



System Overview 

1.3.5 Exceptions, Interrupts, and REI 

Before mentioning other features of the exception and interrupt mechanisms 
used by the VMS operating system, it would be helpful to compare and con- 
trast these two mechanisms. 

1.3.5.1 Comparison of Exceptions and Interrupts. The following list summarizes 
some of the characteristics of exceptions and interrupts. 

• Interrupts occur asynchronously to the currently executing instruction 
stream. They are actually serviced between individual instructions or at 
well-defined points within the execution of a given instruction. Excep- 
tions occur synchronously as a direct effect of the execution of the current 
instruction. 

• Both mechanisms pass control to service routines whose addresses are 
stored in the system control block. These routines perform exception- 
specific or interrupt-specific processing. 

• Exceptions are generally a part of the currently executing process. Their 
servicing is an extension of the instruction stream that is currently execut- 
ing on behalf of that process. Interrupts are system-wide events that can- 
not rely on support of a process in their service routines. 

• Because exceptions are usually caused by an executing process, the sys- 
tem-wide interrupt stack is usually used to store the PC and PSL of the 
process that was interrupted. Exceptions are usually serviced on the per- 
process kernel stack. Which stack to use is actually determined by control 
bits in the system control block entries for each exception or interrupt. 

• Interrupts cause a PC/PSL pair to be pushed onto the stack. Exceptions 
often cause exception-specific parameters to be stored along with a PC/PSL 
pair. 

• Interrupts cause the IPL to change. Exceptions usually do not have an IPL 
change associated with them. (Machine checks and kernel-stack-not-valid 
exceptions elevate IPL to 31.) 

• A corollary of the previous step is that interrupts can be blocked by elevat- 
ing IPL to a value at or above the IPL associated with the interrupt that is 
to be blocked. Exceptions, on the other hand, cannot be blocked. However, 
some exceptions can be disabled (by clearing associated bits in the PSW). 

• When an interrupt or exception occurs, a new PSL is formed that summa- 
rizes the new IPL, the current access mode (almost always kernel), the 
stack being used (interrupt or other), and so on. One difference between 
exceptions and interrupts, a difference that reflects the fact that interrupts 
are not related to the interrupted instruction stream, is that the previous 
access mode field in the new PSL is set to kernel for interrupts, while the 
previous mode field for exceptions reflects the access mode in which the 
exception occurred. 



20 



1.3 Hardware Implementation of the Operating System Kernel 

1.3.5.2 Other Uses of Exceptions and Interrupts. In addition to the translation-not- 
valid fault used by memory management software, the operating system also 
uses the change-mode-to-kernel and change-mode-to-executive exceptions as 
entry paths to the executive. System services that must execute in a more 
privileged access mode use either the CHMK or CHME instruction to gain 
access mode rights (see Figure 1-4). The system handles most other excep- 
tions by passing them through a common exception dispatcher described in 
Chapter 4. 

Hardware interrupts temporarily suspend code that is executing so that an 
interrupt-specific routine can service the interrupt. Interrupts have an IPL 
associated with them. The internal processor priority level (IPL) is raised 
when the interrupt is recognized. High level interrupt service routines thus 
prevent the recognition of lower level interrupts. Lower level interrupt serv- 
ice routines can be interrupted by subsequent higher level interrupts. Kernel 
mode routines can also block interrupts at certain levels by specifically rais- 
ing the IPL. 

The VAX architecture also defines a series of software interrupt levels that 
can be used for a variety of purposes. The VMS operating system uses them 
for scheduling, I/O completion routines, and for synchronizing access to cer- 
tain classes of data structures. 



1.3.5.3 The REI Instruction. The REI instruction is the common exit path for inter- 
rupts and exceptions. Many protection and privilege checks are incorporated 
into this instruction. Because most fields in the processor status longword 
are not accessible to the programmer, the REI instruction provides the only 
means for changing access mode to a less privileged mode (see Figure 1-4). It 
is also the only way to reach compatibility mode. 

Although the IPL field of the PSL is accessible through the PR$_IPL proces- 
sor register, execution of an REI is a common way that IPL is lowered during 
normal execution. Because a change in IPL can alter the deliverability of 
pending interrupts, many hardware and software interrupts are delivered 
after an REI instruction is executed. 



1.3.6 Process Structure 

The VAX architecture also defines a data structure called a hardware process 
control block that contains copies of all a process's general registers when the 
process is not active. When a process is selected for execution, the contents of 
this block are copied into the actual registers inside the processor with a 
single instruction, LDPCTX. The corresponding instruction that saves the 
contents of the general registers when the process is removed from execution 
is SVPCTX. 



21 



System Overview 

1.4 OTHER SYSTEM CONCEPTS 

This chapter began by discussing the most important concepts in the VMS 
operating system, process and image. There are several other fundamental 
ideas that should be mentioned before beginning a detailed description of 
VMS internals. Some of these ideas are briefly described here. 



1.4.1 Resource Control 

The VAX/VMS operating system protects itself and other processes in the 
system from careless or malicious users with hardware and software protec- 
tion mechanisms, software privileges, and software quotas and limits. 

1.4.1.1 Hardware Protection. The memory management protection mechanism that 
is related to access mode is used to prevent unauthorized users from modify- 
ing (or even reading) privileged data structures. Access mode protection is 
also used to protect system and user code, and other read-only data struc- 
tures, from being modified by programming errors. 

A more subtle but perhaps more important aspect of protection provided by 
the memory management architecture is that the process address space of 
one process (PO space and PI space) is not accessible to code running in the 
context of another process. When such accessibility is desired to share com- 
mon routines or data, the operating system provides a controlled access 
through global sections. System virtual address space is available to all proc- 
esses (although page-by-page protection may deny read or write access to 
specific system virtual pages for certain access modes). 

1.4.1.2 Process Privileges. Many operations that are performed by system services 
could destroy operating system code or data or corrupt existing files if per- 
formed carelessly. Other services allow a process to adversely affect features 
in other processes in the system. The VMS operating system requires that 
processes wishing to execute these potentially damaging operations be suita- 
bly privileged. Process privileges are assigned when a process is created, ei- 
ther by the creator or through the user's record in the authorization file. 

These privileges are described in the VAX/VMS System Management and 
Operations Guide and in the VAX/VMS System Services Reference Manual. 
The privileges themselves are specific bits in a quadword that is stored in the 
beginning of the process control block. (The locations and manipulations of 
the several process privilege masks that the operating system maintains are 
discussed in Chapter 21.) When a VMS service that requires privilege is 
called, the service checks to see whether the associated bit in the process 
privilege mask is set. 



22 



1.4 Other System Concepts 

1.4.1.3 Quotas and Limits. The VMS operating system also controls allocation of its 
system-wide resources, such as nonpaged dynamic memory and page file 
space, through the use of quotas and limits. These process attributes are also 
assigned when the process is created. By restricting such items as the number 
of concurrent I/O requests or pending ASTs, the executive exercises control 
over the resource drain that a single process can exert on system resources 
such as nonpaged dynamic memory. In general, a process cannot perform 
certain operations (such as queue an AST) unless it has sufficient quota 
(nonzero PCB$W_ASTCNT in this case). The locations and values of the 
various quotas and limits used by the operating system are described in 
Chapter 20. 

1.4.1.4 User Identification Code (UIC). The VMS operating system uses user identifi- 
cation code (UIC) for two different protection purposes. If a process wishes to 
perform some control operation (Suspend, Wake, Delete, and so on) on an- 
other process, it requires WORLD privilege in order to affect any process in 
the system. A process with GROUP privilege can affect only other processes 
with the same group number. A process with neither WORLD nor GROUP 
privilege can affect only other processes that are part of the same job. (A 
process with neither GROUP nor WORLD privilege cannot affect any other 
process in the system, even if it has the same UIC, unless the target process is 
in the same job as the process in question.) 

The UIC is also the parameter that determines whether a user can read 
from or write to a given file. The owner of a file can determine how much 
access to his files he grants to himself, to other processes in the same group, 
and to other processes in the system. 

The same UIC protection that exists for files is also used for other data 
structures in the system. Both logical names and global sections exist in two 
varieties, group names and sections or system names and sections. The group 
variety is only available to other processes in the same group. Common event 
flags, flags that can be shared among several processes, are restricted to proc- 
esses in the same group. 



1.4.2 Other System Primitives 

Several other simple tools used by the VMS operating system are mentioned 
freely throughout this book and are described in Chapters 2, 3, and 29. 

1.4.2.1 Synchronization. Any multiprogramming system must take measures to pre- 
vent simultaneous access to system data structures. The executive uses two 
simple synchronization techniques. By elevating IPL, a subset of interrupts 
can be blocked, allowing unrestricted access to system-wide data structures. 



23 



System Overview 

The most common synchronization IPL used by the operating system is IPL 
7, called IPL$_SYNCH. 

For some data structures, elevated IPL is either an unnecessary tool or a 
potential system degradation. For example, processes executing at or above 
IPL 3 cannot be rescheduled (removed from execution). Once a process gains 
control of a data structure protected by elevated IPL, it will not allow another 
process to execute until it gives up its ownership. In addition, page faults are 
not allowed above IPL 2 and so any data structure that exists in pageable 
address space cannot be synchronized with elevated IPL. 

The VMS executive requires a second synchronization tool to allow syn- 
chronized access to pageable data structures. This tool must also allow a 
process to be removed from execution while it maintains ownership of the 
structure in question. The synchronization tool that fulfills these require- 
ments is called a mutual exclusion semaphore (or mutex). Synchronization, 
including the use of mutexes, is discussed in Chapter 2. 

1.4.2.2 Dynamic Memory Allocation. The system maintains three dynamic memory 
areas from which blocks of memory can be allocated and deallocated. 
Nonpaged pool contains those system-wide structures that might be manipu- 
lated by (hardware or software) interrupt service routines or process-based 
code executing above IPL 2. Paged pool contains system-wide structures that 
do not have to be kept memory resident. The process allocation region, a 
portion of the process PI space, is used for pageable data structures that will 
not be shared among several processes. Dynamic memory allocation and 
deallocation are discussed in detail in Chapter 3. 

1.4.2.3 Logical Names. The system uses logical names for many purposes, including 
a transparent way of implementing a device-independent I/O system. The use 
of logical names as a programming tool is discussed in the VAX/VMS System 
Services Reference Manual. The internal operations of the logical name sys- 
tem services, as well as the internal organization of the logical name tables, 
are described in Chapter 29. 



1.5 LAYOUT OF VIRTUAL ADDRESS SPACE 

This section shows the approximate contents of the three different parts of 
virtual address space. 

1.5.1 System Virtual Address Space 

The layout of system virtual address space is pictured in Figure 1-6. Details 
such as the no-access pages at either end of the interrupt stack are omitted to 
avoid cluttering the diagram. Table 26-2 gives a more complete description of 



24 



1.5 Layout of Virtual Address Space 



80000000 



High address end 
of system virtual 
address space 



System Service Vectors 



Linked Driver Code and Data Structures 



Nonpaged Executive Data 



Nonpaged Executive Code 



Pageable Executive Routines 



XDELTA (usually unmapped), INIT 



System Virtual Pages 
Mapped to I/O Addresses 



RMS Image 
(RMS.EXE) 



System Message File 
(SYSMSG.EXE) 



Pool of Unmapped System Pages 



Restart Parameter Block 



PFN Database 



Paged Dynamic Memory 



Nonpaged Dynamic Memory 



Interrupt Stack 



System Control Block 



Balance Slots 



System Header 



System Page Table 



Global Page Table 



Static Portion (SYS.EXE) 



Dynamically mapped at 
initialization time 



Figure 1-6 

Layout of System Virtual Address Space 



25 



System Overview 



system space, including these guard pages, system pages allocated by disk 
drivers, and other details. 

This figure was produced from two lists provided by the system dump ana- 
lyzer (the system page table and the contents of all global data areas in system 
space) and from the system map SYS$SYSTEM:SYS.MAP. The relations be- 
tween the variable size pieces of system space and their associated SYSBOOT 
parameters are given in Chapter 26. 



1.5.2 The Control Region (PI Space) 

Figure 1-7 shows the layout of PI space. This figure was produced mainly 
from information contained in module SHELL, which contains a prototype of 
a PI page table that is used whenever a process is created. An SDA listing of 
process page tables was used to determine the order and size of the portions of 
PI space not defined in SHELL. 

Some of the pieces of PI space are created dynamically when the 
process is created. These include a PI map of process header pages, a 
command language interpreter if one is being used, and a symbol table 
for that CLI. 

The two pieces of PI space at the lowest virtual addresses (the user stack 
and the image I/O segment) are created dynamically each time an image exe- 
cutes and are deleted as part of image rundown. Chapter 26 contains a de- 
scription of the sizes of the different pieces of PI space. Table 26-4 gives a 
complete description of PI space, including details such as memory manage- 
ment page protection and the name of the system component that maps a 
given portion. 



1.5.3 The Program Region (PO Space) 

Figure 1-8 shows a typical layout of PO space for both a native mode image 
(produced by the VAX-1 1 Linker) and a compatibility mode image (produced 
by the RSX-11M task builder). This figure is much more conceptual than the 
previous two illustrations because PO space does not contain pieces of the 
executive as PI space and system space do. 

By default, the first page of PO space (0 to IFF) is not mapped (protection set 
to No Access). This no-access page allows easy detection of two common 
programming errors, using zero or a small number as the address of a data 
location or using such a small number as the destination of a control transfer. 
(A link-time request or a system service call can alter the protection of vir- 
tual page zero. Note also that page zero is accessible to compatibility mode 
images.) 



26 



1.5 Layout of Virtual Address Space 



Image-Specific Portion 
of P1 Space 
/Deleted at image exit 
^by MMGSIMGRESET 



Dynamic Permanent 
Portion of P1 Space 



Static Permanent 
Portion of P1 Space 




User Stack 



Image I/O Segment 



Per-Process Message Section(s) 



CLI Symbol Table 



CLI Image 



P1 Window to Process Header 



Channel Control Block Table 



Process I/O Segment 



Per-Process Common Area 
for Users 



Per-Process Common Area 
Reserved to DIGITAL 



Compatibility Mode Data Page 



VMS User Mode 
Data Page 



Image Activator 
Context Page 



Process Allocation Region 



Generic CLI Data Pages 



Image Activator Scratch Pages 



Debugger Context 



40000000 



vectors for Messages and User-Written System Services 



Image Header Buffer 



Kernel Stack 



Executive Stack 



Supervisor Stack 



System Service 
vectors 



P1 Pointer Page 



Debugger Symbol Table 
(not mapped if debugger not present) 



::CTL$GL_CTLBASVA 

/Locates border between\ 

image-specific and j 

process-permanent I 

V pieces of P1 space / 



"MMG$GL_CTLBASVA 

/Locates initial low \ 

address end of P1 1 

space for each process I 

Vas it is created / 



7FFFFFFF 



Figure 1-7 

Layout of PI Space 



17 



to 

00 



On 

3 



This part of 
PO space is 
defined by the 
linker and 
mapped by the 
image activator. 



This part of 
PO space is 
not defined at 
link time. 

If either of 
these pieces is 
required, it is 
mapped. Note 
that both cannot 
be mapped at 
the same time. 



Native Mode Image 



Not Mapped 



Executive 
Image 



VMSRTL 



LBRSHR 



other shareable images 



Debugger (LIBSDEBUG) 

(If requested at link, 

run, or execution time) 



Traceback (LIB$TRACE) 

(If not overriden at link 

time and needed) 



not mapped 



The order of the images 
in this portion is 
undefined at link time. 
The order is determined 
by IMGACT at image 
activation time. 



POLR Pages 



3FFFFFFF 



This portion 
of PO space 
is defined by 
theRSX-11 
task builder 
and mapped 
by the AME. 

The AME is 
mapped by the 
image activator 
when it detects 
that it is activating 
a compatibility 
mode image. 



Compatibility Mode Image 



o 

s. 

TO 

3 



Compatibility 
Mode Image 



not mapped 



RSX-11MAME 

(RSX.EXE) 

(BACKTRANS.EXE) 

Native Mode Image 



> 



not mapped 



End of Compatibility 
Mode Image 



177777 8 =FFFF 16 



POLR Pages 



3FFFFFFF 



Figure 1-8 

PO Space Allocation 



1.5 Layout of Virtual Address Space 

The main image is placed into PO space, starting at address 200 (hex). Any 
shareable libraries that are position independent and shared (for example, 
VMSRTL) are placed at the end of the main image. The order in which these 
libraries are placed into the image is determined in image activation. 

If the debugger or the traceback facility is required, these images are added 
at execution time (even if /DEBUG was selected at link time) by procedure 
SYS$IMGSTA. This mapping is described in detail in Chapter 21. 



29 



2 Synchronization Techniques 



And now I see with eye serene 
The very pulse of the machine. 
—William Wordsworth, She Was a Phantom of Delight 

One of the most important issues in the design of an operating system is 
synchronization. Especially in a system that is interrupt driven, certain se- 
quences of instructions must be allowed to execute without interruption. 
The VMS operating system uses special IPL values to block certain interrupts 
during the execution of critical code paths. 

Any operating system must also take precautions to insure that shared data 
structures are not being simultaneously modified by several routines or being 
read by one routine while another routine is modifying the structure. The 
VMS executive uses a combination of software techniques and features of the 
VAX hardware to synchronize access to shared data structures. The following 
techniques are described in this chapter: 

• Elevated IPL 

• Serialized access 

• Mutual exclusion semaphores, called mutexes 

• VAX/VMS lock management system services 



2.1 ELEVATED IPL 

The primary purpose of raising IPL is to block interrupts at the selected IPL 
value and all lower values of IPL. For example, by raising IPL to 23, all device 
interrupts are blocked; but the clock, which interrupts at IPL 24, can still 
cause interrupts. The operating system also uses selected IPL values for per- 
forming certain actions or for accessing certain structures. 

The IPL, stored in PSL<20:16>, is altered by writing the desired IPL value 
to the privileged register PR$_IPL with the MTPR instruction. This change 
in IPL is usually accomplished in the operating system with one of two 
macros, SETIPL or DSBINT, whose macro definitions are as follows: 

.MACRO SETIPL IPL = #31 
MTPR IPL,S'#PR$_IPL 
.ENDM SETIPL 

.MACRO DSBINT IPL = #31 , DST = -( SP) 

MFPR S A #PR$_IPL,DST 

MTPR IPL,S"#PR$_IPL 

.ENDM DSBINT 



30 



2.1 Elevated IPL 

The SETIPL macro changes IPL to the specified value. If no argument is pres- 
ent, IPL is elevated to 31. The DSBINT macro first saves the current IPL 
before elevating IPL to the specified value. If no alternate destination is speci- 
fied, the old IPL is saved on the stack. The default IPL value is 31. 

The DSBINT macro is usually used when a later sequence of code must 
restore the IPL to the saved value (with the ENBINT macro). This macro is 
especially useful when the caller's IPL level is unknown. The SETIPL macro 
is used when the IPL will later be explicitly lowered with another SETIPL or 
simply as a result of executing an REI instruction. That is, the value of the 
saved IPL is not important to the routine that is using the SETIPL macro. 

The ENBINT macro is the counterpart of the DSBINT macro. It restores 
the IPL to the value found in the designated source argument. 

.MACRO ENBINT SRC=(SP) + 
MTPR SRC,S~#PR$_IVL 

-ENDM ENBINT 

Occasionally it is necessary to save an IPL value (to be restored later by the 
ENBINT macro) without changing the current IPL. 

.MACRO SAVIPL DST=-(SP) 
MFPR S"#PR$_IPL,DST 
.ENDM SAVIPL 

The successful use of IPL as a synchronization tool requires that IPL be raised 
(not lowered) to the appropriate synchronization level. Lowering IPL defeats 
any attempt at synchronization and runs the risk of a reserved operand fault 
when an REI instruction is later executed. (An REI instruction that attempts 
to elevate IPL causes a reserved operand fault.) 



2.1.1 UseofIPL$_SYNCH 

IPL 7 (IPL$_ SYNCH) is used as the interrupt level for the software timer 
routines, those routines that service timer queue entries and handle quantum 
expiration. IPL 7 is also used as the level to which IPL must be raised for any 
routine to access a system-wide data structure. By raising IPL to 7, all other 
routines that might access the same system-wide data structure are blocked 
from execution until IPL is lowered. 

While the processor is executing at IPL 7, certain system-wide events such 
as scheduling and I/O postprocessing are blocked. However, other, more im- 
portant operations, such as hardware interrupt servicing and device driver 
fork processing, can continue. Thus, the amount of time that the operating 
system spends at IPL 7 does not affect more important activities such as 
servicing I/O requests. The fact that I/O processing, including fork process- 
ing, is more important than other system operations (such as satisfying a page 
fault) reflects one of the underlying philosophies of the executive, to keep 
external devices as busy as possible. 



31 



Synchronization Techniques 

2.1.2 Other IPL Levels Used for Synchronization 

Table 2-1 lists several IPL levels that are used for synchronization purposes 
by the system. Some of these levels are used to control access to shared data 
structures. Other levels are used to prevent certain events, such as a clock 
interrupt or process deletion, from occurring while a block of instructions is 
executed. 

2.1.2.1 IPL 31. Routines in the operating system will raise IPL to 31 to block all 
interrupts for a short period of time (usually less than ten instructions once 
the system is initialized). 

• Device drivers use IPL 31 just before they call IOC$WFIxxCH to prevent a 
powerfail interrupt from occurring. 

• The entire bootstrap sequence operates at IPL 3 1 in order to put the system 
into a known state before allowing interrupts to occur. 

• Because the error logger routines can be called from anywhere in the exec- 
utive, including fault service routines that execute at IPL 31 (such as ma- 
chine check handlers), allocation of an error log buffer can only execute at 
IPL 31. A corrolary of this requirement demands that the ERRFMT process 
execute at IPL 31 when it is altering data structures that describe the state 
of the error log buffer. (As Chapter 8 describes, the copy is done at two IPL 
levels. The error log buffer status flags and message counts are modified at 
IPL 31. Then IPL is lowered to zero; the contents of the error log buffer are 
copied to the ERRFMT process PO space, and the messages are formatted 
and written to the error log file.) 

2.1.2.2 IPL 24. When IPL is raised to 24, the level at which the hardware clock inter- 
rupts, clock interrupts are blocked. The software timer interrupt service rou- 



Table 2-1 : Common IPL Values Used by the Executive for Synchronization 



Name 


(decimal) 


IPL$_POWER 


31 


IPL$_HWCLK 


24 


UCB$B_DIPL(1) 


20-23 


UCB$B_FIPL(1) 


8-11 


IPL$_ SYNCH 


7 


IPL$_QUEUEAST 


6 


IPL$_ASTDEL 


2 



Meaning 

Disable all interrupts 

Block clock and device interrupts 

Block interrupts from specific devices 

Device driver fork levels 

Synchronize access to any system-wide 

data structures 

Device driver fork IPL that allows drivers 

to elevate IPL to 7 

Block delivery of ASTs (prevent process 

deletion) 



( 1 ) These symbols are offsets into a device unit control block. 



32 



2.1 Elevated IPL 

tine uses this IPL level when it is comparing two quadword system time 
values. An IPL value of 24 prevents the system time from being updated 
while it is being compared with some other time value. (This precaution is 
required because the VAX architecture does not contain a CMPQ— compare 
quadword — instruction.) 

2.1.2.3 Device IPL. Device drivers will raise IPL to the level at which the associated 
device will interrupt in order to prevent other devices from generating inter- 
rupts while device registers are being read or written. This step usually pre- 
cedes the further elevation of IPL to 31 just described. 

2.1.2.4 Fork IPL. Fork IPL (a value specific to each device type) is used by the execu- 
tive to synchronize access to each unit control block. These blocks are 
accessed by device drivers and by procedure-based code, such as the comple- 
tion path of the $QIO system service and the Cancel I/O system service. 

Device drivers also use their associated fork IPL as a synchronization level 
when accessing data structures that control shared resources, such as multi- 
unit controllers or datapath registers or map registers. In order for this syn- 
chronization to work properly, all devices sharing a given resource must use 
the same fork IPL. 

The use of fork IPL to synchronize access to unit control blocks works the 
same way that elevating IPL to 7 does. That is, one piece of code elevates IPL 
to the specified fork IPL (found at offset UCB$B_FIPL) and blocks all other 
potential accesses to the UCB. Fork processing, the technique whereby de- 
vice drivers lower IPL below device interrupt level in a manner consistent 
with the interrupt nesting scheme, also uses the serialization technique de- 
scribed in Section 2.2. 



2.1.3 IPL$_QUEUEAST 

Perhaps the example that best illustrates the synchronization rules followed 
by the operating system is the use of IPL 6 (IPL$_QUEUEAST) by device 
drivers. There are instances where device drivers find it necessary to interact 
with the scheduler. For example, the terminal driver may notify a requesting 
process about unsolicited input or a CTRL/Y through an AST (see Chapter 7). 
The mailbox driver also can notify requesting processes about reads or writes 
to a mailbox. 

The enqueuing of an AST must occur at IPL$_ SYNCH to synchronize ac- 
cess to the scheduler's database. As already pointed out, IPL must be elevated 
(not lowered) to 7 to achieve this synchronization. The fork level at IPL 6 
allows device drivers that execute at IPL 8 through IPL 11 to make these 
scheduling requests. Specifically, the driver calls a routine called 
COM$DELATTNAST that creates an IPL 6 fork request. That is, a fork block 
is placed into the IPL 6 fork queue and an IPL 6 software interrupt requested 



33 



Synchronization Techniques 

(software interrupts are described in Chapter 6). When that interrupt occurs, 
the fork block is used as an AST control block and passed to SCH$QAST, 
which will elevate IPL to 7, in keeping with the rule that IPL must be raised 
to IPL$_ SYNCH to preserve proper interrupt nesting. 

An obvious question in response to the above description is why the IPL 7 
fork interrupt cannot be used to achieve the same result. The answer is that if 
the IPL 7 software interrupt were not being used for another purpose, that 
would be a perfectly acceptable solution. However, the software timer service 
routine is entered as a result of the IPL 7 software interrupt. So this synchro- 
nization technique uses the first free IPL below 7, the IPL 6 software inter- 
rupt called IPL$_QUEUEAST. 

IPL 6 is used in a second instance by device drivers that interact with the 
scheduler. As described in the next chapter, nonpaged pool cannot be deallo- 
cated from code executing in response to an interrupt above IPL 7, because 
nonpaged pool is a system-wide resource whose availability must be reported 
to the scheduler. Routine COM$DRVDEALMEM creates an IPL 6 fork proc- 
ess that allows the deallocation to take place in response to an IPL 6 software 
interrupt, allowing the scheduler to properly synchronize its database ac- 
cesses. The actual pool manipulation takes place at IPL 1 1 to synchronize 
with the allocation routine. 



2.1.4 TPL2 

IPL 2 is the level at which the software interrupt associated with AST deliv- 
ery occurs. When system service procedures raise IPL to 2, they are blocking 
the delivery of all ASTs, but particularly the special kernel AST that causes 
process deletion. In other words, if a process is executing at IPL 2 (or above), 
that process cannot be deleted. 

This technique is used in several places to prevent process deletion be- 
tween the time that some system resource (such as system dynamic memory) 
is allocated and the time that ownership of that resource is recorded (such as 
the insertion of a data structure into a list). For example, the $QIO system 
service executes at IPL 2 from the time that an I/O request packet is allocated 
from nonpaged dynamic memory until that packet is queued to a unit control 
block or placed into the I/O postprocessing queue. 

The memory management subsystem uses IPL 2 in order to inhibit the 
special kernel mode AST that is queued on I/O completion. This inhibition is 
necessary at times when the memory management subsystem has some 
knowledge of the process's working set and yet the execution of the I/O com- 
pletion AST could cause a modification to the working set, thereby invalidat- 
ing that knowlege. 

IPL 2 also has significance for an entirely different reason: it is the highest 
IPL level at which page faults are permitted. If a page fault occurs at IPL above 



34 



2.2 Serialized Access 

2, a fatal bugcheck (BUG$_PGFIPLHI) is issued. If there is any possibility 
that a page fault can occur, because either the code that is executing or the 
data that it references is pageable, then that code cannot execute above IPL 2. 
The converse of this constraint is that any code that executes above IPL 2, 
and all data referenced by such code, must be locked into memory in some 
way. Chapter 31 shows some of the techniques that the VMS executive uses 
to dynamically lock code or data into memory so that IPL can be elevated 
above IPL 2. 



2.2 SERIALIZED ACCESS 

The software interrupt capability described in Chapter 6 provides no method 
for counting the number of requested software interrupts. The VMS operating 
system uses a combination of software interrupts and doubly linked lists to 
cause several requests for the same data structure or procedure to be serial- 
ized. The most important example of this serialization in the operating sys- 
tem is the use of fork processes by device drivers. The I/O postprocessing 
software interrupt is a second example of serialized access. 



2.2.1 Fork Processing 

Fork processing is the technique that allows device drivers to lower IPL in a 
manner consistent with the interrupt nesting scheme defined by the VAX 
architecture. When a device driver receives control in response to a device 
interrupt, it performs whatever steps are necessary to service the interrupt at 
device IPL. For example, any device registers whose contents would be de- 
stroyed by another interrupt must be read before the driver dismisses the 
device interrupt. 

Usually, there is some processing that can be deferred. For DMA devices, 
an interrupt signifies either completion of the operation or an error. The code 
that distinguishes these two cases and performs error processing is usually 
lengthy, and to execute at device IPL for extended periods of time would slow 
down the system. For non-DMA devices that do not interrupt at too rapid a 
rate, interrupt processing can be deferred in favor of other, more important 
device servicing. 

In either case, the driver signals that it wishes to delay further processing 
until the IPL in the system drops below a predetermined value, the fork IPL 
associated with this driver. This signaling is accomplished by calling a rou- 
tine in the executive that saves the address of the next instruction in the 
driver in a data structure called a fork block (see Figure 6-2). The fork block is 
then inserted at the end of the fork queue for that IPL value. A software 
interrupt at the appropriate IPL is requested. 



35 



Synchronization Techniques 

2.2.2 I/O Postprocessing 

Upon completion of an I/O request, there is a series of cleanup steps that 
must be performed. The event flag associated with the request must be set. A 
special kernel AST that will perform final cleanup in the context of the proc- 
ess that initially issued the $QIO call must be queued to the process. This 
cleanup must be completed for one I/O request before another is handled. In 
other words, I/O postprocessing must be serialized. 

This serialization is accomplished by performing the postprocessing opera- 
tion as a software interrupt service routine (at IPL 4). When a request is recog- 
nized as being complete, the I/O request packet is placed at the tail of the I/O 
postprocessing queue (at global listhead IOC$GL_PSBL), and a software in- 
terrupt at IPL 4 is requested. 

When the device driver recognizes that an I/O request has completed (ei- 
ther successfully or unsuccessfully), it calls routine IOC$REQCOM, which 
makes the IPL 4 software interrupt request at fork IPL (IPL 8 to IPL 1 1 ), so the 
postprocessing interrupt is deferred until the IPL drops below 4. 

Some I/O requests do not require driver action. When the Queue I/O Re- 
quest ($QIO) system service or device-specific FDT routines detect that the 
request can be completed without driver intervention, or if they detect an 
error, they call one of the routines EXE$FINISHIO or EXE$FINISHIOC. 
These two routines execute at IPL 2 and so the requested software interrupt 
is taken immediately. ACPs also place I/O request packets directly into the 
postprocessing queue and request the IPL 4 software interrupt. 

2.3 MUTUAL EXCLUSION SEMAPHORES (MUTEXES) 

The synchronization techniques described so far all execute at elevated IPL, 
thus blocking certain operations, such as a rescheduling request, from taking 
place. There are some shared data structures that must be protected from 
multiple access where elevated IPL is an unacceptable technique for synchro- 
nization, because the processor would have to remain at an elevated IPL for 
an unspecified length of time. For example, two processes cannot allocate 
paged pool at the same time. In addition, when a system is low on paged pool 
or when the pool is highly fragmented, a search for an unused block that is 
the correct size can be very time consuming. 

A second situation where elevated IPL is not acceptable as a synchroniza- 
tion tool occurs when the data structure that is being protected is paged. The 
memory management subsystem does not allow page faults to occur when 
IPL is above 2. Thus, any pageable data structure cannot be protected by 
elevating IPL to 7. For these two reasons, another mechanism is required for 
controlling access to shared data structures. 

The VMS operating system uses mutexes, mutual exclusion semaphores, 
for this purpose. Mutexes are essentially flags that indicate whether a given 
data structure is being examined or modified by one of a group of cooperating 



36 



2.3 Mutual Exclusion Semaphores (Mutexes) 



Table 2-2: List of Data Structures Protected by Mutexes 






Global Address 


Value in 


Data Structure 


ofMutex(l) 


Version 3.0 


System Logical Name Table 


LOG$AL_MUTEX 


80002750 


Group Logical Name Table 




80002754 


I/O Database (2) 


IOC$GL_MUTEX 


800028C0 


Common Event Block List 


EXE$GL_CEBMTX 


800028C4 


Paged Dynamic Memory 


EXE$GL_PGDYNMTX 


800028C8 


Global Section Descriptor List 


EXE$GL_GSDMTX 


800028CC 


Shared Memory Global Section 


EXE$GL_SHMGSMTX 


800028D0 


Descriptor Table 






Shared Memory Mailbox 


EXE$GL_SHMMBMTX 


800028D4 


Descriptor Table 






Enqueue/Dequeue Tables 


EXE$GL_ENQMTX 


800028D8 


(Not Currently Used) 






Known File Entry Table 


EXE$GL_KFIMTX 


800028DC 


Line Printer Unit Control 


UCB$L_LP_MUTEX 


(3) 


Block (3) 







(1) When a process is placed into an MWAIT state waiting for a mutex, the address 
of the mutex is placed into the PCB$L_EFWM field of the PCB. The symbolic 
contents of PCB$L_EFWM will probably remain the same from release to re- 
lease. The numeric contents are almost certain to change with each major re- 
lease of the operating system. 

(2) This mutex is used by the Assign Channel and Allocate Device system services 
when searching through the linked list of device data blocks for a device with a 
given name. It is also used by the Mount Utility and the file system ACPs to 
lock the file system data structures. 

(3) The mutex associated with each line printer unit does not have a fixed address 
like the other mutexes. Its value depends on where the UCB for that unit is 
allocated. 



processes. The implementation allows either multiple readers or one writer 
of a data structure. Table 2-2 lists those data structures in the system that are 
protected by mutexes. 

The mutex itself consists of a single longword that contains the number of 
owners of the mutex (MTX$W_OWNCNT) in the low-order word and status 
flags (MTX$W_STS) in the high-order word (see Figure 2-1). The owner count 
begins at - 1 so that a mutex with a zero in the low-order word has one 
owner. The only flag currently implemented indicates whether a write opera- 
tion is either in progress or pending for this mutex (MTX$V_WRT). 

2.3.1 Locking a Mutex for Read Access 

When a process wishes to gain read access to a data structure that is protected 
by a mutex, it passes the address of that mutex to a routine called 



37 



Synchronization Techniques 

31 17 16 15 



Status 



Ownership Count 



I _ Write-in-Progress or 
Write-Pending Flag 

Figure 2-1 

Format of Mutual Exclusion Semaphore (Mutex) 



SCH$LOCKR. If there is no write operation either in progress or pending, the 
owner count of this mutex (MTX$W_OWNCNT) is incremented, the count 
of mutexes owned by this process (stored at offset PCB$W_MTXCNT in the 
software PCB) is also incremented, and control is passed back to the caller, 
unless this is the only mutex owned by this process (mutex count equals 
one). 

If the mutex count for this process (PCB$W_MTXCNT) is one, indicating 
that the process owns no other mutexes, the current and base priorities are 
stored in the PCB at offsets PCB$B_PRISAV and PCB$B_PRIBSAV. In addi- 
tion, if the process is not a real-time process (priority is less than 16), the 
software priority (both current priority and base priority) of the process is 
elevated to 16 to insure that the mutex will be owned for as little time as 
possible. Notice that the check on the number of owned mutexes prevents a 
process that gains ownership of two or more mutexes from receiving a perma- 
nent priority elevation into the real-time range. 

Routine SCH$LOCKR always returns successfully in the sense that, if the 
mutex is currently unavailable, the process is placed into a mutex wait state 
(MWAIT) until the mutex is available for the process. When the process even- 
tually gains ownership of the mutex, control will then be passed to the proc- 
ess. IPL is set to IPL$_ASTDEL (IPL 2) to prevent process deletion while the 
mutex is owned by this process. This preventative step must be taken be- 
cause the Delete Process system service has no internal checks on whether 
the process being deleted owns any mutexes. If the deletion succeeded, the 
locked data structure would be lost to the system. 



2.3.2 Locking a Mutex for Write Access 

A process wishing to gain write access to a protected data structure passes 
the address of the appropriate mutex to a routine called SCH$LOCKW. This 
routine returns control to the caller with the mutex locked for write access 
if the mutex is currently unowned. In addition, both mutex counts 
(MTX$W_OWNCNT and PCB$W_MTXCNT) are incremented, the process 
software priority is possibly altered, and IPL is set to 2. An alternate entry 
point, SCH$LOCKNOWAIT, returns control to the caller with R0<0> 



38 



2.3 Mutual Exclusion Semaphores (Mutexes) 

cleared (indicating failure) if the requested mutex is already owned. For the 
regular entry point (SCH$LOCKW), if this mutex is owned, the process is 
placed into the mutex wait state (MWAIT). However, the write pending bit is 
set so that future requests for read access will also be denied. In a sense, this 
scheme is placing requests for write access ahead of requests for read access. 
However, all that this check is really doing is preventing a continuous stream 
of read accesses keeping the mutex count (MTX$W_OWNCNT) nonzero. 
When the mutex count goes to -1 (no owners), it is declared available, and 
the highest priority process waiting for the mutex is the one that will get first 
access to the mutex, independent of whether it is requesting a read access or 
a write access. 



2.3.3 Mutex Wait State 

When a process is placed into a mutex wait state, its stack is set up so that 
the saved PC is the entry point of either the read-lock routine or the write- 
lock routine. (In the latter case, the PC points to a branch to SCH$LOCKW.) 
The PSL is adjusted so that the saved IPL is 2. The address of the mutex that 
is being requested is placed into the software PCB at offset PCB$L_EFWM. 
(Because the process is not waiting on an event flag, this field is available for 
other purposes.) Table 2-2 and part of Table 10-2 list the contents of the 
PCB$L_EFWM field for each MWAIT state. 



2.3.4 Unlocking a Mutex 

A process relinquishes ownership of a mutex by passing the address of the 
mutex to be released to a routine called SCH$UNLOCK. This routine decre- 
ments the number of mutexes owned by this process recorded in its PCB. If 
this process does not own any more mutexes (PCB$W_MTXCNT contains 
zero), the saved base and current priorities (in fields PCB$B_PRIBSAV and 
PCB$B_PRISAV) are established as the process's new base and current priori- 
ties. If there are computable (COM) processes with higher priorities than this 
process's new current priority, a rescheduling interrupt is requested. 

SCH$UNLOCK also decrements the number of owners of this mutex 
(MTX$W_OWNCNT). If the owner count of this mutex does not go to - 1, 
there are other outstanding owners of this mutex, so control is simply passed 
back to the caller. 

If the count does become -1, this value indicates that this mutex is cur- 
rently unowned. If the write-in-progress bit is clear, this indicates that there 
are no processes waiting on this mutex, and control is passed back to the 
caller. (A waiting writer would set this bit. A potential reader is only blocked 
if there is a current or pending writer.) If there are other processes waiting for 
this mutex, they are all made computable by scanning the MWAIT queue for 



39 



Synchronization Techniques 

all processes whose PCB$L_EFWM field matches the address of the unlocked 
mutex. 

If the priority of any of the processes removed from the mutex wait state is 
greater than the priority of the current process, a rescheduling pass will occur 
that will select the highest priority process for execution. As noted above, 
there is no difference between processes waiting for read access and processes 
waiting for write access. The criterion that determines who will get first 
chance at ownership of the mutex is software priority. 



2.3.5 Resource Wait State 

The routines that place a process into a resource wait state and make re- 
sources available share some code with the mutex locking and unlocking 
routines and will be briefly described here. Details of resources that one proc- 
ess can access at a time can be found in Chapter 10. 

When a process requires a resource that is unavailable, it is placed into a 
resource wait state, which shares the same scheduling state number and wait 
queue header with the mutex wait state. The resource number is stored in 
the PCB (at offset PCB$L_EFWM) instead of the mutex address (see Table 
10-2). In addition, a bit corresponding to this resource is set in a resource wait 
mask (found at global location SCH$GL_RESMASK). The saved PC and PSL 
are determined by the caller of routine SCH$RWAIT. SCH$RWAIT saves the 
process's context, inserts the PCB into the MWAIT queue, and causes a new 
process to be selected for execution. 

When a resource becomes available, the appropriate bit in the resource wait 
mask is cleared. If the bit was previously set, there are other processes wait- 
ing on this resource. The same routine that frees processes waiting on a 
mutex is entered at this point. Offset PCB$L_EFWM now contains a resource 
number instead of a mutex address, but this difference is a conceptual differ- 
ence that is invisible to the code that is actually executing. 

The MWAIT state queue is scanned for all processes whose PCB$L_EFWM 
field matches the number of the recently freed resource. All such processes 
are made computable. If the new priority of any of these processes is larger 
than the priority of the currently executing process, a rescheduling interrupt 
is requested. In any event, all processes waiting for the now available re- 
source will compete for that resource based on software priority. 



2.4 VAX/VMS LOCK MANAGEMENT SYSTEM SERVICES 

So far, the methods of synchronization described in this chapter have re- 
quired elevated IPL or execution in kernel access mode, or both. Though both 
are powerful and effective in synchronizing access to system data structures, 



40 



2.4 VAX/VMS Lock Management System Services 

there are other system applications in which elevated IPL or kernel mode 
access are not really necessary or desirable (for example, RMS). 

The VAX/VMS lock management system services (or the lock manager) 
provide synchronization tools that can be invoked from all access modes. 
The use of the VAX/VMS lock management system services is described fully 
in the VAX/VMS System Services Reference Manual; the internals of the 
lock manager are described in Chapter 13 of this book. 



41 



3 Dynamic Memory Allocation 



In this bright little package, now isn't it odd? You've a dime's 
worth of something known only to God! 
— Edgar A. Guest, The Package of Seeds 

Some of the data structures described in this book are created when the sys- 
tem is initialized; many others are created when they are needed and de- 
stroyed when their useful life is finished. In order to store the data structures, 
virtual memory needs to be allocated and deallocated in an orderly fashion. In 
addition, different data structures have differing memory requirements; the 
VAX/VMS operating system maintains three separate areas for dynamic allo- 
cation of storage. 

• The process allocation region holds data structures that are required only 
by a single process. 

• Paged dynamic memory contains data structures that are used by several 
processes but are not required to be permanently memory resident. 

• The nonpaged pool contains data structures and code that are used by the 
portions of the VMS operating system that are not procedure based, such as 
interrupt service routines and device drivers. These portions of the operat- 
ing system can use only system virtual address space and usually execute 
at elevated IPL, requiring nonpaged pool space rather than paged pool 
space. 

The nonpaged pool also contains data structures and code that are 
shared by several processes and must not be paged. This requirement is 
usually dictated by the constraint that page faults are not permitted 
above IPL 2. 



3.1 ALLOCATION STRATEGY AND IMPLEMENTATION 

Each of the three pool areas has the same structure, so common allocation 
and deallocation routines can be used. The first two longwords of each un- 
used block in one of the pool areas are used to describe the block. As illus- 
trated in Figure 3-1, the first longword in a block contains the virtual address 
of the next unused block in the list. The second longword contains the size in 
bytes of the unused block. Each successive unused block is found at a higher 
virtual address. Thus, each pool area forms a singly linked memory ordered 
list. 



42 



3. 1 Allocation Strategy and Implementation 



Used 



Size of this Block 



First Unused 
Block 



Used 



Size of this Block 



Next Unused 
Block 



Used 



Beginning of Pool Area 
(Filled in when 
system is initialized) 



Address of First 
Free Block 

(Modified by allocation 
and deallocation routines) 



Size of this Block 




(Zero in pointer 
signifies end of list) 



Figure 3-1 

Layout of Unused Areas in Dynamic Memory Pools 



3.1.1 Allocation of Dynamic Memory 

When the allocation routine is called, it searches from the beginning of the 
list until it encounters the first unused block large enough to satisfy the call. 
If the fit is exact, the allocation routine simply adjusts the previous pointer to 
point to the next free block. If the fit is not exact, it subtracts the allocated 
size from the original size of the block, puts the new size into the remainder 
of the block, and adjusts the previous pointer to point to the remainder of the 
block. The two possible allocation situations (exact and inexact fit) are illus- 
trated in Figure 3-1. 



43 



Dynamic Memory Allocation 

3.1.2 Example of Allocation of Dynamic Memory 

The first part of Figure 3-2 (Initial Condition) shows a section of paged pool, 
including the pointers MMG$GL_PAGEDYN, which points to the beginning 
of paged pool, and EXE$GL_PAGED, which points to the first available block 
of paged pool. In this example, allocated blocks of memory are indicated only 
as the total number of bytes being used, ignoring either the number or size of 
the individual data structures within each block. 

Following the allocation of a block of 60 bytes (an exact fit), the structure of 
the paged pool looks like the second part of Figure 3-2 (60 Bytes Allocated). 



Initial Condition 60 Bytes Allocated 48 Bytes Allocated 

From listhead From listhead 



y 



176 Bytes 
in Use 



32 



32 Bytes > 
Unused 



96 Bytes 

in Use 



:: EXE$GL_PAGED 
:: MMG$GL_PAGEDYN 



60 



60 Bytes 
Unused 



68 Bytes >; 
in Use 



48 



48 Bytes 
Unused 



208 Bytes y, 
in Use 



176 Bytes 
in Use 



32 



32 Bytes 
Unused 



£ 



224 Bytes in Use 
(96+60+68 Bytes) 



48 Bytes 
Unused 



y, 208 Bytes 
in Use 



176 Bytes 

in Use 



y 



32 



32 Bytes 
Unused 



144 Bytes in Use y 
(96+48 Bytes) ■< 



12 



12 Bytes Unused 

(60-48 Bytes) 



68 Bytes 
in Use 



48 



48 Bytes 
Unused 



208 Bytes 
in Use 



Figure 3-2 

Examples of Allocation from Dynamic Memory 



44 



3. 1 Allocation Strategy and Implementation 

Note that the discreet portions of 96 bytes and 68 bytes in use and the 60 
bytes that were allocated are now combined to show simply a 224-byte block 
of paged pool in use. 

The third part of Figure 3-2 (48 Bytes Allocated) shows the case where a 
48-byte block was allocated from the paged pool structure shown in the first 
part of the figure. The 48 bytes were taken from the first unused block large 
enough to contain it. (Note that allocation is done from the low address end 
of the unused block.) Because this allocation was not an exact fit, an unused 
block, 12 bytes long, remains. 



3.1.3 Deallocation of Dynamic Memory 

When a block is deallocated, it must be placed back into the list in its proper 
place, according to its address. This replacement is accomplished by follow- 
ing the unused area pointers until an address larger than the address of the 
block to be deallocated is encountered. If the deallocated block is adjacent to 
another unused block, the two blocks are merged into a single unused area. 
This merging, or agglomeration, can occur at the end of the preceding unused 
block or at the beginning of the following block (or both). Three sample 
deallocation situations, two of which illustrate merging, are shown in Figure 
3-3 and are described in Section 3.1.4. Because merging occurs automatically 
as a part of deallocation, there is no need for any externally triggered cleanup 
routines. 

The deallocation routine assumes that the word at offset 8 from the begin- 
ning of a block contains the size of the block being deallocated. All of the 
dynamically allocated blocks used by the executive adhere to this conven- 
tion. The type code located in the byte at offset 10 is also used by the deallo- 
cation routine to distinguish between structures allocated from local mem- 
ory (type code is positive) and structures allocated from shared memory (type 
code is negative). This size word and the type code stored in the adjacent byte 
at offset 10 allow SDA to correctly interpret the portions of nonpaged pool 
that are currently in use. 

3.1.4 Example of Deallocation of Dynamic Memory 

The first part of Figure 3-3 (Initial Condition) shows the structure of an area 
of paged pool containing logical name blocks for three logical names: ADAM, 
GREGORY, and ROSAMUND. These three logical name blocks are 
bracketed by two unused portions of paged pool, one 64 bytes long, the other 
176 bytes long. 

If the logical name ADAM were deleted, the structure of the pool would be 
altered to look like the structure shown in the second part of Figure 3-3 
(ADAM Deleted). Because the logical name block was adjacent to the high 



45 



Dynamic Memory Allocation 



Initial Condition 



From previous block 



64 



64 Bytes 
Unused 



Logical Name Block 

(48 Bytes) 
Logical Name ADAM 



Logical Name Block 

(80 Bytes) 

Logical Name GREGORY 



Logical Name Block 

(64 Bytes) 

Logical .Name ROSAMUND 



176 



176 Bytes 
Unused 



To next block 



From previous block 
GREGORY Deleted s 



64 



64 Bytes 
Unused 



Logical Name Block 

(48 Bytes) 
Logical Name ADAM 



80 



80 Bytes 

Unused 



Logical Name Block 

(64 Bytes) 

Logical Name ROSAMUND 



176 



176 Bytes 
Unused 



To next block 



ADAM Deleted 



From previous block 



112 



112 Bytes 

Unused 

(64+48 Bytes) 



Logical Name Block 

(80 Bytes) 

Logical Name GREGORY 



Logical Name Block 

(64 Bytes) 

Logical Name ROSAMUND 



176 



176 Bytes 
Unused 



To next block 
From previous block 



ROSAMUND Deleted 



64 



64 Bytes 
Unused 



Logical Name Block 

(48 Bytes) 
Logical Name ADAM 



Logical Name Block 

(80 Bytes) 

Logical Name GREGORY 



240 



240 Bytes 

Unused 

(64+176 Bytes) 



To next block 



Figure 3-3 

Examples of Deallocation of Dynamic Memory 



46 



3. 1 Allocation Strategy and Implementation 

address end of an unused block, the blocks are merged. The size of the 
deallocated block is added to the size of the unused block. 

If the logical name GREGORY were deleted, the structure of the pool 
would be altered to look like the structure shown in the third part of Figure 
3-3 (GREGORY Deleted). The pointer in the unused block of 64 bytes is 
altered to point to the deallocated block; a new pointer and size longword are 
created within the deallocated block. 

The fourth part of Figure 3-3 (ROSAMUND Deleted) shows the case where 
the logical name ROSAMUND was deleted. In this case the deallocated 
block is adjacent to the low address end of an unused block, so the blocks are 
merged. The pointer to the next unused block that was previously in the 
adjacent block is moved to the beginning of the newly deallocated block. The 
following longword is loaded with the size of the merged block (240 bytes). 



3.1.5 Synchronization 

Some method is required to synchronize access to the pool areas to avoid 
several processes or executive routines searching one of these lists simulta- 
neously. 

There is no locking mechanism currently used for either the process alloca- 
tion region or any of the lists (such as the process logical name table or the 
private mounted volume list) found there. However, the allocation routine 
executes in kernel mode at IPL 2, effectively blocking any other mainline or 
AST code from executing and perhaps attempting a simultaneous allocation 
from the process allocation region. 

Paged pool is protected by a mutex. Before a block of memory is either 
allocated or deallocated from the paged pool, this mutex, found at global label 
EXE$GL_PGDYNMTX, is locked for write access. 

Elevated IPL is used to control allocation of nonpaged pool. The IPL that is 
used is stored in the longword immediately preceding the pointer to the first 
unused block in the nonpaged pool (see Table 3-1). The allocation routine for 
nonpaged pool raises IPL to the value found here before proceeding. While the 
system is running, this longword usually contains an 1 1 . The value of 1 1 was 
chosen because device drivers running at fork level frequently allocate dy- 
namic storage, and IPL 1 1 represents the highest fork IPL currently used in 
the operating system. (An implication of this synchronization IPL value is 
that device drivers must not allocate nonpaged pool while executing at de- 
vice IPL in response to a device interrupt.) 

During initialization, the contents of this longword are set to 31 because 
the rest of the code in the system initialization routines (module INIT) exe- 
cutes at IPL 31 to block all interrupts. INIT is described in detail in Chapter 
25. Changing the contents of this longword avoids lowering IPL as a side 



47 






Table 3-1 : Global Listheads for Each Pool Area 



Pool Area 
Nonpaged Pool 



Nonpaged Pool 
Lookaside Lists 



Paged Pool 



Paged Pool 
Process Allocation 
Region 



Process Allocation 
Region 



Global Address 
of Pointer 

EXE$GL_NONPAGED 



MMG$GL_NPAGEDYN 
IOC$GL_LRPSPLIT 

EXE$GL_SPLITADR 

IOC$GL_SRPSPLIT 

EXE$GL_PAGED 



MMG$GL_PAGEDYN 
CTL$GQ _ ALLOCREG 



Size 

3 longwords 
longword 

longword 
longword 

longword 
longword 

longword 

longword 

2 longwords 
longword 
longword 

longword 
2 longwords 
longword 
longword 



Static or 
Use of These Fields Dynamic (1) 

Synchronization IPL for nonpaged pool Dynamic (2) 

allocation. 
Address of next (first) free block. Dynamic 

Dummy size (of zero) for listhead to speed Static 

up allocation routine. 

Address of beginning of nonpaged pool area. Static 

Address of beginning of large request Static 

packet area. 
Address of beginning of I/O request packet Static 

area. 
Address of beginning of small request Static 

packet area. 

Address of next (first) free block. Dynamic 

Dummy size (of zero) for listhead to speed Static 

up allocation routine. 

Address of beginning of paged pool area Static 

Address of next (first) free block. Dynamic 

Dummy size (of zero) for listhead to speed Static 

up allocation routine. 
There is no global pointer that locates the 

beginning of the process allocation region. 






I 



o 
o 

r-f 



(1) Static pointers are loaded at initialization time. The contents of these locations do not change during the life of the system. Dynamic pointers 
generally change their contents each time a block is allocated from or deallocated to a pool area. 

(2) The synchronization IPL is changed to 31 by INIT while it is executing but is reset to 11 and remains at that value for the life of the system. 



3. 1 Allocation Strategy and Implementation 

effect of allocating space from nonpaged pool. The value of this longword is 
reset to 1 1 after INIT has finished its allocation but before INIT passes con- 
trol to the scheduler. 

IPL is also a consideration for deallocation of nonpaged pool, but for a dif- 
ferent reason. Although nonpaged pool can be allocated from fork processes 
running at IPL levels up to IPL 11, it cannot be deallocated as a result of an 
interrupt above IPL 7. The reason for limiting the IPL is that nonpaged pool is 
a system-wide resource that processes might be waiting for. The deallocation 
routine notifies the scheduler that a resource is available. The scheduler in 
turn checks whether any processes are waiting for the nonpaged pool re- 
source. All of this scheduling must take place at IPL$_ SYNCH, and the in- 
terrupt nesting scheme requires that IPL never be lowered below the IPL 
value at which the current interrupt occurred. This rule dictates that all pool 
be deallocated at IPL 7 or lower. 

There may be instances where code executing above IPL 7 must deallocate 
nonpaged pool. Routine COM$DRVDEALMEM exists for this purpose. This 
routine takes the block that is to be deallocated, turns it into a fork block (see 
Figure 6-2), and requests an IPL 6 software interrupt. The code that executes 
as the fork process (the saved PC in the fork block) simply issues a JMP 
to EXE$DEANONPAGED to deallocate the block. However, because 
EXE$DEANONPAGED is entered at IPL 6 and not at fork IPL, the synchro- 
nized access to the scheduler's database is preserved. (This technique is simi- 
lar to the one used by device drivers that need to interact with the scheduler 
by declaring ASTs. The attention AST mechanism is briefly described in 
Chapter 2 and discussed in greater detail in Chapter 7.) 



3.1.6 Granularity of Allocation 

The allocation routines for both paged and nonpaged pool round the re- 
quested size up to the next multiple of 16 bytes to impose a granularity on 
both the allocated and unused areas. Because both pool areas are initially 
page aligned, this rounding causes every structure allocated from one of the 
two system-wide pool areas to be at least quadword aligned. 

There is no granularity imposed on the allocation size for the process allo- 
cation region. However, the two structures allocated from this pool by the 
system (logical name blocks for process logical names and mounted volume 
list entries for private volumes) are both an integral number of quadwords 
long so that any block allocated from the process allocation region is quad- 
word aligned. Also, the smallest possible size of an unallocated block is eight 
bytes. Any user-written privileged program that allocates space from the 
process allocation region should insure that it requests an integral number of 
quadwords to keep this region quadword aligned. 



49 



Dynamic Memory Allocation 

3.2 PREALLOCATED REQUEST PACKETS 

While most of the structures found in the nonpaged pool are allocated and 
deallocated infrequently, pool is constantly being allocated and deallocated 
for I/O request packets and other system data blocks. To avoid the overhead 
of searching for blocks of free memory of sufficient size to accommodate 
specific request packets, portions of nonpaged pool (called the lookaside lists) 
are dedicated to the allocation and deallocation of I/O request packets (IRPs), 
small request packets (SRPs), and large request packets (LRPs). 

Specifically, at initialization time, a portion of the nonpaged system space 
following the main portion of pool is partitioned into three pieces. One piece 
is reserved for the IRP list, one is for the LRP list, and one is for the SRP list. 
The pieces are then structured into a series of elements. The size of the IRP 
list element is determined by the symbol IRP$C_LENGTH. The sizes of the 
elements in the LRP and SRP lists are contained in the cells IOC$GL_LRPSIZE 
and IOC$GL_SRPSIZE, which are defined in module SYSCOMMON. INIT 
determines the values for LRPSIZE and SRPSIZE from SYSBOOT parameters. 
In each of the lists, the elements are entered into a doubly linked list (with 
the INSQUE instruction) so that the each list is a doubly linked list contain- 
ing fixed size list elements. 



3.2.1 Allocation from One of the Lookaside Lists 

When a routine (such as the $QIO system service) needs an I/O request 
packet, it simply issues a REMQUE from the beginning of this list (found 
through global label IOC$GL_IRPFL). The SRP and LRP lookaside lists are 
located by the global labels IOC$GL_SRPFL and IOC$GL_LRPFL respec- 
tively. Only if the list is empty (indicated by the V-bit set in the PSW) would 
the more general allocation routine have to be called. Because allocation and 
deallocation from the lookaside list are so much more efficient than the gen- 
eral routines that allow any size block to be allocated or deallocated, a special 
check is built into the general nonpaged pool allocation routine to determine 
whether the requested block can be allocated from one of the lookaside lists. 
The logic of this routine is approximately the following. 

1. The allocation size is rounded up to the next multiple of 16. 

2. If the rounded size is greater than the size of an IRP (IRP$C_LENGTH), an 
attempt is made to allocate a packet from the LRP list. If the rounded size 
is still greater than the size of an LRP, the general allocation routine is 
called to search for the first free block large enough to accommodate the 
request. If the rounded size is less than the smallest request size for which 
an LRP can be allocated (IOC$GL_LRPMIN), the general allocation rou- 
tine is called. 

3. The cell IOC$GL_IRPMIN indicates the smallest request size that can be 



50 



3.2 Pieallocated Request Packets 

allocated an IRP. If the rounded size is less than IOC$GL_IRPMIN, an 
attempt is made to allocate a packet from the SRP list. If the rounded size 
is greater than the size of an SRP (IOC$GL_SRPSIZE), the general alloca- 
tion routine is called. 

4. Once the appropriate lookaside list is found, and if the list is not empty, 
the first packet is removed from the list and returned to the caller. 

5. If a lookaside list is empty, an attempt is made to extend the list (see 
Section 3.3.3.2). If the list is extended, the allocation is attempted again. If 
the list cannot be extended, the general allocation routine is called. 

Note that because allocation is done with a single instruction, there is no 
need for any other synchronization than that provided by the REMQUE in- 
struction; however, IPL is raised to IPL$_ SYNCH before determining if the 
allocation can be made from one of the lookaside lists or the main portion of 
pool (allocation from the main portion does require synchronization). The 
other concern of the general allocation routines, the block granularity, is also 
irrelevant here because all blocks on the lookaside list are the same size. 



3.2.2 Deallocation to the Lookaside List 

When the routine to deallocate a block of nonpaged pool is called, it first 
checks whether the block was allocated from the main portion of the pool or 
from one of the lookaside lists. The lookaside lists are divided by the follow- 
ing symbols, beginning with the smaller addresses: 

IOC$GL_LRPSPLIT Boundary between the main part of pool and the 

LRP list 
EXE$GL _ SPLITADR Boundary between the LRP and the IRP list 

IOC$GL _ SRPSPLIT Boundary between the IRP list and the SRP list 

These addresses were determined by INIT when the lookaside lists were 
initialized. Figure 3-4 shows the relationship of the lookaside lists to the rest 
of nonpaged pool. 

The deallocation routine determines the list to which the piece of pool is 
being returned by the following steps: 

• The address of the block being deallocated is compared to the contents of 
global location IOC$GL_ SRPSPLIT. If the address of the block is greater 
than IOC$GL_SRPSPLIT, the block came from the SRP list. 

• If the address was less than IOC$GL_ SRPSPLIT, the address is compared 
to EXE$GL_ SPLITADR. If the address is greater, the block came from the 
IRP list. 

• If the address was less than EXE$GL_SPLITADR, the address is compared 
to IOC$GL_LRPSPLIT. If the address is greater/the block came from the 
LRP list. 



51 



Dynamic Memory Allocation 



IOCSGI LRPBL::»- 



IOC$GL_IRPBL: : • ». 



Rest of 

Nonpaged 

Pool 



-• NEXT 



SIZE 



First 
Unused 

Block 



LRP Lookaside List 



.<* ." .* ," 



/" 



Room for Expansion of LRP List 



: :MMG$GL_NPAGEDYM 



-• : :EXE$GI NONPAGED 



:IOC$GL__LRPSPLIT 

:IOC$GI LRPFL 



\ 



IOCSGI SRPBL: 




:EXE$GL_SPLITADR 

:IOC$GL_IRPFL 



^, 



:IOC$GI SRPSPLIT 

:IOC$GL_SRPFL 



^:,„ 



Figure 3-4 

Preallocated Request Packets 



• If the address was less than IOC$GL_LRPSPLIT ; the block came from the 
main part of pool. 

If the block was originally allocated from one of the lookaside lists, it is 
returned there by inserting it at the end of the list with an INSQUE instruc- 
tion. The ends of the lookaside lists are indicated by the global labels 
IOC$GL_SRPBL, IOC$GL_IRPBL, and IOC$GL^LRPBL. Note that by allo- 
cating packets from one end of the list and putting them back at the other 
end, a transaction history as long as the list itself is maintained. If the block 



52 



3.3 Use of Dynamic Mem oiy 

was originally allocated from the general pool area, the general deallocation 
routine is called. The differences between the lookaside list and the general 
nonpaged pool are summarized in Table 3-2. 

Although the allocation from the lookaside list required no additional syn- 
chronization in addition to the REMQUE instruction, deallocation must 
be done at IPL 7 or below, because nonpaged pool is a resource whose avail- 
ability must be reported to the scheduler, which will elevate IPL to 7. All 
deallocation to nonpaged pool is accomplished through the routines 
EXE$DEANONPAGED (which should not be called above IPL 7), and 
COM$DRVDEALMEM (which can be called from any IPL). 



3.3 USE OF DYNAMIC MEMORY 

Almost all of the data structures that are dynamically configured are placed 
in either the nonpaged or paged pool areas. Only the PFN database, the global 
and system page tables, the system header, and the interrupt stack have sepa- 
rate virtual address space allocated. Most per-process data structures, on the 
other hand, are assigned to dedicated areas of PI space, as defined in the 
module SHELL and illustrated in Figure 1-7 and listed in Table 26-4. One 
per-process data structure, the process header, resides in the area of system 
space called the balance slot area. 



3.3.1 Process Allocation Region 

The process allocation region is currently 46 pages long. Its size is fixed by an 
assembly time parameter in module SHELL. Its protection is set to UREW 
(the page protection codes are described in Table 14-1). That is, it can be 
written from executive and kernel modes and read from any access mode. 
Only the process logical name table and the mounted volume list for private 
volumes are found in the process allocation region. There is enough room in 
the process allocation region for privileged application software to allocate 
reasonably sized process-specific data structures. 



3.3.2 Paged Dynamic Memory 

The following data structures are located in the paged pool area: 

• The group and system logical name tables. 

• Global section descriptors, which are required only when a section is 
mapped or unmapped. 

• Data structures required by the Install Utility to describe known images. 
Any image that is installed has a known file entry created to describe it. 



53 






Table 3-2: Comparison of Different Pool Areas 



Pool Area 
Nonpaged Pool 



Allocation 
Quantum 

16 bytes 



Type of List 
(1 and 2) 

Variable size 
(1) 



Synchronization 
Technique 

Elevated IPL 



Lookaside Lists 
SRP 



@IOC$GL_SRPSIZE 



Fixed size blocks 
(2) 



None required 



IRP 



LRP 



156 bytes 



@IOC$GL_LRPSIZE 



Typical Structures 
Allocated Here 

Buffered I/O buffer (GTRU 96 bytes) 

Driver Prolog Table (Driver Structure) 

Job Information Block 

Network Data Structures 

Process Control Block 

Process Quota Block 

Unit Control Block (Driver Structure) 

Buffered I/O buffer (LEQU @IOC$GL_IRPMIN bytes) 

Channel Request Block (Driver Structure) 

Device Data Block (Driver Structure) 

File Control Block 

Interrupt Dispatch Block (Driver Structure) 

Timer Queue Element 

Window Control Block 

Buffered I/O buffer (GTR @IOC$GL_IRPMIN bytes) 

Common Event Block 

I/O Request Packet 

Volume Control Block 

DECnet buffer 



3 

O 

1 
I 

o 
o 

§ 



Table 3-2: Comparison of Different Pool Areas (continued) 



Pool Area 
Paged Pool 



Allocation 
Quantum 

16 bytes 



Type of List 
(1 and 2) 

Variable size 
ID 



Synchionization 
Technique 

Mutex 



Process Allocation 
Region 



Variable size 
(1) 



Access mode 



Typical Structures 
Allocated Here 

Global Section Descriptors 
Known File Entries 
Known File Headers 
Logical Name Blocks for group 

and system logical names 
Mounted Volume List Entry for volumes 

mounted /SYSTEM or /GROUP 
Logical Name Blocks for 

process logical names 
Mounted Volume List Entry for private 

volumes (/SHARE OR /NOSHARE) 



(1| The lookaside list has extremely efficient (single instruction] allocation and deallocation routines. Because the blocks are fixed size, internal 

fragmentation (unused space within individual blocks) can result. 
(2) The general pool areas allow variable sized allocation requests (and contain variable sized empty areas). The allocation and deallocation routines 

must search at least a portion of the empty list. External fragmentation (unused blocks equal to the allocation quantum) near the beginning of 

the list can result from this type of allocation scheme. 



en 
en 



Co 
Go 



£ 
« 









Dynamic Memory Allocation 

Some frequently accessed known images also have their image headers 
permanently resident. These data structures are described in more detail in 
Chapter 21. 
• The mounted volume list for volumes shared among several processes. 

The size of paged dynamic memory is determined by the SYSBOOT parame- 
ter PAGEDYN. Its protection is set to URKW. The pages of paged dynamic 
memory used by RMS for the shared file database have their protection al- 
tered to EW (either read or write access from executive or kernel mode) by 
RMSSHARE, the image that executes as part of STARTUP.COM to initialize 
the shared file database. 



3.3.3 Nonpaged Dynamic Memory 

Nonpaged pool serves several purposes. At initialization time, data structures 
whose size and contents depend on SYSBOOT parameters will be allocated 
from nonpaged pool and initialized. These structures include the PCB vector 
and sequence vector, the swapper's I/O page table, the page file bitmap, modi- 
fied page writer arrays, and the adapter control blocks for all external adapters 
located at bootstrap time. The detailed use of nonpaged pool by the initializa- 
tion routines is described in Chapter 25. 

A second general, somewhat static use of nonpaged pool is to contain de- 
vice driver code and associated data structures for all devices that are either 
located through the autoconfigure phase of SYSGEN or explicitly loaded with 
the SYSGEN commands LOAD or CONNECT. The details of these struc- 
tures are described in the VAX/VMS Guide to Writing a Device Driver. 

3.3.3.1 The Sizes of Nonpaged Dynamic Memory Regions. The sizes of the variable 
nonpaged pool and the lookaside lists are determined by SYSBOOT parame- 
ters. Nonpaged dynamic memory differs from the paged dynamic area (and 
the process allocation area) in that it is potentially extensible during normal 
system operation (see Section 3.3.3.2). For each of the four regions of non- 
paged pool there exist two SYSBOOT parameters, one to specify the initial 
size of the region, and another to specify the maximum size of the region. 
The size in bytes of the variable length region of nonpaged pool is con- 
trolled by the SYSBOOT parameters NPAGEDYN and NPAGEVIR, both of 
which are rounded down to an integral number of pages. During system ini- 
tialization, sufficient contiguous system page table entries (SPTEs) are allo- 
cated for the maximum size of the region (the larger of NPAGEDYN and 
NPAGEVIR). Physical pages of memory are allocated for the initial size of the 
region and are mapped using the first portion of allocated SPTEs. The protec- 
tion of the valid pages is ERKW. The remaining SPTEs are left invalid. SPTEs 
and other memory management data structures are described in Chapter 14. 



56 



3.3 Use of Dynamic Memory 



Table 3-3: SYSBOOT Parameters Controlling Lookaside List Sizes 
List Type Size of Packet Initial Count Maximum Count 



IRP 


160 


IRPCOUNT 


IRPCOUNTV 


SRP 


SRPSIZE 


SRPCOUNT 


SRPCOUNTV 


LRP 


LRPSIZE+64 


LRPCOUNT 


LRPCOUNTV 



During system operation, failure to allocate from the variable nonpaged 
pool region will result in an attempted expansion of the region, with physical 
page(s) allocated to fill in the next invalid SPTE(s). The deallocation merge 
strategy described in Section 3.2.2 requires that the newly extended nonpaged 
dynamic area be virtually contiguous with the existing area and that the four 
regions be adjacent. It is because of these restrictions that the maximum 
number of SPTEs are allocated for each region^ even if some of them are 
initially unused. • 

The lookaside lists are allocated during system initialization in the same 
manner as the variable length region. Table 3-3 lists the SYSBOOT parame- 
ters for each lookaside list. In each case, the initial count and maximum 
count are maximized. SRPSIZE is rounded up to a 16-byte boundary, and the 
maximum size in bytes of the SRP lookaside list is rounded up to a page 
boundary. The value 64 is added to LRPSIZE and the sum is rounded Up to a 
16-byte boundary, and the maximum size in bytes of the LRP lookaside list 
region is rounded up to a page boundary. 

The parameter LRPSIZE is intended to be the DECnet buffer size, exclusive 
of a 64-byte internal buffer header. (Note that the output of SHOW MEM- 
ORY displays the inclusive packet size.) 

Dynamic nonpaged pool expansion enables automatic system tuning. The 
penalty for setting an inadequate initial allocation size is the increased over- 
head encountered in allocating requests that cause expansion. An additional 
minor physical penalty is that unnecessary PFN database is built for those 
physical pages that are subsequently added to nonpaged pool as a result of 
expansion. The cost is about four percent of the size of the page (18 bytes) per 
added page. The penalty for a maximum allocation that is too large is one 
SPTE for each unused page, or less than one percent. If the maximum size of 
a lookaside list is too small, system performance may be adversely affected 
when the system is prevented from using the lookaside mechanism for pool 
requests. If the maximum size of the variable length region is too small, 
processes may be placed into the MWAIT state, waiting for nonpaged pool to 
become available. 

3.3.3.2 Expansion of Nonpaged Dynamic Pool. When routine EXE$ALONONPAGED 
(in module MEMORYALC) fails to allocate nonpaged pool from any of the 



57 



Dynamic Memory Allocation 

four regions, it attempts to expand nonpaged pool by invoking the routine 
EXE$EXTENDPOOL (found in module MEMORYALC). 

EXE$EXTENDPOOL examines each lookaside list in turn. If a list is empty 
and is not at its maximum size, EXE$EXTENDPOOL attempts to allocate a 
page of physical memory. First a check is made to see if a physical page can be 
allocated without reducing the number of physical pages available to the 
system, that is, sufficient pages to accommodate the sum of the maximum 
working set size, the modified list low limit, and the free list low limit. If a 
page can be allocated, EXE$EXTENDPOOL places its page frame number 
(PFN) in the first invalid SPTE for that list, setting the valid bit. The new 
virtual page and any fragment from the previous virtual page are formatted 
into packets of the appropriate size and placed on the list. EXE$EXTENDPOOL 
records the size and address of any fragment left from the new page. 

If EXE$EXTENDPOOL was called due to a failure to allocate space from 
the variable length region, EXE$EXTENDPOOL attempts to expand the re- 
gion by a page and reports that the resource RSN$_NPAGEDYN is available 
for any waiting processes. (See Chapter 10 for more information on schedul- 
ing and event reporting.) 

For proper synchronization of system databases, the resource availability 
report and the allocation of physical memory must not be done from a thread 
of execution running as the result of an interrupt above IPL 7. For this reason, 
EXE$EXTENDPOOL checks to see whether it has been entered in system 
context (that is, on the interrupt stack) as the result of attempted pool alloca- 
tion from a device driver. If the interrupt stack bit in the PSL is set, 
EXE$EXTENDPOOL creates an IPL 6 fork process to expand the lists at some 
later time when IPL drops below 6 and returns an allocation failure status to 
its invoker. 



58 



PART II/Control Mechanisms 



4 Condition Handling 



"Would you tell me, please, which way I ought to go from here?" 
"That depends a good deal on where you want to get to," said the Cat. 
. —Lewis Carroll, Alice's Adventures in Wonderland 

One of the design goals of the VAX architecture was a generalized uniform 
condition handling facility for both hardware-detected exceptions and soft- 
ware-generated conditions. In addition to making this facility available to 
users, the VAX/VMS operating system uses many of the features of the condi- 
tion handling facility for its own purposes. 

4. 1 OVERVIEW OF THE CONDITION HANDLING FACILITY 

The generalized condition handling facility that is included as part of the 
VAX architecture provides users and the system with a powerful tool in han- 
dling exceptional conditions that arise during normal program execution. In 
addition, software-detected conditions (not necessarily indicating an error) 
can be passed to the operating system to allow them to be handled in exactly 
the same manner as hardware-detected exceptions. 

The options that are available to user programs to allow them to use the 
features of the VAX- 11 condition handling facility are described in the 
VAX/VMS System Services Reference Manual and the VAX-11 Run-Time 
Library Reference Manual. This chapter discusses how the tools described in 
those two manuals actually implement their features. 

4.1.1 Goals of the VAX-11 Condition Handling Facility 

Some of the goals of the VAX-11 condition handling facility reflect goals of 
the VAX-1 1 procedure calling standard. Other goals reflect the desire to place 
an easy-to-use, general purpose mechanism into the operating system so that 
application programs and other layered products such as compilers can use 
this mechanism rather than inventing their own application-specific tools. 
Some of the explicit and implicit goals of the VAX-11 condition handling 
facility are the following. 

1. The condition handling facility should be included in the base machine 
architecture so that it is available as a part of the base machine and not as 
part of some software component. The space reserved for condition han- 
dler addresses in the first longword of the call frame accomplishes this 
goal. 



61 



Condition Handling 

2. By including the handler specification as a part of the call frame, signal 
handling is an integral part of a procedure, rather than a global facility 
within a process. Including the handler specification as part of the call 
frame contributes to the general goal of modular procedures and allows 
condition handlers to be nested. The nested inner handlers can either serv- 
ice a detected exception or pass it along to some outer handler in the 
calling hierarchy. 

3. Some languages such as BASIC and PL/I have signaling and error handling 
as part of the language specification. These languages can use the general 
mechanism rather than inventing their own procedures. 

4. There should be little or no cost to procedures that do not establish han- 
dlers. Further, procedures that do establish handlers should incur little 
overhead for establishing them, with the expense in time being incurred 
when an error actually occurs. 

5. As far as the user or application programmer is concerned, there should be 
no difference in the appearance of exceptions initially detected by the 
hardware and signals generated by software. 

4.1.2 Features of the VAX-1 1 Condition Handling Facility 

Some of the features of the VAX- 11 condition handling facility show how 
these goals were attained. Others show the general desire to produce an easy- 
to-use but general condition handling mechanism. Features of the VAX- 11 
condition handling facility include the following. 

1 . A condition handler has three options available to it. The handler can fix 
the condition (continuing). The handler may not be capable of fixing the 
condition, so it passes the condition on to the next handler in the calling 
hierarchy (resignaling). The handler can alter the flow of control (unwind- 
ing the call stack). 

2. Because condition handlers are themselves procedures, each has its own 
call frame with its own slot for a condition handler address. This condition 
handler address gives handlers the ability to establish their own handlers 
to field errors that they might cause. 

3. The goals related to cost in space and time were realized by using only a 
single longword per procedure activation for handler address storage. 
There is no cost in time for procedures that do not establish handlers. 
Procedures that do establish handlers can do so with a single MOVAx 
instruction. No time is spent looking for condition handlers until a signal 
is actually generated. 

4. The mechanism is designed to work even if a condition handler is written 
in a language that does not produce reentrant code. Thus, if a condition 
handler written in FORTRAN generated an error, that error would not be 
reported to the same handler. 

In fact, the special actions that are taken if multiple signals are active 



62 



4.2 Generation of Exceptions 

have a second benefit, namely that no condition handler has to worry 
about errors that it generates, because a handler would never be called in 
response to its own signals. 

5. Uniform exception dispatching for hardware and software exceptions is 
accomplished by providing parallel mechanisms for the two forms of ex- 
ceptions. Software-detected exceptions are generated by calling a proce- 
dure in the Run-Time Library. Hardware exceptions transfer control to an 
exception dispatcher in the executive. While the initial execution of these 
two mechanisms differs slightly to reflect their differing initial conditions, 
they eventually execute identical instruction sequences so that the infor- 
mation reported to condition handlers is independent of the initial detec- 
tion mechanism. 

6, By making condition handling a part of a procedure, high level languages 
can establish handlers that can examine a given signal and determine 
whether the signal was generated as a part of that language's support li- 
brary. If so, the handler can attempt to fix the error in the manner defined 
by the language. If not, the handler passes the signal along to procedures 
further up the call stack. 



4.2 GENERATION OF EXCEPTIONS 

One way of classifying the conditions that occur in a running VAX/VMS 
system is to separate those conditions that originate in the VAX-1 1 hardware 
from those that are initiated by software. The primary differences between 
the two sets of initial conditions are the initial state of the stack that con- 
tains the exception parameters and the location of the routine that performs 
the dispatching. 



4.2.1 Exceptions That Originate in the Hardware 

When an exception is detected by the hardware, the exception PC and PSL 
(and possible exception-specific parameters) are pushed onto the appropriate 
stack. The appropriate stack is determined by the access mode in which the 
exception occurred and whether the CPU was previously executing on the 
interrupt stack. 

• If the exception occurred in any mode other than kernel and the exception 
was not a CHMU, CHMS, or CHME exception, the kernel stack is used. 
(The interrupt stack is not a consideration in this case because it is impos- 
sible to be on the interrupt stack in other than kernel mode.) 

• If the exception occurred in kernel mode and the kernel stack was in use, 
the kernel stack is also used as the exception stack. 

• If the exception occurred in kernel mode and the interrupt stack was in 
use, the interrupt stack is used as the exception stack. The VMS system 



63 



Condition Handling 



does not expect exceptions to occur when it is operating on the interrupt 
stack. If an exception should occur on the interrupt stack, the exception 
dispatcher generates a VMS-requested system crash called a bugcheck (see 
Chapter 8) with a BUG$_INVEXCEPTN code. 

The actual stack (interrupt or kernel) that is used to service an exception 
or interrupt is determined by the low-order two bits in the system control 
block (SCB) entry and whether the interrupt stack is already in use. These 
rules reflect the behavior of the VMS executive, where exceptions are asso- 
ciated with a process and serviced on that process's kernel stack (because 
the low-order two bits in the SCB entry are zero). The interrupt stack is 
only used if it was already in use when the exception occurred. Note that 
two serious aborts (machine check and kernel stack not valid), exceptions 
that also change IPL to 31, are serviced on the interrupt stack by the sys- 
tem. 

After all of the exception information has been pushed onto the stack, 
control is then passed to an exception-specific service routine whose ad- 
dress is stored in the SCB (see Figure 4-1). The use of the first twenty 



System Control Block 



Exceptions (20) 



Processor Faults (12) 



Software Interrupts (16) 



Clock and Console (1 6) 



External Adapter 
Interrupts 



The System Control Block 
Base Register (SCBB) 
contains the physical 
address of the page- 
aligned System Control 
Block (SCB). 



:PR$_SCBB 



::EXE$GL_SCB 



The system virtual address 

of the SCB is stored 

in global location EXESGI SCB. 



The VAX-1 1/730 and VAX-1 1/750 system control 
block is two pages long. The second page is used 
for directly vectored IJNIBUS device interrupts. 
The system control block in a VAX-1 1/750 
with a second UNIBUS is three pages long. 

The VAX-1 1/780 system control block is one 
page long. 



Figure 4-1 

System Control Block 



64 



4.2 Generation of Exceptions 

locations of this table are listed in Table 4-1. Most of the exceptions that 
are listed in this table are handled in a uniform way by the operating sys- 
tem. The actions that the VMS executive takes in response to these excep- 
tions are the subject of most of this chapter. Some of the exceptions, how- 
ever, result in special action on the part of the operating system. These 
exceptions are discussed in the paragraphs that follow and are indicated in 
Table 4-1 by an asterisk. 

4.2.1.1 Exceptions That the VMS Executive Treats in a Special Way. Although the 
operating system provides uniform handling of most exceptions generated by 
users, several possible exceptions are used as entry points into privileged 
system procedures. Other exceptions can only be acted upon by the execu- 
tive. It makes no sense for these procedures to pass information about the 
exceptions along to user's programs. 

1. The machine check exception is a processor-specific condition that may or 
may not be recoverable. The machine check exception service routine is 
discussed in Chapter 8. 

2. A kernel-stack-not-valid exception indicates that the kernel stack was not 
valid while the processor was pushing information onto the stack during 
the initiation of an exception or interrupt. The exception service routine 
for this exception generates a fatal bugcheck with a BUG$_KRNLSTAKNV 
code. 

3. The powerfail entry point that appears as one of the first twenty entries in 
the SCB is not an exception. Because a power fluctuation occurs 
asynchronously with respect to the currently executing instruction 
stream, it is actually an interrupt. The fact that powerfail is an interrupt, 
with an associated IPL, implies that the powerfail interrupt can be blocked 
simply by raising IPL to 30 or 31. The steps that the VMS system takes in 
response to power failure as well as on power recovery are described in 
Chapter 27. 

4. The translation-not-valid exception is a signal that a reference was made 
to a virtual address that is not currently mapped to physical memory. The 
page fault handler that is invoked in response to this exception is dis- 
cussed in detail in Chapter 15. 

5. The change-mode-to-kernel and change-mode-to-executive exceptions are 
the mechanisms used by the VMS system services and by RMS to reach a 
more privileged access mode. The dispatching scheme for system services 
and RMS calls is described in Chapter 9. 

The last two exceptions in the list (the two change mode exceptions) are 
paths into the operating system that allow nonprivileged users to reach a 
privileged access mode in a controlled fashion. 



65 



On 



§ 



Table 4-1: Use of First 20 Locations in System Control Block 



Byte Offset Exception Name 
from SCB Base 

Unused 

4 'Machine Check 

8 * Kernel Stack Not Valid 

12 *Powerfail 

16 Reserved/Privileged Instruction 

20 Customer Reserved Instruction 

24 Reserved Operand 

28 Reserved Addressing Mode 

32 Access Violation 

36 "Translation Not Valid 

40 Trace Pending 

44 BPT Instruction 

48 Compatibility Mode 

52 Arithmetic 



Extra 


Type (Abort, 


Notes on VMS 


Comments 


Parameters 


Fault, Trap) 


Dispatching 




Note 1 


Note 1 


Note 1 


(See Chapter 8.) 





Abort 


Note 2 


IPL=31, Interrupt Stack 





Interrupt 


Note 3 


IPL=30 (See Chapter 27.) 





Fault 









Fault 




XFC Instruction 





Abort/Fault 









Fault 






2 


Fault 






2 


Fault 


Note 4 


(See Chapter 14.) 





Fault 


Note 5 







Fault 


Note 5 




1 


Abort/Fault 






1 


Fault/Trap 




VMS modifies code 
(See Table 4-3.) 



§ 

I 

I 



Table 4-1 : Use of First 20 Locations in System Control Block (continued) 



Byte Offset 
from SCB Base 

56 
60 
64 

68 

72 
76 



Exception Name 

Unused 
Unused 
*CHMK 

*CHME 

CHMS 
CHMU 



Extra 
Parameters 



Type (Abort, 
Fault, Trap) 



Trap 
Trap 



Notes on VMS 
Dispatching 



Note 6 
Note 6 



Trap 
Trap 



Comments 



Uses Kernel Stack 
(See Chapter 9.) 
Uses Executive Stack 
(See Chapter 9.) 
Uses Supervisor Stack 
Uses User Stack 



"These exceptions result in special action on the part of the operating system. 

( 1 ) The machine check exception indicates a processor-detected internal error. Machine checks in executive and kernel mode cause bugchecks. 

Machine checks in supervisor and user mode are reported through the normal exception dispatch method. 
(2| The exception service routine for the kernel-stack-not-valid abort issues a bugcheck. 
(3] Powerfail causes an interrupt that passes control to the powerfail handler. 
(4] The translation-not-valid fault is the entry path into the paging facility in VMS. 

(5) If executive debugging (XDELTA) is selected at SYSBOOT time, the exception vectors for BPT and trace pending are altered to point into 
XDELTA fault handlers (see Chapter 25). 

(6) The change-mode-to-kernel and change-mode-to-executive traps are the entry paths into system service and RMS procedures. 






-fcu 



a. 
§ 



"a 
§ 

03 



Condition Handling 

4.2.1.2 Other Hardware Exceptions. The rest of the exceptions detected by hardware 
are handled uniformly by their exception service routines. These exceptions 
are all reported to condition handlers established by the user or by the sys- 
tem, rather than resulting in special system action such as occurs following a 
change-mode-to-kernel exception or a translation-not-valid fault (page fault). 

When a hardware-detected exception occurs, the PSL and PC at the time of 
the exception are pushed onto the stack. The usual stack that is used is the 
kernel stack but the CHMx exceptions use the stack of the destination mode. 
For example, a CHMS exception pushes the PC and PSL of the exception onto 
the supervisor stack. Note that a CHMx instruction issued from an inner 
access mode in an attempt to reach a less privileged (outer) access mode will 
not have the desired effect. The mode indicated by the instruction is mini- 
mized with the current access mode to determine the actual access mode 
that will be used. For example, a CHMS instruction issued from kernel mode 
will generate an exception through the correct SCB vector (the one for 
CHMS), but the final access mode will still be kernel. In other words, as 
illustrated in Figure 1-4, the CHMx instructions can only reach equal or more 
privileged access modes. 

The PC that is pushed depends on the nature of the exception, that is, 
whether the exception is a fault, a trap, or an abort. 

• Exceptions that are faults (see Table 4-1) cause the PC of the faulting in- 
struction to be pushed onto the stack. When faults are dismissed with an 
REI instruction, the faulting instruction will execute again. 

• Exceptions that are traps (see Table 4-1) push the PC of the next instruc- 
tion onto the destination stack. Instructions that cause traps do not 
reexecute when the exception is dismissed with an REI instruction. 

• A third class of exception, an abort, causes a PC in the middle of the in- 
struction to be pushed onto the stack. Aborts are not restartable. Some 
aborts also raise IPL to 31, blocking all other activity on the system. IPL is 
usually not affected when exceptions occur. Independence from IPL is one 
of the features that distinguishes exceptions from interrupts. Exceptions 
that are aborts include kernel-stack-not- valid, some machine check codes, 
and some reserved operand exceptions. 

For all exceptions that will eventually be reported to condition handlers, 
the hardware has pushed a PC/PSL pair onto the destination stack. In addi- 
tion, from zero to two exception-specific parameters are pushed onto the 
destination stack (see Table 4-1). Finally, the hardware passes control to 
the exception service routine whose address VMS placed into the SCB 
when the system was initialized. 

4.2.1.3 Initial Action of Exception Service Routines. These exception service rou- 
tines all perform approximately the same action. The exception name (of the 



68 



4.2 Generation of Exceptions 

form SS$_exception-name) and the total number of exception parameters 
(from the exception name to the saved PSL inclusive) are pushed onto the 
stack so that the destination stack now contains a list, called the signal array, 
that resembles a VAX- 11 argument list used by the CALLx instructions (see 
Figure 4-2). The exceptions that the operating system handles in this uniform 
way, including their names and total number of signal array elements, are 
listed in Table 4-2. 

After the VMS system has built this array, control is passed to a general 
exception dispatcher that must locate any condition handlers that have been 
established in the access mode of the exception. The search method and the 
list of information passed to condition handlers is described in Section 4.3 
below. 

All hardware exceptions (except for CHME, CHMS, and CHMU) are ini- 
tially reported on the kernel stack (assuming the processor is not already on 
the interrupt stack). In addition, the hardware exception reporting mecha- 
nism assumes that the kernel stack is valid. The decision to use the kernel 
stack was made to avoid the case of attempting to report an exception on, for 
example, the user stack, only to find that the user stack is corrupted in some 
way (invalid or otherwise inaccessible), resulting in another exception. If a 
kernel-stack-not-valid exception is generated while reporting an exception, 
the operating system causes a fatal bugcheck to occur. 

However, the exception must eventually be reported back to the access 
mode in which the exception occurred. Before the dispatcher begins its 
search, it creates space on the stack of the mode in which the exception 
occurred. The exception parameter lists are then copied to that stack, where 
they will become the argument list that is passed to condition handlers. 



SS$_exception-name 



From to 2 

Exception-Specific 

Parameters (Table 4-1) 



Exception PC 



Exception PSL 



Pushed 
by software 



N is the number of longwords from 

SS$ exception-name to the exception 

PSL. It ranges from 3 to 5. 



Arguments are pushed onto the kernel 
Pushed stack except for CHMS and CHMU 

by hardware exceptions where the supervisor or 
user stack Is used. 



Figure 4-2 

Signal Array Built by Hardware and Exception Routines 



69 



o 



Table 4-2: Exceptions That Use the Dispatcher in Module EXCEPTION 

Exception Name 

Access Violation 



Arithmetic Exception 

AST Delivery Stack Fault 
(Software exception) 



Name in 
Signal Array 


Notes on VMS 
Dispatching 
(Section 4.2.1.4) 


Size of 
Signal Array 


SS$_ACCVIO 


Item 1 




5 


(See Table 4-3.) 


Item 2 




3 


SS$_ASTFLT 


Item 3c 




7 



BPT Instruction 
Change Mode to Supervisor 
Change Mode to User 
Compatibility Mode 
Debug Signal 

(Software exception) 
Machine Check 
Customer Reserved Instruction 
Reserved/Privileged Instruction SS$_OPCDEC 



SS$_BREAK 

SS$_CMODSUPR Item 4 

SS$_CMODUSER Item 4 

SS$_COMPAT Item 4 

SS$_DEBUG Item 3 

SS$_MCHECK 
SS$_OPCCUS 



Item 5 



3 
4 
4 
4 
3 

3 
3 
3 



Extra Parameters 
in Signal Array 
(Note 1) 

Signal (2) = Reason Mask 

Signal (3) = Inaccessible Virtual Address 

Note 2 

Signal (2) = SP Value at Fault 

Signal (3) = AST Parameter of failed AST 

(Note 3) 

Signal (4) = PC at AST delivery interrupt 

Signal (5) = PSL at AST delivery interrupt 

Signal (6) = PC to which AST would have 

been delivered 
Signal (7) = PSL at which AST would have 

been delivered 

Signal (2) = Change mode code 
Signal (2) = Change mode code 
Signal (2) = Compatibility exception code 



Note 4 



§ 



§ 

fcs 

00 



Table 4-2: Exceptions That Use the Dispatcher in Module EXCEPTION (continued) 
Exception Name 



Page Fault Read Error 
(Software exception) 

Reserved Addressing Mode 

Reserved Operand 

System Service Failure 
(Software exception) 

Trace Pending 



Namein 
Signal Array 



Notes on VMS 
Dispatching 
(Section 4.2.1.4) 



SS$_PAGRDERR Item 3b 



SS$_RADRMOD 

SS$_ROPRAND 

SS$_SSFAIL 



Item 3a 



Size of 
Signal Array 



Extra Parameters 
in Signal Array 
(Note 1) 

Signal (2) = Reason Mask 

Signal (3) = Inaccessible Virtual Address 



Signal (2) = System service final status 



SS$_TBIT 



Additional parameters in the signal array are represented in the following way. 

Signal (0) = N Number of additional longwords in signal array 

Signal ( 1 ) Exception name 

Signal (2) First additional parameter 

Signal (3) Second additional parameter 



^1 



Signal (N - 1) Exception PC 
Signal (N) Exception PSL 

(2) The arithmetic exception has no extra parameters, despite the fact that the hardware pushes an exception code onto the kernel stack. VMS 
modifies this hardware code into an exception-specific exception name (see Table 4-3). 

Signal (1) = 8 * code + SS$_ARTRES 

(3) The AST delivery code exchanges the interrupt PC/PSL pair and the PC/PSL to which the AST would have been delivered. 

(4) Machine check exceptions that are reported to a process do not have any extra parameters in the signal array. The machine check parameters have 
been examined, written to the error log, and discarded by the machine check handler (see Chapter 8). 



§ 

§ 
in 



Condition Handling 

4.2.1.4 More Special Cases in Exception Dispatching. Although the procedure de- 
scribed above is a reasonable approximation to the operation of the exception 
service routines in the operating system, there are detailed differences that 
occur in the dispatching of several exceptions that deserve special mention. 
These special cases are listed here. 

1. User Stack Overflow is detected by the hardware as an access violation at 
the low address end of PI space. The access violation fault handler tests 
whether the inaccessible virtual address is at the low end of PI space. If it 
is, the stack is expanded and the exception dismissed. User and system 
condition handlers would only be notified about such an exception if the 
stack expansion were unsuccessful. 

2. There are ten possible arithmetic exceptions that can occur. They are dis- 
tinguished in the hardware by different exception parameters. However, 
the exception service routine does not simply push a generic exception 
name onto the stack, resulting in a four-parameter signal array. Rather, the 
exception parameter is used by the exception service routine to fashion a 
unique exception name for each of the possible arithmetic exceptions. The 
exception parameters and their associated signal names are listed in Table 
4-3. 

3. There are three exceptions listed in Table 4-2 that are detected by software 
rather than by hardware. However, these conditions are not generated by 
LIB$SIGNAL. Rather, they are detected by the executive, and control is 
passed to the same routines that are used for dispatching hardware- 
detected exceptions. The conditions are dispatched through the executive, 
because they are typically detected in kernel mode but must be reported 
back to some other access mode. The code to accomplish this access mode 

. switch is contained in EXCEPTION. LIB$SIGNAL has no corresponding 
function. The three exceptions that fall into this category are system serv- 
ice failure exceptions, page fault read errors, and insufficient stack space 
while attempting to deliver an AST. 

• The SS$_SSFAIL exception is reported when a process has enabled sys- 
tem service failure exceptions and a system service returns unsuccess- 
fully with a status of either STS$K_ERROR or STS$K_SEVERE. 

• The SS$_PAGRDERR exception is reported when a process incurs a 
page fault for a page on which a read error occurred in response to a 
previous page fault. 

• The SS$_ASTFLT exception is reported when an inaccessible stack is 
detected while attempting to deliver an AST to a process. 

A fourth software-detected exception is listed in Table 4-2 although it 
does not have a global entry point in module EXCEPTION. The signal 
SS$_DEBUG is generated by either the DCL or MCR command language 
interpreter in response to a DEBUG command while an image exists in an 



11 



4.2 Generation of Exceptions 



Table 4-3: Signal Names for Arithmetic Except: 


ions 




Exception Type 
Traps 


Code Pushed 
by Hardware 


Resulting Exception 
Reported by VMS 


Notes 


Integer Overflow 


1 


SS$_INTOVF 


1 


Integer Divide by Zero 


2 


SS$_INTDiV 




Floating Overflow 


3 


SS$_FLTOVF 


3 


Floating/Decimal 








Divide by Zero 


4 


SS$_FLTDIV 


3 


Floating Underflow 


5 


SS$_FLTUND 


2,3 


Decimal Overflow 


6 


SS$_DECOVF 


1 


Subscript Range 


7 


SS$_SUBRNG 




Faults 








Floating Overflow 


8 


SS$_FLTOVF_F 


3 


Floating Divide by Zero 


9 


SS$_FLTDIV_F 


3 


Floating Underflow 


10 


SS$_FLTUND_F 


3 



( 1 ) Integer overflow enable and decimal overflow enable bits in the PSW can be al- 
tered either directly or through the procedure entry mask. 

(2) The floating underflow enable bit in the PSW can only be altered directly. There is 
no corresponding bit in the procedure entry mask. 

(3) On the VAX-1 1/730 and VAX-1 1/750, these three floating point exceptions are 
faults. On the VAX-1 1/780 earlier than microcode revision (revj level 7, they are 
traps. Rev level 7 ECO changes them to faults. 



interrupted state. The DEBUG command processor pushes the PC and PSL 
of the interrupted image, the exception name (SS$_DEBUG), and the size 
of the signal array (3) onto the supervisor stack and jumps to 
EXE$REFLECT, a global entry address in module EXCEPTION. 

The reason that a CLI uses this mechanism for the DEBUG signal rather 
than simply calling LIB$SIGNAL is that the DEBUG command is issued 
while in supervisor mode but the exception has to be reported back to user 
mode. Reporting information back to user mode involves moving the excep- 
tion parameters from one stack to another (a function that does not exist 
in LIB$SIGNAL but does exist in EXCEPTION), because most hardware- 
detected exceptions are reported on the kernel stack. 
4. The exception dispatching for the CHMS and CHMU exceptions and for 
compatibility mode exceptions can be short-circuited by use of the De- 
clare Change Mode or Compatibility Mode Handler system service. When 
this system service is executed, one of three longword locations in the PI 
pointer page (see Appendix A) is loaded with the address of the handler 
passed as a parameter to the system service. 



73 



Condition Handling 



When the dispatcher for the change-mode-to-supervisor or change- 
mode-to-user exception finds nonzero contents in the associated longword 
in PI space, it transfers control to the routine whose address is stored in 
that location with the exception stack (supervisor or user) in exactly the 
same state it was in following the exception. That is, the change mode 
code is on the top of the stack, and the exception PC and exception PSL 
occupy the next two longwords. 

The dispatcher for compatibility mode exceptions transfers control to 
the user-declared compatibility mode handler (if one was declared) with 
the user stack in the same state it was before the compatibility mode 
exception occurred. That is, no parameters are passed to the compatibility 
mode handler on the user stack. The compatibility mode code, the excep- 
tion PC and PSL, and the contents of RO through R6 are saved in the first 
ten longwords of the compatibility mode context page in PI space at global 
location CTL$AL_CMCNTX (see Appendix A). 
5. The reserved instruction fault is generated whenever an unrecognized op- 
code is detected by the instruction decoder. The same exception is gener- 
ated when a privileged instruction is executed from other than kernel 
mode. 

VMS uses this fault as a path into the operating system crash code called 
the bugcheck mechanism. Opcode FF, followed by FE or FD, tells the re- 
served instruction exception service routine that the exception is actually 
a bugcheck. Control is passed to the bugcheck routine that is described in 
Chapter 8. 



4.2.2 Exceptions Detected by Software 

One of the goals of the design of the VAX architecture was to have a common 
condition handling facility for both hardware-detected and software-detected 
conditions. The dispatching for conditions that are initially detected by the 
hardware (and for four special software-detected exceptions) is performed by 
the routines in the executive module EXCEPTION. The Run-Time Library 
procedure called LIB$SIGNAL provides a similar capability to any user of a 
VAX/VMS system. 

4.2.2.1 Passing Status from a Procedure. There are usually two methods available 
for a procedure to indicate to its caller whether it completed successfully. 
One method is to indicate a return status in RO. The other is the signaling 
mechanism. The signaling mechanism involves a call to the VAX- 11 Run- 
Time Library procedure LIB$SIGNAL to initiate a sequence of events exactly 
like those that occur in response to a hardware-detected exception. One of 



74 



4.3 Uniform Exception Dispatching 

the choices that must be made when designing a modular procedure is the 
method for reporting exceptional conditions back to the caller. 

There are two reasons why signaling may be chosen over completion sta- 
tus. In some procedures, such as the mathematics procedures in the Run- 
Time Library, RO is already used for another purpose, namely the return of a 
function value, and is therefore unavailable for error return status. In this 
case, the procedure must use the signaling mechanism to indicate excep- 
tional conditions, such as an attempt to take the square root of a negative 
number. 

The second common use of signaling occurs in an application that is using 
an indeterminate number of procedure calls to perform some action, such as 
a recursive procedure that parses a command line, where the use of a return 
status is often cumbersome and difficult to code. In this case, the VAX- 11 
signaling mechanism provides a graceful way not only to indicate that an 
error has occurred but also to return control (through SYS$UNWIND) to a 
known alternate return point in the calling hierarchy. 

4.2.2.2 Initial Operation of LIB$SIGNAL. When the procedure that detects an error 
wishes to signal it, the procedure calls LIB$SIGNAL with the name of the 
exception and whatever additional parameters it wishes to pass to the condi- 
tion handlers that have been established by the user and by the system. The 
state of the stack following a call to LIB$SIGNAL is pictured in Figure 4-3. 
Before LIB$SIGNAL begins its search for condition handlers, it removes the 
call frame (and possibly the argument list) from the stack. Removing the call 
frame causes the stack to appear almost exactly the same to LIB$SIGNAL as 
it does to EXCEPTION following a hardware exception (see Figure 4-3). After 
building the exception argument list, LIB$SIGNAL uses the routines in EX- 
CEPTION to search for condition handlers. The only difference between this 
procedure and the code contained in the executive is that no stack switch is 
required here. The search for condition handlers takes place on the stack of 
the caller of LIB$SIGNAL. 



4.3 UNIFORM EXCEPTION DISPATCHING 

Once information concerning the exception has been pushed onto the stack, 
the differences between hardware and software exceptions are no longer im- 
portant. In the following discussion, the operation of exception dispatching 
will be discussed in general terms and explicit mention of EXCEPTION or 
LIB$SIGNAL will only be made where they depart from each other in their 
operation. 

Before the search for a condition handler begins, the exception dispatcher 
must build a second data structure on the stack that will be used to report the 



75 



OS 



State of the stack immediately 
after the CALLS to LIBSSIGNAL 



= No condition handler 



Register Save 
Mask, etc. 



Saved^-v PSW 
O 



Saved AP 







Saved FP 



© 



Saved PC (4, 



0...3 Stack Alignment 
Bytes 



0M 



32-bit Status Code 
(Signal Name) 



Additional Arguments 
(If Any) Passed 
to LIBSSIGNAL 
or LIB$STOP 



If CALLG instead of 

CALLS, then the argument list is 1 

copied from elsewhere to the 

signal array. The rest of the 

call frame is discarded 

in the same fashion. 



The call frame is discarded before 
handlers are called. 



Saved PSW = low 16 
1 ) bits of PSL in signal 

array 
2 ) Saved AP-AP 



© 

© 
© 



Saved FP-FP 



\ (T) Saved PC-signal array 

Call \ /T\ M is the size of the 

frame for \ ^S/ argument list. 

LIB$SIGNAL / /^N is the size of the 

or LIBSSTOP / W signal array (N = M+2). 



J\. 









\°, 



\ '<< 



Exit from LIBSSIGNAL with^„_--- 
REI and not RET. „---'''' 



The argument list is 
shifted up 8 bytes 
to make room for the 
PC/PSL pair so 
that hardware and 
software signal arrays 



%> 



+*\ 



\- *a \ look the same 

\ *<*, X 



Argument list 

passed to 

LIBSSIGNAL or 

LIBSSTOP 



\^\ 



W \ 



\ 



\. 



% 



Value of SP before 

call and push 

of argument list 



State of the stack after 

LIBSSIGNAL has removed 

the call frame 

Mechanism array 
will go here 



Signal/Stop code 
= LIBSSIGNAL; 2 = LIBSSTOP 



0N 



32-bit Status Code 
(Signal Name) 



Additional Arguments 
(If Any) Passed 
to LIBSSIGNAL 
or LIBSSTOP 



§ 
Br 

§ 



PC of Instruction s-\ 
Following CALLx \U 



PSL that Existed s-\ 
Before CALLx \*J 



Signal 
array 
\ passed to 
condition 
handlers 



Figure 4-3 

Removal of Call Frame by LIBSSIGNAL 



4.3 Uniform Exception Dispatching 



These two longwords are 
used and modified by 
handler search procedure. 

Because the VAX-11 calling 
standard dictates that 
RO and R1 are not saved 
across calls, they must be 
preserved in some other way. I 



Exception generated by 
call to LIB$SIGNAL or 
LIBSSTOP. The argument 
list is passed by call to 
LIBSSIGNAL or LIBSSTOP. 
The PC and PSL are added 
before handlers are called. 
See Figure 4-3. 



Address of 
Signal Array 



Address of 
Mechanism Array 



FP of Establisher Frame 



Depth Argument 



Saved RO 



Saved R1 



Signal/Stop Code 
1 = LIB$SIGNAL; 2 = LIB$STOP 



Exception or Signal Name 



Additional exception parameters 

pushed by hardware or 

additional arguments passed to 

LIBSSIGNAL or LIBSSTOP 



Exception PC or PC following 
call to LIBSSIGNAL or LIBSSTOP 



Exception PSL 



Condition handlers can pass 
status back to mainline code 
by modifying saved RO (and R1). 



Argument count (N) is the 
number of longwords in a signal 
array (N> 3). 



Exception dispatched through 
hardware dispatcher. Parameters 
are pushed initially onto the kernel 
stack (except for CHMS and CHMU) 
by hardware and copied to the 
exception stack by software. The 
exception name and argument count 
are added by software before 
handlers are called, 

Value of SP before 
exception 



Figure 4-4 

Signal and Mechanism Arrays 



exception. The address of this structure, called the mechanism array, along 
with the address of the table containing the exception arguments will be the 
two arguments that are passed to any condition handlers that are called by 
the dispatcher (see Figure 4-4). 



4.3.1 Establishing a Condition Handler 

The VMS operating system provides two different methods for establishing 
condition handlers. 



11 



Condition Handling 



• One method uses the call stack associated with each access mode. Each 
call frame includes a longword to contain the address of a condition han- 
dler associated with that frame. 

• The second method uses software exception vectors, set aside in the con- 
trol region (PI space) for each of the four access modes. Vectored handlers 
do not possess the modular properties associated with call frame handlers 
and are intended primarily for debuggers and performance monitors. 

Call frame handlers are established by placing the address of the handler in 
the first longword of the currently active call frame. Thus, in assembly lan- 
guage, call frame handlers can be established with a single instruction: 
MOVfiB new-handler, (FP) 

Because the frame pointer is generally not available to high level language 
programmers, the Run-Time Library procedure LIB$ESTABLISH can be 
called in the following way to accomplish the same result: 

old-handler = LIB$ESTABLISH (new-handler) 

Condition handlers are removed by clearing the first longword of the current 
call frame, as in the following assembly language instruction: 

CLEL (FP) 

The Run-Time Library procedure LIB$REVERT removes the condition 
handler established by LIB$ESTABLISH. 

Exception vector handlers are established and removed with the Set Excep- 
tion Vector system service, which simply loads the address of the specified 
handler into the specified exception vector, located in the pointer page in PI 
space. 



4.3.2 The Search for a Condition Handler 

At this point in the dispatch sequence, the signal and mechanism arrays have 
been set up on the stack of the access mode that the exception will be re- 
ported to. The establisher frame argument in the mechanism array (see Fig- 
ure 4-4) will be used by the search procedure to indicate how far along the 
search has gone. The depth argument in the mechanism array not only serves 
as useful information to condition handlers that wish to unwind but also 
allows the search procedure to distinguish call frame handlers (nonnegative 
depth) from exception vector handlers (negative depth). 

4.3.2.1 Primary and Secondary Exception Vectors. The search for a condition handler 
begins with the primary exception vector of the access mode in which 
the exception occurred. If the vector contains the address of a condition han- 
dler (any nonzero contents), the handler is called with a depth argument of 
-2 (third longword in mechanism array, Figure 4-4). If that handler resignals 



78 



4.3 Uniform Exception Dispatching 

or if none exists, the same step is performed for the secondary exception 
vector, where the depth argument is now -1. 



4.3.2.2 Call Frame Condition Handlers. If the search is to continue (no handler yet 
passed back a status of SS$_ CONTINUE), the contents of the current call 
frame are examined next. If the first longword in the current call frame is 
nonzero, that handler is called next. If no handler is found there or if that 
handler resignals, the previous call frame is examined by using the saved 
frame pointer in the current call frame (see Figure 4-5). As each handler is 
called, the depth longword in the mechanism array is set to the number of 
frames that have already been examined for a handler. 

The search continues until some handler passes back a status code of 
SS$_ CONTINUE or until a saved frame pointer of zero is found (indicating 
the end of the call frame chain). When the exception dispatcher receives a 
return status of SS$_ CONTINUE (any code with the low bit of RO set will 
do), the stack is cleaned off, RO and Rl are restored from the mechanism 
array, and the exception is dismissed by issuing an REI, using the saved PC 
and PSL that form the last two elements of the signal array. 

Note that control is passed back with an REI instruction, even if the excep- 
tion was caused by a call to LIB$SIGNAL, because it discarded the call frame 
that was set up when it was called. That is, LIB$SIGNAL modifies its stack to 
look just like the stack used by EXCEPTION (see Figure 4-3). 

4.3.2.3 Last Chance Condition Handler. In the event that all handlers resignal, the 
search terminates when a saved frame pointer of zero is located. The excep- 
tion dispatcher then calls the handler whose address is stored in the last 
chance exception vector with a depth argument of -3. (This handler is also 
called in the event that any errors occur while searching the stack for the 
existence of condition handlers.) The usual handler found in the last chance 
vector is the so-called catch-all condition handler established as part of image 
initiation. The action of this system-supplied handler is described at the end 
of this chapter. 

If the last chance handler returns to the dispatcher (its status is ignored) or 
if the last chance vector is empty, the exception dispatcher indicates that no 
handler was found. This notification is performed by a procedure called 
EXE$EXCMSG (see Chapter 30) in the executive. Its two input parameters 
are an ASCIZ string containing message text and the argument list that was 
passed to any condition handlers. Following the call to EXE$EXCMSG (see 
Chapter 30), the image is terminated with a status indicating either that no 
handler was found or that a bad stack was detected while searching for a 
condition handler. 



79 



Condition Handling 





© 












2 









Signal Array 










' 










Mechanism Array 
























4 




Signal and 








mechanism 


Eslablisher FP 








arrays for 




© 






signal S 
generated by 


Depth = 1 


procedure C 


RO 




R1 




Signal/Stop Code 






N 










Name of Signal S 


- 
- 


s 
s 


- Other Parameters ; 




Exception PC in C 




Exception PSL 










CH 




© 








Call frame for 
procedure C 
























Saved PC in B 










BH 








O 






Call frame for 
procedure B 








Saved FP , 


















Saved PC in A 














AH 












© 






Call frame for 
procedure A 








Saved FP « 












■ 






Saved PC 






Direction of 
stack growth 



To previous frame 



Figure 4-5 

Order of Search for Condition Handler 



80 



4.3 Uniform Exception Dispatching 

4.3.3 Multiply Active Signals 

If an exception occurs in a condition handler or in a procedure called by a 
condition handler, a situation called multiply active signals is reached. To 
avoid an infinite loop of exceptions, the procedure that searches for condition 
handlers modifies its search algorithm so that frames searched while servic- 
ing the first condition are skipped while servicing the second condition. 

In order for this skipping to work correctly, call frames of condition han- 
dlers must be uniquely recognizable. The frames are made unique by always 
calling the condition handlers from a standard call site, located in the system 
service vector area. 



4.3.3.1 Common Call Site for Condition Handlers. Before the dispatch to the handler 
occurs, the stack is set up to contain the signal and mechanism arrays and the 
handler argument list (see Figure 4-4). The handler address is loaded into Rl 
by the handler search procedure and control is passed to the common dis- 
patch site with the following instruction: 

JSB @#SYS$CALL_HANDL 

The code located at SYS$CALL_HANDL simply calls the procedure whose 
address is stored in Rl and returns to its caller with an RSB. 

SYS$CALL_HANDL: : 

CALLG 4(SP),(R1) 
RSB 

The call instruction leaves the return address SYS$CALL_HANDL + 4, 
the address of the RSB instruction, in its call frame. Thus, the unique identi- 
fying characteristic of a condition handler is the address SYS$CALL_HANDL 
+ 4 in the saved PC of its call frame. This signature is used not only by the 
search procedure but also by the Unwind system service, as described below. 



4.3.3.2 Example of Multiply Active Signals. The modified search procedure can best 
be illustrated through an example. Figure 4-5 shows the stack after procedure 
C, called from B called from A, has generated signal S. We are assuming that 
the primary and secondary condition handlers (if they exist) resignaled. Con- 
dition handler CH also resignaled. 

(T) Procedure A calls procedure B, which calls procedure C. 

(2) Procedure C generates signal S. 

(3) The search procedure modifies the depth argument and establisher frame 
argument. If handler CH resignals, the depth argument is 1 when BH is 
called. 

(4) The call frame for handler BH is located (at lower virtual addresses) on 
top of the signal and mechanism arrays for signal S (see Figure 4-6). (The 
only intervening items are the saved registers and stack alignment bytes 



81 



Condition Handling 



® 







2 








Signal Array < 
















© 


Mechanism Array 








_J 




Signal and 
mechanism 


Establisher FP 


4 




arrays for 










signal T 
generated by 
procedure Y 




® 








Depth = 3 




RO 






R1 






Signal/Stop Code 








N 












Name of Signal T 






- Other Parameters ^ 






Exception PC in Y 






Exception PSL 


<"> 












YH 




I I 










Direction of 
stack growth 


© 




■ 


Call frame for 
procedure Y 






Saved FP 
















Return PC in X 












XH 


r 








© 




procedure A 
in Figure 4-5 


Call frame for 
procedure X 






Saved FP 
















Return PC in BH 






BHH 










© 


RSM 






Call frame for 








Saved FP 












© 

© 


ro call f 
procec 






Dispatcher Call Site 




< 




Saved registers and 

stack alignment 

'- bytes indicated ; 

by register save 

mask (RSM) in 

call frame BH 


ame for 
ureC 




Return PC from JSB 


in Figure 4-5 



© 



Figure 4-6 

Modified Search with Multiply Active Signals 



82 



4.4 Condition Handler Action 

indicated by the register save mask in the upper byte of the second long- 
word of the call frame for handler BH.) The saved frame pointer in the call 
frame for BH points to the frame for procedure C. 

(5) Handler BH now calls procedure X, which calls procedure Y (see Figure 
4-6). 

(6) Procedure Y generates signal T. The desired sequence of frames to be 
examined is: frame Y, frame X, frame BH, and then frame A. Frames B and 
C should be skipped because they were examined while servicing condi- 
tion S. 

(7) The search procedure proceeds in its normal fashion. The primary and 
secondary vectors are examined first (no skipping here). Then frames Y, 
X, and BH are examined, resulting in handlers YH, XH, and BHH being 
called in turn. Let us assume that all these handlers resignal. After han- 
dler BHH returns to the dispatcher with a code of SS$_RESIGNAL, the 
search procedure notes that this is the frame of a condition handler, be- 
cause its saved PC is SYS$CALL_HANDL + 4 (see Figure 4-6). 

(8) The skipping is accomplished by locating the frame that established this 
handler. The address of that frame is located in the mechanism array for 
signal S. 

To locate the mechanism array for signal S, the value of SP before the 
call to BH must be calculated, using the register save mask and stack 
alignment bits in the call frame. 

(9) One extra longword, the return PC from the JSB to SYS$CALL_HANDL, 
' must be skipped to locate the argument list (and thus the mechanism 

array) for signal S. 

@ Because the frame pointed to by the mechanism array element has al- 
ready been searched, the next frame examined by the search procedure is 
the frame pointed to by the saved frame pointer in the call frame of proce- 
dure B, which in this case is the frame for procedure A. The depths that 
are passed to handlers as a result of the modified search are for YH, 1 for 
XH, 2 for BHH, and 3 for AH. 

@ The frame for the search procedure, or for any of the handlers YH, XH, 
BHH, and AH when they are called, will be located on top of the signal 
and mechanism arrays for signal T (at lower virtual addresses). (One ex- 
ample is shown in Figure 4-8, which illustrates the operation of 
SYSSUNWIND.) 



4.4 CONDITION HANDLER ACTION 

Condition handlers have several options available to them. 
• They can fix the exception and allow execution to continue at the inter- 
rupted point in the program. 



83 



Condition Handling 



They can pass the exception along to another handler by resignaling. 
They can also allow execution to resume at any arbitrary place in the 
calling hierarchy by unwinding a number of frames from the call stack. 



4.4.1 Continue or Resignal 

A handler first determines the nature of the exception by examining the sig- 
nal name in the signal array (see Figure 4-4). If the handler determines that it 
is not capable of resolving the current exception for whatever reason, it in- 
forms the exception dispatcher that the search for a handler must go on. This 
continuation is called resignaling and is performed by passing a return status 
code of SS$_ RESIGNAL back to the dispatcher. (Recall that condition han- 
dlers are function procedures that return a status to their caller in RO.) 

On the other hand, if the condition handler is able to resolve the exception 
(in some unspecified way), it indicates to the dispatcher that the program that 
was interrupted when the exception occurred can continue. To indicate that 
the program can continue, the return status code of SS$_ CONTINUE is 
passed back to the caller. 

When the dispatcher detects this return status code, it removes the argu- 
ment list and mechanism array from the stack (see Figure 4-4), restoring RO 
and Rl in the process. It then removes all of the signal array except the excep- 
tion PC and PSL from the stack. Finally, these are removed with the REI 
instruction that dismisses the exception and passes control back to the pro- 
gram that was interrupted when the exception occurred. 

If the exception that occurred was a hardware fault (such as an access viola- 
tion), the instruction that caused the exception will be repeated because the 
PC of that instruction was pushed onto the stack when the exception oc- 
curred. If the exception was a hardware trap (such as integer overflow), the 
next instruction in the instruction stream will be the first to execute. In the 
event that a condition handler continues from an exception that was initi- 
ated through a call to LIB$SIGNAL, the first instruction to execute will be 
the instruction following the CALLx instruction. 



4.4.2 Unwinding the Call Stack 

Another powerful tool available to condition handlers allows them to alter 
the flow of control when an exception occurs. This tool is called unwinding 
and allows the condition handler to pass control back to a previous level in 
the calling hierarchy by throwing away a specified (or default) number of call 
frames. 

The Unwind Call Stack system service is called with two optional argu- 
ments, the first of which indicates the number of frames to remove from the 



84 



4.4 Condition Handler Action 

call stack and the second of which gives an alternate return PC to which 
control will be returned. 

The Unwind system service does not actually remove frames from the 
stack. Rather, it changes the return PC in the specified number of frames to 
point to a special routine in the executive that will be entered as each proce- 
dure exits with a RET instruction. The effect of calling Unwind is pictured in 
Figure 4-7. If the alternate PC argument has also been passed to Unwind, the 
return PC in the next call frame is altered to the specified argument (see 
Figure 4-7). 

As each procedure issues a RET instruction, control is passed to the execu- 
tive routine that examines the current frame for the existence of a condition 
handler. If such a handler exists, it is called with the exception name 
SS$ -UNWIND. When the condition handler returns to the unwind routine, a 
RET is issued by the unwind routine on behalf of the procedure to discard the 
current call frame. This sequence goes on until the specified number of call 
frames have been discarded. This technique of calling handlers as a part of the 
unwind sequence allows handlers that previously resignaled an exception to 
regain control and perform procedure-specific cleanup. 



4.4.3 Example of Unwinding the Call Stack 

An example of an unwind sequence is illustrated here with the help of Figure 
4-7. The situation begins with a sequence exactly like the one pictured in 
Figure 4-5. Procedure A calls procedure B, which calls procedure C. Procedure 
C generates signal S. The primary and secondary handlers (if they exist) sim- 
ply resignal. Handlers CH and BH, located next by the search procedure, also 
resignal. 

Finally, handler AH is called. AH decides to unwind the call stack back to 
its establisher frame. (This unwinding is not the default case.) To accomplish 
the unwinding , AH must call SYS$UNWIND with a depth argument equal 
to the value contained in the mechanism array. In this example, the depth 
argument is 2. After the call to SYS$UNWIND, which executes in the access 
mode of its caller, but before the frame modification occurs, the stack has the 
form pictured on the left-hand side of Figure 4-7. The operation of frame 
modification by the $UNWIND system service now proceeds as follows. 

(T) Unwind looks down the call stack until it locates a condition handler. 
Recall that a condition handler is identified by a saved PC of 
SYS$CALL_HANDL + 4. If handler AH had called another procedure in 
this example, nothing would have happened to that procedure's call 
frame. The first call frame modified by Unwind is the frame of the 
first handler that it encounters, which in the example in this figure is 
the frame for AH. 



85 



Condition Handling 





Call frames on entry 
to EXE$UNWIND 




Return PCs in these 

frames after they 
have been modified by 

EXE$UNWIND 




SYS$UNWIND's Handler 




FP 








system service 
SYSSUNWIND 








(EXE$UNWIND 




























of caller.) 








Return PC in AH 


Return PC in AH 










AHH (if established) 




Call frame for 
condition handler 




the signal and 
mechanism arrays 
passed to 
handler AH. 




AH 






















O 


Return PC in Exception Dispatcher 
(SYS$CALI_HANDL+4) 


STARTUNWIND 




Signal and mechanism arrays for 

initial condition located here 

(Figure 4-5) 


Trie signal array contains return PC 

in procedure C, which is 
bypassed if unwinding any frames. 












CH (if established) 












Call frame for 
procedure C 


































Return PC in B 


LOOPUNWIND 










BH (if established) 










Call frame for 
procedure B 


































Return PC in A 


(Alternate Return PC) 










AH 










Call frame for 

procedure A 






























' 




Return PC in Caller of A 








Previous call 
frame 





© 



© 



'© 



© 



Figure 4-7 

Call Frame Modification by SYSSUNWIND 



86 



4.4 Condition Handler Action 

(2) Unwind does not modify its own frame. When it issues a RET, control is 
passed back to handler AH. 

(3) The first frame that Unwind modifies is the frame of the first condition 
handler that it encounters by tracing back the call stack. It replaces the 
return address found there with the address of a routine (STARTUNWIND) 
internal to itself. 

When handler AH issues its RET, control will not go back to the excep- 
tion dispatcher. Instead, the instructions beginning at STARTUNWIND 
execute. Note that not returning to the exception dispatcher means that 
control will never get back to procedure C, because its return PC is stored 
in the mechanism array and would be restored by the REI instruction 
issued by the exception dispatcher. 

(4) Unwind continues to modify the saved PC longwords in successive 
frames on the call stack until the number of frames specified (or implied) 
in the SYS$UNWIND argument list have been modified. All frames ex- 
cept the first have their saved PC replaced with address LOOPUNWIND, 
another label in the internal unwind routine (see Figure 4-7). It is this 
routine that checks whether the current frame has a handler established 
and, if so, calls that handler with the signal name SS$ -UNWIND to 
allow the handler to perform procedure-specific cleanup. 

If a handler called in this way calls SYS$UNWIND (with the signal 
array containing SS$ -UNWIND as the signal name), an error status of 
SS$ -UNWINDING is returned, indicating that an unwind is already in 
progress. 

(5) If the alternate PC argument was also supplied to SYS$UNWIND, the 
call frame into which this argument would be inserted is the next frame 
beyond the last frame specified (or implied) in the first SYS$UNWIND 
argument. In this case; if an alternate PC argument were present, it 
would be placed into the call frame for procedure A. 

Now that all the frames have been modified, the actual unwinding occurs. 
The sequence of steps is approximately the following. 

1. Unwind returns control to handler AH. 

2. Handler AH does whatever else it needs to do to service the condition. 
When it has completed its work, it returns to the code beginning at label 
STARTUNWIND in module SYSUNWIND. (Because none of the unwind 
routines check return status, it does not matter what status is passed back 
by AH as it returns.) 

3. The routine beginning at STARTUNWIND first restores RO and Rl from 
the mechanism array. It then performs the following three steps. 

a. If a handler is established for this frame, the handler is called with the 
signal name SS$_UNWIND. 



87 



Condition Handling 



b. If either RO or Rl is specified in the register save mask, the unwind 
routine replaces the value of that register in the register save area of the 
call frame with the current contents of the register. Note that this is 
rather an unusual case,- the procedure calling standard specifies that RO 
and Rl are to be used to return status codes and function values. 

c. Control is returned to whatever address is specified in the saved PC 
longword of the current call frame by issuing a RET. 

4. The RET issued in step 3c discards the call frame for procedure C, passing 
control to LOOPUNWIND where the three steps 3a through 3c are again 
executed. 

5. The RET that discards the call frame for procedure B passes control back 
to the point in procedure A following the call to procedure B (if we assume 
no alternate PC argument) where execution will resume. 

In effect, STARTUNWIND and LOOPUNWIND simulate returns from 
each nested procedure that is being unwound. These procedures never receive 
control again. However, the outermost procedure receives control as if all of 
the nested procedures had returned normally. 



4.4.4 Potential Infinite Loop 

There is one possible pitfall that can happen with this implementation. The 
previous section pointed out that the exception dispatcher takes care (when 
multiple signals are active) not to search frames for the second condition that 
were examined on the first pass. If a condition handler generates an excep- 
tion, it is not called in response to its own signal (unless it establishes itself 
to handle its own signals!). 

However, Unwind cannot perform such a check. It must call each condi- 
tion handler that it encounters as it removes frames from the stack. Thus, a 
poorly written condition handler (one that generates an exception) could re- 
sult in an infinite loop of exceptions if a handler higher up in the calling 
hierarchy unwinds the frame in which this poorly written handler is de- 
clared. This loop has no effect on the system but effectively destroys the 
process in which this handler exists. 

4.4.5 Unwinding Multiply Active Signals 

There is a slight change to the Unwind system service when multiple signals 
are active. While modifying saved PCs in call frames, Unwind counts the 
number of frames that have been modified until the requested number has 
been reached. The only change that occurs with multiply active signals is 
that the loop stops counting while the skipped frames are being modified. 
The example of multiply active signals pictured in Figures 4-5 and 4-6 can 



88 



4.4 Condition Handler Action 

be used to illustrate the unwinding. Recall that procedure A called procedure 
B, which called procedure C, which signaled S. Handler CH resignaled. Han- 
dler BH called procedure X, which called procedure Y, which signaled T. 
Handlers YH, XH, and BHH all resignaled. Finally, handler AH was called for 
signal T with a depth of 3. 

If AH calls SYS$UNWIND, the top of the stack is as pictured in Figure 4-8, 
with the continuations of this figure in Figure 4-6. Assume that the depth 
argument passed to SYS$UNWIND is 3 (taken from the mechanism array 
and meaning unwind to the establisher of AH), and the alternate PC argu- 
ment is not present. 

The end result of the operation of Unwind in this case is as follows. 

1. Unwind looks down the call stack until it locates a condition handler, 
which in this case is AH. The saved PC is modified to STARTUNWIND. 

2. The saved PC longwords in frames Y and X are altered to contain address 
LOOPUNWIND. Note that SYS$UNWIND has now altered three frames. 

3. Because the next frame on the stack, BH, indicates a condition handler 
(saved PC of SYS$CALL_HANDL + 4), its associated mechanism array is 
located (by climbing over saved registers, stack alignment bytes, and a 
saved PC from the JSB instruction). The saved PCs in all frames up to the 
frame pointed to by the mechanism array are modified (but not counted 
toward the number specified in the argument passed to SYS$UNWIND) to 
contain address LOOPUNWIND. This modification causes frames. BH and 
C to get their saved PCs altered in the example. 

4. The saved PC in the frame for procedure B is not altered so that when the 
unwind takes place, control will return to the call site of procedure B in 
procedure A. 



4.4.6 Correct Use of Default Depth in SYS$UNWIND 

A default depth argument to SYS$UNWIND (DEPADR = 0) specifies that the 
stack is to be unwound to the caller of the handler's establisher. In most 
cases, the caller of the handler's establisher is equivalent to the depth of the 
handler plus 1. However, because of an inherent ambiguity in counting the 
stack frames when multiply active signals are present, it is important that 
the default be used when unwinding to the caller of the establisher, rather 
than an explicit depth. 

Consider the two following cases of nested exceptions. In Figure 4-9, rou- 
tine A calls routine B. An exception causes handler BH to be invoked. An 
exception within BH causes handler AH to be invoked (because frame B is 
skipped, as described in Section 4.3.3). The depth of the mechanism vector in 
AH's argument list is 1. For AH to unwind to its establisher, it must specify 
an explicit depth of 1 to SYS$UNWIND. Then SYS$UNWIND removes one 



89 



Condition Handling 



Call frame 
for condition 
handler AH 



Signal and 
mechanism 
arrays generated 
by procedure A 



Call frame 
for condition 

handler BH 



Signal and 
mechanism 
arrays generated 
by procedure B 



Call frame for 
procedure B 



Call frame for 
procedure A 


















































Depth = 1 





























' 










' 








Depth = 






BH 




















' 




- 


AH 






















' 




To previous frame 

Figure 4-8 

Modified Unwind with Multiply Active Signals 



90 



4,4 Condition Handler Action 



Call frame for 
condition handler 
AHH 



Signal and 
mechanism 
arrays generated 
by handler AH 



Call frame for 
condition 
handler AH 



Signal and 
mechanism 
arrays generated 
by procedure A 



Call frame 

for procedure A 



























' 








Depth = 








AHH 
























. 








Depth = 








AH 














' 






To previous frame 



Figure 4-9 

Nested Exception, Type 1 



91 



Condition Handling 



frame, as specified by the count. The handler AH then notices that the next 
frame is a handler frame, and therefore continues to remove stack frames 
until it finds the establisher of the handler. This discovery completes the 
unwind to frame A. 

Now consider Figure 4-10, in which routine A incurs an exception, result- 
ing in the invoking of handler AH. Handler AH then causes an exception, 
causing its handler AHH to be invoked. The depth of AHH is zero. Now let us 
suppose that AHH wishes to unwind to the caller of its establisher. Now the 
establisher of AHH is AH. Since AH is a handler, its caller is the condition 
dispatcher, NOT routine A. 

Compare Figure 4-10 with Figure 4-9 carefully and consider what happens 
if AHH calls SYS$UNWIND with an explicit depth of 1 (its depth plus 1 ). The 
depth of 1 causes AHH's frame to be removed. SYS$UNWIND then notices 
that the next frame is a handler frame and, therefore, unwinds it back to its 
establisher (frame A). Note that once AHH's frame is removed, the stack is 
indistinguishable from the stack in Figure 4-9 (down to frame B). Thus, 
SYS$UNWIND with an explicit depth of 1 results in control returning to 
routine A, which is incorrect. 

Therefore, for AHH to unwind to the caller of its establisher (the condition 
dispatcher), it must specify a default depth. When this is done, $UNWIND's 
behavior upon encountering a handler frame after the count has been ex- 
hausted is modified so that the stack is not unwound further and control 
passes correctly back to the condition dispatcher. 

Because of the inherent ambiguity of these two cases, it is important that 
handlers always use the default depth when unwinding to the caller of their 
establisher. 



4.4.7 Unwinding ASTs 

In VAX/VMS Version 3.0, the behavior of $UNWIND was changed so that it 
correctly handles unwinding out of ASTs. Doing so requires some special 
processing, because simply peeling off the stack frames ignores the presence 
of the AST and fails to dismiss the AST properly. The result is that execution 
continues in the user's main level code, with delivery of further ASTs 
blocked. 

This situation is depicted in Figure 4-11. If handler XH unwinds to the 
caller of its establisher (procedure A), it will also unwind out of the AST. The 
problem is handled by having the $UNWIND service recognize the return PC 
of the AST call frame, which is set to the value EXE$ASTRET, the AST 
return point in the executive. When this PC is seen in a call frame, SUN- 
WIND knows that located immediately beneath it is the AST parameter list. 
In this case, the unwind PC (STARTUNWIND or LOOPUNWIND) is stored 
not in the call frame, but rather in the PC of the AST parameter list. 



92 



4.4 Condition Handler Action 



Call frame for 
condition 
handler XH 



Signal and 
mechanism 
array generated 
by AST 
routine X 



Call frame for 
AST routine X 



AST parameters 



Call frame for 
procedure A 











Saved FP 




















Establisher FP 
















XH 
















* 




EXE$ASTRET 








N 




AST Parameter 


RO 


R1 


PC 


PSL 






AH 


























' 




Direction of 
stack growth 



To previous frame 



Figure 4-10 

Nested Exception, Type 2 



93 



Condition Handling 



Call frame for 
system service 
SYSSUNWIND 
(EXE$UNWIND 
executes in 
access mode 
of caller.) 



Call frame for 
condition 
handler AH 



Signal and 
mechanism arrays 
for signal T 



SYSSUNWIND's Handler 



Saved AP 



Saved FP 



Return PC in AH 



-FP 



AHH (if established) 



RSM 



Saved FP ,_ 



Return PC in 
Exception Dispatcher 



Saved registers and 

stack alignment bytes 

indicated by register 

save mask RSM in 

call frame AH 



Return PC from JSB 



Signal Array •- 



Mechanism Array #- 




Direction of 
stack growth 



To frame for 
procedure Y 
in Figure 4-6 



To signal array 

array 
in Figure 4-6 



Figure 4-11 

Exception during an AST 



When the AST call returns during the actual unwinding of the stack, it 
returns to EXE$ASTRET, which dismisses the AST and returns to the inter- 
rupted code with an REI. The REI then returns back to STARTUNWIND or 
LOOPUNWIND because of the modified PC. In addition, immediately before 
returning to EXE$ASTRET, $UNWIND stores the current RO and Rl in the 
AST parameter list so that they will propagate through the unwind process. 

While it is technically possible to unwind out of an AST, doing so must be 
done with some caution. If the AST routine has any sort of side effects, it is 
essential to have a condition handler declared by the AST routine to clean up 
the side effects when the AST is unwound. (Note that issuing an I/O opera- 
tion is a side effect of the highest order!) Note also that cleaning up any 



94 



4.5 Default (VMS-Supplied) Condition Handlers 

subroutines of the main line program from which an unwind was executed 
may be more difficult, because the asynchronous nature of ASTs means that 
unwinding could take place at any instant during the execution of a program. 



4.5 DEFAULT (VMS-SUPPLIED) CONDITION HANDLERS 

Although the use of condition handlers is totally general and completely in 
the hands of the user, some actions will always occur as the result of default 
condition handlers that are established by the executive as a part of process 
creation or image activation. 

The discussions of process creation in Chapter 20 and image initiation in 
Chapter 21 point out exactly when and how each of the handlers described in 
this section is established. The action of each of these handlers, once they are 
invoked, is briefly described here. 

4.5.1 Traceback Handler Established by Image Startup 

When an image includes either the debugger or the traceback handler, an- 
other frame is put on the user stack before the image itself is called (see 
Chapter 21). The code that executes before calling the image places the ad- 
dress of a condition handler into this frame so that subsequent conditions 
that are not handled by an intervening condition handler will be picked up by 
this traceback handler. 

This handler first checks whether the exception that occurred was 
SS$_DEBUG. If so, it maps the debugger into P0 space (if not already mapped) 
and passes control to it. This condition is signaled by a CLI in response to a 
DEBUG command. This feature allows an image that was not linked or run 
with debugger support to be interrupted and have that support added. 

For all other exceptions, if the severity level is warning, error, or severe 
error, the handler maps the traceback facility into the top of P0 space and 
passes control to it. The traceback facility passes information about the ex- 
ception to SYS$OUTPUT and terminates the image. 

If the severity level is other than the three listed above, the traceback con- 
dition handler resignals the condition, which usually means that the condi- 
tion is being passed on to the catch-all condition handler. 

4.5.2 Catch-All Condition Handler 

The address of this handler is placed in an initial call frame on the user stack 
and in the last chance exception vector for user mode either by PROCSTRT 
when the process is created or by a command language interpreter before an 
image is called. This handler is always called if no other handlers exist or if 
all other handlers resignal. Because the address of the handler is duplicated in 



95 



Condition Handling 

the last chance vector, it will also be called in the event of some error while 
looking through the user stack. 

The first step that this handler takes is to call SYS$PUTMSG (see Chapter 
30). If the handler was called through the last chance exception vector (the 
depth argument in mechanism array is -3), or if the severity level of the 
exception name in the signal array indicates severe (exception name <2:0> 
GEQU 4), then SYS$EXCMSG (see Chapter 30) is called to print a summary 
message and the image is terminated. Otherwise, the image is continued. 

4.5.3 Handlers Used by Other Access Modes 

In addition to the handlers that the operating system supplies to handle ex- 
ceptions that occur in user mode, it also sets up handlers that will determine 
system behavior if an exception occurs in one of the other three access 
modes. 

4.5.3.1 Exceptions in Kernel or Executive Mode. In response to an exception in kernel 
mode, the exception dispatcher makes special checks to determine 
whether the processor was operating on the interrupt stack when the excep- 
tion occurred, whether the process was the swapper process or null process, 
or whether IPL was above IPL$_ASTDEL (IPL 2). Any of these conditions 
could indicate that the exception is not associated with a normal process. In 
any case, if either of these conditions holds, an Invalid Exception fatal bug- 
check (BUG$_INVEXCEPTN) is generated. Routines that forbid exceptions 
include interrupt service routines, device drivers (except for their FDT rou- 
tines), and process-based code that happens to be executing above 
IPL$_ASTDEL (such as portions of certain system services). 

If a kernel mode exception is associated with process-based code for which 
exceptions are allowed (IPL is less than or equal to 2 and the exception oc- 
curred on the kernel stack), then exception dispatching proceeds in its usual 
manner. The last chance exception vectors for both kernel and executive 
modes are initialized in module SHELL (see Chapter 20) to contain the ad- 
dresses of routines that generate a bugcheck code of Unexpected System 
Service Exception. The difference between the bugchecks for the two access 
modes is that the bugcheck generated by the kernel mode primary exception 
handler is fatal while the corresponding bugcheck generated by the executive 
mode primary exception vector is not. Fatal bugchecks cause the system to 
crash. Nonfatal bugchecks generally result in error log entries and the dele- 
tion of the process that caused the bugcheck. The bugcheck operation is de- 
scribed in Chapter 8. 

Routines that execute in executive mode include RMS, parts of the execu- 
tive, and any user-written procedure that is entered through either a user- 
written system service dispatcher or through the Change Mode to Executive 



96 



4.5 Default (VMS-Supplied) Condition Handlers 

system service. Routines that execute in kernel mode (that can cause this 
bugcheck and not the Invalid Exception bugcheck because they execute at 
IPL or IPL 2) include portions of all system services, many exception service 
routines, device driver FDT routines, including those that are written by 
users, and procedures that are called either by the user-written system serv- 
ice dispatcher or by the Change Mode to Kernel system service. 

4.5.3.2 Condition Handler Used by DCL or MCR. The DCL and MCR command 
language interpreters establish nearly identical condition handlers at the begin- 
ning of their command loops to field exceptions that occur in supervisor 
mode. 

Part of process creation involves image activation of the CLI (DCL or 
MCR). The first step that the CLI takes after image activation is to establish 
the supervisor mode condition handler that the CLI uses to handle its own 
internal errors. The condition handler performs two tasks when it is called: 

• It cancels any exit handlers that have been established. 

• It resignals the error. 

The CLI is then allowed to run to completion, as a result of which the 
process is deleted. 



97 



5 Hardware Interrupts 



While I nodded, nearly napping, suddenly there came a tapping, 
As of some one gently rapping, rapping at my chamber door. 
—Edgar Allan Poe, The Raven 

The VMS operating system is an interrupt-driven operating system. It con- 
tains a collection of interrupt service routines that execute in response to 
hardware interrupts from external devices and internal devices such as the 
clock. The VMS operating system does not have a software-based central 
dispatching module that receives notification of all system events (that is, 
interrupts) and decides what to do next. Instead, the VMS operating system 
relies on a hardware-controlled interrupt dispatching scheme that always 
forces the highest priority interrupt on the system to be serviced first. 



5.1 HARDWARE INTERRUPT DISPATCHING 

The VAX architecture provides 16 hardware interrupt priority levels (IPL), 
from IPL 31 down to IPL 16. The top eight levels are for use by urgent condi- 
tions including serious errors (such as machine check), the system clock, and 
power failure. These conditions are discussed in Chapters 8, 11, and 27 re- 
spectively. The lower eight levels are used by peripheral devices. 

When a peripheral device generates an interrupt, that interrupt is requested 
at a particular hardware IPL (fixed for a given device). As in the case of soft- 
ware interrupts, if the requested IPL value is higher than the level at which 
the processor is currently running (as determined by PSL <20:16>), then the 
interrupt service routine whose address is in the selected vector in the sys- 
tem control block (SCB) is entered immediately. Otherwise, servicing of the 
interrupt is deferred until IPL drops below the level associated with the inter- 
rupt. 

When an interrupt is serviced, the current processor status must be pre- 
served so that the interrupted thread of execution (either process-based code 
or an interrupt service routine executing at lower IPL) can continue normally 
after the interrupt is dismissed. Preserving the processor status is accom- 
plished (by the hardware) by automatically saving the PC and PSL on the 
stack. These are later restored with an REI instruction that dismisses the 
interrupt. Other elements of the process context, such as general registers, 
must be saved and restored by the routine(s) handling the interrupt. In order 
to reduce interrupt overhead, no memory mapping information is changed 
when an interrupt occurs. Therefore, the instructions and data referenced by 
an interrupt service routine must be in system address space. 



98 



5. 1 Hardware Interrupt Dispatching 

5.1.1 Interrupt Dispatching 

The following list outlines the primary sequence of events that occur in in- 
terrupt dispatching. 

1. An interrupt is requested. 

2. The current instruction finishes or reaches a well-defined point where the 
instruction state is completely contained in the general registers, PC, and 
PSL (which happens in the execution of the string instructions). (Some 
instructions can also be interrupted at well-defined points so that, after 
the interrupt dismissal, they are restarted, rather than continued.) 

3. The interrupt sequence is initiated by the hardware, pushing the current 
PC and PSL onto the stack. The VMS operating system uses the interrupt 
stack for all hardware interrupt servicing. Hardware interrupts are indi- 
cated by placing a 01 in bits <1:0> of each hardware interrupt vector in 
the system control block (see Figure 5-1). 

Most software interrupts are also serviced on the interrupt stack. On the 
other hand, the per-process interrupt associated with AST delivery and 
nearly all exceptions are serviced on the per-process kernel stack. 

4. A new PC is loaded (from the appropriate SCB vector), and a new PSL is 
created (with PSL <20:16> containing the IPL associated with the inter- 
rupt, and the previous access mode, current access mode, CM, TP, FPD, 
DV, FU, IV, T, N, Z, and C bits cleared by the hardware). The current 
access mode bits are cleared to indicate that the service routine will run in 
kernel mode. 

5. The interrupt service routine identified by the PC in the SCB executes 
and, eventually, exits with an REI instruction that dismisses the interrupt. 

6. The PC and PSL are restored by the execution of the REI instruction, and 
the interrupted thread of execution (process or less important interrupt 
service routine) continues where it left off. 



31 1 o 

SCB vector 



Address of Longword-Aligned 
Interrupt Service Routine 



Code 



Code Meaning 

00 Service the event on the kernel stack unless currently on the interrupt stack; in that 
case, use the interrupt stack. 

01 Service the event on the interrupt stack; if the event is an exception, raise IPL to 31. 

1 Service the event in the Writeable Control Store (WCS), passing bits < 1 5:2 > 

to the microcode; if the WCS does not exist or is not loaded, the operation is undefined 
(the processor will halt). 

1 1 The operation is undefined (the processor will halt). 

Figure 5-1 

System Control Block Vector Format 



99 



Hardware Interrupts 



Unlike software interrupt dispatching, there is not a one-to-one corre- 
spondence between hardware IPL and an interrupt service routine vector in 
the SCB (see Figure 5-2). The SCB contains the addresses of several interrupt 
service routines for a given device IPL. There are no registers corresponding 
to the Software Interrupt Request Register (PR$_SIRR) or Software Interrupt 
Summary Register (PR$_SISR); rather, the processor notes that a lower prior- 
ity interrupt has been requested, but not granted. When IPL falls below the 
device interrupt level, and the device is still requesting the interrupt, the 
interrupt will be granted. 

If, however, the device is no longer requesting an interrupt, the system will 
be unable to determine which interrupt service routine to call; such occur- 
rences are called passive releases. If the adapter to which the device is 
connected is still requesting an interrupt, an adapter-specific error routine is 
called. If the adapter is no longer requesting an interrupt, the system is un- 
able to determine which adapter requested the interrupt; in this case a nexus 
interrupt service routine is called. In either case, the system increments the 
counter IO$GL_SCB_INT0. 



5.1.2 System Control Block 

The system control block (SCB) contains the vectors used to dispatch (soft- 
ware and hardware) interrupts and exceptions. The starting physical address 
of the SCB is found in the System Control Block Base Register (PR$_SCBB). 
The size of the SCB varies depending on processor type. The VAX- 1 1/750 and 
the VAX- 11/730 system control blocks are two pages long; a VAX- 11/750 
with a second UNIBUS has a three-page system control block; the 
VAX- 11/780 system control block consists of a single page. 

The first page of the system control block is the only page defined by the 
VAX architecture. It contains the addresses of software and hardware inter- 
rupt service routines as well as exception service routines. The layout of the 
first SCB page is pictured in Figure 4-1. Table 6-1 contains more details about 
the SCB vectors used for software interrupts. Figure 5-2 shows how the sec- 
ond half of the first page is divided among 16 possible external devices, each 
interrupting at four possible IPL values. The second SCB page on the VAX- 
11/730 and VAX-1 1/750 is used for directly vectored UNIBUS device inter- 
rupts. The third page on a VAX-1 1/750 with a second UNIBUS is used for 
directly vectored UNIBUS device interrupts to the second UNIBUS. 

Each vector in the SCB is a longword that is examined by the processor 
when an exception or interrupt occurs, to determine how to service the 
event. Figure 5-1 illustrates the format of a vector in the SCB, and indicates 
which stack is used to service an exception or interrupt. In the VAX/VMS 
operating system, all hardware interrupts (and all software interrupts above 
IPL 3) are serviced on the system-wide interrupt stack. The rescheduling soft- 



100 



Offsets 

in 
SCB 



140 



16 



180 



16 



1C0 



16 



5. 1 Hardware In terrupt Dispatching 



SCB (System Control Block) 



Various Exceptions and 
Software Interrupts 



IPL20 
Interrupts 



SCBB 

(Physical 
address 
of start 
of SCB) 



IPL21 
Interrupts 



IPL22 
Interrupts 



IPL23 
Interrupts 



16 vectors, 
one for 
each TR 
number 



> 1 6 vectors 



1 6 vectors 



> 1 6 vectors 



A second SCB page exists on the 

VAX-1 1/730 and VAX-1 1/750 for directly 

vectored UNIBUS device interrupts. 

A VAX-1 1/750 with a second UNIBUS 

will have a third SCB page 
for interrupts on the second UNIBUS. 

Figure 5-2 

System Control Block Vectors for Hardware Interrupts 



101 



Hardware Interrupts 



ware interrupt (IPL 3) begins execution on the kernel stack but immediately 
changes to the interrupt stack when it executes a SVPCTX instruction (see 
Chapter 10). AST delivery (IPL 2) is serviced using a process-specific kernel 
stack. 



5.1.2.1 VAX-1 1/730 External Adapters. On the VAX- 11/730 the CPU, the UNIBUS 
adapter, and the memory controller are connected by the Array Bus. In addi- 
tion to the Array Bus, communications between the CPU and the integrated 
disk controller (IDC) are performed over the Accelerator Bus (the floating 
point accelerator also communicates over the Accelerator Bus). The IDC con- 
trols RL02 and R80 disks. The VAX-1 1/730 is not expandable and does not 
use expansion slots. 

Because there are no expansion slots in the VAX-1 1/730, the first page of 
the SCB contains only one set of SCB vectors. The longwords located at SCB 
+ 08 through SCB + 0B in the first page of the SCB are used as external 
adapters, one for each IPL value from 20 to 23. The second SCB page on the 
VAX-1 1/730 is used for directly vectored UNIBUS device interrupts. Each 
SCB vector corresponds to a UNIBUS vector in the range from to 774 (octal). 

5.1.2.2 VAX-11/750 External Adapters. The backplane interconnect on the 
VAX-1 1/750, called the CMI (CPU to memory interconnect), connects the 
CPU, memory controllers, and UNIBUS or MASSBUS adapters. Each connec- 
tion to the CMI is identified by its slot number. There is a total of 32 slots, 
the first 16 of which are used for the optional writeable control store (WCS). 
The next 10 slots are reserved for memory controllers and UNIBUS or MASS- 
BUS adapters. These 10 slots are called fixed slots because the mapping of 
controller/adapter to slot number is fixed. That is, a particular slot can have 
only a particular adapter placed in it. Five of the ten fixed slots are currently 
used by external adapters. The following list details these adapters: 

Memory Controller Slot 

Up to three MASSBUS Adapters Slots 4 through 6 

UNIBUS Adapter Slot 8 

The last six slots are reserved for adapters with configuration registers and 
are called floating slots. 

Each slot has four SCB vectors in the first SCB page assigned to it, one for 
each IPL value from 20 to 23. As shown in Figure 5-2, the first 16 vectors are 
assigned to IPL 20. The second SCB page on the VAX-11/750 is used for di- 
rectly vectored UNIBUS device interrupts. Each SCB vector corresponds to a 
UNIBUS vector in the range from to 774 (octal). The third SCB page on a 
VAX-11/750 in a two-UNIBUS configuration is used for directly vectored 
UNIBUS device interrupts on the second UNIBUS. 



102 



5. 1 Hardware Interrupt Dispatching 

5.1.2.3 VAX-11/780 External Adapters. On the VAX- 1 1/780, the Synchronous Back- 
plane Interconnect (SBI) connects the CPU, memory controllers (including 
MA780s), DR780s, CI780s, and UNIBUS or MASSBUS adapters. Each con- 
nection to the SBI is assigned a transfer request (TR) number that identifies 
its SBI priority. TR numbers range from (highest priority) to 15 (lowest 
priority). There is a limit of 15 connections to the SBI (see Table 5-1). TR 
number 14 is reserved for the CI780; TR number is used for a special pur- 
pose on the SBI and has no corresponding external adapter. The TR number 
defines the physical address space through which the device's registers are 
accessed and through which vectors the device will interrupt. 

An adapter is not restricted to having a specific TR number. However, the 
relative priorities of the various adapters may not change. That is, a system 
cannot have an MBA with a higher priority (lower TR number) than a UBA. 
For instance, if a system has two local memory controllers and an MA780 



Table 5-1 : Standard SBI Adapter 


Assignments on the VAX- 1 1/780 




VAX-11/780 




External Adapter Type 


Assignment 


Comments 




TRO 


Hold Line for next cycle. 
TR is the highest 
TR level and is not 
assigned to a device. 


First Memory Controller 


TR1 




Second Memory Controller 


TR2 




First MA780 Shared Memory 






Second MA780 Shared Memory 






First UNIBUS Adapter 


TR3 




Second UNIBUS Adapter 


TR4 




Third UNIBUS Adapter 


TR5 




Fourth UNIBUS Adapter 


TR6 






TR7 


Reserved 


First MASSBUS Adapter 


TR8 




Second MASSBUS Adapter 


TR9 




Third MASSBUS Adapter 


TRIO 




Fourth MASSBUS Adapter 


TR11 




DR780 SBI Interface 


TR12 






TR13 


Reserved 


CI 


TR14 






TR15 


Reserved 




TR16 


The CPU has implicit 
TR 16. Level 16 is the 
lowest TR level. 



103 



Hardware Interrupts 

shared memory controller, the first UNIBUS adapter on that system could 
have TR number 4, with the MA780 having TR number 3, and the memory 
controllers having TR numbers 1 and 2. 

5.1.2.4 Adapter Configuration. On the VAX-1 1/750 and VAX- 1 1/780, the presence of 
an adapter at a particular slot or TR number is checked by testing the first 
longword in the adapter's I/O register space, and checking for nonexistent 
memory. The presence or absence of an external adapter is determined by the 
primary bootstrap program VMB (see Chapter 24) as part of that program's 
memory sizing operation. Specifically, VMB loads the machine check vector 
in the SCB with the address of a special routine while it is sizing memory and 
determining which external adapters are present. If a nonexistent memory 
machine check occurs, there is no connected adapter at the location being 
tested. The result of this testing is stored in a 16-byte array in a data structure 
called a restart parameter block (RPB). The later stages of system initializa- 
tion use the information obtained by VMB and stored in the RPB when they 
configure specific adapters into the system. 

On the VAX-1 1/730, VAX-1 1/750, and VAX-1 1/780, only IPL levels 20 
through 23 are used for device interrupts. Within the SCB, vectors are re- 
served for each IPL level available to each adapter (see Figure 5-2). Whenever 
an adapter generates an interrupt for a device connected to it, the slot number 
or TR number of the adapter and the device IPL are used by the hardware to 
index into the SCB for the appropriate interrupt service routine. Some adapt- 
ers such as local memory controllers do not generate interrupts. 



5.2 VAX/VMS INTERRUPT SERVICE ROUTINES 

The interrupt service routines used by the VMS operating system operate in 
the limited system context or interrupt context described in Chapter 1. 
These routines execute at elevated IPL on the interrupt stack outside the 
context of a process. 



5.2.1 Restrictions Imposed on Interrupt Service Routines 

There are several restrictions imposed on interrupt service routines either by 
the VAX architecture or by synchronization techniques used by the VMS 
operating system. These restrictions result from the limited context that is 
available to any routine that executes outside the context of a process. The 
following list of items indicates some of the specific operations and data 
references that cannot occur in an interrupt service routine. The description 
of interrupt context in Chapter 1 contains a more general list of these and 
other restrictions. 



104 



5.2 VAX/VMS Interrupt Service Routines 

Interrupt service routines should be very short and do as little processing 
as possible at elevated IPL. 

Any registers used by an interrupt service routine must first be saved. 
Although an interrupt service routine can elevate IPL, it cannot lower IPL 
below the level at which the original interrupt occurred. 
The size of the interrupt stack, the stack used by all hardware interrupt 
service routines, is controlled by the SYSBOOT parameter INTSTKPAGES 
(which has a default value of two pages). This parameter determines the 
amount of stack storage available to interrupt service routines. 
Any elements pushed onto the stack by an interrupt service routine must 
be removed before the interrupt is dismissed in order that REI works cor- 
rectly. 

Because the low two bits of interrupt service routine addresses in the sys- 
tem control block are used for stack selection, interrupt service routines 
called directly by the hardware must be longword aligned. 
No pageable routines or data structures can be referenced above IPL 2. 
Data structures that are synchronized by either IPL$_ SYNCH or by 
mutexes cannot be referenced by interrupt service routines without de- 
stroying the synchronization (unless the interrupt service routine is exe- 
cuting at IPL$_ SYNCH with the express purpose of accessing the data 
structure). 

No references to per-process address space (PO space or PI space) are al- 
lowed. 



5.2.2 Servicing UNIBUS Interrupts 

Each device on the UNIBUS has one (or more) vector number(s) to identify 
the device, and a bus request (BR) priority to allow the UNIBUS to arbitrate 
among devices when multiple interrupts occur. There are 4 BR levels, called 
BR4, BR5, BR6, and BR7. BR7 has the highest priority. If multiple interrupts 
occur for devices with the same BR level, the device electrically closest to the 
UNIBUS interface has the highest priority. The device IPL used equals the BR 
priority +16. For example, BR4 corresponds to IPL 20. 

5.2.2.1 VAX-11/730 and VAX-11/750 UNIBUS Interrupt Service Routines. UNIBUS 
interrupts on the VAX-1 1/730 and VAX-1 1/750 are directly vectored through 
the second page of the system control block. The system control block con- 
tains separate addresses for the interrupt service routines for all of the UNI- 
BUS interrupt vector locations. When a unit is connected (using SYSGEN), 
the appropriate fields in the SCB are initialized to point to the interrupt serv- 
ice routines for the device vectors. The interrupt service routines eventually 
transfer control to the appropriate device driver interrupt service routines. 
The VAX/VMS Guide to Writing a Device Driver describes the data struc- 



105 



Hardware Interrupts 



tures in the I/O database, and contains a more complete discussion of driver 
interrupt service routines than that presented here. 

When a UNIBUS device generates an interrupt on the VAX- 11/730 or 
VAX- 11/750, the interrupt is vectored directly through the SCB, and control 
is immediately transferred to the following instruction in the appropriate 
device controller's channel request block (CRB). 

POSHR #"M<R0,R1,R5,R3,R4,R5> 

The next instruction in the CRB is a JSB to the driver interrupt service 
routine (see Figure 5-3). The longword following the JSB instruction contains 
the address of another data structure (the IDB, interrupt dispatch block). This 
address is pushed onto the stack (as the return PC for the JSB instruction). 
However, control is never returned there because that address is removed 
from the stack by the driver interrupt service routine. 

After the JSB instruction in the CRB transfers control to the driver inter- 
rupt service routine, the following events take place. 

1. The driver interrupt service routine removes the IDB pointer from the 
stack and uses it to obtain both the address of the device controller's con- 
trol/status register (CSR) and the address of the UCB for the device gener- 
ating the interrupt. 

2. Having found the UCB, the interrupt service routine determines whether 
the interrupt was expected or not, and, if expected, restores the driver 
context stored in the UCB and transfers control to the saved PC. 

3. When the driver finishes processing the interrupt, it issues an RSB. 

4. Control is transferred back to the driver interrupt service routine, which 
restores the registers (R0 through R5) saved by the PUSHR instruction and 
dismisses the interrupt with an REI. 

If the interrupt was unsolicited, the driver may either take some appropriate 
action or simply dismiss the interrupt by restoring R0 through R5 and issuing 
an REI. 



5.2.2.2 VAX-1 1/780 UNIBUS Interrupt Service Routines. When a device on the 
UNIBUS requests an interrupt, the UBA converts that request into an inter- 
rupt on the SBI. The SBI interrupt is vectored through the SCB to a UNIBUS 
adapter interrupt service routine. In the case of interrupts generated by a 
UNIBUS device on the VAX- 11/780, the corresponding adapter receives de- 
vice interrupt requests, determines which has the highest priority, and gener- 
ates an interrupt of its own for the CPU (on behalf of the interrupting device). 
It is actually the adapter interrupt that is vectored through the SCB (using the 
interrupting device's IPL and the adapter's TR number), to an adapter inter- 
rupt service routine. The adapter interrupt service routine saves registers R0 



106 





VAX-1 1/780 
SCB 




UBA ADP 










to 


UBA Interrupt Service Routine 

• Saves R0-R5 

• Reads BRRVR register in UBA 

• Uses vector read as index 
into vector table 


The executing process is interrupted; the 
^^- software response to the interrupt 
Jt '^ is initiated. 


An interrupt 
occurs; 
























the hardware 














Vector Table Containing 
Device CRB Addresses 




the interrupt 










VAX-1 1/730 
Or 










Device IDB 






VAX-1 1/750 












SCB 






























PUSHR R0-R5 
JSB • 


I ICR artriroc* . 














* 








*i 












S* L 






























Second page 




Device Driver \ 










Device Driver Interrupt Service 
Routine 

• Uses IDB address on stack 
to locate: 

- Device registers 

- Device UCB 

• Restores R3 and R4 from 
fork block in UCB 

• Transfers control to PC in 
fork block (via JSB) 

• When driver issues RSB: 

- Restores R0-R5 

- Issues REI to dismiss 
the interrupt 




w 


Second 
iX-1 1/750 UNIBUS 
(Optional) 
SCB 






Device UCB 












Fork Block 
•R3 
•R4 
•PC 










































process 
continues 
















execution. 



o 
^1 



Figure 5-3 

Control Flow in Servicing a UNIBUS Interrupt 






r+ 

s. 

O 

TO 

o 



13 



Hardware Interrupts 



through R5, determines which device actually requested the interrupt, and 
then passes control to an interrupt service routine in the device driver for the 
interrupting device. The driver interrupt service routine can then respond to 
the interrupt in a device-dependent fashion. After servicing the interrupt, the 
registers saved by the adapter interrupt service routine must be restored, and 
an REI instruction issued to dismiss the interrupt. 

There are four interrupt service routines for each UBA, one for each BR 
level at which UNIBUS devices request interrupts. They differ only in which 
internal UBA register they read to determine which device requested the 
interrupt. These interrupt service routines are found in a data structure de- 
scribing the UBA (the adapter control block) that is created when the system 
is bootstrapped (from module INITADP). 

UNIBUS interrupt servicing on the VAX- 1 1/780 begins in one of four UNI- 
BUS adapter interrupt service routines. 

1. The UBA interrupt service routines (see Figure 5-3) save registers R0 
through R5. 

2. A UBA internal register (BRRVR) is read to determine the identity of the 
interrupting device. Each BRRVR register contains either the vector num- 
ber corresponding to the device interrupt or an indication that the UBA is 
interrupting on behalf of itself, not for some device. (There are four 
BRRVRs in the UBA, one for each BR level.) 

3. If the UBA is interrupting on behalf of itself, it is normally indicating an 
adapter error condition. These errors usually result when a reference is 
made to a nonexistent address in UNIBUS I/O space. They may indicate 
only a transient hardware error or a bug in a device driver. These errors are 
logged, up to a maximum of 3 in any given 15-minute period, and the 
interrupt is dismissed. 

4. For a device interrupt, the vector number is used as an index into a vector 
table. The vector table contains a pointer to the JSB instruction inside the 
CRB. Control is transferred to the JSB instruction by a JMP instruction in 
the adapter interrupt service routine. 

The vector table entry pointing to the CRB, as well as the address fields 
in the CRB, are filled in by SYSGEN at the time the device driver is loaded 
into the system with the SYSGEN command CONNECT. 

The instruction inside the CRB is a JSB to the driver interrupt service routine. 
The longword following the JSB instruction contains the address of another 
data structure (the IDB, interrupt dispatch block). This address is pushed onto 
the stack (as the return PC for the JSB instruction). However, control is never 
returned there because that address is removed from the stack by the driver 
interrupt service routine. 

After the JSB instruction in the CRB transfers control to the driver inter- 
rupt service routine, the following events take place: 



108 



5.2 VAX/VMS Interrupt Service Routines 

1. The driver interrupt service routine removes the IDB pointer from the 
stack and uses it to obtain both the address of the device controller's con- 
trol/status register (CSR) and the address of the UCB for the device gener- 
ating the interrupt. 

2. Having found the UCB, the interrupt service routine determines whether 
the interrupt was expected or not, and, if expected, restores the driver 
context stored in the UCB and transfers control to the saved PC. 

3. When the driver process finishes processing the interrupt, it issues an RSB. 

4. Control is transferred back to the driver interrupt service routine, which 
restores the registers (RO through R5) saved by the UBA interrupt service 
routine and dismisses the interrupt with an REI. 

If the interrupt was unsolicited, the driver may either take some appropriate 
action or simply dismiss the interrupt by restoring RO through R5 and issuing 
an REI. 

At this point, interrupt dispatching proceeds exactly as it does in the case 
of the VAX- 11/750. Note that device drivers need not concern themselves 
with whether they are on a VAX-1 1/730, a VAX- 11/750, or a VAX- 11/780, 
because their interrupt service routines will be entered in a transparent man- 
ner. 



5.2.3 MASSBUS Interrupt Service Routines 

Unlike UNIBUS interrupt dispatching, the MASSBUS interrupt sequences for 
the VAX-1 1/750 and the VAX-1 1/780 MASSBUS are identical. The 
VAX-1 1/730 has no MASSBUS. When the system is bootstrapped, entries are 
made in the SCB to transfer control to locations in the CRB for the MASSBUS 
adapter. The instructions in the MBA CRB are a PUSHR for R2 to R5 and a 
JSB to the MBA interrupt service routine MBA$INT (which is part of module 
MBAINTDSP). 

MBA interrupts are handled differently from UNIBUS interrupts, partly 
because one MBA interrupt may indicate that multiple devices on the adapter 
need servicing. The MBA interrupt service routine reads an attention sum- 
mary register to determine what it must do to respond to an interrupt. 

If the interrupt enable bit in the MBA is set, an MBA interrupt can be 
caused by any of the following operations. 

• A data transfer completes. 

• An attention line is asserted while the MBA is not busy. 

• An MBA error occurs while the MBA is not busy. 

• The power is turned on for the MBA. 

Devices on the MASSBUS can assert the attention line under the following 
circumstances: 



109 



Hardware Interrupts 



• If an error occurs, whether or not a transfer is taking place 

• When a mechanical motion such as a disk seek or tape rewind completes 

• When a device changes its state 

In general, MASSBUS device drivers do not request ownership of the MBA 
until they need it to perform a transfer. The MBA interrupt service routine 
assumes that if the MBA owner is expecting an interrupt, then the interrupt 
currently being serviced indicates that a transfer has completed or been 
aborted. That is, when an MBA interrupt occurs and the current owner of the 
MBA is expecting an interrupt, MBA$INT dispatches immediately to the 
owner's driver. It then checks whether other devices on the MASSBUS need 
attention. The UCB list contained in the IDB allows MBA$INT to associate 
UCB addresses with devices that are requesting service. 

MBA$INT responds to an interrupt in one of three ways (see Figure 5-4). It 
may perform all three of these actions to service multiple attention requests 
in response to a single interrupt. 

• For an expected interrupt for a single-unit controller (a disk), MBA$INT 
issues a JSB instruction that transfers control directly to the fork PC stored 
in the UCB of the interrupting device. The driver returns to MBA$INT 
when it has completed its work. 

• For an unsolicited interrupt for a single-unit controller, MBA$INT issues a 
JSB instruction that transfers control to a driver-supplied unexpected inter- 
rupt service routine, which will return to MBA$INT 

• For a multidevice controller (a magtape), MBA$INT transfers control to 
the CPvB for the device controller. The device controller CRB dispatches to 
a controller interrupt service routine that saves R2 to R5 and transfers 
control to the driver interrupt service routine. This service routine eventu- 
ally returns control to MBA$INT 

The way MBA$INT decides whether an entry in the MBA IDB is a UCB 
address (single-unit controller), or a pointer into a CRB (multidevice control- 
ler) is by checking the low-order bit of the entry in the MBA IDB for the 
controller. If the bit is set, then the entry is for a multidevice controller. If the 
bit is clear, the entry represents the UCB address for the device on a single- 
device controller. UCBs, like CRBs, are always longword aligned (the low 
order two bits are clear). When a CRB is created for a multidevice controller, 
and its address stored in the MBA IDB, the address is incremented by 1 so the 
low order bit will be set. Control is actually transferred to the PUSHR in- 
struction in the multidevice controller CRB using the following instruction 
(where R5 contains the MBA IDB entry) so that the low-order bit is cleared 
before control is actually transferred: 

JSB -(R5) 



110 



SCB 



An interrupt occurs; 
the hardware responds 
to the interrupt. 

/J 



The executive process 
is interrupted; the — H 
software response to 
the interrupt is initiated 



The interrupted process 
continues execution. 




MBA Interrupt Service 
Routine (MBA$INT) 



MBA IDB 



MBA$INT determines type 
of interrupt and executes the 
appropriate code. 

-CASE 1: Single-Unit (Disk) 
Controller Expecting 
Interrupt 
JSB 



-*■ 



-CASE 2: Single-Unit (Disk) 
Controller Not Expecting 
Interrupt 

JSB -^— ^— — — 



-CASE 3: Multiunit (Tape) 
Controller 
PUSHL PSL 
JSB 



After returning from subroutine, 
MBASINT cleans up and then 
determines if another interrupt is 
present. If one exists, return to 
cases; if there is no interrupt, 
REI. 



MBA CSR 

List of CRBs 
and UCBs for 
devices on • 
MASSBUS » 



-+- MBA Registers 

Device UCB 
for Single-Unit Controller 




Device CRB 



PUSHR R2-R5 
JSB 



Device Driver 



*- Instruction awaiting . 
interrupt (PC stored 
in UCB fork block) 
• Exits with RSB 



*" Unsolicited interrupt 
routine 
• Exits with RSB 



^- Interrupt service 
routine JSB * 

• Restores R2-R5 ' 

• Exits with REI 



Device IDB 



Device UCB for 
Multiunit 
Controller 



Device CSR * 
List of UCBs • 



for devices on 
this controller 



Controller Registers 



Figure 5-4 

Control Flow in Servicing a MASSBUS Interrupt 



Ch 



On 
& 

r-f 
Co 

Si 
o' 

TO 

S3 
O 
C 



S3 



Hardware Interrupts 



Because data transfer functions block the interrupts from nontransfer func- 
tions until the data transfer completes, MBA$INT always checks the MBA 
attention summary register after a driver interrupt service routine returns 
control. This check is made to determine if another device on the MASSBUS 
requested an interrupt either while the MASSBUS owner was transferring 
data or while the current interrupt was being processed. 



5.2.4 DR32 Interrupt Service Routine 

DR32 (or DR750 and DR780) interrupt dispatching is handled similarly to 
MBA interrupt dispatching. When the system is bootstrapped, entries are 
made in the SCB to transfer control to locations in the CRB for the DR32. 
The instructions in the CRB are a PUSHR for R2 to R5, and a JSB. The DR32 
IDB address follows the JSB instruction in the DR32 CRB (see Figure 5-5). 
Initially, the JSB in the DR32 CRB transfers control to routine DR$INT in 
module DRINTHAND. This routine simply performs the following opera- 
tions: 

1. It clears the adapter power up and power down bits in a DR32 control 
register. 

2. It calls a controller initialization routine to reset the DR32 (and disable 
DR32 interrupts). 

3. It restores registers R2 to R5. 

4. It issues an REI instruction. 

When the DR32 driver (XFDRIVER) is loaded by SYSGEN (as part of 
AUTOCONFIGURE when the system is bootstrapped, or by an explicit 
CONNECT command), the JSB instruction is overwritten to point to the 
interrupt service routine in the driver. This routine performs the following 
operations: 

1. It responds to the various types of DR32 interrupts. 

2. It restores registers R2 to R5. 

3. It issues an REI instruction. 



5.2.5 MA780 Interrupt Dispatching 

Although the standard MS780 memory controller does not generate inter- 
rupts, the shared memory (MA780) controller does. Interrupts are requested 
by a driver or the executive to interrupt another processor connected to the 
shared memory. Interrupts occur whenever a shared memory event flag is set 
or a shared memory mailbox message is written, or whenever there is inter- 
processor communication in the VAX- 11/782. Note that this discussion de- 
scribes MA780 used as shared memory among VAX-ll/780s ; interrupt han- 



112 



SCB 




DR32 CRB 



the hardware 
responds to 
the interrupt. 



PUSHR R2-R5 

JSB • 

DR32 IDB 



The executing 
process is interrupted; 
the software response 
to the interrupt is 

initiated. 



DR32 IDB 



DR32 CSR •- 
Device UCB •- 



CASE1 

Path 

taken 
until the 

DR32 
driver is 

loaded 




CASE 2 



Device UCB 



Fork Block 

• R3 

• R4 
•PC 



DR32 
Registers 



Figure 5-5 

Control Flow in Servicing a DR32 Interrupt 



Path 

taken 
after the 

DR32 
driver is 

loaded 



DRSINT:: 

• Disables DR32 
interrupts 

• Restores R2-R5 

• REI 



DR32 Driver Interrupt Service Routine 



• Respond to interrupt; 
e.g., queue AST to 
user process to 
inform user of interrupt 

• Restore R2-R5 

• REI 




The interrupted 

process 

continues 

execution. 



G-> 






Oo 



Co 

s. 

o 

OS 

o 
c 
G. 
Ss 
to 

03 



Hardware Interrupts 



MA780 Registers 



SCB 



An interrupt 
occurs; 



the hardware 
responds 
to the 
interrupt. 



MA780 ADP 



/ 



MA780 CSR •— 
PUSHR R0-R5 

JSB . 



MASINT:: 



• Computes address of 
ADP from pointer on 
stack 

• Services interrupt 

• Restores R0-R5 

• Exits with REI 



The 

interrupted 
- process 
continues 
execution. 



The executing process is 
interrupted; the software 
response to the interrupt 
is initiated. 



Figure 5-6 

Control Flow in Servicing an MA780 Interrupt 



dling in the VAX- 11/782 is somewhat different and is briefly discussed in 
Section 5.2.6. Chapter 28 gives a more complete description of MA780 inter- 
rupts in the VAX- 11/782. 

When the system is bootstrapped, module INITADP places entries into the 
SCB to transfer control to locations in the MA780 ADP when MA780 inter- 
rupts occur (see Figure 5-6). The locations in the ADP contain a PUSHR in- 
struction saving RO to R5, and a JSB instruction that transfers control to 
routine MA$INT (in MAHANDLER). 

1. When MA$INT obtains control, it removes the value pushed onto the 
stack by the JSB instruction in the ADP and uses it to determine the ad- 
dress of the MA780's ADP. 

2. It uses fields in the ADP to locate adapter registers in the MA780 and to 
determine which port requested an interrupt (and what kind of interrupt 
was requested). 

3. If the interrupt is for a processor being connected to the memory, the 
interrupt is dismissed by restoring RO to R5 and issuing an REI. 

4. Otherwise, MA$INT services the interrupt. 

5. Finally, the interrupt is dismissed by restoring RO to R5 and issuing an 
REI. 

5.2.6 MA780 Interrupts on the VAX-1 1/782 

The VAX-1 1/782 multiprocessing system uses interrupts from the MA780 to 
allow the processors to interrupt one another. Thus, the MA780 interrupts 
must be handled somewhat differently on the VAX-1 1/782. 

When the multiprocessing code is loaded, the MA780 interprocessor inter- 
rupt vectors in the primary processor's SCB are redirected to point to a multi- 



114 



5.3 Connect-to-Interrupt Mechanism 

processing MA780 interrupt routine (only for the first MA780). The interrupt 
routine serves interrupts from the secondary processor. A new SCB is created 
in nonpaged pool for the secondary processor. The new SCB contains vectors 
that point to multiprocessing MA780 interrupt routines for the secondary 
processor. The interprocessor interrupt vector for the remaining MA780s is 
pointed to an unexpected interrupt handler. 

When multiprocessing code is loaded, the operating system debugger 
(XDELTA) is moved from interrupt vector 5 to interrupt vector 15. Interrupt 
vector 5 is used for the multiprocessing rescheduling routine. 

For more information on the VAX- 11/782 multiprocessing system, see 
Chapter 28. 



5.3 CONNECT-TO-INTERRUPT MECHANISM 

The connect-to-interrupt mechanism allows a process to be notified of a 
UNIBUS device interrupt by the delivery of an AST, by the setting of an event 
flag, or both. The process can also specify an interrupt service routine that 
will respond to device interrupts. 

A suitably privileged process (with CMKRNL and PFNMAP privileges) can 
respond to an interrupt by reading or writing device registers and, possibly, by 
initiating further device activity. However, in order to directly manipulate 
device registers, the process must first map the UNIBUS I/O page(s) contain- 
ing the registers for the device into its own process space (PO or PI). The 
VAX/VMS Real-Time User's Guide contains a discussion of mapping the 
UNIBUS I/O page and using the connect-to-interrupt capability. Chapter 16 
of this book contains more detailed information on how the mapping is actu- 
ally performed. 

Note that the physical addresses of the UNIBUS I/O page differ among the 
VAX-1 17730, VAX-1 1/750, and VAX-1 1/780. Therefore, different PFNs must 
be used when mapping the UNIBUS I/O page. The details of mapping to the 
I/O page are described in the VAX/VMS Real-Time User's Guide. Appendix B 
contains a list of symbols defined by the $IO730DEF, $IO750DEF, and 
$IO780DEF macros to make this mapping as symbolic as possible. 

The connect-to-interrupt facility is an extension of the interrupt dispatch- 
ing scheme. In order to use it, the connect-to-interrupt driver (CONINTERR) 
must be associated with the interrupt vector. The association is made using 
the SYSGEN command CONNECT, specifying all of the following: 

• A name for the device (to be used by the process that connects to the 
interrupt) 

• The address of the device 

• The interrupt vector at which the device generates interrupts 

• The CONINTERR driver, which initially responds to the device interrupts 



115 



Hardware Interrupts 



Device CRB 




CONINTERR Interrupt 
Service Routine 






User-supplied Interrupt 
Service Routine 


JSB » 




• JSB (or CALL) 




r 


• Responds to interrupt 
in device-dependent 
fashion 

♦ Exits with RSB 








if requested by 
user 

• Request delivery 
of AST to process 
or set an event 
flag, if desired 

by user 

• Restore R0-R5 

• Issue an REI 

to dismiss interrupt 






IDB address • 
















Device IDB 




















■ 








' 


















Device UCB 






This portion of the interrupt 
dispatch scheme is specific 
to the connect-to-interrupt dri 


Fork Block 

• R3 

• R4 

• PC 




dispatch scheme is an 
explicit example of the general 1 
UNIBUS interrupt dispatch ( 
scheme illustrated in Figure 5-3. 1 

1 



Figure 5-7 

Extending Interrupt Dispatch Mechanism with the 

Connect-to-Interrupt Facility 



When the device generates an interrupt, the normal UNIBUS interrupt dis- 
patching sequence is followed, as discussed in Sections 5.2.1 and 5.2.2. How- 
ever, the CONINTERR interrupt service routine transfers control to the 
user-supplied interrupt service routine (if one was supplied) using a JSB or 
CALL instruction (as requested by the user). This transfer is illustrated in 
Figure 5-7. When the user-supplied interrupt service routine issues an RSB (or 
RET), the CONINTERR interrupt service routine regains control. Before re- 
storing RO to R5 and issuing an REI, the CONINTERR interrupt service rou- 
tine queues an AST to the process (if requested) to notify the process that an 
interrupt has occurred (via the AST, or by setting an event flag). 

In order for the process-supplied interrupt service routine to be accessible 
to the CONINTERR interrupt service routine, the CONINTERR driver dou- 
ble-maps the user routine into system address space. The double mapping 
requires enough system page table entries (reserved by the REALTIME- SPTS 
SYSBOOT parameter) to map the user-supplied routines (other driver rou- 
tines besides an interrupt service routine may be specified when connecting 
to an interrupt). When the process disconnects from the interrupt, the SPTEs 
used to map the routines for that process are made available for later use by 
other processes. 



116 



6 Software Interrupts 



Noise is the most impertinent of all forms of interruption. It is 
not only an interruption, but also a disruption of thought. 
— Schopenhauer, Studies in Pessimism: On Noise 

The software interrupt mechanism that is provided as an integral part of the 
VAX architecture is relied on heavily by the VAX/VMS operating system for 
several purposes. The scheduler is invoked as a software interrupt service 
routine: Software interrupts provide device drivers a clean method for lower- 
ing IPL. Several I/O completion routines run as software interrupt service 
routines. This chapter first describes the general software interrupt mecha- 
nism and then lists several uses of software interrupts in the VAX/VMS oper- 
ating system. 



6.1 THE SOFTWARE INTERRUPT 

A software interrupt is actually a hardware mechanism, similar to an inter- 
rupt generated by an external device. It causes a PC/PSL pair to be pushed 
onto an appropriate stack (usually the interrupt stack) and passes control to 
an interrupt service routine whose address is stored in the system control 
block. Like hardware interrupts, the VMS operating system interprets soft- 
ware interrupts as system-wide events that are serviced independently of the 
context of a specific process. The AST interrupt, discussed briefly at the end 
of this chapter and in greater detail in Chapter 7, is the only variation from 
this sequence of events. 

The big difference between software interrupts and hardware interrupts, 
and the reason for the name, is that software interrupts are generated by an 
explicit request from software. The typical software interrupt request occurs 
as the result of a hardware interrupt or within another software interrupt 
service routine. However, there are examples within the VMS operating sys- 
tem of software interrupts being issued from code executing in process con- 
text. 



6.1.1 Hardware Mechanism of Software Interrupts 

The VAX architecture provides 15 software interrupt levels, from IPL 15 
down to IPL 1. There are 15 entries in the system control block (SCB) for 
addresses of software interrupt service routines, one for each IPL level. A 
software routine (usually a hardware or software interrupt service routine) 



117 



Software Interrupts 



requests a software interrupt at a given IPL level by writing the desired IPL 
value into the privileged register Software Interrupt Request Register 
(PR$_SIRR). Writing to this register causes a bit in the Software Interrupt 
Summary Register (PR$_SISR) to be set. The bit in the SISR is cleared when 
the interrupt is finally taken. The layout of these two processor registers is 
pictured in Figure 6-1. All software interrupt requests in the VMS operating 
system use the SOFTINT macro to write the SIRR. This macro expands into 
the following instruction: 

-MACRO SOFTINT IPL 
MTPE IPL,S~#PR$_SIRR 
-ENDM SOFTINT 

The usual situation in the VMS operating system is that the requested IPL 
level is less than or equal to the current IPL (as determined by PSL>20: 16<). 
In this case, the interrupt is deferred until the IPL drops below the requested 
level. The deferral of pending software interrupts based on current IPL is 
exactly the way that pending hardware interrupts are treated. This lowering 
of IPL usually occurs as the result of an REI instruction but could also occur if 
privileged code directly altered the current IPL by writing to the PR$_IPL 
register (with the SETIPL or the ENBINT macros, described in Chapter 2). 

If the requested IPL value is higher than the level at which the processor is 
currently running, then the interrupt service routine whose address is in the 
selected slot in the SCB is entered immediately. (This is the same way that 
pending hardware interrupts are treated.) 

There are a few occurrences in the VMS operating system of a software 
interrupt request at an IPL level greater than that at which the processor is 




:PR$_SIRR 



Software Interrupt Request Register 
(Write Only) 



31 



16 15 



1 



MBZ 



Pending Software Interrupts 



F,E, D,C,B,A,9,8,7,e, 5, 4,3, 2,1 



:PR$_SISR 



Software Interrupt Summary Register 
(Read/Write) 

Figure 6-1 

Content of Software Interrupt Request Register and Soft- 
ware Interrupt Summary Register 



118 



6.2 Software Interrupt Levels in the VAX/VMS Operating System 

currently running. For example, device driver FDT routines may signal com- 
pletion by calling the routines EXE$FINISHIO or EXE$FINISHIOC. These 
routines execute at IPL 2 and terminate by requesting the I/O postprocessing 
software interrupt at IPL 4. In this case, the interrupt is taken immediately. 
The file system ACP uses the same technique to signal I/O completion for 
requests in which it was involved. 



6.1.2 Software Interrupt Service Routines 

There are several features about the use of software interrupts in the VMS 
operating system that are independent of the purposes of individual interrupt 
service routines. Some of these are dictated by the particular way that soft- 
ware interrupts are treated in the hardware. 

Because the VAX architecture supplies no mechanism for determining how 
many times a software interrupt has been requested before it is taken, soft- 
ware must supply some protocol for determining this number. The VMS op- 
erating system uses queues (doubly linked lists manipulated by the INSQUE 
and REMQUE instructions) for this purpose. In general, each queue element 
represents a specific operation that must be performed. The use of queues, 
particularly the use of the INSQUE and REMQUE instructions, allows other 
optimizations to be made. 

• The software interrupt service routine can use the information provided by 
condition code settings, this time as the result of executing a REMQUE 
instruction. That instruction returns the V-bit set if the queue was empty 
before the instruction began execution, an indication that the work of this 
particular interrupt service routine is complete. 

• By coding software interrupt service routines so that they keep removing 
work list elements from a queue until there is no more work to do, it is 
possible to simply ignore spurious software interrupt requests. In fact, all 
of the software interrupt service routines in the VMS operating system, 
including those that do not use queues, handle interrupts, even in the 
event of spurious interrupts requests. 

6.2 SOFTWARE INTERRUPT LEVELS IN THE VAX/VMS 

OPERATING SYSTEM 

The VMS operating system uses the software interrupt mechanism for sev- 
eral purposes. 

• Mount verification cancellation executes above driver fork IPL and below 
device IPL so that DMA operations will work, yet drivers cannot interfere 
with the device data structures. 

• Device drivers use forks in order to execute at an IPL below device IPL. 



119 



Software Interrupts 



Table 6-1 


: Software Interrupt Levels Used by the Executive 


IPL 


Use 


Stack 


15 


XDELTA on VAX-1 1/782 


Interrupt 


14-13 


Unused 


Interrupt 


12 


Mount Verification Cancellation 


Interrupt 


11 


IPL= 1 1 Fork Dispatching 


Interrupt 


10 


IPL=10 Fork Dispatching 


Interrupt 


9 


IPL =9 Fork Dispatching 


Interrupt 


8 


IPL =8 Fork Dispatching 


Interrupt 


7 


Software Timer Service Routine 


Interrupt 


6 


IPL =6 Fork Dispatching 


Interrupt 


5 


Used to Enter XDELTA, also 
Scheduling on VAX-1 1/782 


Interrupt 


4 


I/O Postprocessing 


Interrupt 


3 


Rescheduling Interrupt 


Kernel 


2 


AST Delivery Interrupt 


Kernel 


1 


Unused 


na 



• The software timer service routine performs timer operations that would 
bog the system down (because I/O device interrupts are blocked) if they 
were performed at IPL 24, the level at which the hardware clock interrupts. 

• The need for I/O postprocessing can be flagged by device driver interrupt 
service routines but the actual processing deferred while another pending 
I/O request is started. 

• Rescheduling, the removal of the current process from execution and the 
selection of a new process for execution, is implemented as a software 
interrupt service routine. 

• The AST delivery interrupt is the only software interrupt that is treated as 
a process-specific interrupt rather than a system-wide event. 

Table 6-1 lists all the software interrupt levels used by the VAX/VMS operat- 
ing system. 



6.2.1 Mount Verification Cancellation 

If a Files- 11 volume is mounted in a drive, and the corresponding device 
driver generates one of a select set of errors, mount verification is invoked. 
Mount verification allows the system to recover gracefully from certain er- 
rors, rather than wait indefinitely or report a bugcheck. While mount verifi- 
cation is in progress on a particular device, no other requests will be serviced 
by the ACP associated with that device. 



120 



6.2 Software Interrupt Levels in the VAX/VMS Operating System 

If the device undergoing mount verification uses the same ACP as the sys- 
tem disk, mount verification can effectively stall the system until the mount 
verification either completes or times out. This stall can occur because the 
ACP will not service any other requests. 

In order to abort mount verification, an IPL 12 interrupt must be requested 
from the console terminal. The interrupt service routine that serves the IPL 
12 interrupt prompts with the following prompt: 

IPO 
At this point, commands can be issued to cancel mount verification or enter 
XDELTA. More information about canceling mount verification can be found 
in the VAX/VMS System Management and Operations Guide. 



6.2.2 Fork Processing 

Another use of software interrupts is found in the mechanism called fork 
processing employed by device drivers. The interrupt nesting scheme defined 
by the VAX architecture will not work correctly if an interrupt service rou- 
tine lowers IPL below the level at which the interrupt occurred. However, 
device driver interrupt service routines, initially entered or invoked at device 
IPL (typically 20 to 23 decimal), often must perform lengthy processing that 
does not require device interrupts to be blocked, the usual reason for main- 
taining high IPL. Some mechanism is required to allow device drivers to 
lower IPL without destroying the interrupt nesting scheme. 

Several IPL values (6, and 8 to 1 1) and their associated SCB slots are used by 
device drivers to allow them to continue their execution at lower IPL, as 
so-called fork processes. There are also six quadword listheads associated 
with the fork IPLs. (Because IPL 7 software interrupts are used by the soft- 
ware timer, this listhead is not used by the fork processor but merely serves 
as a place saver so that context indexed addressing can be used by the fork 
processor and the fork dispatcher with the IPL value as an index.) The queue 
elements that describe each individual operation that must be performed at 
lower IPL are called fork blocks and are used to pass context between driver 
interrupt service routines and the fork level software interrupt service rou- 
tines. A fork block (pictured in Figure 6-2) is often part of a larger structure 
such as a unit control block. 

When a driver must lower its IPL (by creating a fork process), it calls rou- 
tine EXE$FORK with R5 containing the address of the fork block. That rou- 
tine saves the driver context (R3, R4, and saved PC) in the fork block, inserts 
the fork block into the appropriate fork queue, and requests a software inter- 
rupt at the requested IPL level. The actual instructions in routine EXE$FORK 
that perform these functions are listed here to illustrate how work queues 
and software interrupt requests are managed. 



121 



Software Interrupts 



Fork Block 



Fork Queue Forward Link 



Fork Queue Backward Link 



Fork I PL 



Type 



Size 



Saved PC 



Saved R3 



Saved R4 



Figure 6-2 




Layout of Fork Block 




EXE$FORK: : 




MOVQ 


R3,FKB$L_FR3(R5) 


POPL 


FKB$L_FPC(R5) 


MOVZBL 


FKB$B_FIPL(R5),R< 


MOVAQ 


W*SWT$GL_FQFL-<b*fi>[R<],R3 


INSQUE 


(R5),@4(R3) 


SOFTINT 


R4 


ESB 





The fork dispatcher, which is the software interrupt service routine that exe- 
cutes in response to the requested interrupt, executes the following sequence 
of instructions (or a sequence much like it), which removes each queue ele- 
ment in turn from the associated queue and processes it. This processing 
continues until the queue is empty, at which time the software interrupt is 
dismissed with an REI. R6 is loaded with the address of the fork queue lis- 
thead before this sequence is executed. 



.ALIGN 


LONG 


EXEIFORKDSPTH: : 




PUSHL 


R5 


PUSHL 


R< 


PUSHL 


R3 


PUSHL 


R5 


PUSHL 


Rl 


PUSHL 


RD 


REMQUE 


@(Rb),R5 


BVS 


5D$ 


10$: MOVQ 


FKB$L_FR3(RS),R3 


JSB 


@FKB$L_FPC(R5), 


REMQUE 


@(Rb),RS 


BVC 


10$ 



122 



6.2 Software Interrupt Levels in the VAX/VMS Operating System 



50$: POPR #"H<R0,Rl,R5,R3,R4,RS,Rb> 

REI 



.END 



6.2.3 Software Timer 

Most of the timer operations in the VMS operating system execute in re- 
sponse to a software interrupt at IPL 7. These operations are described in 
detail in Chapter 11. The use of software interrupts by the timer support 
routines is described here. 

When the hardware clock interrupt service routine (executing at IPL 24) 
determines that further service is required (due to quantum expiration or 
because the first element in the timer queue has come due), it requests a 
software interrupt at IPL 7 (IPL$_TIMER). Unlike the fork queue described in 
the previous section, timer queue elements (TQEs) are not placed into the 
timer queue by an interrupt service routine. Rather, they are usually placed 
there by one of the timer-related system services (such as $SETIMR or 
$SCHDWK). The key to the timer queue is that the queue elements are or- 
dered by expiration time so that only the first TQE has to be examined by the 
hardware clock service routine. 

The software interrupt service routine rechecks for quantum expiration 
and takes action if necessary. After any required quantum end processing has 
occurred, the software timer service routine examines the timer queue for 
any timer requests that have expired. Any timer queue element that has an 
expiration time earlier than the current system time is then removed from 
the timer queue and serviced. Because of the time ordering of the timer 
queue, this removal takes place from the beginning of the list. When no more 
expired timer queue elements remain (the expiration time of the first TQE in 
the queue is later than the current system time), the software interrupt is 
dismissed. Note that a second difference between this software interrupt 
service routine and fork processing is that the software timer service routine 
may leave timer queue elements (the ones that have not yet expired) in the 
queue when it dismisses the interrupt. For more information on timers and 
timer queues, see Chapter 11. 



6.2.4 I/O Postprocessing 

When a device driver or FDT routine detects that a particular I/O request is 
complete, it calls a routine that places the I/O request packet (pointed to by 
R3) at the tail of the I/O postprocessing queue (located through global pointer 
IOC$GL_PSBL) and requests a software interrupt at IPL 4 (IPL$_IOPOST) if 
the queue was previously empty. The following instructions (from routine 



123 



Software Interrupts 



IOC$REQCOM in module IOSUBNPAG) show the similarities between the 
software interrupt requests for fork processing and I/O postprocessing. (Other 
routines that request an IPL$_IOPOST software interrupt, $QIO completion 
code and ACP routines, execute similar instructions.) 



INSQUE (R3),@fflOC$GL_PSBL 
SOFTINT #IPL$_IOPOST 



The I/O postprocessing software interrupt service routine removes each IRP 
in turn from the beginning of the queue (located through global pointer 
IOC$GL_PSFL) and processes it. When the queue is empty, the IPL 4 soft- 
ware interrupt is dismissed. The similarities between fork processing and I/O 
postprocessing are also found in their respective software interrupt service 
routines. The following instructions from module IOCIOPOST illustrate 
these similarities. 



IOC$IOPOST: : 








MOVQ 


H4,-(SP) 






MOVQ 


Ra,-(SP) 






MOVQ 


RD,-(SP) 




IOPOST: 


REMQUE 


@W"l0C$GL_PSFL,R5 






BVC 


1D$ 






MOVQ 


(SP)+,RD 






MOVQ 


(SP)+,R3 






MOVQ 


(SP)+,R4 






RE I 






10$: 


■ 




; Complete processing of 
; this request 



BRx 



IOPOST 



6.2.5 Rescheduling Interrupt 

The routine that removes a process from execution and selects the highest 
priority process for execution is invoked as a software interrupt service rou- 
tine at IPL 3 (IPL$_SCHED) by the routine that makes a process computable. 
Whenever the state of a resident process becomes computable and its priority 
is greater than or equal to the priority of the current process, this software 
interrupt is requested. Because several processes could all become computa- 
ble at effectively the same time, there could be multiple requests for this 
software interrupt service routine. 

The rescheduling interrupt is not totally independent of process context 
like the fork processing and I/O postprocessing interrupts. The SCB entry for 



124 



6.2 Software Interrupt Levels in the VAX/VMS Operating System 

this interrupt indicates that it should be serviced on the kernel stack (see 
Table 6-1). In fact, its first operation is to remove the current process from 
execution with a SVPCTX instruction. However, that instruction performs a 
stack switch from the kernel stack to the interrupt stack so the rest of the 
rescheduling interrupt service routine is performed in system context. The 
operation of the scheduler, including a detailed description of the reschedul- 
ing interrupt, is discussed in Chapter 10. 

Unlike fork processing or I/O postprocessing requests, there is no need to 
count requests for the rescheduling interrupt, because only one process can 
become current at a given time. The software priorities of the computable 
processes determine which of them is chosen for execution. The scheduler 
will select the process with the highest software priority. The rest of the 
processes will remain in the computable state until some system event oc- 
curs that alters the scheduling balance of the system and causes one of these 
processes to be selected for execution. For example, if a higher priority proc- 
ess were to become computable, an IPL 3 software interrupt would be re- 
quested. (If the current process were to enter a wait state, a different path is 
taken through the scheduler, one that bypasses the software interrupt request 
and executes the code contained in the second half of the rescheduling inter- 
rupt service routine.) 



6.2.6 AST Delivery Interrupt 

The software interrupt that indicates that there is an AST to deliver differs in 
several respects from the other software interrupts. 

• The AST delivery interrupt is associated with a specific process and is 
serviced on the kernel stack of that process. 

• The interrupt request is made in two steps. Routines that recognize that 
there is an AST that can be delivered to a process indicate that by writing 
the access mode associated with the AST into a per-process privileged reg- 
ister called the AST level register (PR$^ASTLVL). The REI instruction 
compares the contents of this register with the access mode that it is re- 
storing to determine whether to request an IPL 2 software interrupt. 

• As this mechanism suggests, IPL 2 software interrupts have a second di- 
mension associated with them, namely access mode. 

The use of ASTs in the VMS operating system is so important and complex 
that it is described in a separate chapter (Chapter 7). 



125 



AST Delivery 



There's absolutely no reason for being rushed along with the 
rush. Everybody should be free to go very slow. . . . What you 
want, what you're hanging around in the world waiting for, is for 
something to occur to you. 
— Robert Frost 

Asynchronous system traps (ASTs) are a mechanism for signaling asynchro- 
nous events to a process. Specifically, a procedure (or routine) designated by 
either the process or the system executes in the context of the process. ASTs 
are created in response to system services such as $QIO, $SETIMR, and 
$DCLAST. Additionally, unrequested ASTs occur as implicit results of other 
operations such as I/O completion, process suspension, and obtaining infor- 
mation about another process with the Get Job/Process Information 
($GETJPI) system service. The reason that ASTs are used for these operations 
is that it is necessary for code to execute in the context of a specific process. 
ASTs fulfill this need. 

AST enqueuing is a system event that may result in a rescheduling inter- 
rupt. AST delivery occurs in the context of the process that is to actually 
receive the AST. This chapter discusses how ASTs are enqueued and deliv- 
ered to a process. Several examples of how ASTs are used by the VMS operat- 
ing system are also included. 

7.1 HARDWARE ASSISTANCE TO AST DELIVERY 

The delivery of ASTs is an example of the VAX hardware providing assistance 
to the VMS operating system. Three hardware components or mechanisms 
contribute to AST delivery: 

• The REI instruction 

• The PR$_ASTLVL processor register 

• The IPL 2 software interrupt 

The first two features are discussed in this section. The IPL 2 interrupt 
service routine, ASTDEL, is discussed in Section 7.3. 

7.1.1 REI Instruction 

The return from exception or interrupt routine instruction, REI, provides the 
initial step in the delivery of an AST to a process. Among the operations 
performed by the REI microcode are the following. 



126 



7.2 Queuing an AST to a Process 

1. A check is made to determine which stack will be active after the return. 
No ASTs are delivered if the interrupt stack is active. 

2. The value in the AST level processor register, PR$_ASTLVL, is compared 
with the access mode to which control is being passed. If the destination 
access mode number is less than the value in PR$_ASTLVL (that is, more 
privileged), no ASTs can be delivered. 

3. If the interrupt stack is not going to be used and the access mode number 
is greater than or equal to the PR$_ASTLVL value, then an AST can be 
delivered. The REI instruction microcode requests a software interrupt at 
IPL 2. (Note that the requested IPL 2 interrupt will not actually be granted 
until the IPL drops below 2.) The IPL 2 software interrupt service routine 
is found at global location SCH$ASTDEL (see Section 7.3). 

7.1.2 ASTLVL Processor Register (PR$_ASTLVL) 

The processor register, PR$_ASTLVL, is a per-process hardware register indi- 
cating the deliverability of ASTs to the current process. PR$_ASTLVL is part 
of the hardware context of the process (loaded by LDPCTX) and is recorded in 
the hardware process control block (see Chapter 10). PR$_ASTLVL can con- 
tain the following values: 

A kernel mode AST is deliverable. 

1 An executive mode AST is deliverable. 

2 A supervisor mode AST is deliverable. 

3 A user mode AST is deliverable. 

4 No AST is deliverable. 

Thus, if multiple ASTs are deliverable, PR$_ ASTLVL contains the access 
mode value for the AST that has the innermost access mode. The null value 
of four is chosen so that the REI test, described above, will fail, regardless of 
the destination access mode of the REI instruction. If the access mode of the 
deliverable AST is at least as privileged as the destination access mode of the 
REI instruction, the AST delivery interrupt will be requested. 

7.2 QUEUING AN AST TO A PROCESS 

ASTs are queued to a process as the corresponding events (I/O completion, 
timer expiration, and so on) occur. The AST queue is maintained as a list 
structure of AST control blocks (ACBs) with the listhead contained in the 
software process control block (PCB) (see Figure 7-1). 

7.2.1 AST Control Block 

The AST control block (ACB) contains the following information necessary 
to deliver an AST to a process: 



127 



AST Delivery 

Software Process Control Block (PCB) 



I 



ASTEN ASTACT 



ASTQFL 



ASTQBL 



ASTCNT 



^ 



AST Control Block (ACB) 



ASTQFL 



• ASTQBL 

RMOD J TYPE 1 



SIZE 



PID 



AST 



ASTPRM 



KAST 



Links to other 
" ACBs in queue 
(See Figure 7-2.) 



RMOD Bits: 

7 6 5 4 


3 2 


1 










X 





MODE 



SPARE 
-PKAST 
-NODELETE 
•QUOTA 

■KAST 



Figure 7-1 

AST Control Block and AST Queue in Software PCB 



• The process identification and AST routine address 

• The correct access mode 

• The appropriate parameter to pass to the routine 

The ACB is allocated from nonpaged dynamic memory before the queuing 
of an AST to a process is requested. 

Figure 7-1 shows the format of an AST control block and the relevant soft- 
ware PCB fields. ACB$L_ASTQFL and ACB$L_ASTQBL link the ACB into 
the AST queue for the process. The listhead of this queue is the pair of 
longwords PCB$L_ ASTQFL and PCB$L_ASTQBL. The field ACB$B_RMOD 
provides five types of information. 

1. Bits <0:1> (ACB$V_RMOD) contain the value corresponding to the ac- 
cess mode in which the AST routine is to execute. 

2. Bit <4> (ACB$V_PKAST) indicates the presence of a piggyback special 
kernel mode AST (see Section 7.2.4). 



128 



7.2 Queuing an AST to a Process 

3. Bit <5> (ACB$V_NODELETE) indicates that the ACB should not be 
deallocated after the AST is delivered. Typically this bit indicates that the 
ACB is a portion of a larger structure. 

4. Bit <6> (ACB$V_QUOTA) indicates whether the allocation of the data 
structure is accounted for in the process AST quota, PCB$W_ASTCNT. 

5. Bit <7> (ACB$V_KAST) indicates the presence of a special kernel mode 
AST (see Sections 7.2.3 and 7.4). 

ACB$L_PID identifies which process is to receive the AST. ACB$L_AST 
and ACB$L_ASTPRM are the entry point of the designated AST routine and 
the AST parameter, respectively. ACB$L_KAST contains the entry point of a 
system-requested special kernel mode AST routine if the ACB$V_PKAST or 
ACB$V_KAST bit of ACB$B_RMOD is set (items 2 and 5 above). 
ACBs can be created by three types of action. 

1. The process explicitly declares an AST. The $DCLAST system service 
simply allocates an ACB, fills in the ACB information from its argument 
list, and requests the queuing of the ACB. The following checks are made 
before the ACB is queued: 

• The AST quota for the process is checked to make sure it is not ex- 
ceeded by the request. 

• The access mode in which the AST routine is to execute is checked to 
make sure that it is no more privileged than the access mode from 
which the system service was called. 

The ACB$V_ QUOTA bit is set to indicate that this AST is counted 
against the process AST quota, PCB$W_ASTCNT 

2. The process requests an AST to be associated with an event such as the 
completion of a request (I/O or update section, lock management, or timer 
requests). System services such as these have arguments that include an 
AST routine entry point and an AST parameter. The delivery of an AST is 
accounted for in the PCB$W_ASTCNT field. The control block (ACB) is 
actually a reuse of the I/O request packet (IRP), lock block (LKB), or timer 
queue element (TQE) used in the initial operation. (Compare the ACB 
format pictured in Figure 7-1 with the TQE format shown in Figure 11-1, 
the LKB format shown in Figure 13-1, or the IRP layout shown in the 
VAX/VMS Guide to Writing a Device Driver.) 

3. The system, or another process, can request an AST to execute code in the 
context of the selected process. Examples of this type of action include I/O 
completion, Get Job/Process Information system service executed from 
another process, Forced Exit system service, expiration of CPU time 
quota, and working set adjustment as part of the quantum end event (see 



129 



AST Delivery 



Chapter 10). AST control blocks used in these situations are not deducted 
from the AST quota of the target process because of their involuntary 



nature. 



7.2.2 Access Mode and AST Queuing 

The ACB$V_RMOD bits of the ACB$B_RMOD field determine the inser- 
tion position of an AST control block when it is queued to a process. The 
AST queue is maintained as a first-in/first-out (FIFO) list for each access 
mode. ASTs of different access modes are placed into the queue in ascending 
access mode order, that is, kernel mode ASTs first and user mode ASTs last. 
Special kernel mode ASTs precede normal kernel mode ASTs. 

When the subroutine SCHSQAST (in module ASTDEL) is invoked, the pre- 
allocated and preinitialized AST control block is inserted into the AST queue 
of the appropriate process at IPL$_ SYNCH. The following steps are then 
performed. 

1. If the process is nonexistent, the ACB is deallocated and the AST event is 
ignored. An error status code is returned. 

2. If the AST queue is empty (the contents of PCB$L_ASTQFL are equal to 
its address), the ACB is inserted as the first element in the AST queue. 

3. Otherwise, the queue elements (ACBs) are scanned until either the end of 
the queue is reached or an ACB is found with an access mode less privi- 
leged than the one being inserted (that is, the ACB$V_RMOD value is 
higher). The new AST control block is inserted at this point. Thus, ASTs 
are first-in/first-out within an access mode and grouped by access mode in 
decreasing amount of privilege. User mode ASTs are always placed at the 
tail of the queue. 

7.2.3 Special Kernel Mode ASTs 

Special kernel mode ASTs represent a fifth type of AST. They are maintained 
as a separate group in the AST queue. Special kernel mode ASTs are indicated 
by the ACB$V_KAST bit of the ACB$B_RMOD field. Insertion of a special 
kernel mode AST will occur after any previous special kernel mode ASTs, 
but before any normal ASTs of any access mode (including kernel). The orga- 
nization of the AST queue is shown in Figure 7-2. 

Section 7.4 discusses special kernel mode ASTs more fully and provides 
several examples. 

7.2.4 Piggyback Special Kernel Mode ASTs 

Piggyback special kernel mode ASTs (PKASTs) are a new form of AST deliv- 
ery used in VAX/VMS Version 3. PKASTs allow a special kernel mode AST to 



130 



PCB 



AST Queue 
Listheads 



Special 
Kernel 



Normal 
Kernel 



Executive 



Supervisor 



User 



~\r~ 



D--D 



h< 



ACB 



Figure 7-2 

Organization of the AST Queue 



Oo 



o 

c 
to 

e 

1-4. 

fcs 

IX) 

fcs 
In 

r+ 

o 

£ 
o 
o 
to 

On 

Go 



AST Delivery 



ride piggyback in the ACB$L_KAST field of a normal mode AST. Piggyback 
special kernel mode ASTs are inserted in the AST queue according to the 
mode of the normal mode AST on which they ride. 

When the normal AST becomes deliverable, the information in the ACB is 
saved and the special kernel mode AST is delivered first. When the special 
kernel mode AST returns, the normal mode AST is called. 

There are reasons for using piggyback special kernel mode ASTs: 

1. It is faster to deliver two ASTs from one interrupt than to deliver two 
ASTs separately. 

2. There are times when delivering an AST requires some additional work in 
kernel mode in the context of the calling process. Piggyback special kernel 
mode ASTs reduce the work involved in this operation. 

The lock manager uses piggyback special kernel mode ASTs to load the 
fields of the caller's lock status block and lock value block. In order to 
copy the information from the lock manager's database to the caller's 
process space, a piggyback special kernel mode AST is required. 

3. A piggyback special kernel AST can be used to queue other normal mode 
ASTs to a process. The lock manager uses this feature to deliver both 
blocking and completion ASTs to one process. The terminal driver uses 
piggyback special kernel mode ASTs to requeue out-of-band ASTs (thus 
making them repeating). 



7.2.5 Computation of a New Value for ASTLVL 

An AST can be enqueued to a process at any time, because the software PCB 
and the AST control blocks are neither paged nor swapped. Each time an AST 
control block is inserted into the queue, the assignment of a value to 
ASTLVL (processor register and hardware PCB field) is attempted. However, 
the process can be in any one of three possible situations that determine to 
what degree the state of the AST queue can be updated. 

• If a process is outswapped, the ASTLVL cannot be updated because the 
process header (including the hardware process control block) is not availa- 
ble. When the process becomes resident and computable at a later time, 
ASTLVL will be calculated by the swapper (by invoking SCH$NEWLVL in 
module ASTDEL). 

• If the process is memory resident but not currently executing, the new 
value for ASTLVL will be recorded in the hardware PCB field but not in the 
processor register. 

• If the process is currently executing, the new ASTLVL value will be stored 
in both the hardware PCB field and the processor register, PR$_ ASTLVL. 



132 



7.3 Delivering an AST to a Process 

The ASTLVL value indicates the deliverability and access mode of the first 
pending AST in the queue. There is no indication of the deliverability of any 
other pending ASTs. ASTLVL is calculated in the following steps: 

• If the AST queue is empty, ASTLVL is set to 4. 

• If the AST queue is not empty and the first ACB is for a special kernel 
mode AST (see Sections 7.2.3 and 7.4), then ASTLVL is set to 0. 

• If the AST queue is not empty and the first ACB is for a normal mode AST, 
ASTLVL is set to the access mode of that ACB (the value contained in 
RMOD). 

7.3 DELIVERING AN AST TO A PROCESS 

An AST is delivered to a process when an REI instruction determines (from 
the destination access mode and the PR$_ASTLVL register) that a pending 
AST is deliverable (see Sections 7.1 and 7.2). A software interrupt is requested 
at IPL 2. The amount of time before the AST is actually delivered is depend- 
ent upon the interrupt activity of the system. When IPL finally drops below 
two, the AST delivery interrupt service routine will be executed. 

Note that a rescheduling interrupt at IPL 3 may be requested and granted, 
prior to the granting of the IPL 2 AST delivery interrupt request. Thus, it is 
possible for a spurious AST delivery interrupt to be granted in the context of 
a different process than was originally requested. Such spurious AST inter- 
rupts are detected and ignored. 

7.3. 1 AST Delivery Interrupt 

Routine SCH$ASTDEL (in module ASTDEL) is the IPL 2 interrupt service 
routine. Its function is to remove the first pending AST from the queue and 
execute the appropriate AST routine in the correct access mode. 
SCH$ ASTDEL performs the following operations: 

1. After raising the IPL to SYNCH, the first AST control block is removed 
from the AST queue of the process. If the queue was empty, the routine 
sets ASTLVL to 4 and exits with an REI instruction. This test detects 
spurious AST delivery interrupts. 

2. The removed ACB is tested for a special kernel mode AST (using 
ACB$V_KAST in ACB$B_RMOD). If the AST is a special kernel mode 
AST, a shortened sequence of steps occurs: 

a. IPL is dropped from SYNCH to IPL$_ASTDEL (IPL 2). 

b. The special kernel mode routine is executed by a JSB instruction with 
the ACB address in R5 and the PCB address in R4. 

c. On return from the special kernel mode routine, SCH$ASTDEL returns 
to step 1. 



133 



AST Delivery 



3. If the AST removed from the queue is not a special kernel mode AST, then 
a check is made to confirm that the mode of the AST is at least as privi- 
leged as the destination of the REI instruction that initiated AST delivery. 
This test is accomplished by checking the saved PSL on the kernel stack. If 
the mode of the AST is not correct, the ACB is reinserted at the head of the 
queue and the routine exits through the REI instruction, setting the new 
ASTLVL; these tests detect spurious AST delivery interrupts. Similar 
checks are made for already active ASTs (PCB$B_ASTACT, which insures 
that an AST is not interrupted by another AST at the same access mode) 
and for disabled access modes (cleared bits in PCB$B_ASTEN indicate 
that the access mode that corresponds to the bit cannot receive ASTs). 

4. If the AST is deliverable, then the following operations are performed be- 
fore dispatching to the AST routine. 

a. The bit corresponding to the current access mode in PCB$B_ASTACT 
is unconditionally set. 

b. If the ACB is accounted for in the PCB$W_ASTCNT quota, then the 
count is incremented to show delivery of the AST and deallocation of 
the ACB to nonpaged pool. 

c. ASTLVL is recomputed because the removal of the first ACB alters the 
state of the AST queue. The new value of ASTLVL is the access mode of 
the current process plus one (the next outer mode). The access mode is 
calculated in this manner in order to prevent another AST interrupt 
when SCH$ASTDEL executes its REI to EXE$ASTDEL. ASTLVL is 
computed more precisely when the AST procedure is done, based on the 
access mode of the first ACB in the queue. 

d. IPL is dropped to ASTDEL. 

e. A kernel mode AST does not require changing access mode, and the 
appropriate stack is already active. For executive, supervisor, and user 
mode ASTs, however, the inactive stack pointer is obtained. 

f. An argument list (described in the next section) is built on the stack of 
the AST's access mode. 

g. For ASTs for the outer three access modes, a PC/PSL pair of longwords 
is built on the kernel stack. The stored PC is the location EXE$ASTDEL, 
the AST dispatcher. The stored PSL contains the access mode in which 
the AST is to be delivered in both its current mode and previous mode 
fields. 

h. If a piggyback special kernel mode AST is associated with the current 
AST, the special kernel mode AST routine is dispatched through a JSB 
instruction with the ACB address in R5 and the PCB address in R4. 
When the AST routine returns, processing continues with the next 
step. 

i. If a piggyback special kernel mode AST does not exist, the bit 



134 



7.3 Delivering an AST to a Process 

ACB$V_NODELETE is tested. If the bit is set, processing continues 
with the previous step ; if the bit is not set, the ACB is deallocated and 
returned to nonpaged dynamic memory. 

EXESASTDEL executes in the access mode of the AST. For kernel 
mode, this merely requires dropping the IPL to zero. For the other ac- 
cess modes, transfer of control and change of access mode is accom- 
plished through an REI instruction, the only way to reach a less privi- 
leged access mode (see Figure 1-4). (The PC and PSL used by the REI 
instruction are described above in item 4g.) A CALLG instruction is 
executed, transferring control to the AST procedure, with the argument 
pointer (AP) pointing to the argument list. 



7.3.2 Argument List 

User-written ASTs are procedures, which means that they can be written in 
any language. The procedures must begin with an entry mask and return 
control to their caller (the AST dispatcher) with a RET instruction. 

Figure 7-3 shows the argument list passed to an AST procedure by the 
interrupt service routine, ASTDEL. The AST parameter is obtained from the 
ACB where it was initially stored by a system service such as $QIO, 
$SETIMR, or $DCLAST. The parameter was originally an argument to that 
system service. The interpretation of the AST parameter is dependent on the 
application. 

The general purpose registers, RO and Rl, are saved in the argument list 
because the procedure calling convention does not require that they be saved. 
The asynchronous nature of ASTs implies that the RO and Rl contents are 
unpredictable and cannot be destroyed. The registers are saved and restored 
by the AST delivery mechanism. 

The saved PC and PSL values are the register contents originally saved 
when the IPL 2 interrupt was initiated by the hardware. The values are nor- 



ASTPRM 



SAVED RO 



SAVED R1 



SAVED PC 



SAVED PSL 



AP 



Figure 7-3 

Argument List Passed to AST by Dispatcher 



135 



AST Delivery 



mally the pair that was about to be used by the original REI instruction re- 
questing the AST delivery. 



7.3.3 AST Exit Path 

When the AST routine issues the RET instruction, control is returned to the 
location EXE$ASTRET in the access mode of the AST. The call frame, but 
not the argument list, was removed from the current stack by the RET in- 
struction. The argument list remains because a CALLG rather than a CALLS 
instruction was used to execute the AST routine. The following steps then 
occur. 

1. The argument count and the AST parameter are removed from the stack, 
leaving the RO, Rl, PC, and PSL values. 

2. The following instruction is executed: 

CHMK #ASTEXIT 

This instruction invokes the change-mode-to-kernel system service dis- 
patcher, CMODSSDSP (described in Chapter 9). The service code of zero 
(ASTEXIT = 0) causes the normal kernel mode dispatching mechanism to 
be bypassed. 

3. In place of the kernel mode dispatching mechanism, the following actions 
are performed while in kernel mode: 

• The IPL is raised to SYNCH. 

• The appropriate PCB$B_ASTACT bit is cleared to signal AST comple- 
tion. 

• The ASTLVL value is recomputed. 

These fields can only be written from kernel mode. Thus, it is necessary 
for the AST dispatcher to reenter kernel mode after the AST returns con- 
trol to the dispatcher and before the AST delivery interrupt is dismissed. 

4. An REI instruction, still in module CMODSSDSP, drops the IPL to zero, 
and returns the access mode to that of the AST. 

5. Code in the module ASTDEL resumes at the previous access mode and IPL 
with the following steps: 

• The saved values in RO and Rl are restored. 

• Another REI instruction is issued. 

The REI instruction returns control to the access mode and location origi- 
nally interrupted by AST delivery. 

Note that the REI instructions in CMODSSDSP and ASTDEL may cause 
another IPL 2 interrupt to occur, depending upon the ASTLVL value and the 
access mode transitions. 



136 



7.4 Special Kernel Mode ASTs 

7.4 SPECIAL KERNEL MODE ASTs 

Special kernel mode ASTs are different from normal ASTs in several ways: 

1. The ASTs represent system actions that must occur in the context of the 
process. These actions are frequently requested when the process is not 
currently executing. 

2. The special kernel mode AST routines are dispatched at IPL 2 and execute 
at that level or higher. Synchronization is provided by the interrupt mech- 
anism itself, rather than requiring additional PCB$B_ASTACT and 
PCB$B_ASTEN bits. Only one special kernel mode AST can be active at 
any moment because the AST delivery interrupt is blocked. 

3. The special kernel mode AST routines are invoked by a JSB instruction 
rather than a CALLG instruction. There is no argument list (the PCB ad- 
dress is in R4 and the ACB address is in R5). When the special kernel mode 
AST routine executes its RSB instruction, the stack must be in its original 
state (when the special kernel mode AST routine was called). The routine 
must also save and restore general registers R6 through Rll. 

4. The AST routine is responsible for the deallocation of the ACB (to non- 
paged pool). (For normal ASTs, this deallocation is done by the AST deliv- 
ery routine.) 

5. On return from the AST routine (with an RSB instruction), the AST queue 
is checked once more (in case a special kernel mode AST queued a normal 
AST to the process). If the queue is empty, an REI instruction is executed. 
This instruction attempts to pass control to the originally interrupted 
PC/PSL pair. IPL will drop from two to zero at the same time. 

The next five sections briefly describe five examples of the special kernel 
mode AST mechanism. 

7.4.1 I/O Postprocessing in Process Context 

Part of the sequence of completing an I/O request involves the delivery of a 
special kernel mode AST to the requesting process. I/O postprocessing is 
described in the VAX/VMS Guide to Writing a Device Driver. This request is 
made by the IPL 4 (I/O postprocessing) interrupt service routine by queuing 
the former I/O request packet as an ACB. The operations performed by the 
I/O completion AST routine are those that must execute in process context, 
particularly those that reference process virtual addresses. The primary oper- 
ations (executed at IPL 2) are the following. 

1. For buffered read I/O operations only, the data is moved from the system 
buffer to the user buffer, and the system buffer is deallocated to nonpaged 
dynamic memory. 

2. The buffered or direct I/O count field of the process header is incremented 
for accounting information. 



137 



AST Delivery 

3. If a user diagnostic buffer was specified, the diagnostic information is 
moved from the system diagnostic buffer before it is deallocated. 

4. The channel control block (in the control region) is updated to show I/O 
completion. Updating the CCB may make the channel idle. 

5. The event flag associated with the I/O request is set. 

6. If an I/O status block (IOSB) was specified, the IOSB is written using infor- 
mation in the I/O request packet. 

7. If an AST was specified with the $QIO request, then the ACB$V_QUOTA 
bit was set in the IRP. The AST procedure address and the optional AST 
parameter were originally stored in the IRP (now an ACB). The former IRP 
is queued to the process once again in the access mode of the requesting 
process. 

8. Otherwise, the IRP/ACB is deallocated to nonpaged dynamic memory. 

7.4.2 Process Suspension 

When a $SUSPND system service request specifies a process other than the 
requesting process, the suspend mechanism requires a special kernel mode 
AST to enter the context of the target process. 

When the special kernel mode AST is delivered, the following actions are 
performed: 

1. The ACB is deallocated to nonpaged dynamic memory. 

2. After raising IPL from ASTDEL (IPL 2) to SYNCH, the PCB$V_RESPEN 
bit is cleared. If a request to resume from the $RESUME system service 
was pending, then the resume request has precedence. That is, the AST 
routine exits without suspending the process (after dropping IPL back to 
ASTDEL). 

3. If no resume request was pending, then the process is placed into the SUSP 
wait state. The process hardware context is saved with a SVPCTX instruc- 
tion (described in detail in Chapter 10). The process quantum field in the 
process header is charged with a voluntary wait interval (determined by 
the special system parameter IOTA, described in Chapter 10). The time at 
which the process enters the wait state is stored in the process header at 
offset PHD$W_WAITIME. Control is passed to the scheduler at 
SCH$SCHED to select the next process for execution. 

When the process finally executes again (after a $RESUME system service 
call), the PCB$V_SUSPEN bit is unconditionally cleared and the process is 
made computable. 

7.4.3 Process Deletion 

The major portion of the steps involved in process deletion occur in a special 
kernel mode AST routine queued in response to a $DELPRC system service 



138 



7. 4 Special Kernel Mode ASTs 

call. A detailed explanation of process deletion is provided in Chapter 22. The 
use of the special kernel mode AST mechanism provides the following: 

• Execution as the current process is accomplished by AST delivery. Almost 
all waiting processes are made computable by AST delivery (see Chapter 
10), with the exception of suspended processes. The $DELPRC service en- 
sures the deletion of a suspended process by issuing a $RESUME first. 

Execution as the current process is required for process virtual address 
translation and other operations that require process context (particularly 
in obtaining the information contained in the control region). 

• The delivery of deletion ASTs cannot be prevented by the $SETAST sys- 
tem service. A process can only avoid deletion by raising IPL to ASTDEL 
(IPL 2) or above to prevent all AST deliveries. Because IPL can only be 
elevated while in kernel mode, only privileged processes, or the system 
acting on behalf of some process, can explicitly prevent process deletion. 



7.4.4 $GETJPI System Service 

The $GETJPI system service is described in Chapter 30. When information is 
requested for a process other than the requesting process, the target process 
must execute to establish process context. In addition, if the target process is 
outswapped, the enqueuing of the special kernel mode AST will make the 
process an inswap candidate. This action brings in both the working set and 
the process header (where much of the accounting information is main- 
tained). 
In general terms, the $GETJPI AST activity is as follows. 

1. An ACB is constructed for a special kernel AST. A system buffer is also 
allocated and a pointer to it is placed in the ACB. 

2. When the special kernel mode AST routine executes in the context of the 
target process, the requested information is moved into the system buffer. 
(The requests had been encoded in the ACB.) The ACB is then reset to 
deliver a special kernel mode AST back to the requesting process. 

3. The second special kernel mode AST moves data from the system buffer 
into a user buffer in the requesting process. Other actions include the 
following: 

• Deallocating the system buffer 

• Setting an event flag 

• Delivering an AST in the access mode of the caller, if requested 

4. If an AST is delivered, the ACB is used for the third time. If no AST is 
delivered, then the ACB is deallocated. 



139 



AST Delivery 

7.4.5 Power Recovery ASTs 

Another example of the use of special kernel mode ASTs occurs in the imple- 
mentation of power recovery ASTs, a tool that enables processes to receive 
notification that a power failure and successful restart have occurred. (Power 
failure and power recovery are described in Chapter 27.) 

When a successful power recovery occurs, all processes that have estab- 
lished a power recovery AST are notified first with a special kernel mode 
AST This AST retrieves information from the PI pointer page that allows 
the user-requested AST to be delivered. The AST is required because PI space 
information is only available from process context. 



7.4.6 Other System Use of ASTs 

Three other features within the executive are implemented through ASTs, 
but these ASTs are not special kernel mode ASTs. The automatic working set 
adjustment that takes place at quantum end is implemented with normal 
kernel ASTs. (See Chapter 10 for information on quantum end activities and 
Chapter 16 for detailed description of of automatic working set adjustment.) 
CPU time limit expiration is implemented with potentially multiple ASTs. 
Beginning with user mode, the AST procedure calls the $EXIT system serv- 
ice. If the process is not deleted, a supervisor mode time expiration AST is 
queued. This loop continues with higher access modes until the process is 
deleted. The Force Exit system service (see Chapters 12 and 21) causes a user 
mode AST to be delivered to the target process. 



7.5 ATTENTION AND OUT-OF-BAND ASTs 

Two other categories of AST use are the mechanisms for serving attention 
and out-of-band ASTs. Attention ASTs and out-of-band ASTs are used in 
association with I/O operation to notify processes or routines that an unsolic- 
ited event has occurred on a device. Out-of-band ASTs are described in Sec- 
tion 7.5.5. 



7.5.1 Set Attention Mechanism 

In order to establish an attention AST for a particular device (whose driver 
supports this function), the user must issue a $QIO system service request 
with the I/O function IO$_SETMODE (or IO$_SETCHAR for some devices). 
The kind of attention AST requested is indicated by a function modifier. 
The following steps are provided by the routine COM$SETATTNAST in 
module COMDRVSUB. (This routine requires process context and so is 
called only from device driver FDT routines.) 



140 



7.5 Attention and Out-of-Band ASTs 

1. If the user AST routine address (the $QIO PI parameter) is zero, the re- 
quest is interpreted as a flush attention AST list request (see Section 7.5.3). 

2. An expanded ACB is allocated from nonpaged dynamic memory. The ACB 
is deducted from the process quota, PCB$W_ASTCNT. 

3. Information from the I/O request packet (such as the AST routine entry 
point, AST parameter, device channel number, and process ID) is moved 
into the ACB. 

4. IPL is raised to UCB$B_DIPL, the IPL at which this list is synchronized. 
The ACB is linked to the unit control block (UCB) of the associated device 
in a singly linked, last-in/first-out (LIFO) list. 



7.5.2 Delivery of Attention ASTs 

The occurrence of a situation for which attention ASTs have been defined 
causes the delivery of all such attention ASTs. The mechanism of delivery is 
implemented in the routine COM$DELATTNAST of module COM- 
DRVSUB. COM$DELATTNAST is usually invoked by a device driver at de- 
vice IPL (IPL 20 through 23), after specifying which list of attention AST fork 
blocks/ ACBs is to be used. 

Each ACB is originally formatted as a fork block with the AST information 
located at different offsets. Figure 6-2 shows the layout of a fork block. The 
control block contains relevant additional information such as saved PC, R3, 
and R4 values, the channel number for the device, and the IPL value for 
processing the AST (IPL$_QUEUEAST = IPL 6). During fork processing, the 
control block is reformatted into a standard ACB. 

When COM$DELATTNAST begins execution, the CPU is usually execut- 
ing at device IPL. The queuing of ASTs is an operation using IPL$_SYNCH as 
a synchronization mechanism (see Chapter 2). Specifically, IPL must be 
raised to SYNCH. To accomplish correct synchronization, the IPL 6 fork dis- 
patcher is used. 

The following steps summarize the delivery of attention ASTs: 

1. At IPL 20 through 23, each attention AST fork control block/ ACB is re- 
moved from the appropriate list in the reverse order of declaration. 

2. The routine invokes the FORK system macro to dispatch to EXE$FORK. 
EXE$FORK queues the fork block to the listhead defined by the fork IPL 
field and requests an interrupt at that IPL. 

3. As the interrupt priority level of the CPU drops below six, the fork inter- 
rupt is taken. The IPL$_QUEUEAST fork dispatcher removes each fork 
control block from its queue and passes the control block back to a loca- 
tion in COM$DELATTNAST at IPL 6. 

4. At IPL 6, the fork control block is then reformatted into an ACB, repre- 
senting an AST in the access mode of the original requesting process. 



141 



AST Delivery 



5. The ACB is then queued to the process through SCH$QAST (which will 
immediately raise IPL to IPL$_SYNCH in order to synchronize access to 
the ACB listhead and the scheduler database). 



7.5.3 Flushing an Attention AST List 

The list of attention ASTs is flushed as the result of an explicit user request, a 
cancel I/O request, or a deassign channel request for the associated device. 

An explicit user request to flush the attention AST list is performed as the 
result of a set attention AST request with an AST routine address of zero (see 
Section 7.5.1). COM$SETATTNAST then branches to COM$FLUSHATTNS. 

Device drivers can request the flushing of the attention AST list by either 
invoking COM$SETATTNAST with an AST routine address of zero or by 
directly invoking COM$FLUSHATTNS with the channel number of the de- 
vice in R6. 

COM$FLUSHATTNS performs the following operations. 

1. The IPL is raised to the hardware IPL of the device (IPL 20 through 23). 

2. As each control block in the attention AST list is found, the process ID of 
the process requesting the flushing operation is compared with the process 
ID stored in the control block. An AST control block is retained in the 
attention AST list if the process IDs do not match. 

3. If the process IDs match, then the channel numbers must match. One 
channel number is passed in R6 from the flush request, and the other is in 
the control block from the declaration of the AST. If the channel numbers 
do not match, then the control block is retained in the attention AST list. 
Otherwise, the control block is removed from the attention AST list. 
Control blocks are therefore removed for a specific process on a specific 
channel. 

4. IPL is dropped from device interrupt level (IPL 20 through 23). 

5. The ASTCNT quota is incremented to indicate deallocation of the control 
block. 

6. The control block is deallocated to nonpaged dynamic memory. This oper- 
ation requires execution through the fork dispatcher at IPL$_ QUEUE AST 
to insure proper synchronization with IPL. (Actual deallocation is done at 
IPL 11 as described in Chapter 3.) 

7. Processing continues until the entire attention AST list has been scanned. 



7.5.4 Examples in the VAX/VMS Executive 

Two devices that commonly have attention ASTs associated with them are 
terminals and mailboxes. Brief descriptions of the support for attention ASTs 
in these device drivers are given here. 



142 



7.5 Attention and Out-of-Band ASTs 

7.5.4.1 Terminal Driver and CTRL/Y Notification. The terminal IO$_SETMODE 
and IO$_SETCHAR functions may take either IO$M_CTRLCAST or 
IO$M_CTRLYAST function modifiers. When a CTRL/C is typed on a termi- 
nal, the CTRL/C attention AST list is emptied by delivering each CTRL/C 
AST associated with the terminal. If no CTRL/C attention AST is declared, 
then the CTRL/C is interpreted as a CTRL/Y and the CTRL/Y AST list is 
searched instead. If a CTRL/Y is typed, only the CTRL/Y attention AST list 
is emptied. 

Because the list is emptied each time a CTRL/Y or a CTRL/C is typed, both 
CTRL/C and CTRL/Y attention ASTs must be reenabled each time they are 
delivered to a process. In contrast, out-of-band ASTs are repeating. That is, 
once declared, out-of-band ASTs can be delivered to the process for the life of 
the process, or until the Cancel system service is called to flush the AST list. 

7.5.4.2 Mailbox Driver. The IO$M_READATTN and IO$M_WRTATTN function 
modifiers provide notification of mailbox requests from other processes. 
IO$M_WRTATTN provides notification of unsolicited input to a mailbox. 
IO$M_READATTN notifies the enabling process when any process issues a 
read to a mailbox when no message is available. 

Multiple attention ASTs of each type may be declared by processes for the 
same mailbox. When a condition corresponding to an attention AST occurs 
in a mailbox, all ASTs of the appropriate type are delivered. Only the first 
process to issue a responding I/O request will be able to complete the transfer 
of data signaled by the attention ASTs. 

Read and write attention ASTs must be reenabled after delivery because 
the entire attention AST list is delivered (and removed) after each occurrence 
of the specified condition. 



7.5.5 Out-of-Band ASTs 

In VAX/VMS Version 3.0 a new form of AST mechanism was introduced 
specifically for the terminal driver. Routines establish out-of-band ASTs in 
order to intercept control characters received from the terminal (ASCII codes 
00 through 20 [hex]) and to perform special processing as a result of the con- 
trol character being typed. This mechanism is intended to supplement the 
attention AST mechanism described in Section 7.5, which applies only to the 
characters CTRL/C and CTRL/Y (ASCII codes 03 and 19 [hex]) in the termi- 
nal driver. 

7.5.5.1 Set Out-of-Band AST Mechanism. The mechanism of out-of-band ASTs is 
similar in many ways to that of attention ASTs. Out-of-band ASTs are estab- 
lished by issuing the $QIO system service, specifying IO$_SETMODE (or 
IO$_SETCHAR) with the function modifier IO$M_OUTBAND. Like atten- 



143 



AST Delivery 

tion ASTs, the list of out-of-band ASTs is linked to the unit control block 
(UCB) of the associated terminal. 

The following steps are performed by the routine COM$SETCTRLAST in 
module COMDRVSUB. (This routine requires process context, so it can be 
called from device driver FDT routines only.) 

• If the user AST routine address (the $QIO PI parameter) is zero, or if the 
character mask (the $QIO P2 parameter) is zero, the request is interpreted 
as a flush out-of-band AST list request (see Section 7.5.5.3). 

• The list of out-of-band ASTs is scanned, searching for an out-of-band AST 
control block with the same characteristics as the caller. The following 
items are checked: 

—The process ID (PID). Out-of-band ASTs can be issued to the same ter- 
minal device from a process and its subprocesses (which will have differ- 
ent PIDs). 

— The channel number. 

— The character mask. 

If an out-of-band AST control block is found with the same characteristics, 
the request is interpreted as a request to modify the existing out-of-band AST 
control block. If a similar out-of-band AST control block is not found, a new 
control block is allocated from nonpaged dynamic memory. The ACB in the 
out-of-band AST control block is deducted from the process AST quota, 
PCB$W_ASTCNT. 

• Information from the I/O request packet (such as the AST routine entry 
point, AST parameter, device channel number, and process ID) is moved 
into the out-of-band AST control block. 

• The out-of-band AST control block is placed on the tail of the control 
block list. 

• The character mask is ORed into the out-of-band AST summary mask. 

7.5.5.2 Delivery of Out-of-Band ASTs. When a control key is typed at a terminal, a 
check must be made to see if an out-of-band AST has been enabled for that 
key. The character typed is compared with the out-of-band AST summary 
mask. If the bit in the summary mask is set, an out-of-band AST has been 
declared for that control character and the AST is delivered. The mechanism 
of delivery is implemented in the routine COM$DELCTRLAST of module 
COMDRVSUB. COM$DELCTRLAST is invoked by the terminal driver at 
device IPL. 

Each out-of-band AST control block is originally formatted as a fork block 
with the AST fields located at different offsets. (The first six longwords of the 
unit control block pictured in the VAX/VMS Guide to Writing a Device 
Driver are the most common example of a fork block.) The control block 



144 



7.5 Attention and Out-of -Band ASTs 

contains relevant additional information, such as: the saved PC, R3, and R4 
values; the channel number for the device; and the IPL value for processing 
the AST (IPL$_QUEUEAST = IPL 6). During fork processing, the out-of -band 
AST control block is reformatted into a standard ACB. 

When COM$DELCTRLAST begins execution, the CPU is executing at 
device IPL. ASTs are queued using IPL$_ SYNCH as a synchronization mech- 
anism (see Chapter 2). Specifically, IPL must be raised to SYNCH. To accom- 
plish correct synchronization, the IPL 6 fork dispatcher is used. 

The following steps summarize the delivery of out-of-band ASTs. 

1. At device IPL, the list of out-of-band AST control blocks is searched for a 
block whose character mask contains the character typed at the terminal. 
When a match is found, a bit in the out-of-band AST control block is 
checked to see if the control block is already in use. If the block is in use, it 
is skipped; if the block is not in use, it is marked in-use, the control block 
is modified to act as a fork block, and the block is queued to the IPL6 fork 
queue listhead. 

2. The routine invokes the FORK system macro to notify the fork dispatcher 
through the IPL 6 software interrupt. 

3. As the interrupt priority level of the CPU drops below six, the fork inter- 
rupt is taken. The IPL$_QUEUEAST fork dispatcher removes each fork 
control block from its queue and passes the control block back to a loca- 
tion in COM$DELCTRLAST at IPL 6. 

4. At IPL 6 the fork control block is then reformatted into an ACB, represent- 
ing an AST in the access mode of the original requesting process. The no 
delete and piggyback special kernel mode AST flags are set in the ACB, 
and the special kernel mode AST field is loaded with the address of the 
piggyback special kernel mode AST. 

5. The ACB is then queued to the process through SCH$QUAST (which will 
immediately raise IPL to IPL$_SYNCH). 

6. When the process receives the ASTs, the piggyback special kernel mode 
AST is delivered first. The piggyback special kernel mode AST performs 
two functions: 

• It clears the busy bit. 

• If the out-of-band AST is marked as "lost," it is deallocated. "Lost" 
control blocks occur when a request to flush the AST list cannot deallo- 
cate a control block because the busy bit is set (see Section 7.5.5.3). 
Once the AST is delivered and the busy bit is clear, the control block is 
no longer needed and can be deallocated. 

7.5.5.3 Flushing an Out-of-Band AST List. The list of out-of-band ASTs is flushed as 
the result of an explicit user request, a cancel I/O request, or a deassign chan- 
nel request for the associated device. 



145 



AST Delivery 



An explicit user request to flush the out-of-band AST list is performed as 
the result of a set out-of-band AST request with an AST routine addresss of 
zero or a character mask of zero (see Section 7.5.5.1). COM$SETCTRLAST 
then branches to COM$FLUSHCTRLS. 

Device drivers can request the flushing of the out-of-band AST list by ei- 
ther invoking COM$SETCTRLAST with an AST routine address of zero (or a 
character mask of zero) or by directly invoking COM$FLUSHCNTRLS with 
the channel number of the device in R6. 

COM$FLUSHCTRLS performs the following operations. 

1. The IPL is raised to the device IPL for the terminal. 

2. The list of out-of-band AST control blocks is scanned. As each control 
block is found, the process ID of the process requesting the flushing opera- 
tion is compared with the process ID stored in the control block. An AST 
control block is retained in the out-of-band AST list if the process IDs do 
not match. 

3. If the process IDs match, then the channel numbers must match. One 
channel number is passed in R6 from the flush request; the other is in the 
control block from the declaration of the AST. If the channel numbers do 
not match, then the control block is retained in the out-of-band AST list. 

4. If the channel numbers match, the busy bit is checked. If the busy bit is 
set, the "lost" bit is set so that the control block will be deallocated once 
its AST is delivered. Otherwise, the control block is removed from the 
out-of-band AST list. 

5. IPL is dropped from device interrupt level. 

6. The ASTCNT quota is incremented to indicate deallocation of the control 
block. 

7. The control block is deallocated to nonpaged dynamic memory. This oper- 
ation requires execution through the fork dispatcher at IPL$_QUEUEAST 
to insure proper synchronization with IPL. (The actual deallocation is 
done at IPL 11 as described in Chapter 3.) 

8. Processing continues until the entire out-of-band AST list has been 
scanned. 



146 



8 Error Handling 



There is always something to upset the most careful of human 

calculations. 

— Ihara Saikaku, The Japanese Family Storehouse 

There are several levels for reporting system-wide errors in the VMS operat- 
ing system. (Process-specific and image-specific errors are handled by the ex- 
ception mechanism described in Chapter 4.) 

• The error logging subsystem allows device drivers and other system com- 
ponents to record errors and other events for later inclusion in an error log 
report. 

• The BUGCHECK mechanism is used by the VMS operating system to shut 
down the system in an orderly fashion when internal inconsistencies or 
other irrecoverable errors are detected. 

• A machine check is an exception that indicates that the processor has 
detected some CPU-specific error. 

8.1 ERROR LOGGING 

The error logging subsystem is used to record device errors, processor- 
detected conditions, and other noteworthy events, such as volume mounts 
and system startups. 

8.1.1 Overview of the Error Logging Subsystem 

Error logging occurs in three steps. 

1. Components such as device drivers that wish to log an error call routines 
in the executive that write error messages into one of two buffers perma- 
nently allocated in the executive image. 

2. When the buffer allocation routine detects that a buffer is full, it awakens 
the ERRFMT process so that the buffer contents can be written to the 
error log file SYS$ERRORLOG:ERRLOG.SYS. 

3. The contents of this file can be assembled into a report by the report gener- 
ator utility SYE. 

8.1.2 Device Driver Errors 

There are two routines in the error log subsystem used by device drivers. 
ERL$DEVICERR is used to report device-specific errors. ERL$DEVICTMO 



147 



Error Handling 



can be called by a driver to report a device timeout. In either case, the follow- 
ing action is performed by the routines: 

1. An error message buffer is allocated. 

2. The buffer is loaded with information obtained from the unit control 
block and from the current I/O request packet. 

3. The driver is called at its register dump routine entry point to store de- 
vice-specific information into the error message buffer. 



8. 1 .3 Other Error Log Messages 

The VMS operating system uses the error log subsystem to record other infor- 
mation besides device errors. The kinds of items written to the error log 
include the following: 

• Warm start entries. These entries record successful recoveries from power 
failure. 

• Cold start entries. These entries record all successful system bootstrap 
attempts. 

• All bugchecks, fatal and otherwise. Bugchecks are described in the next 
section. 

• Machine check occurrences. 

• Volume mounts and dismounts. 

• Any messages written to the error message buffer by the Send Message to 
Error Logger system service. The use of this system service requires 
BUGCHK privilege. 



8.1.4 Operation of the Error Logger Routines 

Error message buffer allocation occurs at IPL 31. This high IPL allows the 
allocation routine (ERL$ALLOCEMB) to be called from anywhere in the sys- 
tem (including machine check handlers, which execute at IPL 31) without 
causing IPL problems. IPL is restored to the caller's IPL before control is 
passed back to the caller. 

There are two 512-byte buffers used for holding messages. A flip-flop 
switch (ERL$GB_BUFIND) indicates which of the two buffers is currently 
active. Allocation involves finding enough free space in the buffer indicated 
by ERL$GB_BUFIND to hold a message. When the current buffer is filled, 
the switch is thrown to activate the other buffer and the ERRFMT process is 
awakened to write the filled buffer to the error log file. 

After a message buffer is successfully allocated, its address is returned to 
the caller of the allocation routine, which loads the buffer with information 
specific to the message being logged. Once the information has been stored, a 



148 



8.1 Error Logging 

second routine (ERL$RELEASEMB) is called to write more information into 
the message header, indicating that the message is valid. 

8.1.4.1 Waking the ERRFMT Process. The routine ERL$WAKE is called at least once 
a second from EXE$TIMEOUT (see Chapter 11). This routine is also called 
when one of the two log buffers is filled. The routine does not automatically 
wake the ERRFMT process. Rather, it decrements a counter (ERL$GB_ 
BUFTIM) and only wakes ERRFMT if the counter goes to zero. 

If the counter goes to zero, it is also reset. The current starting value for the 
error log timer is 30. (This value is an assembly-time parameter, not adjusta- 
ble with SYSGEN.) That is, the routine can be called a maximum of 30 times 
before ERRFMT is awakened. Thus, a maximum of thirty seconds can elapse 
without ERRFMT's becoming computable, forcing error messages to be writ- 
ten to the error log file at reasonable intervals, even on systems that have 
very few errors occurring. 

This timing mechanism is exploited by the allocation and deallocation 
routines if they wish to force an awakening of ERRFMT. Either of these rou- 
tines simply loads a 1 into ERL$GB_BUFTIM. The next call to ERL$WAKE 
(which must be done at IPL 7 and, thus, cannot be done directly either by the 
allocation or deallocation routine) is guaranteed to wake ERRFMT. 

The allocation routine forces a wake whenever it is forced to switch buffers 
because the current buffer is full. The buffer release routine forces a wake if 
the current message buffer contains ten or more messages. 

8. 1 .5 Cursory Overview of the ERRFMT Process 

The ERRFMT process copies a previously filled error message buffer to the 
error log file SYS$ERRORLOG:ERRLOG.SYS, as described by the following 
steps: 

• The contents of the message buffer are copied into the P0 space of ERRFMT. 
This copying occurs at IPL 31 to synchronize with the allocation subrou- 
tine. 

• Once the message buffer contents are accessible in ERRFMT's address 
space, they can be put into a format acceptable to SYE, the error log report 
generator. The reformatted error messages are written to SYS$ERRORLOG: 
ERRLOG.SYS. 

• If a process has declared an error log mailbox, each message in the error log 
buffer is also sent to that mailbox. 

• If ERRFMT detects volume mounted or volume dismounted messages 
within the message buffer, it will send volume mounted or volume dis- 
mounted message to terminals enabled as disk or tape operators. 

After ERRFMT has completed its output operations, it reenters the hibernate 
(HIB) state. 



149 



Error Handling 

8.1.6 Error Log Mailbox 

The error logging subsystem provides the capability (currently available for 
internal use by DIGITAL) for a process to monitor error logging activity as it 
is happening rather than wait for offline processing with the formatting pro- 
gram SYE. This capability is provided through an unsupported system service 
called Declare Error Log Mailbox (SYS$DERLMB). 

8.1.6.1 System Service Call. A process that has DIAGNOSE privilege can call the 
$DERLMB system service with a single argument, the unit number of the 
mailbox to receive error log messages. If the error log mailbox is not in use 
(the error log mailbox descriptor EXE$GQ_ERLMBX contains a zero), the 
unit number is stored in the first word of the mailbox descriptor and the PID 
of the requesting process is stored in the second longword. 

Note that the Declare Error Log Mailbox ($DERLMB) system service is not 
supported by DIGITAL, and is not documented in the VAX/VMS System 
Services Reference Manual. 

If this service is called with a unit number of zero, the descriptor is cleared, 
disabling the error log mailbox feature. The descriptor is also unconditionally 
cleared by the image rundown routine (see Chapter 21). 

8.1.6.2 Action of the ERRFMT Process. If the ERRFMT process detects that the error 
log mailbox feature is enabled, it sends each message that it extracts from the 
error log buffer to that established mailbox. Thus a process can monitor mes- 
sages that the ERRFMT process is writing to the error log file. 



8.2 SYSTEM CRASHES (BUGCHECKS) 

When the VMS operating system detects an internal inconsistency, such as a 
corrupted data structure or an unexpected exception, it declares a bugcheck. 
If the system can continue running, a nonfatal bugcheck is declared, which 
results in an error log entry. Serious errors result in fatal bugchecks, through 
which the system is shut down in a controlled fashion. 

1. The contents of physical memory are written to the system dump file 
(unless inhibited by a SYSBOOT flag, DUMPBUG). 

2. After the system is halted, it may restart itself (again according to the 
setting of a SYSBOOT flag, BUGREBOOT). 



8.2.1 Bugcheck Mechanism 

The path into the bugcheck routine appears in source code as the invocation 
of the BUG- CHECK macro. This macro expands into opcode "XFF, a byte 
containing A XFE, and a word containing the particular bugcheck code. 



150 



8.2 System Crashes (BUGCHECKS) 

The execution of opcode "XFF results in a reserved instruction exception 
(SS$_OPCDEC, opcode reserved to DIGITAL), causing control to be trans- 
ferred through the system control block to an exception-specific service rou- 
tine. This routine checks for both of the following: 

• If the opcode is XFF. 

• If the byte following the reserved opcode is either ~XFE or "XFD. (A *XFE 
indicates that the bugcheck code is contained in the next word. A A XFD 
indicates that the bugcheck code is contained in the next longword. The 
VMS operating system does not currently use longword bugcheck codes.) 

If both of these checks succeed, the VMS operating system interprets this 
exception as a bugcheck and transfers control to routine EXE$BUG_CHECK. 
Otherwise, the illegal opcode exception is treated in the usual manner de- 
scribed in Chapter 4. 



8.2.2 Operation of Bugcheck Routine 

The bugcheck routine performs several steps, depending on the access mode 
in which the bugcheck occurred and whether the bugcheck was fatal. (The 
fatality of the bugcheck is determined by the severity field, bits <2:0> in the 
bugcheck code. If the BUG-CHECK macro call includes the parameter 
FATAL, a code of STS$K_SEVERE [value of 4] is placed into this field. Other- 
wise, a zero is placed there.) If the SYSBOOT flag BUGCHECKFATAL is set, 
all bugchecks are treated as fatal, independent of the severity code in the 
low-order three bits of the bugcheck code. The BUGCHECKFATAL flag is 
clear by default, which means that nonfatal bugchecks do not cause the sys- 
tem to crash. 

8.2.2.1 Bugchecks from User and Supervisor Mode. If a bugcheck is generated from 
either user or supervisor mode, and the process has BUGCHECK privilege, a 
message (of type user-generated bugcheck) is written to the error log buffer. 

• If the bugcheck is fatal, the $EXIT system service is called with the code 
SS$ -BUGCHECK as the final image status. What happens as a result of 
this call depends on whether the process is executing a single image (no 
supervisor mode termination handler has been established) or the process 
is an interactive or batch job. 

—If the process is executing a single image, a fatal bugcheck from user or 

supervisor mode results in process deletion. 
—With the current use of supervisor mode termination handlers, a fatal 

bugcheck issued from an interactive or batch job causes the currently 

executing image to exit and control to be passed to the CLI to receive 

the next command. 



151 



Error Handling 

In either case, the only difference between user and supervisor mode is 
that user mode termination handlers are not called if a fatal bugcheck is 
issued from supervisor mode. 

• If the bugcheck code is not fatal, the exception (the initial path into the 
bugcheck code) is dismissed, and execution continues with the instruction 
following the BUG-CHECK macro. 

The BUGCHECKFATAL flag has no effect on bugchecks issued from user or 
supervisor mode. The severity field in the bugcheck code is used to deter- 
mine whether a given bugcheck is fatal. In addition, neither user nor supervi- 
sor mode bugchecks cause the system to shut down. 

8.2.2.2 VMS Use of Bugchecks. The bugchecks that the VMS operating system uses 
for its own purposes are issued from executive or kernel mode. If the bugcheck 
is not fatal and the SYSBOOT parameter flag BUGCHECKFATAL was turned 
off, the bugcheck routine proceeds as it does for nonfatal bugchecks for the 
outer two access modes. A message is sent to the error logger and the excep- 
tion is dismissed, passing control back to the caller at the instruction follow- 
ing the bugcheck invocation. 

A fatal bugcheck results in an orderly shutdown of the system. Rather than 
describe each step that the bugcheck routine takes to accomplish this shut- 
down, several items of general interest in the operation of the orderly shut- 
down are described. 

• All disk I/O performed by the bugcheck routine uses the bootstrap disk 
driver used by the initialization programs VMB and SYSBOOT (see Chap- 
ter 24) and loaded into nonpaged pool by INIT (see Chapter 25). The use of 
this driver allows a dump file to be written even if the system disk driver is 
corrupted. 

• Most of the bugcheck routine and all the bugcheck codes and associated 
text are not resident. They are stored in the executive image SYS.EXE and 
read into memory (by the boot driver). 

This code and data are read into system space on top of a read-only 
portion of the executive. Global label BUG$FATAL defines the beginning 
of the buffer into which the bugcheck code and data will be read. This label 
immediately precedes the blank program section (named ". BLANK ." and 
located at address 80007A6E in VAX/VMS Version 3.0). 

The code and data that are read into memory at this time include the 
following: 

—The bulk of the bugcheck service routine 

—A template for the message that is typed on the console terminal 

—Some primitive console terminal output routines 

—The textual description of all possible bugcheck messages 



152 



8.2 System Crashes (BUGCHECKS) 

There are two implications of reading code into memory on top of existing 
code. 

—None of the routines destroyed by BUGCHECK is available for use by 
the bugcheck code. This requirement is most important in deciding how 
the nonpaged executive is laid out. 

—Portions of the dump may look strange when inspected by SDA. For 
example, it is impossible to determine if a portion of the instruction 
stream is corrupted because SDA displays bugcheck code and data in- 
stead of the original instructions and read-only data. 

A header block for the dump file is constructed in the 512 bytes immedi- 
ately preceding the area into which the bugcheck code and data were writ- 
ten. This area contains more read-only portions of the nonpaged executive. 
(The system virtual address range whose contents are altered by the opera- 
tion of bugcheck, including the 512-byte dump file header block, extends 
from 8000786E to 8000A26E. These numbers are valid for VAX/VMS Ver- 
sion 3.0 but are almost certain to change with the next major release of the 
system.) 

The contents of the dump file header block are listed in Table 8-1. Note 
that the error log entry associated with this bugcheck is written into the 
header to avoid loss of information if the error log buffers were full when 



Table 8-1: Contents of the Dump File Header Block 

Description 

Last error log sequence number (unused) 

Dump file flag 

(Low bit set if dump file analyzed) 
Dump file version 

(Contains 1 if Version 2.0 format) 
Contents of SBR, SLR, KSP, ESP, SSP, USP, ISP 
Quadword memory descriptors for up to eight 

memory controllers (each quadword is 

broken down as follows: 
Page count 

TR number for this controller 
Base PFN for this controller 
System version number 
One's complement of previous longword 
Error log entry for crash/restart 

(See Table 8-2) 
Contents of software PCB of current process 

(See Table B-2) 



Size 

Longword 

Word 

Word 

7 Longwords 

8 Quadwords 



24 Bits 
8 Bits 
32 Bits 
Longword 
Longword 
125 Words 

156 Bytes 



153 



Error Handling 



the bugcheck occurred. This error log entry will be written into one of the 
error log buffers by SYSINIT (see Chapter 25) when the rest of the error log 
messages (blocks 2 and 3 in the dump file) are put back into the buffers. (If 
there is no room in the error log buffers, the bugcheck entry will never be 
written to the error log file, although it is preserved in the dump file.) 
A small amount of information describing the bugcheck is written to the 
console terminal. This information includes the contents of general regis- 
ters, the kernel and executive stacks, the contents of processor internal 
registers, and a summary of the reason for the bugcheck. This output oc- 
curs before the dump file is written and should not be interrupted by halt- 
ing the VAX processor from the console terminal. Such an interruption 
would prevent the dump file from being written. 

The dump header, the contents of the two error log buffers, and the con- 
tents of physical memory are written to the system dump file. This step 
can be inhibited by clearing the SYSBOOT parameter flag DUMPBUG. 
The system dump file is described in some detail in the next section. 
The last step in the bugcheck routine reboots the system. This is accom- 
plished by writing a special code fXF02) into the console transmit data 
buffer (PR$_TXDB). (The special uses of the console registers are described 
in Chapter 19.) After the bootstrap code is written, a HALT instruction is 
executed that allows console microcode to gain control and process the 
bootstrap command. 

—On a VAX-1 1/730 processor, the AUTO RESTART/BOOT switch must 
be in the AUTO RESTART ON position in order for the system to auto- 
matically reboot following a bugcheck. 

— On a VAX-1 1/750 processor, the bootstrap device selector switch must 
be properly set and the system disk must be unit in order for the 
system to automatically reboot following a bugcheck. 

—On a VAX-1 1/780 processor, the contents of the file DEFBOO.CMD on 
the console floppy must contain commands to direct a reboot from the 
system disk. 

The automatic reboot following a bugcheck can be prevented by clearing 
the SYSBOOT parameter flag BUGREBOOT This flag is also manually 
cleared by OPCCRASH, the program that executes as part of the orderly 
shutdown procedure SHUTDOWN.COM. When automatic rebooting is 
inhibited, the system loops at IPL 31, waiting for a command to be entered 
at the console terminal. 



8.2.3 System Dump File 

The most important operation that is performed by the bugcheck routine is 
writing the contents of physical memory and other important information to 



154 



8.2 System Crashes (BUGCHECKS) 



Table 8-2: Contents of Error Message Buffer for Crash/Restart Entry 

Description 

Error message buffer header 
Size in bytes of buffer 
Allocation buffer indicator 
Error message valid indicator 
Entry type (contains EMB$K_CR = 37 decimal) 
System time when crash occurred 

(from EXE$GQ_SYSTIME) 
Error log sequence number 

(low order word of ERL$GL_SEQUENCE) 
Contents of KSP, ESP, SSP, USP, ISP 
Contents of RO to Rl 1, AP, FP, SP, PC, PSL 
Contents of POBR, POLR, P1BR, P1LR, SBR SLR, 
PCBB, SCBB, ASTLVL, SISR, ICCS, ICR, 
TODR, ACCS 
Contents of CPU-specific registers 
There are no CPU-specific registers saved for 
the VAX- 11/730. 

For the VAX- 11/750 this area contains the following: 
Translation buffer disable register (PR$_TBDR) 
Cache disable register (PR$_CADR) 
Machine check error summary (PR$_MCESR) 
Cache error register (PR$_CAER) 
CMI error summary register (PR$_CMIERR) 
For the VAX- 1 1/780 this area contains the following: 
SBI fault status (PR$_SBIFS) 
SBI comparator register (PR$_SBISC) 
SBI maintenance register (PR$_SBIMT) 
SBI error register (PR$_SBITA) 
SBI timeout address register (PR$_SBIS) 
Bugcheck crash code 
Length in bytes of software PCB 



Size 

Longword 

Word 

Byte 

Byte 

Word 

Quadword 

Word 

5 Longwords 
17 Longwords 
14 Longwords 



21 Longwords 



Longword 
Longword 
Longword 
Longword 
Longword 

Longword 
Longword 
Longword 
Longword 
Longword 
Longword 
Word 



note. The error log entry for a nonfatal bugcheck contains the same information as the 
entry for a fatal bugcheck except for the 35 longwords set aside for architectural and CPU- 
specific processor registers. 

the dump file. In the case of system crashes, the dump file can be examined 
by the System Dump Analyzer (SDA) to determine the reason for the crash. 
SDA is invoked by the DCL command ANALYZE/CRASH_DUMP. The 
dump file contains three distinct pieces. 

1. The previously constructed dump header (see Table 8-1) is written to the 
first block in the file. 

2. The two error log buffers are written to the next two blocks. These buffers 



155 



Error Handling 



will be copied back into the error log buffers in memory from the dump 
file by SYSINIT (see Chapter 25) as part of the initialization code. In this 
way, no error log information is lost across a system crash or an operator- 
requested shutdown. 
3. The rest of the dump file is filled with the current contents of physical 
memory. Bugcheck uses the memory descriptors in the restart parameter 
block (RPB) constructed by VMB (see Chapter 24) to provide an accurate 
layout of physical address space. If a MA780 shared memory adapter is 
present on the system, its contents are also written to the dump file. 

The size of the dump file must be four blocks larger then the number of 
physical pages in the system. (The fourth block is not currently used.) In 
order to insure that a crash dump can be analyzed with SDA, it is important 
that the dump file be large enough. If a dump file is too small, only the 
physical pages that fit into the underconfigured dump file will be written. In 
a typical VMS configuration, the most crucial contents of physical memory, 
the system page table, are located at the largest physical addresses (see Chap- 
ter 24) and will not be written, making a partial dump useless. That is, SDA 
cannot be used to examine a dump file that does not contain all of physical 
memory. 



8.3 MACHINE CHECK MECHANISM 

A machine check is an exception that is reported when the CPU or an exter- 
nal adapter detects an internal error. The initial processing of a machine 
check exception is CPU specific. This section contains an overview of ma- 
chine check handling. Consult the VAX Hardware Handbook or other hard- 
ware-related literature for information about a specific type of machine 
check. 

The basic philosophy of any of the machine check handlers is to keep as 
much of the system running as possible. There are two important pieces of 
information that determine how serious a particular machine check is: the 
nature of the machine check itself and the access mode in which the machine 
check occurred. 

• If the machine check is recoverable, the simple action is to log an error. 
This step is taken no matter what access mode was active when machine 
check occurred. In addition, the error time is recorded. If machine checks 
start occurring too quickly (more than one machine check per 10-millisec- 
ond interval), then the handler assumes that something is seriously wrong 
and treats a recoverable machine check in the same way that it treats an 
abort. The distinction between recoverable machine checks and aborts is 
CPU specific. The VAX Hardware Handbook or the module MCHECKxxx 
(where xxx represents the processor number) contains information about 
the machine checks that can occur on a particular processor. 



156 



8.3 Machine Check Mechanism 

If the machine check has put the system into a state from which it cannot 
recover, the action taken by the machine check handler depends on the 
access mode in which the machine check occurred. If the previous mode 
was supervisor or user, a machine check exception is reported to that ac- 
cess mode. (Unless the process has taken special action, this step will re- 
sult in image exit.) If the previous mode was executive or kernel, an irre- 
coverable machine check causes a fatal bugcheck (with the bugcheck code 
BUG$_MACHINECHK). 



8.3.1 VAX-11/730 Machine Check 

When a machine check occurs on a VAX-11/730, IPL is elevated to 31 and the 
interrupt stack contains the following information. 

• The length in bytes of the exception-specific information pushed on the 
stack. (This count does not include either the PC/PSL pair or the count 
longword itself.) There are currently 3 longwords in this list, which result 
in a value of OC hex onto the stack. 

• Machine check error code. 

• Two parameters, the contents of which depend on the machine check error 
code. The machine check codes and the information passed in these two 
parameters are detailed in Table 8-3. 

• PC of aborted opcode. 

• PSL at the time of the abort. 

The machine check error code (the second item on the stack) determines the 
specific action of the machine check handler. If the machine check is an 
abort (PC left in an indeterminate state), then recovery is impossible. In addi- 
tion, a subset of the VAX- 11 instruction opcodes on the VAX-11/730 cannot 
be restarted. (The list of these instructions can be found in module 
MCHECK730.) 

In addition to the VAX-11/730 machine checks that appear as exceptions 
(through the SCB vector at offset 4), one type of machine check can appear as 
an interrupt through a dedicated SCB vector. When this machine check oc- 
curs, only the PC and PSL are pushed onto the interrupt stack. 

This machine check is a corrected memory data condition (CRD) and will 
interrupt at IPL 26 through SCB vector 54 (hex). This exception simply causes 
an error log entry (indicating a soft memory error) to be written. (If errors 
occur too quickly, the CRD interrupt bit in the memory controller is turned 
off by the machine check handler.) 



8.3.2 VAX-11/750 Machine Check 

When a machine check occurs on a VAX- 1 1/750, IPL is elevated to 31 and the 
interrupt stack contains the following information. 



157 



GO 



Table 8-3: VAX-1 1/730 Machine Check Codes and Their Associated Parameters 



Code 
MICRO_ERRORS 

TB_PARITY 

BAD_MEM_CSR 
NO_FAST_INT 

FPA_ PARITY 



RDATASUBS 

NX_MEM 
UNALIGNED_IO 

UNK_IO_ADDR 
BAD_UB_ADDR 



Explanation 

Microcode detected errors 



Translation Buffer Parity 

Error 

Illegal format for memory CSR 

Fast interrupts with no IDC 

present 

Floating Point Accelerator 

Parity Error 



SPTE_READCHK Hard Memory Error on SPTE read 



Uncorrectable ECC Errors 
Read Data Substitute 
Nonexistent Memory 
Unaligned or non-longword 
reference to I/O space 
Illegal I/O space address 
Illegal UNIBUS reference 



MC$L_P1 

0:No information available 
2: Unable to set PTE modify bit 
3:Bad microprocessor interrupt 
PTE in error 

VA referenced 
zero 

FPA parity information 

Physical Address of SPTE 

Physical Address Referenced 

Physical Address Referenced 
Physical Address Referenced 

Physical Address Referenced 
Physical Address Referenced 



MC$L_P2 
zero 

VAofPTEinTB 

Bad CSR value 
zero 

zero 

Memory Controller 
Diagnostics 
Memory Controller 
Diagnostics 
zero 



zero 
zero 



5? 



S3 
fcs 

Oq 



8.3 Machine Check Mechanism 

• The length in bytes of the exception-specific information pushed on the 
stack. (This count does not include either the PC/PSL pair or the count 
longword itself.) There are currently 10 longwords in this list, which result 
in a value of 28 hex on the stack. 

• Machine check error code. 

• Virtual address of the last fetch or store operation. 

• Program counter at the time of the error. 

• Memory data of the last fetch or store operation. 

• Saved mode register. 

• Read lock timeout register. 

• Translation buffer parity error register. 

• Cache error register. 

• Bus error register. 

• Error summary register. 

• PC of aborted opcode. 

• PSL at the time of the abort. 

The machine check error code (the second item on the stack) determines the 
specific action of the machine check handler. If the machine check is an 
abort (PC left in an indeterminate state), then recovery is impossible. In addi- 
tion, a subset of the VAX- 11 instruction opcodes on the VAX- 11/750 cannot 
be restarted. (The list of these instructions can be found in module 
MCHECK750.) 

In addition to the VAX- 11/750 machine checks that appear as exceptions 
(through the SCB vector at offset 4) there are two machine checks that appear 
as interrupts through dedicated SCB vectors. When either of these occurs, 
only the PC and PSL are pushed onto the interrupt stack. 

• A corrected memory data condition (CRD) will interrupt at IPL 26 through 
SCB vector 54 (hex). This exception simply causes an error log entry (indi- 
cating a soft memory error) to be written. (If errors occur too quickly, the 
CRD interrupt bit in the memory controller is turned off by the machine 
check handler.) 

• A write bus error condition will interrupt at IPL 29 through SCB vector 60 
(hex). This error is treated as an irrecoverable error and further processing 
depends on the previous access mode. 



8.3.3 VAX-1 1/780 Machine Check 

When a machine check occurs on a VAX- 1 1/780, IPL is elevated to 3 1 and the 
interrupt stack contains the following information. 

• The length in bytes of the exception-specific information pushed on the 
stack. (This count does not include either the PC/PSL pair or the count 



159 



Error Handling 



longword itself.) There are currently 10 longwords in this list, which result 

in a value of 28 hex on the stack. 

Machine check summary parameter. 

CPU error status. 

Trapped micro PC, the microcode error location. 

Virtual address at fault time. 

CPU D register at fault time. 

Translation buffer status register 0. 

Translation buffer status register 1. 

Physical address causing SBI timeout. 

Cache parity error status register. 

SBI error register. 

PC of instruction that caused the machine check. 

PSL of machine at fault time. 

The machine check summary parameter determines the specific action of the 
machine check handler. If the machine check is an abort (PC left in an inde- 
terminate state), then recovery is impossible. In addition, a subset of the 
VAX-1 1 instruction opcodes on the VAX-1 1/780 cannot be restarted. (The list 
of these instructions can be found in module MCHECK780.) 

There are also several error conditions on the VAX-1 1/780 that generate 
interrupts instead of machine check exceptions. 

• A corrected read data condition or a read data substitute condition inter- 
rupts through SCB vector 54 (hex) and raises IPL to 26. 

• An SBI alert interrupts through vector 58 at IPL 27. 

• An SBI fault interrupts through vector 5C at IPL 28. 

• An asynchronous write error is reported through SCB vector 60 at IPL 29. 

The first three of these errors result in error log entries. An attempt is made 
to continue from the error. The asynchronous write error causes a fatal bug- 
check if it occurred in kernel or executive mode or if an error occurred while 
updating a page table. 



8.3.4 Machine Check Recovery Blocks 

The VMS operating system provides a capability for a block of kernel mode 
code to protect itself from machine checks while the protected code is exe- 
cuting. For example, the VMS operating system uses this feature if an inter- 
rupt is generated from a previously unconfigured adapter. If the code that read 
the configuration register were not protected and the interrupt were spurious, 
then the configuration register would not exist and the reference to a nonex- 
istent I/O space address would crash the system. 
There are several restrictions on the protected code. 



160 



8.3 Machine Check Mechanism 

1. It must be executing in kernel mode. 

2. The stack cannot be used across the entry into or the exit out of the pro- 
tected code block. This restriction exists because a coroutine mechanism 
is used to pass control between the protected block and the VMS routines 
that establish the protected code. 

3. VMS elevates IPL to 31 so a limited number of instructions should be 
included in the block. 

4. RO is destroyed by the mechanism. 



8.3.4.1 Using the Recovery Mechanism. Several macros are provided in the macro 
library SYS$LIBRARY:LIB.MLB to use this protection mechanism. The fol- 
lowing macro defines the beginning of the block: 

$PRTCTINI LABEL, MASK 

The label argument is identical to the label argument associated with the 
following macro, which defines the end of the block: 

$PRTCTEND LABEL 

If no error occurred while the protected code was executing, RO contains the 
success code SS$_NORMAL. Otherwise, the low bit of RO is clear. 

The mask argument allows the block of code to protect itself from different 
classes of errors. The following list describes the specific types of protection 
that are defined by the $MCHKDEF macro: 

MCHK$M_LOG Inhibit error logging for the error 

MCHK$M_MCK Protect against machine checks 

MCHK$M_NEXM Protect against nonexistent memory 
MCHK$M_UBA Protect against UNIBUS adapter 

error interrupts 

Two other features used by the VMS operating system are a part of this pro- 
tection mechanism. The following macro allows the VMS system to deter- 
mine whether a recovery block is in effect and take action accordingly: 

$PETCTEST ADDRESS, MASK 

The status is returned in RO. The low bit set indicates that a recovery block is 
in effect and that the specified mask is being used. 

The following macro is used by the machine check handlers for the VAX- 
1 1/730, the VAX-1 1/750, and the VAX-1 1/780 before issuing a fatal bugcheck. 

SBUGPRTCT 

If no recovery block is in effect, control is passed back to the location where 
this macro was invoked, where a bugcheck is usually issued. If a recovery 
block is in effect, control is passed to the end of the protected block with RO 
containing an error code of SS$_MCHECK. 



161 



9 System Service Dispatching 



Between the idea 

And the reality 

Between the motion 

And the act 

Falls the Shadow. 

— T.S. Eliot, The Hollow Men 

Many of the operations that the VMS operating system performs on behalf of 
the user are implemented as procedures called system services. Most of these 
procedures are linked as part of the executive and reside in system space; 
others are contained in privileged libraries. System services have global entry 
point names of the form EXE$service and typically execute in kernel or exec- 
utive access mode so that they can read or write data structures protected 
from access by less privileged access modes. Some services are invoked di- 
rectly by application programs. Others are called on behalf of the user by 
components such as RMS. This chapter describes how control is passed from 
a user program to the procedures in the executive that execute service-spe- 
cific code. 



9.1 SYSTEM SERVICE VECTORS 

The addresses 7FFEDE00 to 7FFEE5FF (four pages of PI space) are reserved for 
entry points to the system services and to RMS service routines. The global 
entry point name of each system service vector is SYS$service, as distin- 
guished from EXE$service, the global name of the procedure in the executive 
image that performs the actual work of the system service. 

Previous to Version 3.0, the system service entry points were maintained in 
the the lowest four pages of system virtual address space (addresses 80000000 
to 800005FF). These entry points still exist in this location, in order that 
programs that were linked before VAX/VMS Version 3.0 will still refer to the 
correct entry points. The vectors were moved to process space so that system 
services could be intercepted on a per-process basis. 

As new services are added to future releases of the VAX/VMS operating 
system, the vector area will grow to make room for new entry points. In 
addition, the absolute locations of the SYS$service entry points of existing 
services will remain fixed forever, so that existing user programs will not 
have to be relinked each time there is a new release of the VMS operating 
system. 

Each service entry point contains eight bytes of code and data called a 
system service vector. Each vector consists of a global entry point named 



162 



9. 1 System Service Vectors 

SYS$service, a register save mask, a single instruction that transfers control 
eventually to a service-specific procedure in the executive, and an instruction 
(usually a RET) that passes control back to the caller. 

Note that the vectors for the "composite" system services ($QIOW and 
$ENQW) contain the number of bytes required to execute the service, test 
return conditions, conditionally execute the $WAITFR service, and pass con- 
trol back to the caller. 

Most of the system services execute in kernel mode and the vectors for 
these services contain a CHMK instruction. A few system services and all of 
the RMS services contain a CHME instruction. Some services such as the 
text formatting services execute in the access mode of the caller and dispatch 
directly to the service-specific code in the VMS operating system with a JMP 
instruction. The following examples illustrate the three sets of instructions 
found in the system service vector area. The entry mask in each system serv- 
ice vector is identical to the entry mask found at location EXE$service. 
Table 9-1 lists the VMS system services that use each of the three illustrated 
methods of initial dispatch. 

Vectors for system services that change mode to kernel contain the follow- 
ing code: 

SYS$service:: ;Entry point 

.WORD entry-mask 

CHMK i"#service-specific-code 

RET ; Return to caller 

.BLKB 1 ;Sparebyte 

The extra byte here and in the vector for executive mode is used to keep the 
entry points on quadword boundaries. 

Vectors for system services that change mode to executive contain the fol- 
lowing code: 

SYS$service:: ;Entry point 

.WORD entry-mask 

CHME i"#service-specific-coae 

EE j ; Return to caller 

.BLKB 1 ; Spare byte 

Most vectors for RMS service calls replace these last two bytes with a branch 
to an RMS synchronization routine. 

Vectors for system services that do not change mode contain the following 
code: 



SYS$service: : 




;Entry point 


.WORD 


entry-mask 


; of the caller 


JMP 


@#EXE$service + 5 


;Transfer control to 
; first instruction after 
; the entry mask at 
; EXE$service 



This JMP instruction transfers control to the first instruction after the entry 
mask at EXE$service. 

163 



System Service Dispatching 



Table 9-1 : System Services and RMS Services That Use Each Form of System Service 
Vector 

The following system services execute initially in kernel mode: 



$ADJSTK 


$CREMBX 


$DEQ 


SGETPTI 


SSETAST 


SSETSSF 


$ADIWSL 


SCREPRC 


$DERLMB 


$GETSYI 


$SETEF 


$SETSTK 


$ALLOC 


$CRETVA 


SDGBLSC 


$HIBER 


$SETEXV 


$SETSWM 


$ASCEFC 


$CRMPSC 


$DLCEFC 


SLCKPAG 


SSETIME 


$SNDERR 


SASSIGN 


$DACEFC 


$ENQ 


$LKWSET 


$SETIMR 


$SUSPND 


$BRDCST 


$DALLOC 


$ENQW 


SMGBLSC 


$SETPFM 


ITRNLOG 


SCANCEL 


$DASSGN 


$EXIT 


SPURGWS 


$SETPRA 


$ULKPAG 


SCANEXH 


$DCLAST 


$EXPREG 


SQIO 


$SETPRI 


$ULWSET 


SCANTIM 


$DCLCMH 


SFORCEX 


$QIOW 


$SETPRN 


SUPDSEC 


$CANWAK 


$DCLEXH 


SGETCHN 


$READEF 


SSETPRT 


$WAITFR 


$CLREF 


$DELLOG 


$GETDEV 


$RESUME 


SSETPRV 


$WAKE 


SCMKRNL 


$DELMBX 


$GETDVI 


$RUNDWN 


$SETRWM 


$WFLAND 


$CNTREG 


$DELPRC 


$GETJPI 


$SCHDWK 


SSETSFM 


$WFLOR 


SCRELOG 


SDELTVA 










The following 


system services execute initially in executive 


mode: 




SCMEXEC 


$NUMTIM 


$SNDOPR 








SGETTIM 


$SNDACC 


SSNDSMB 








$IMGACT 













The following system services execute in the access mode of the caller. The services marked 
with a ( 1 ) can be called from any access mode; the services marked with a (2] can be called 
from executive and outer access modes. Those not marked can only be called from supervi- 
sor and user mode. 



$ASCTIM(1) 
SBINTIM (1] 
$EXCMSG (2) 
$FAO(l] 



$FAOL(l) 
$GETMSG |2) 
$IMGFIX 



SIMGSTA 
SPUTMSG 
SUNWIND 



The following RMS services execute in executive mode and branch to a synchronization 
routine before returning to the caller: 



$CLOSE 


$EXTEND 


$OPEN 


$REWIND 


SCONNECT 


$FIND 


SPARSE 


SSEARCH 


$CREATE 


$FLUSH 


$PUT 


$SPACE 


$DELETE 


$FREE 


$READ 


STRUNCATE 


SDISCONNECT 


$GET 


SRELEASE 


SUPDATE 


$DISPLAY 


SMODIFY 


SREMOVE 


$WAIT 


SENTER 


SNXTVOL 


SRENAME 


SWRITE 


$ERASE 









The following RMS services execute in executive mode. The vectors for these RMS services 
contain RET instructions rather than a branch to an RMS synchronization routine. 



SRMSRUNDWN 



SSETDDIR 



SSETDFPROT 



$SSVEXC 



164 



9.2 Change Mode Instructions 

9.2 CHANGE MODE INSTRUCTIONS 

When a change mode instruction is executed, an exception is generated that 
pushes the PSL, the PC of the next instruction, and the code that is the single 
operand of the change mode instruction onto the stack indicated in the in- 
struction. (As pointed out in Chapter 4, the actual access mode is the mini- 
mum of the access mode indicated by the instruction and the current access 
mode contained in the PSL.) For example, the execution of a CHME #5 instruc- 
tion will push a PSL, the PC of the instruction following the CHME instruc- 
tion, and a 5 onto the executive stack. Control is then passed to the exception 
service routine whose address is located in the appropriate entry in the sys- 
tem control block (SCB). 



9.2.1 The CHMK and CHME Instructions 

At initialization time, the VMS operating system fills in the SCB entries for 
CHMK and CHME with the addresses of change mode dispatchers that pass 
control to the procedures that perform service-specific code. The action of 
these two dispatchers is discussed in the next section. 



9.2.2 The CHMS and CHMU Instructions 

The SCB entries for CHMS and CHMU are filled in with the addresses of 
exception service routines that usually pass control to the general exception 
dispatcher described in Chapter 4. In this case, a CHMS or CHMU exception 
would be reported to a process through the normal signal and mechanism 
arrays. The particular exception names are SS$_CMODSUPR and 
SS$_CMODUSER respectively. 

However, a user can short circuit the normal exception dispatching in the 
case of either of these exceptions by using the $DCLCMH system service to 
establish a per-process change-mode-to-supervisor or change-mode-to-user 
exception handler. This service fills location CTL$GL_CMSUPR or 
CTL$GL_CMUSER in the PI pointer page with the address of the user-writ- 
ten change mode dispatcher. The exception service routines for the 
CHMS and CHMU exceptions check these locations for nonzero contents 
and dispatch accordingly. 

The DCL and MCR command language interpreters use this service to 
create a special change-mode-to-supervisor handler. This handler is used 
when it is necessary to get to supervisor mode from user mode when an 
image is interrupted with a CTRL/Y. The use of the change-mode-to-supervi- 
sor handler is discussed in Chapter 23. The job controller uses a 
change-mode-to-user dispatcher for its processing of error messages. 



165 



System Service Dispatching 

9.3 CHANGE MODE DISPATCHING IN THE VMS EXECUTIVE 

The change mode dispatcher that receives control from the CHMK or CHME 
instruction in the system service vector must dispatch to the procedure indi- 
cated by the code that is found on the top of the stack. In addition, because 
the service routines are written as procedures, the dispatcher must construct 
a call frame on the stack. Building the call frame could be accomplished by 
using a CALLx instruction and a dispatch table of service entry points. 

However, the call frame that must be built is identical for each service. In 
addition, the registers that the service-specific procedure will modify have 
already been saved because the register save mask in the vector area (at global 
location SYS$service) is the same as the register save mask at location 
EXE$service. So the dispatcher avoids the overhead of the general purpose 
CALLx instruction and builds its call frame by hand. 

Further speed improvement is achieved in this commonly executed code 



User Program 



CALLx- 




/ 



System Space 



Change 
Mode Dispatcher 



EXE$CMODxxxx:: 

1) Build call frame 

v 2) Check argument 

V list 

3) CASEW 
Offsets 



Offsets 

Process illegal 
change mode 
codes 



Service-Specific 
Procedure 



EXE$service:: 
Entry mask 



RET 



Common Exit Path / 



\ 



\ 



SRVEXIT: 



REI 



Figure 9-1 

Control Flow of System Services That Change Mode 



166 



9.3 Change Mode Dispatching in the VMS Executive 

path by overlapping memory write operations (building the call frame) with 
register-to-register operations and instruction stream references. The actual 
dispatch to the service-specific procedure is then accomplished with a 
CASEW instruction that uses the CHMx code as its index into the case table. 
Figure 9-1 pictures the control flow from the user program all the way to the 
service-specific procedure. This flow is illustrated for both kernel and execu- 
tive access modes. Figure 9-2 shows the corresponding flow for those services 
that do not change mode. 



9.3.1 Operation of the Change Mode Dispatcher 

The operation of the change mode dispatchers is almost identical for kernel 
and executive modes. This section discusses the common points of the dis- 
patchers for kernel and executive modes. The next sections point out the 
only differences between the dispatchers for the two access modes. 

The first instruction of the dispatcher pops the exception code, unique for 
each service, from the stack into RO. In both the kernel mode dispatcher and 
the executive mode dispatcher, the call frame is built on the stack by the 
following four instructions. 



PUSHAB 


B'SRVEXIT 


PUSHL 


FP 


PUSHL 


RP 


CLRQ 


-(SP) 


PO 


1 P1 


Space 


| Space 



User Program 



CALLx 



System 
Service Vector 



SYS$servlce:: 
Entry mask 
JMP 



System Space 



Service-Specific 
Procedure 



EXE$scrvice:: 
Entry mask 



RET 



Figure 9-2 

Control Flow of System Services That Do Not Change 
Mode 



167 



System Service Dispatching 

While the call frame is being built, two checks are performed on the argu- 
ment list. The number of arguments actually passed (found in the first byte of 
the argument list) is compared to a service-specific entry in a prebuilt table to 
determine whether the required number of arguments for this service have 
been passed. Read accessibility of the argument list is checked (with the 
PROBER instruction generated by the IFNORD macro). If either of these 
checks fails, control is passed back to the caller, with an error indication in 
RO. 

Finally, a CASEW instruction is executed, using the unique code in RO as 
an index into the case table. The case table has been set up at assembly time 
to contain the addresses of the first instruction of each service-specific rou- 
tine. Because each service is written as a procedure with a global entry point 
named EXE$service pointing to a register save mask, the case table contains 
addresses of the form EXE$service + 2. This structure is illustrated in the 
following examples of dispatchers. If control is passed to the end of the case 
table, then a CHMx instruction was executed with an improper code and the 
error processing described in the next section is performed. 

Code Example 9-1 compares the code for the two dispatchers, copied 
from the module CMODSSDSP. The entries containing the string "******" 
indicate places where the change mode dispatchers differ. The instructions 
are not listed in exactly the same order that they appear in the source mod- 
ule. Rather, the instructions are shown in the order that they are found when 
all the PSECTs have been sorted out at link time. 

The examples shown in Code Example 9-2 contain the error routines to 
which the change mode dispatchers branch. These routines are invoked if the 
argument list is inaccessible or if an insufficient number of arguments was 
passed to the service. 

The routine in Code Example 9-3 is the common exit path for all system 
service and RMS service calls. The usual exit path is the REI instruction. The 
alternate exit path is to report a SS$_SSFAIL exception. 



9.3.2 Change-Mode-to-Kernel Dispatcher 

There are two steps performed by the change-mode-to-kernel dispatcher that 
are not performed by the change-mode-to-executive dispatcher. Before con- 
trol is passed to those services that execute in kernel mode, the address of the 
PCB for the current process (found at global location SCH$GL_CURPCB) is 
placed into R4. The second difference is that CHMK #0 is a special entry 
path into kernel mode that is used by the AST delivery routine following the 
call to the AST procedure. If the CHMK code removed from the stack is a 
zero, control is passed to a routine called ASTEXIT. The action of this routine 
is described in Chapter 7. 



168 



9.3 Change Mode Dispatching in the VMS Executive 



Code Example 9-1 

Change Mode to Kernel Dispatcher 



Change Mode to Executive Dispatcher 



EXE$CMODKRNL: : 




EXE$CMODEXEC: : 




POPL 


RD 




POPL 


RD 


BEQL 


ASTEXIT 




****** 




PUSHAB 


b'srvexit 




PUSHAB 


B'SRVEXIT 


MOVZBL 


RD,R1 




MOVZBL 


RD,R1 


PUSHL 


FP 




PUSHL 


FP 


MOVZBL 


jTb_krnlarg[ri),ri 




MOVZBL 


W"B_EXECNARG[R1],R1 


PUSHL 


AP 




PUSHL 


AP 


MOVAL 


@#4|R1],FP 




MOVAL 


@#4[R1],FP 


CLRQ 


-(SP) 




CLRQ 


-(SP) 


IFNORD 


FP,(AP), ACCVIO 
prober #D,fp,(ap) 
beql accvio 




IFNORD 


FP, (AP),EXACCVIO 
prober #D,fp,(ap) 
beql exaccvio 


MOVL 


SP,FP 




MOVL 


SP,FP 


CMPB 


(AP),R1 




CMPB 


(AP),R1 


BLSSU 


KINSARG 




BLSSU 


EXINSARG 


KERDSP: 




EXEDSP: 






MOVL 


SCH$GL_CURPCB,R4 




****** 




CASEW 


RD,#1,#KCASMAX 




CASEW 


RD,#D,S"ECASMAX 



offset to EXESservice + 2 



offset to EXESservice + 5 



JSB 



@CTL$GL_RMSBASE 



check inhibit bits 



check inhibit bits 





BSBW 


CHECKARGLIST 




MOVL 


@#CTL$GL_USRCHMK,R1 




BEQL 


1D$ 




JSB 


(El) 


0$: 


MOVL 


L~EXE$GL_USRCHMK,R1 




BEQL 


5D$ 




JSB 


(Rl) 


'0$: 


NOP 
NOP 




XLSER: 


MOVZWL 
RET 


#SS$_ILLSER,RD 



1D$: 



50$: 



BSBW 


CHECKARGLIST 


MOVL 


@#CTL$GL_USRCHME , Rl 


BEQL 


1D$ 


JSB 


(Rl) 


MOVL 


L~EXE$GL_USRCHME , Rl 


BEQL 


50$ 


JSB 


(Rl) 


BRW 


ILLSER 



169 



System Service Dispatching 



Code Example 9-2 



EXACCVIO: 




MOVL 


SP,FP 


CMPW 


RD,#ECASCTR 


BGEQU 


EXEDSP 


BRW 


ACCVIO_EET 


EXINSARG: 




CMPW 


RD,#ECASCTE 


BGEQU 


EXEDSP 


BRB 


INSAEG 



;From EXE$CMODEXEC 
;Point FP to call frame 
; so that EET works 
;Only report INSAEG for EMS 
; and built-in functions 
;Otherwise, get back in line 



;0nly report INSARG for EMS 
; and built-in functions 
;Otherwise/ get back in line 
;Eeport error to caller 



CHECKAEGLIST: 





IFNOED 


#4,(AP),ACCVIO_RET 




CVTBL 


(AP),R1 




BLSS 


10$ 




ASHL 


#S,R1,R1 




IFNORD 


Rl,4(AP),ACCVIO_RET 




RSB 




10$: 


MOVZBL 


R1,R1 




ASHL 


#a,Rl,Rl 




PUSHL 


RO 




POSHL 


R5 




PUSHL 


R3 




MOVAL 


4(AP),R0 




CLEL 


E3 




JSB 


EXE$PROBEE 




POPL 


E3 




POPL 


E2 




BLBC 


R0,5D$ 




POPL 


RO 




RSB 




50$: 


POPL 


RD 




BRB 


ACCVIO_RET 


ACCVIO: 








MOVL 


SP,FP 


ACCVIO_ 


RET: 






MOVZWL 


#SS$_ACCVIO,R0 




RET 




KINSAEG 


: 






CMPW 


RO,#KCASCTR 




BGEQU 


KERDSP 


INSARG: 








MOVZWL 


#SS$_INSFARG,RO 




RET 





;Check argument list for 

; read accessibility 

;First check count 

;Then get count 

;Branch if more than ISA arguments 

;Convert to byte count 

;Now check rest of list 

;Clear high three bytes 
;Convert to byte count 



;Get beginning of list 
; Kernel mode 
;Can addresses be read? 
; restore registers 

;Address could not be read, 
; return access violation 
;Address could be read, 
;Eeturn 



;Set FP so that RET works 



;Is this a recognized code? 
;No. Get back in line 



170 



Code Example 9-3 
SRVEXIT: 



SRVEEI: 



BLBC 



REI 



9.3 Change Mode Dispatching in the VMS Executive 



R0,SSFAIL 



SSFAIL: 

BITL 
BEQL 

BRW 

SSFAILMAIN: 
HOVL 
TSTW 
BNEQ 
EXTZV 
RDDL 



BBC 

HOVPSL 
EXTZV 

BNEQ 

SETIPL 

JMP 



5$: 

1D$: 

3D$: 

9.3.3 



9.3.4 



RE I 
BUG_CHECK 



#?,RD 
SRVREI 

SSFAILMAIN 



G*CTL$GL_PCB,R1 

PCB$W_MTXCNT 

2D$ 

#PSL$V_CURMOD,#PSL$ 

#PCB$V_SSFEXC, (SP) 



;Check for mere warning 
;If so, do not generate 
; exception 
;Go to SSFAIL logic 



(SP+),PCB$L_STS(R1),1D$ 



-(SP) 

#PSL$V_CURMOD , #PSL$ 

5$ 
#0 
EXESSSFAIL 



;Check for ownership of a mutex 

;If so, BUGCHECK 
S_CURMOD,4(SP),-(SP) 

;Are system service 

; failure exceptions enabled 

; for caller's access mode 

;If not, dismiss the 

; exception 

;Get current PSL 
S_CURMOD,(SP), (SP)+ 

;If the current mode is kernel 

;IPL must be lowered to 
;Pass control to the 
;general exception dispatcher 
; Return from service with 
; error status 



MTXCNTNONZ, FATAL 



Change-Mode-to-Executive Dispatcher 

The change-mode-to-executive dispatcher performs one step unique to exec- 
utive mode. If the CHME code is not a recognized system service, the 
CASEW instruction passes control to the end of the case table. At^that point, 
the change-mode-to-executive dispatcher transfers control to the RMS dis- 
patcher to determine whether this was a valid RMS call before dropping into 
the error processing described in the next section. 

RMS Dispatching 

The RMS dispatcher, illustrated in Figure 9-3, consists of two instruc- 
tions. The CASEW instruction will dispatch to RMS service-specific proce- 
dures for legitimate RMS service codes. These procedures will exit with a 
RET back to SRVEXIT. If an illegal code (that is, a code not recognized as 
an RMS service call) was issued, the RSB instruction following the CASEW 
instruction will pass control back to EXE$CMODEXEC for normal error 
processing. 



171 



System Service Dispatching 



System Space 




RMS Dispatcher 



RMSSDISPATCH: 

CASEW 

_ Offsets 



RMS Service-Specific 
Procedure 



RMSSservice:: 
Entry mask 



Figure 9-3 

Control Flow of RMS Dispatching 



9.3.5 Return Path for System Services 

When the service-specific procedure has completed its operation, it places a 
status code in RO and issues a RET instruction. This instruction returns con- 
trol to the code at label SRVEXIT (shown in the examples in Section 9.3.1) 
because this address was put into the saved PC area of the call frame built by 
the change mode dispatcher. The routine SRVEXIT first checks whether an 
error occurred. If no error occurred or if the error was merely a warning 
(R0>2:0<=0), the CHMx exception is dismissed with an REI instruction that 
passes control to the instruction following the CHMx in the vector area. This 
instruction is a RET which finally returns control to the user program follow- 
ing the call to SYS$service (see the code examples in Section 9.1). 

One additional step is taken by routine SRVEXIT when it is executed in 
kernel mode: IPL is explicitly lowered to zero. This step is unnecessary un- 
less the process has enabled system service failure exceptions because the 



172 



9.3 Change Mode Dispatching in the VMS Executive 

REI instruction that dismisses the CHMK exception will lower IPL. How- 
ever, if a system service failure exception is to be generated, the exception 
code must be entered with IPL set to zero. (A similar check is not needed for 
executive mode services because only kernel mode code can execute at ele- 
vated IPL.) 

If an error or severe error occurred, a check is made to see whether the 
process owns any mutex. If so, the system service has not released all of its 
mutexes on exit (an erroneous error path) and a fatal bugcheck is generated. 
(Chapter 8 describes bugcheck processing. Mutexes are described in Chapter 
2.) If the mutex check succeeds, a check is made to determine whether this 
process has enabled system service exceptions for the calling access mode. If 
it has, control is passed to the exception dispatcher at global label 
EXE$SSFAIL. The exception that will be reported to the caller in the signal 
array is SS$_SSFAIL. Otherwise, control is passed back to the caller with RO 
containing the error status code. 

9.3.6 Return Path for RMS Services 

The return path for RMS services is slightly more complicated than the re- 
turn path for system services. The last two bytes of the vector contain a 
branch (BRB) to an RMS synchronization routine (contained in module 
CMODSSDSP). This routine first checks whether the caller of the RMS serv- 
ice wishes to wait. This is the usual case, but RMS does allow asynchronous 
I/O operations. (The return status code is set to RMS$_ STALL by RMS in the 
usual state, where the process must wait until the completion of the RMS 
operation.) 

9.3.6.1 Wait State Associated with RMS Requests. If a stall is indicated, the caller is 
put into an event flag wait state, waiting for the event flag associated with 
the I/O request that RMS has just issued. The crucial point in this implemen- 
tation is that the caller is waiting at the access mode associated with the 
original call to RMS and not in executive access mode, thus allowing AST 
delivery for all access modes at least as privileged as the caller of RMS. (In the 
usual case where RMS is called from user mode, the access mode of the wait 
state allows both user and supervisor ASTs as well as executive and kernel 
ASTs to be delivered while waiting for the RMS operation to complete.) 
When the original I/O request completes, RMS gains control first in an 
executive mode AST that it associated with its $QIO request. If it determines 
that the original request is complete, it sets final status in the data structure 
(FAB or RAB) associated with the operation and returns from its AST. The 
caller now drops through the event flag wait in the synchronization routine 
(because the I/O completion routine set the event flag). The synchronization 
routine determines that the RMS operation is complete (because the FAB or 



173 



System Service Dispatching 

RAB status field contains nonzero), and executes a RET, passing control back 
to the point where the initial call to RMS was issued. 

If the RMS executive mode AST determines that more I/O is required to 
complete the original request (such as occurs when reading a large record 
from a sequential file with small internal buffers or when operating on an 
ISAM file), RMS issues the next $QIO and returns from its AST. Because the 
previous I/O completion set the associated event flag, the process is now 
computable. However, the RMS operation is not yet complete. For this rea- 
son, the RMS synchronization routine (executing in the caller's access mode) 
checks the status field in the RAB or FAB for zero, indicating that RMS has 
more to do. In this case, the caller is again placed into the LEF state by the 
RMS synchronization routine. In other words, at a primitive level, the proc- 
ess is placed into a LEF state by RMS one or more times. However, the actual 
indication that the RMS operation has completed is nonzero contents in the 
status field of the FAB or RAB. 

9.3.6.2 RMS Error Detection. When the RMS synchronization routine finally decides 
that RMS has completed its work, it checks the final status. If this status 
indicates either success or warning, a RET is executed. If either an error or a 
severe error occurred, a special RMS call ($SSVEXC) is issued. This service 
simply reports the error status through the normal VMS service exit path 
(SRVEXIT) that determines whether the process has enabled system service 
failure exceptions. Because RMS errors are reported through the system serv- 
ice dispatcher, they are treated in exactly the same manner as system service 
errors. 



9.4 USER- WRITTEN SYSTEM SERVICE DISPATCHING 

The VAX architecture reserves CHMx instructions with negative codes for 
customer use. VMS system service dispatching acknowledges this in its dis- 
patch scheme and contains hooks that allow a privileged user to write his 
own system services. The method for doing this is described in the VAX/VMS 
Real-Time User's Guide. This section merely describes how control is passed 
to user-written system services. 

The code examples in Section 9.3.1 illustrate the error processing code that 
follows the case table for the change-mode-to-kernel or change-mode-to-ex- 
ecutive dispatcher. The only differences between these two routines are 
the names of the global pointers that are referenced. 

9.4. 1 Per-Process User- Written Dispatcher 

If the index into the case table is too large, the CHMK or CHME instruction 
was executed with an invalid code (control is passed to the end of the case 



174 



9.4 User-Written System Service Dispatching 

table). The VMS operating system attempts to pass control to a user-written 
change mode dispatcher. First, a location in PI space (CTL$GL_USRCHMK 
or CTL$GL_USRCHME) is checked to see whether a per-process dispatcher 
exists. Nonzero contents of this location are interpreted as the address of a 
user-written dispatcher and control is passed to it with the stack as shown in 
Figure 9-4. The assumption being made by the VMS operating system at this 
point is that a valid change mode code will result in the eventual transfer of 
control to SRVEXIT with a RET instruction. If the per-process dispatcher 
rejects the code, it returns control to the code listed in Section 9.3.1 with an 
RSB instruction. 



9.4.2 Privileged Shareable Images 

The usual contents of CTL$GL_USRCHMK and CTL$GL_USRCHME are 
addresses within the two pages in PI space set aside by the VMS operating 
system for user-written system services and image-specific message process- 
ing. Kernel mode and executive mode each have one half page (256 bytes) 
devoted to system service dispatching. The initial content of the first byte of 
each dispatch area (set up by PROCSTRT) is an RSB instruction. With the 
dispatch scheme described in the previous section, there is effectively no 
per-process change mode dispatching. 

However, if an image executes that was previously linked with a privileged 
shareable image (linked with the /PROTECT and /SHAREABLE options and 
installed with the /PROTECTED and /SHARED options), the image activator 
replaces the RSB instruction with a JSB to the user-written change mode 
dispatcher specified as a part of the privileged shareable image (see Figure 
9-5). The VMS operating system allows multiple privileged shareable images 
to be linked into the same executable image. (There is a limit of 42 user-writ- 



These two 
longwords are 
removed by the 
dispatcher before 
calling the 
system service 
code. 



Return PC in Dispatch Vector 



Return PC in CMODSSDSP 



(Conditi on Handler Address) 
(PSW/Register Save Mask) 



Saved AP 



Saved FP 



SRVEXIT (Return PC) 



PC Following CHMx Instruction 



PSL Following CHMx Instruction 



-SP 

-FP 




Direction of 

stack growth 



Figure 9-4 

State of the Stack within a User-Written Dispatcher 



175 



System Service Dispatching 



PO 
Space 

User Program 



I""© 



CALLx 



Dispatcher A 
RSB (7) 



Dispatcher B 



.ENTRY 
CHMxfA. 
. RET ^-^ 



CASE 



£> 



.ENTRY 

:® 

RO -status 
RETfiY. 



P1 

Space 



<D 



JSBA 



■© 



JSBB 



JSBC 



RSB 



This vector is built 
by the image 
activator 
(CTL$A_DISPVEC). 



System Space 



0" 



Change Mode 
Dispatcher 



EXE$CMODxxxx:: 

1) Build call frame 

2) Check argument 
list 

3) CASEW (T) 
Offsets 



Offsets 
JSB 

Process illegal 
change mode codes 



Common Exit Path 



SRVEXIT: 

;© 

— REI 



Figure 9-5 

Dispatching to User- Written System Services 



ten dispatchers of each type. How these dispatchers are collected into 
privileged shareable images determines the number of privileged shareable 
images that can be included in a single executable image.) An RSB instruc- 
tion follows the last JSB instruction in the dispatch area. The example pic- 
tured in Figure 9-5 shows three privileged shareable images. 

When the image activator (see Chapter 21) encounters a privileged share- 
able image as a part of the executable image it is activating, it maps the 
sectionf s) containing the user- written system services in the usual manner. 
However, it also uses information stored in a protected image section or in 
the first eight longwords of the image (a privileged library vector pictured in 



176 



9.4 User-Written System Service Dispatching 

Figure 9-6) to modify the PI space dispatch area. For example, if a privileged 
shareable image contained a change-mode-to-kernel dispatcher, the image 
activator would insert a JSB instruction in PI space that transferred control to 
the dispatcher specified by the PLV$L_KERNEL longword in the privileged 
library vector. Once the image containing user-written system services is 
•activated, execution proceeds normally until one of the services is invoked. 
Dispatching proceeds as follows (see Figure 9-5). 

(T) A CALLx instruction transfers control to a service-specific entry mask in 
PO space. The CHMx (CHMK or CHME) instruction located there trans- 
fers control to the VMS change mode dispatcher. 



Privileged Shareable Image 





, 


f .ENTRY mask^ 1 
1 CHMx #code> V 
l^RET J 


Entry Vectors 
(1 per service) 




Vector Type 






System Version 




















— • Executive Dispatcher 


Privileged 








\ Library Vector 
(1 per image) 




_____ 






Address Check 


> 




CASE RO,... 
RSB 


Executive Dispatcher 




CASE RO,.- 
RSB 


Kernel Dispatcher 




ENTRY mask 






1 ; 


Functional Routines 
'. (1 per service) 




MOVL fstatus, RO 






RET 




Fig 

Str 


ure 

UCtl. 


i>-6 

tie of Privileged Shareable Ima 


§e 



177 



System Service Dispatching 

(2) Execution proceeds as if a VMS service was invoked except that the 
change mode code is not recognized by the VMS dispatcher and control 
passes to the end of the case table (see the code examples in Section 
9.3.1). 

(3) The JSB instruction in CMODSSDSP passes control to the PI space dis- 
patch area where another JSB instruction passes control to the first dis- 
patcher. 

@ The change mode code is rejected by the first dispatcher by simply exe- 
cuting an RSB back to the PI space vector where a second JSB is executed. 

(5) The second dispatcher recognizes the change mode code as valid and dis- 
patches (probably with a CASEx instruction) to a service-specific proce- 
dure that is also a part of the second privileged shareable image. 

(6) When the service completes (successfully or unsuccessfully), it loads a 
final status into RO and exits with a RET which passes control to 
SRVEXIT. At this point, user-written system service dispatching merges 
with VMS system service dispatching. 

If each dispatcher rejected the change mode code (by executing an RSB), con- 
trol would eventually reach the RSB instruction in the PI space vector. This 
RSB instruction passes control back to the VMS change mode dispatcher in 
CMODSSDSP where a system-wide dispatcher is checked for next. 

9.4.3 System- Wide User- Written Dispatcher 

If the PI space location contains a zero, or if no per-process dispatchers are 
invoked, or if the last per-process user-written dispatcher returns to the rou- 
tine in CMODSSDSP with an RSB, a location in system space 
(EXE$GL_USRCHMK or EXE$GL_USRCHME) is checked for the existence 
of a system-wide user- written dispatcher. If none exists (contents are zero, its 
usual contents in a VMS system), or if this dispatcher passes control back 
with an RSB, an illegal system service call (SS$_ILLSER) is reported back to 
the user in RO. This scheme assumes that user-written system services that 
complete successfully will exit with a RET back to SRVEXIT, where an REI 
instruction will dismiss the CHMK or CHME exception. Note that there is 
no standard documented way to add a system-wide user-written dispatcher to 
the system. 

9.5 RELATED SYSTEM SERVICES 

There are five system services in the VMS operating system that are closely 
related to system service dispatching and the change mode instructions. The 
$DCLCMH system service was briefly described in Section 9.2.2. This sec- 
tion describes the $SETSFM service, the $SETSSF service, and the change 
mode system services. 



178 



9.5 Related System Services 

9.5.1 Setting System Service Failure Exceptions 

The $SETSFM system service either enables or disables the generation of 
exceptions when an error is detected by the system service common exit 
path. The service itself simply sets (to enable) or clears (to disable) the bit in 
the process status longword (at offset PCB$L_STS in the software PCB) for 
the access mode from which the system service was called. 



9.5.2 Change Mode System Services 

The $CMKRNL and $CMEXEC system services provide a simple path for 
privileged processes to execute code in kernel or executive mode. These serv- 
ices check for the appropriate privilege (CMKRNL or CMEXEC) and then 
dispatch (with a CALLG instruction) to the procedure whose address is sup- 
plied as an argument to the service. (Note that if $CMKRNL is called from 
executive mode, no privilege check is made.) 

The procedure that executes in kernel or executive mode must load a re- 
turn status code into RO. If not, the previous contents of RO will be used to 
determine whether an error occurred. 



9.5.3 System Service Filtering 

In some applications, especially user-written CLIs, it is desirable to deny 
access to system services that can be called from user mode. The Set System 
Service Filter ($SETSSF) system service was provided for this purpose. 

When the module CMODSSDSP is assembled, in order to create the sys- 
tem service vectors, two tables of bytes are created, one for kernel mode 
system services (at the symbol B-KMASK), and one for executive mode sys- 
tem services (at the symbol B_EMASK). Each entry in these tables contains a 
mask that indicates whether or not the system service can be disabled by 
$SETSSF. If the service can be disabled by $SETSSF, the mask also indicates 
the system service filter groups for which the service is disabled. Group 
specifies all services, except $EXIT ; group 1 specifies most services, with the 
exception of $EXIT and those services required for condition handling or 
image rundown. The VAX/VMS System Services Reference Manual lists the 
services that are not disabled by $SETSSF. 

The byte at offset CTL$GB_SSFILTER in the per-process control region 
contains the system service filter mask for a particular process. Usually this 
mask contains the value zero. When $SETSSF is called, the mask value speci- 
fied in the call to $SETSSF is written into this mask. 

When the system is bootstrapped, module INIT checks the bit 
EXE$V_SSINHIBIT at global location EXE$GL_DEFFLAGS. This bit corre- 
sponds to the SYSBOOT paramter SSINHIBIT. If the bit is set, the entry 



179 



System Service Dispatching 

points in the change mode dispatcher for CHME and CHMK are revectored to 
the entry points EXE$CMODEXECX and EXE$CMODKRNLX, respectively. 
When control is passed to these alternate entry points (from a CHME or 
CHMK instuction), the value in CTL$GB_SSFILTER is ANDed with the 
value in the system service filter tables (found at locations B_EMASK or 
B_KMASK). The CHMx code is used as an index into these tables. If the 
result of the AND is zero, processing continues and control is passed to the 
system service; if the result of the AND is nonzero, the call to the system 
service fails with the exit status SS$_INHCHME or SS$_INCHMK, depend- 
ing on whether the system service was an executive mode or kernel mode 
service. 



180 



PART Ill/Scheduling and Timer Support 



10 Scheduling 



It is equally bad when one speeds on the guest unwilling to go, 
and when he holds back one who is hastening. Rather one should 
befriend the guest who is there, but speed him when he wishes. 
— Homer, The Odyssey 



Scheduling is concerned with the order of execution of processes and the 
occurrence of events over time. The scheduler identifies and executes the 
highest priority, memory-resident process. Processes may or may not be 
scheduled, depending on the scheduling state of the process and the nature of 
the event or resource for which the process is waiting. Transitions from one 
state to another occur as the result of system events such as the setting of an 
event flag, enqueuing an AST, calling the $WAKE system service, and so 
forth. This chapter describes the interactions of software priorities, process 
states, and system events, as well as the operation of the scheduler. 



10.1 PROCESS STATES 

The state of a process defines the readiness of the process to be scheduled for 
execution. In addition, the process state may indicate whether the process is 
memory resident or outswapped. If a process is waiting for the availability of 
a system resource or the occurrence of an event, then the process state is one 
of several distinct wait states. The wait state reflects the particular condition 
that must be satisfied for the process to become computable again. 



10.1.1 Process Control Block 

The major data structure describing the state and priority of a process is the 
software process control block (PCB). Figure 10-1 illustrates the fields of the 
software PCB that are particularly important to scheduling. The field 
PCB$W_ STATE contains a numeric value associated with a particular proc- 
ess state. The process state is established by moving the appropriate value 
into PCB$W_ STATE and inserting the PCB into the corresponding state 
queue by means of the state queue link fields, PCB$L_SQFL and 
PCB$L_SQBL. Appendix B contains a complete description of the software 
PCB. Table 10-1 lists the process state names and the corresponding 
PCB$W_STATE values. Other software PCB fields define the scheduling or 
software priority of the process and indicate whether the process is in mem- 



183 



Scheduling 



Software PCB 



SQFL 



SQBL 



PRI 



PHYPCB 



STS 



PRIB 



STATE 



Figure 10-1 

Process Control Block Fields Used in Scheduling 



ory or outswapped. The location of a data structure containing the hardware 
context of the process is also stored in the software PCB (PCB$L_PHYPCB). 



10.1.2 Software Priority 

Software priority (as distinct from interrupt priority, a hardware mechanism) 
is used in determining the relative precedence of processes for execution and 
memory residence. Software priority is a value in the range from to 31. The 
null process executes at software priority level 0, and the highest priority 
real-time process executes at software priority level 31. The range of 32 soft- 
ware priority levels is divided evenly between the normal process levels of 
to 15 and the real-time process levels of 16 to 31. The execution behavior of 
a process is significantly affected by the type of process (normal or real time) 
and the assigned software priority level. 

Two fields of the software process control block directly describe the 
scheduling or software priority of the process. The field PCB$B_PRI (see Fig- 
ure 10-1) defines the current software priority of the process, which is used to 
make scheduling decisions. PCB$B_PRIB defines the base priority of the 
process, from which the current priority is calculated. For normal or time- 
sharing processes, these priority values are sometimes different, while real- 



184 



10.1 Process States 



Table 10-1: Process Scheduling States 

State Name 

Collided Page Wait 
Miscellaneous Wait 

Mutex Wait 

Resource Wait 
Common Event Flag Wait CEF 
Page Fault Wait 

Local Event Flag Wait (Resident) 
Local Event Flag Wait (Outswapped) 
Hibernate Wait (Resident) 
Hibernate Wait (Outswapped) 
Suspend Wait (Resident) 
Suspend Wait (Outswapped) 
Free Page Wait 
Computable (Resident) 
Computable (Outswapped) 
Currently Executing Process 



Mnemonic 


Value 


COLPG 


1 


MWAIT 


2 


3 




PFW 


4 


LEF 


5 


LEFO 


6 


HIB 


7 


HIBO 


8 


SUSP 


9 


SUSPO 


10 


FPG 


11 


COM 


12 


COMO 


13 


CUR 


14 



10.1.2.1 



time processes always have identical current and base priority values. Each 
field may have a value from to 31. 

However, the values in these fields are stored internally in an inverted 
order. That is, the base and current priorities of for the null process are 
stored internally in the PCB fields as 31. The highest priority process possible 
would have internally stored software priority values of 0. Thus, the internal 
field values are stored as 31 minus the software priority value. This inverted 
value causes priority promotions or boosts to be implemented through sub- 
tract or decrement instructions. System utilities such as SDA, MONITOR, 
and the DCL command SHOW SYSTEM interpret these inverted values and 
display external values, where is the lowest priority and 31 is the highest. 
External values are also returned by the $GETJPI system service when a proc- 
ess priority is requested. 

Note that all discussions in this book treat software priority as an increas- 
ing entity from (for the null process) to 31 (for the highest priority real-time 
process). Please take this convention into account when relating descriptions 
in this book to the actual routines in the listings, where inverted priorities 
are used. 

Real-Time Priority Range. Processes with software priority levels 16 through 
31 are considered real-time processes. There are two scheduling characteris- 
tics that distinguish real-time processes. 

1. The software priority of a real-time process does not change over time, 
unless there is a direct program or operator request to change it (with a Set 



185 



Scheduling 

Priority system service or a SET PROCESS/PRIORITY command). The 
fact that the priority does not change implies that the base priority and the 
current priority of a real-time process are identical, and no dynamic prior- 
ity adjustment (see Section 10.1.2.3) is applied by the operating system. 
2. A real-time process executes until it is either preempted by a higher or 
equal priority process or it enters one of the wait states (see Section 
10.1.3.2). Thus, a real-time process is not susceptible to quantum end 
events (see Section 10.1.2.4) and is not removed from execution (resched- 
uled) because some interval of execution time has expired. 

Taken in isolation, the real-time range of VMS software priorities provides 
a scheduling environment like traditional real-time systems: preemptive, pri- 
ority-driven scheduling without time slices or quanta. 

10.1.2.2 Normal Priority Range. Normal processes include interactive terminal ses- 
sions, batch jobs, and all system processes except the swapper. The schedul- 
ing behavior of a normal process is different from that of a real-time process. 

1. The current software priority of the process varies over time while the 
base priority remains constant (unless altered by the Set Priority system 
service or by a SET PROCESS/PRIORITY command). This behavior is the 
result of dynamic priority adjustment applied by the VMS system to favor 
I/O-bound and interactive processes at the expense of compute-bound (and 
frequently also batch) processes. The mechanism of priority adjustment is 
discussed in the following section. Priority adjustment can also occur as a 
result of locking a mutex (see Section 2.3.1) or as a result of action by the 
routine EXE$TIMEOUT (see Section 11.3.5). 

2. Normal processes run in a time-sharing environment that allocates CPU 
time slices (or quanta) to processes in turn. Therefore, an executing nor- 
mal process will control the CPU until one of the following events occurs: 

• It is preempted by a higher or equal priority, computable process (see 
Figure 10-2, event 5, for example). 

• It enters a resource or event wait state (see Figure 10-2, event 7, for 
example). 

• The current quantum or time slice has been used (see Figure 10-2, event 
17, for example). 

3. Processes with identical current priorities are scheduled on a round robin 
basis. That is, each process at a given software priority level executes in 
turn before any other process at that level executes again. Although this 
mechanism applies to real-time processes as well, it generally has no effect 
because real-time processes are usually assigned to unique software prior- 
ity levels and their priorities do not change. Normal processes do experi- 
ence round robin scheduling both because there are usually more of them 



186 



10.1 Process States 



Increasing 
Software 

Priority 

20 



m 



IXI 



m 



.m 



m 



.EEL 



[I 






^ ^A^rJ> 



Time 



Events! © © ©© ©©©©© © ©@®@©©© © 



= SWAPPER 



Process Type Base Priority 



A Compute bound 4 

B I/O bound 4 

C Real time 18 



Events 



(T) I/O request 
M?) Preemption 
(ojQuantum end 



Figure 10-2 

Software Priorities and Priority Adjustments 



on a given system and because the default behavior (from Create Process 
system service arguments or from the user authorization file) is to assign a 
base priority of four to all user processes. Thus software priority levels four 
through nine tend to be occupied by several processes simultaneously. 

10.1.2.3 Priority Adjustment. Normal processes do not generally execute at a single 
software priority level. Rather, a process software priority changes over time 
in a range of zero to six software priority levels above the base process prior- 
ity. Two mechanisms provide this priority adjustment. As a condition for 
which the process has been waiting is satisfied or a needed resource becomes 
available, a boost or priority increment may be applied to the base priority to 
improve the scheduling response for the process (see Section 10.2.4). Each 
time the process executes without further system events (see Section 10.2) or 
quantum expiration (see the next section) occurring, the current priority is 
moved toward the base priority (or demoted) by one priority level (see Section 
10.3). Over time, compute-bound process priorities tend to remain at their 



187 



Scheduling 

base priority levels, while I/O-bound and interactive processes tend to have 
average current priorities somewhat higher than their base priority. An ex- 
ample of priority adjustment that occurs over time for several processes is 
illustrated in Figure 10-2. 

10.1.2.4 Quantum Expiration. The SYSBOOT parameter QUANTUM determines, for 
most process states, the minimum amount of time a process can remain in 
memory after an inswap operation, but it is not an absolute guarantee of 
memory residence. (The swapper's use of the initial quantum flag is de- 
scribed in Chapter 17.) The quantum also defines the size of the time slice for 
the round robin scheduling of normal processes. The value of QUANTUM is 
the number of 10-millisecond intervals (clock ticks) in the quantum. The 
default QUANTUM value of 30 therefore produces a scheduling interval of 
300 milliseconds. After each 10-millisecond interval, the hardware clock in- 
terrupt service routine updates the quantum-remaining field in the process 
header of the current process. When this value becomes zero, the software 
timer routine signals a quantum end event by invoking the subroutine 
SCH$QEND in module RSE. 

An additional deduction from the QUANTUM is governed by the special 
SYSBOOT parameter IOTA. This value (in units of 10 milliseconds) is de- 
ducted from the remaining quantum value each time a process enters a wait 
state. Therefore, the default IOTA value of 2 charges 20 milliseconds against 
the quantum of the process. This mechanism is provided to insure that all 
processes experience quantum end events with some regularity. Processes 
that are compute bound experience quantum end as a result of using a certain 
amount of CPU time. Processes that are I/O bound experience quantum end 
as a result of performing a reasonable number of I/O requests. This scheme 
guarantees that processes that spend most of their time in some wait state 
can also accomplish useful work before they are outswapped. 

The routine SCH$QEND is executed at the end of every quantum, regard- 
less of the software priority of the current process. For real-time processes, 
however, the only action performed is to reset the process header quantum 
field to the full quantum value and to clear the initial quantum bit in the PCB 
status vector (bit PCB$V_INQUAN in the field PCB$L_STS, pictured in Fig- 
ure 10-1). The cleared initial quantum bit makes a process more likely to be 
outswapped, if process swap mode has not been disabled. 

The following notes relate to the numbers at the bottom of Figure 10-2: 

(T) Process C becomes computable. Process A is preempted. 

(2) C hibernates. A executes again, one priority level lower. 

(3) A experiences quantum end and is rescheduled at its base priority. B is 
computable outswapped. 

(4) The Swapper process executes to inswap B. B is scheduled for execution. 



188 



10.1 Process States 

(5) B is preempted by C. 

(6) B executes again, one priority level lower. 

(7) B requests an I/O operation (not terminal I/O). A executes at its base 
priority. 

(D A requests a terminal output operation. The Null process executes. 
(9) A executes following I/O completion at its base priority + 3. (The applied 

boost was 4.) 
© A is preempted by C. 
@ A executes again, one priority level lower. 
@ A experiences quantum end and is rescheduled at one priority level 

lower. 
© A is preempted by B. A priority boost of 2 is not applied to B because the 

result would be less than the current priority. 
@ B is preempted by C. 
© B executes again, one priority level lower. 
© B requests an I/O operation. A executes at its base priority. 
@ A experiences quantum end and is rescheduled at the same priority (its 

base priority). 
© A is preempted by C. 

For normal processes, however, the occurrence of quantum expiration in- 
volves several different operations. 

1. As with real-time processes, normal processes have the process header 
quantum field reset and the initial quantum bit cleared. 

2. If there are any inswap candidates (SCH$GL_COMOQS is nonzero, indi- 
cating at least one nonempty COMO state queue), the current priority of 
the process is set to its base priority. (If SCH$GL_COMOQS contains a 
zero, the priority is left alone.) 

3. Routine SCH$SWPWAKE is called to determine whether swapper activity 
is required. The swapper process is awakened if any of the following are 
true: 

• There is at least one computable outswapped process. 

• Modified page writing is required as indicated by the upper and lower 
limit thresholds for the free and modified page lists. 

• There is at least one process header of a deleted process still in the 
balance slots. 

• A powerfail recovery has just occurred. 

These checks avoid needless awakening of the swapper, with the associ- 
ated context switch overhead, only to determine that the swapper has no 
useful work to do. 

The swapper process does not execute immediately but must be sched- 
uled for execution. As a computable (after waking), resident, real-time 



189 



Scheduling 



process of software priority 16, the swapper is very likely to be the next 
process scheduled. 

4. The CPU limit field of the process header is next checked to determine if 
a CPU limit has been imposed and if that limit has expired. If the CPU 
limit has expired, each access mode will have an interval of time to clean 
up or run down before the image exits and the process is deleted. The size 
of the warning interval given to each access mode is defined by the 
SYSBOOT parameter EXTRACPU. (This parameter has a default value of 
one second.) 

5. If no CPU limit expiration has occurred, then the automatic working set 
adjustment calculations take place if they are enabled. The size of the 
process working set may be expanded or contracted by amounts specified 
by the SYSBOOT parameters WSINC or WSDEC. Five SYSBOOT parame- 
ters determine threshold values to be applied to the automatic adjust- 
ments: 

• For a new adjustment to take place, this process must have accumu- 
lated AWSTIME units of CPU time (each clock tick accounts for 10 
milliseconds) since the last test for adjustment. 

• The page fault rate must be larger than PFRATH faults per 10 seconds or 
less than PFRATL faults per 10 seconds. 

• The working set cannot be contracted through automatic working set 
adjustment below AWSMIN nor expand above a process-specific maxi- 
mum number of pages (see the next item). 

• If there are more than BORROWLIM free pages, the working set list can 
grow up to WSEXTENT If there are fewer than BORROWLIM free 
pages, the working set list can only grow to WSQUOTA. Note that this 
growth affects the working set list, not the actual working set size. 
Pages can be added to the extended working set list when a page fault 
occurs and there are more than GROWLIM pages on the free page list. 

There are two possible courses of action that will disable automatic 
working set adjustment, and a third method is available to keep working 
set size less than or equal to WSQUOTA (disable borrowing) on a per-proc- 
ess basis: 

• Use the DCL command SET WORKING- SET/NOADJUST to disable it 
on a per-process basis. 

• Set the SYSBOOT parameter WSINC to zero to disable it on a system- 
wide basis. 

• Set WSEXTENT equal to WSQUOTA, or set BORROWLIM to -1, to 
disable borrowing on a per-process basis. 

Automatic working set adjustment is discussed from the memory man- 
agement point of view in Section 16.4.1.3. 



190 



10.1 Process States 

6. Finally, a scheduling interrupt at IPL 3 will be requested to remove the 
current process from execution and schedule the highest priority, mem- 
ory-resident, computable process for execution. Note that on a quiet sys- 
tem, the currently executing process may be selected for execution again. 



10.1.3 State Queues 

With the exception of the single process executing at a given moment, all 
processes in the system are in a process wait state, the computable resident 
state, or the computable outswapped state. The process state is indicated by 
the PCB$W_STATE field and the linking of the process control block into a 
queue of similar PCBs. The listheads for all wait queues, computable resident 
(COM) queues, and computable outswapped (COMO) queues, as well as the 
pointer to the PCB of the current (CUR) process, are defined in the module 
SDAT. 

10.1.3.1 Computable States. Processes in the computable or executable state are not 
waiting for events or resources, other than acquiring control of the CPU for 
execution. Computable resident (COM) processes are placed in one of 32 pri- 
ority queues, with the queue chosen by the internal value for the current 
software priority of the process (see Figure 10-3). There is a similar set of 32 
quadword listheads for the computable outswapped (COMO) state. Processes 
in the computable outswapped state are waiting for the swapper process to 
bring them into memory. As computable resident processes, they can then be 
scheduled for execution. Processes must be in the computable resident state 
to be considered for scheduling. Processes are created in the computable out- 
swapped (COMO) state. Deletion of processes occurs from the current (CUR) 
state. 

10.1.3.2 Wait States. The listheads for the process control block queues corresponding 
to all process wait states except the common event flag wait state (CEF) look 
like Figure 10-4. (Common event flag wait queues are described in Chapter 
12.) The first two longwords are the longword links to the PCBs in this queue. 
The STATE field of the queue header contains the numerical value corre- 
sponding to the process state. All PCBs in a state queue have PCB$W_ STATE 
values identical to the STATE value of the wait state queue header. Recog- 
nized STATE values and the corresponding state names are summarized in 
Table 10-1. The COUNT field of the wait state queue header is simply the 
number of process control blocks currently in this state and queue. 

10.1.3.2.1 Voluntary Wait States. There are two process states associated with local 
event flag waits. Resident processes waiting for local event flags are placed 
into the LEF state, while outswapped processes occupy the LEFO state. There 



191 



Scheduling 



For State COM 



Bits 31 



Priorities 

Queue 
Priority 31 

1 
30 



• » 

•4 • 

• *■ 

■« • 



For State COMO 



Bits 31 



Priorities 



Queue 
Priority 31 

1 
30 



PRIORITY 31 



PRIORITY 30 



PRIORITY 29 



Longword Queue Bit Map 
::SCH$GL_COMQS 
^J (A clear bit implies an empty queue.) 

Queue Headers 

::SCH$AQ_COMH 

::SCH$AQ_COMT 



31 



Longword Queue Bit Map 

::SCH$Gl COMOQS 

(A clear bit implies an empty queue.) 

Queue Headers 

::SCH$AQ_COMOH 
::SCH$AQ_COMOT 



Figure 10-3 

Computable (Executable) State Queues 



are separate queues maintained for these states, and an LEF state process 
being outswapped must be removed from the LEF queue and placed into the 
LEFO state queue. Processes enter the LEF state as a result of issuing 
$WAITFR, $WFLOR, and $WFLAND system services directly or indirectly 
(for example, with a $QIOW or $ENQW system service call, issued either by 
the user or on his behalf by some system component such as RMS). Removal 
from the LEF or LEFO states to the computable (COM) or computable 
outswapped (COMO) states can occur as a result of matching the event flag 
wait mask, enqueuing an asynchronous system trap (AST), or process dele- 
tion. 

Similarly, there are separate resident and outswapped states and queues for 
hibernating and suspended processes. The Hibernate and Suspend system 
services cause processes to enter the resident wait states. Hibernating proc- 



192 



10.1 Process States 



Wait Queue 
Forward Link 








Wait Queue 
Backward Link 




State 


Count 


Figure 10-4 

Format of Wait State ( 


3ueue Headers 



esses can leave the HIB and HIBO states and enter the COM and COMO 
states as a result of $WAKE system services, AST enqueuing, or process dele- 
tion Suspended processes are sensitive only to $RESUME system services 
and process deletion (because ASTs cannot be delivered to processes while 
they are suspended). The transitions between states are diagrammed in Figure 
10-5. 

10 1 3 2 2 Memory Management Wait States. Three process wait states are associated 
with memory management. Each state is represented by a single queue and 
listhead of the form shown in Figure 10-4. Differentiation of resident and 
outswapped processes in these states is accomplished only by means of the 
PCB$V_RES bit of the PCB$L_STS field. The outswapping of processes in 
these states does not involve removal from and insertion into queues. The 
PCB$V_RES bit is simply cleared in the process control block. (Memory 
management wait states are discussed from another point of view in Chap- 
ter 15.) 

The page fault wait state (PFW) is entered when a process refers to a page 
that is not in physical memory. While the page read is in progress, the process 
is placed into the PFW state. Completion of the page read, AST enqueuing, or 
process deletion can cause the process to become computable (COM) or com- 
putable outswapped (COMO), depending upon its PCB$V_RES bit value 
when the satisfying condition occurs. 

The free page wait state (FPG) is entered when a process requests a page to 
be added to its working set, but there are no free pages to be allocated from 
the free page list. This state is essentially a resource wait until the supply of 
free pages is replenished through modified page writing, process outswap- 
ping, or virtual address space deletion. 

The collided page wait state (COLPG) usually occurs when several proc- 
esses cause page faults on the same shared page at the same time. The initial 
faulting process enters the PFW state, while the second and succeeding proc- 



193 



Scheduling 



Outswap 




AST 
DEL 




Represents AST enqueuing 
Represents process deletion 

Transitions from memory-resident 
wait states to COM are not 
labeled to avoid cluttering 
the figure. They are caused 
by the same events shown for 
transitions to the COMO state. 



Represents a process state 
with a single queue 



Represents a process state 
with a multiple queues 



Figure 10-5 

State Transition Diagram 



194 



10.1 Process States 

esses enter the COLPG state. The COLPG state can also be entered when a 
process refers to a private page that is already in transition from the disk. All 
COLPG processes are made computable or computable outswapped when the 
read operation completes. (A more detailed discussion of collided pages is 
contained in Chapter 15.) 

10.1.3.2.3 Miscellaneous Wait State (MWAIT). The miscellaneous wait state (MWAIT) 
is used to indicate processes waiting for resources not managed by any of the 
other process wait states. There is a single MWAIT queue for memory-resi- 
dent and outswapped processes. Table 10-2 lists the resources associated with 
the two forms of the MWAIT state. 

The miscellaneous resource wait state is used to wait for the availability of 
a depleted or locked resource. A process may enter a resource wait if the 
resource requested has already been allocated. Common examples are the 
depletion of nonpaged dynamic memory or no room in mailboxes. The proc- 
ess will become computable when the resource becomes available again. The 
number of the resource (a small integer defined by the $RSNDEF macro) is 
stored in the PCB$L_EFWM field (see Table 10-2), and the PCB$W_STATE is 
changed to MWAIT to indicate a miscellaneous resource wait. Whether a 
process can be made executable by the enqueuing of an AST to the process is 
dependent upon the interrupt priority level of the caller of the routine declar- 
ing the resource wait. If the IPL in the saved PSL in the hardware process 
control block is two or larger, the process will reexecute the resource wait 
code and be placed back into the MWAIT state immediately. If the saved IPL 
is smaller than two, an AST delivery interrupt will occur, resulting in the 
execution of the previously enqueued AST. 

The Set Resource Wait Mode system service ($SETRWM) can force the 
immediate return of an error status code rather than placing the process in 
the MWAIT state. $SETRWM does this by setting the PCB$V_SSRWAIT bit 
of the PCB$L_STS field. Disabling resource waits affects many directly re- 
quested operations (such as I/O requests or timer requests) but has no effect 
on allocation requests by the system on behalf of the user. An example of this 
situation is the pager requiring an I/O request packet to perform a page read 
operation. If nonpaged dynamic memory is depleted, the process will enter 
the MWAIT state, even if $SETRWM had been used to disable resource waits. 
The reason for this distinction is that a process can respond to a depleted 
resource error from a system service call or an RMS request but has no means 
of reacting to a similar error in the event of an unexpected event such as a 
page fault. 

System routines that access data structures protected by mutexes will 
place a process in the MWAIT state if the requested mutex ownership cannot 
be granted (see Chapter 2). Thus, the mutex wait state indicates a locked 
resource and not necessarily a depleted one. The logical name system serv- 



195 



Scheduling 



Table 10-2: Types of MWAIT State 



Mutex Waits 



Reason for Wait 



System Logical Name Table 

Group Logical Name Table 

I/O Database 

Common Event Block List 

Paged Dynamic Memory 

Global Section Descriptor List 

Shared Memory Global Section Descriptor Table 

Shared Memory Mailboxes 

(Not used) 

Known File Entry Table 

Line Printer Unit Control Block (2) 

Resource Waits 

AST Wait (Wait for system or special kernel AST) 

Mailbox Full 

Nonpaged Dynamic Memory 

Page File Full 

Paged Dynamic Memory 

Breakthrough (Wait for broadcast message) 

Image Activation Lock 

Job Pooled Quota (Not currently used) 

Lock Identification Database 

Swap File Space 

Modified Page List Empty 

Modified Page Writer Busy 



Contents of PCB$L. 
Symbolic 

LOG$AL_MUTEX 

80002754 

IOC$GL_MUTEX 

EXE$GL_CEBMTX 

EXE$GL_PGDYNMTX 

EXE$GL_GSDMTX 

EXE$GL_SHMGSMTX 

EXE$GL_SHMMBMTX 

EXE$GL_ENQMTX 

EXE$GL_KFIMTX 

UCB$L_LP_MUTEX 



Symbolic 



RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 
RSN$. 



-ASTWAIT 

.MAILBOX 

.NPDYNMEM 

PGFILE 

PGDYNMEM 

BRKTHRU 

JACLOCK 

JQUOTA 

LOCKID 

SWPFILE 

MPLEMPTY 

MPWBUSY 



.EFWM(l) 
Numeric (hex) 

80002750 



800028CO 
800028C4 
800028C8 
800028 CC 
800028D0 
800028D4 
800028D8 
800028DC 
(Note 2) 

Numeric (hex) 

00000001 
00000002 
00000003 
00000004 
00000005 
00000006 
00000007 
00000008 
00000009 
0000000A 
0OOOOOOB 
0000000C 



(1) The symbolic contents of PCB$L_EFWM will probably remain the same from release to release The 
numeric contents for mutex waits are almost certain to change with each major release of the operating 
system. 6 

(2) The mutex associated with each line printer unit does not have a fixed address like the other mutexes 
Its value depends on where the UCB for that unit is allocated. 

ices operating on the system and group logical name tables are one example 
of this type of operation. When the owner of the requested mutex releases it, 
the requesting process becomes resident computable (COM), or computable 
outswapped (COMO) if it has been outswapped, and requests ownership of 
the mutex again. AST enqueuing cannot make a mutex-waiting process com- 
putable for long because the IPL in the stored PSL is IPL$_ASTDEL (IPL 2), 
blocking the AST delivery interrupt. 

The mutex wait state is distinguished from the resource wait state by stor- 
ing the system virtual address of the requested mutex in the PCB$L_EFWM 
field. (When treated as a signed integer, the contents of this field are positive 
and small when the process is waiting for a resource. When the process is 



196 



10.2 System Events 

waiting for a mutex, the contents are negative, as listed in Table 10-2.) For 
example, if a process wishes to allocate a block of paged dynamic memory, it 
must first acquire the paged pool mutex to allow it to search the linked list of 
available blocks (see Chapter 3). If another process is already looking at paged 
pool, this process is put into a mutex wait state (with 800028 C8, the address 
of the paged pool mutex, stored in PCB$L_EFWM). Once the mutex is availa- 
ble and then owned by this process, paged pool is searched for a block of the 
requested size. If there is no block large enough to satisfy the allocation re- 
quest, the process is placed into a resource wait state (with 00000005, the 
value of RSN$_PGDYNMEM, stored in PCB$L_EFWM). The process re- 
mains in this state until a block of paged pool is deallocated. 

10.1.3.3 Common Event Blocks. Processes waiting for one or more common event 
flags are enqueued to wait queues in data structures called common event 
blocks (CEBs). These data structures are allocated from nonpaged dynamic 
memory when processes create common event flag clusters. The contents of 
a CEB include three longwords that exactly correspond to a wait state queue 
header (see Figure 10-4). The entire format of the common event block is 
shown in Chapter 12. 

The number of CEF state queues depends upon the number of common 
event flag clusters that exist on a particular system at any given time. (Addi- 
tional processes associating with existing common event flag clusters do not 
create further CEBs or CEF queues.) Outswapped processes waiting for com- 
mon event flags are differentiated from similar memory resident processes by 
the PCB$V_1RES bit of the PCB$L_STS field only. In addition to satisfying 
the event flag wait mask, the system can also make a CEF process computa- 
ble by AST enqueueing or process deletion. 



10.2 SYSTEM EVENTS 

System events are occurrences of operations that change the states of proc- 
esses. A system event may make a process computable, memory resident, or 
outswapped. System events provide the transitions among the process states 
diagrammed in Figure 10-5. 

A process initially enters a wait state from the current state (CUR). That is, 
a process either directly or indirectly executes a request for a system opera- 
tion for which it must wait. Direct requests such as $QIOW, $HIBER, 
$SUSPND, and $WAITFR place the process in the voluntary wait states LEF, 
CEF, HIB, and SUSP. Subsequent outswapping (from the process viewpoint an 
unrequested system operation) may move a process to the LEFO, HIBO, or 
SUSPO states. 



197 



Scheduling 

10.2.1 Process State Changes 

Indirect wait requests occur as a result of paging or contention for sys- 
tem resources. Aprocess does not request PFW, FPG, COLPG, or M WAIT transi- 
tions. Rather, the transitions to these wait states occur because direct service 
requests to the system cannot be completed or satisfied at the moment. 
A process can become computable for a variety of reasons. The availability 
of a requested resource or the satisfaction of a wait condition (such as an 
event flag setting or a $WAKE system service call) will make the process 
computable. In all process states except SUSP and SUSPO, the enqueuing of 
an AST will make a process computable even if the wait condition is not 
satisfied. (Because processes are usually put into the MWAIT state at IPL 2, 
the AST is not able to be delivered until the miscellaneous wait is satisfied! 
Thus, the typical process in an MWAIT state will not become comrmtable for 
long, due to the enqueuing of an AST. In particular, processes waiting for 
resources or mutexes typically cannot be deleted.) Process deletion, imple- 
mented with a special kernel mode AST, will make all processes that are 
being deleted computable (including processes in the SUSP or SUSPO states) 
because the target process is resumed before the AST is queued. 

Exchanges of processes between the current executing state (CUR) and the 
computable, memory-resident state (COM) are performed by the scheduler 
routine (see Section 10.3). The movement of a process into and out of the 
balance set is the responsibility of the swapper process (see Chapter 17). 

10.2.2 Wait States and AST Delivery 

One of the responsibilities of the routines that place processes into wait 
states is to insure that these processes will correctly enter their appropriate 
wait states after successful delivery of an AST. There are three different tech- 
niques used, depending on the particular wait state being entered. 

10.2.2.1 System Service Wait States. In the case where a process is entering a wait 
state as a result of executing a system service (HIB, LEF, or CEF), the wait 
routine is entered with the PC and PSL of the the system service CHMK 
exception (see Chapter 9) on the top of the stack. The first implication of this 
arrangement is that the process will wait in the access mode in which the 
system service was issued. Because ASTs are enqueued and delivered based 
on access mode, a supervisor mode AST can be delivered to a process waiting 
on an event flag as a result of a $QIOW call issued from user or supervisor 
mode. 

In addition, the wait code backs up the saved PC by four so that it points to 
the CHMx instruction in the system service vector (see the code examples in 
Section 9.1). If a process receives an AST while in such a wait state, the AST 
is delivered and executes. When the AST delivery routine releases' its inter- 



198 



10.2 System Events 

rupt through an REI instruction, the system service executes again, typically 
placing the process right back into the wait state it was in before the AST was 
delivered. 

10.2.2.2 Memory Management Wait States. The page fault handler (see Chapter 15) is 
solely responsible for placing processes into the three wait states associated 
with memory management. This routine places a process into a wait state 
with the PC and PSL associated with the page fault as the saved process 
context. Once again, because the PSL reflects the access mode in which the 
fault occurred, ASTs can be delivered for that and all inner access modes. 
(Note that this routine does not need to change the PC that it finds on the 
stack because page fault exceptions are faults and not traps. Faults, discussed 
in full in Chapter 4, cause the PC of the faulting instruction and not the PC of 
the next instruction to be pushed onto the exception stack.) 

If an AST is delivered to and executes in such a process, the process will 
execute the faulting instruction again. If the reason for the fault has been 
removed (a free page became available or the page read completed) while the 
AST was being delivered or was executing, the process will simply continue 
with its execution. If, on the other hand, the situation that caused the process 
to wait still exists, the process will reincur the page fault and be placed back 
into one of the memory management wait states. (Note that a process that 
was initially in a PFW state would be placed into a COLPG state by such a 
sequence of events.) 

10.2.2.3 Special Cases. The two remaining wait states (SUSP and MWAIT) are handled 
in a special way by the wait routine. A process suspension occurs as a result 
of executing a special kernel AST. ASTs cannot be delivered to suspended 
processes. That is, an AST queued to a suspended process has its AST control 
block inserted into the AST queue in the software PCB. However, the AST 
event is ignored by the scheduler. (In fact, while a process is suspended, the 
saved PC is an address in the special kernel AST that caused the process to 
enter the suspend state. The saved PSL indicates kernel mode and IPL 2.) 

When a process is placed into a wait state waiting for a mutex (see Chapter 
2), its saved PC is either SCH$LOCKR or SCH$LOCKW, depending on 
whether it is attempting to lock the mutex for read access or write access. 
The saved PSL indicates kernel mode and IPL 2, which implies that processes 
in an MWAIT state waiting for a mutex cannot receive ASTs. 

A process can also be placed into an MWAIT state while waiting for an 
arbitrary system resource. In this case, the caller of SCH$RWAIT controls the 
PC and PSL that are saved when the process is placed into the MWAIT state. 
In particular, the current access mode and IPL in the saved PSL determine 
whether any ASTs can be delivered to a process that is waiting for a resource. 



199 



Scheduling 

10.2.3 Event Reporting 

Events are reported to the scheduler from many system routines through the 
RPTEVT macro, which generates the following code: 

JSB SCH$RSE 

-BYTE EVT$_event-name 

The byte value stored depends upon the event being declared by the system 
routine. The address of the value will be pushed on to the stack by the BSBW 
instruction. Additional parameters (priority increment class and PCB address 
of the affected process) are passed in registers. 
The routine SCH$RSE (in module RSE) performs the following operations: 

1. The event number is loaded into a register and the return PC value (on the 
stack as a result of the BSBW instruction) is adjusted to point to the ad- 
dress after the stored byte event value. 

2. The state and the event are checked for a significant transition. Each event 
(or state transition) has a bit mask defining which states this event can 
affect. The state of the process is obtained from the PCB$W_ STATE field. 

• For example, a wake event is only significant for processes that are 
hibernating (HIB or HIBO states). 

• An outswap event is only significant for the four states (COM, HIB, LEF, 
and SUSP) where a wait queue change is required. 

• The enqueuing of an AST is significant to some process states. If the 
process is in a SUSP or SUSPO, COM or COMO, or CUR state, the 
enqueuing of an AST is ignored by SCH$RSE. If the event is not signifi- 
cant for the current process state, the event is ignored (and SCH$RSE 
simply issues an RSB). 

3. For significant events, one of the following actions is taken: 

• An outswap event producing an LEF to LEFO, HIB to HIBO, or SUSP to 
SUSPO transition simply removes the PCB of the process from the resi- 
dent wait queue and inserts it in the corresponding outswapped wait 
queue. The corresponding wait queue header count fields and the proc- 
ess state (PCB$W_STATE) are also adjusted. 

• An outswap event producing a COM to COMO transition removes the 
PCB from the COM priority queue corresponding to PCB$B_PRI and 
inserts it into the corresponding COMO priority queue. The value in 
PCB$W_STATE is changed to the value SCH$C_COMO. The 
SCH$GL_COMQS status bit vector is also modified if the COM queue 
is now empty. The appropriate SCH$GL_COMOQS bit is uncondition- 
ally set. 

• For transitions from the LEF (implied resident) or CEF resident state to 
the COM state, the saved PC in the hardware PCB stored in the process 



200 



10.2 System Events 

header is incremented by four to point past the CHMx instruction. Sav- 
ing the PC value allows the process to begin execution immediately 
following the system service call rather than going through a Wait for 
Event Flag system service for a flag that is already set. The residence 
check is necessary because the saved PC of nonresident processes is 
usually not available. (The saved PC is stored in the hardware PCB in 
the process header, which may be outswapped if the process is not resi- 
dent.) 
• For the remaining transitions (all of which make a process computable), 
the process is removed from the wait queue and the wait queue header 
count is decremented. The PCB is inserted into a COM or COMO state 
queue depending upon whether the process is memory resident or 
outswapped, and the state field in the PCB is altered. The particular 
priority queue of the COM or COMO state is selected for insertion after 
a priority adjustment is attempted (see the following section). The 
SCH$GL_COMQS or SCH$GL_COMOQS summary bit correspond- 
ing to the selected priority queue is unconditionally set. 

Subsequent scheduling or swapping activity is necessary to execute or 
inswap the now computable process. The swapper is awakened (routine 
SCH$SWPWAKE is called) if the now computable process is presently out- 
swapped (see Section 10.1.2.4, item 3). 

The scheduler is requested, through an IPL 3 software interrupt, if the 
now computable process is memory resident and has a priority greater 
than or equal to that of the currently executing process. This priority 
check avoids needless context switches with their associated overhead, 
only to have the previously executing process again execute. 



10.2.4 System Events and Associated Priority Boosts 

System routines that report events to the scheduler not only describe the 
event and the process that is responsible, but also specify one of five classes 
of priority increments or boosts that may be applied to the base priority of the 
process. Table 10-3 lists the events, the priority class, and the potential 
amount of priority increment applied to the process. The table does not show 
AST enqueuing because system routines enqueuing ASTs to a process can 
select any of the priority increment classes to be associated with the enqueu- 
ing of an AST. 
The actual software priority of the process is determined by the following 

steps: 

1. The priority increment for the event class (see Table 10-3) is added to the 
base priority of the process (PCB$B_PRIB). 



201 



Scheduling 



Table 10-3: System Events and Associated Priority Boosts 

Priority Priority 

Event Class (1) Boost 

Page Fault Read Complete (PRI$_NULL) 

Quantum End 

Other Events with No Boost 

Direct I/O Completion 1 (PRI$_IOCOM) 2 

Nonterminal Buffered I/O Completion 1 2 

Update Section Write Completion 1 2 

Set Priority Priority 2 

Resource Available 2 (PRI$_RESAVL) 3 

Wake a Process 2 3 

Resume a Process 2 3 

Delete a Process 2 3 

Timer Request Expiration 2 (PRI$_TIMER) 3 

Terminal Output Completion 3 (PRI$_TOCOM) 4 

Terminal Input Completion 4 (PRI$_TICOM) 6 

Process Creation 4 6 



(1) Routines that report system events pass an increment class to the sched- 
uler. The scheduler uses this class as a byte index into a table of values 
(local label B_PINC in module RSE) to compute the actual boost. 

2. If the process has a current priority higher than the result of step one, the 
current priority will be retained (such as occurs in Figure 10-2, event 13). 

3. If the higher priority of steps one and two is above 15, then the base prior- 
ity of the process is used. (Note that this test accomplishes two checks at 
the same time. First, all real-time processes fit this criterion, with the 
result that real-time processes do not have their priorities adjusted in re- 
sponse to system events. Second, priority boosts cannot move a normal 
process into the real-time priority range.) 

A side effect of step three is that real-time processes always execute at 
their base priorities. Further, note that normal processes with base priori- 
ties from 10 to 15 will not always receive priority increments as events 
occur. As the base priority of a normal process is moved closer to 15, the 
process will spend a greater amount of time at its base priority. Priority 14 
and 15 processes experience no priority boosts. Thus, this strategy benefits 
those processes that most need it, I/O-bound and interactive processes 
with base priorities of 4 through 9. Processes with elevated base priorities 
do not require this assistance as they are always at these levels. 



10.3 RESCHEDULING INTERRUPT 

The IPL 3 interrupt service routine, SCHED, schedules processes for execu- 
tion. The actual work of the scheduler is performed at IPL$_ SYNCH to block 



202 



10.3 Rescheduling Interrupt 

concurrent access and modification of the scheduler's database by other sys- 
tem components. The principal purpose of this interrupt service routine is to 
remove the currently executing process by storing the contents of the process 
private processor (hardware) registers and replacing the register contents with 
those of the highest priority computable resident process. This operation, 
known as context switching, is accompanied by modifications to the affected 
processes in terms of process state, current priority, and state queue. 



10.3.1 Hardware Context 

The definition of a process from the viewpoint of the hardware is contained 
in the hardware context. This collection of data is the set of hardware proces- 
sor registers whose contents are unique to the process. These include the 
following categories of information: 

• The general purpose registers, R0 through Rl 1, the argument pointer (AP), 
the frame pointer (FP), and the program counter (PC). 

• The per-process access mode stack pointers for kernel, executive, supervi- 
sor, and user stacks. One of these four registers contains the current stack 
pointer for the process, as indicated by the current mode field in the saved 
PSL. 

• The processor status longword (PSL). 

• The AST level processor register (ASTLVL). 

• The process page table registers for the program and control regions (POBR, 
POLR, P1BR, and P1LR). 

With the exceptions of the ASTLVL register value and the contents of the 
memory management registers for the program and control regions, the cur- 
rent values for the various registers forming the hardware context of the cur- 
rent process are maintained only in the processor registers. When a process is 
not executing, the complete hardware context is contained in a portion of the 
process header called the hardware process control block. 

The hardware process control block (see Figure 10-6) is a part of the fixed 
portion of the process header for each process. It is resident in memory when- 
ever the corresponding process is in the balance set. Access by the operating 
system occurs normally through offsets from the starting address of the par- 
ticular process header. However, during context switching operations, the 
hardware must access this data structure directly without address transla- 
tion. This access is accomplished by using the current value in the process 
control block base register (PR$_PCBB). This register contains the physical 
address of the hardware process control block for the currently executing 
process. The VMS operating system stores the physical address of the hard- 
ware process control block for each resident process (calculated when the 
process is swapped into memory) in the PCB$L_PHYPCB field of the corre- 
sponding software process control block (see Figure 10-1). 



203 



Scheduling 



Hardware PCB 




26 24 
AST 
LVL 



KSP 



ESP 



SSP 



USP 



RO 



R1 



R2 



R3 



R4 



R5 



R6 



R7 



R8 



R9 



R10 



R11 



AP 



FP 



PC 



PSL 



POBR 



21 



P1BR 




21 



:PR$_PCBB 



The process control block 
base register contains 
the physical address 
of this structure 
for the currently 
executing process. 



POLR 



P1LR 



Figure 10-6 

Hardware Process Control Block 



10.3.2 Removal of Current Process from Execution 

The entry point SCH$RESCHED in the module SCHED performs the opera- 
tions of rescheduling, preserving the hardware context of the currently exe- 
cuting process, and removing it from execution. Rescheduling is accom- 
plished by the following steps: 

1. The hardware context of the current process is saved by the SVPCTX in- 
struction. The destination of the data is the hardware process control 
block whose physical address is contained in the process control block 
base register, PR$_PCBB. Additional operations of the SVPCTX instruc- 
tion are described in Section 10.3.5.1. 

2. The address of the software process control block for the current process is 
obtained from the pointer SCH$GL_CURPCB in the module SDAT. (A 



204 



10.3 Rescheduling Interrupt 

single longword pointer is required for the current state (CUR), rather than 
a quadword listhead, because there is only one current process and not a 
queue of several such processes.) 

3. The current priority of the process is determined from the PCB$B_PRI 
field. The current priority is used to determine which of the resident com- 
putable state queues is to include this PCB. The process is inserted at the 
tail of the corresponding priority queue. 

4. The state of the process is changed to computable (COM) by updating the 
PCB$W_STATE field. 

At this point, there is no current process, and the search for the next proc- 
ess to execute begins. 



10.3.3 Selection of Next Process for Execution 

The entry point SCH$SCHED begins the portion of code that searches for the 
next process to be scheduled for execution. Under some circumstances (such 
as system initialization, placing the previous process into a wait state, or 
deletion of the previous process) there may not be a current process to be 
saved by SCH$RESCHED. In these cases, system routines transfer control 
directly to SCH$SCHED for process selection. (The difference between the 
two entry points is determined by whether the previous process is still com- 
putable. Typically, a process entering a wait state will cause entry at 
SCH$SCHED, while a higher priority process becoming computable will 
cause entry, through a software interrupt, at SCH$RESCHED.) 

The SCH$RESCHED logic flows directly into SCH$SCHED. As with re- 
scheduling, the search for and modification of the next process to be executed 
must be performed at IPL$_SYNCH to block other potential system opera- 
tions on the scheduler database. 

The following operations are involved in selecting and executing the next 
process: 

1. The first software process control block (PCB) in the highest priority, non- 
empty, computable resident (COM) state queue is removed from the 
queue and pointed to by SCH$GL_CURPCB as the current process. Con- 
sistency checks are made to insure that the queue really had at least one 
PCB and that the data structure removed was actually a PCB. Failure of 
either of these tests results in a fatal bugcheck (BUG$_QUEUEMPTY). 

2. The state of the process is made current by inserting the appropriate value 
(SCH$C_CUR) into the PCB$W_STATE field. 

3. The current process priority is examined and potentially modified. If the 
process is a real-time process or if it is a normal process already at its base 
priority, then the process is scheduled at its current or base priority (they 
are the same). If the current process is a normal process above its base 



205 



Scheduling 



priority, then a decrease of one software priority level is performed before 
scheduling. Thus, priority "demotions" always occur before execution, 
and a process executes at the priority of the queue to which it will be 
returned (and not the priority of the queue from which it was removed). 
See Figure 10-2, event 2 for an example 

4. The physical address of the hardware process control block for the sched- 
uled process is loaded into the PR$_PCBB register from the software proc- 
ess control block PCB$L_PHYPCB field, and a load process context, 
LDPCTX, instruction is executed (see Section 10.3.5.2). 

5. Control is passed to the scheduled process by executing an REI instruc- 
tion. This transfer of control is possible because the LDPCTX instruction 
left the PC and PSL of the scheduled process on the kernel stack. When 
control is passed to the process through the REI instruction, the following 
operations are performed: 

• The interrupt priority level is dropped from IPL$_ SYNCH. 

• The access mode is typically changed from kernel to a less privileged 
one. 

• If ASTs are queued to the process control block, they are likely to be 
delivered at this time, depending on their access mode and the access 
mode at which the process is reentered (see Chapter 7). 



10.3.4 Summary Longword and Computable State Queues 

The search for the highest priority computable resident process and the re- 
moval of its PCB from the computable state (COM) queue is achieved in 
three instructions (see Figure 10-7). The efficiency of this operation is due to 
the instruction set and the design of the scheduler database for the computa- 
ble (COM) and computable outswapped (COMO) states (see Figure 10-3). 

(V) A find first set (FFS) instruction will locate the least significant set bit in 
the longword SCH$GL_COMQS. The located bit position indicates the 
highest priority nonempty computable resident state queue. The 
swapper's search for the first PCB in the highest priority nonempty com- 
putable outswapped (COMO) queue uses the same operations (see Chap- 
ter 17). 

One reason for storing the software priority in inverted or 31 -comple- 
ment form is the following. By making bit correspond to software prior- 
ity 31, and so on, the highest priority queues will be scanned first. Con- 
version in the various user interfaces occurs because systems and users 
generally associate higher priority numbers with higher priority jobs, 
tasks, or processes. 
(2) The listhead of the selected computable resident queue is found by using 



206 



10.3 Rescheduling Interrupt 

the nonempty queue bit position as an index into the contiguous list- 
heads. 

(3) The first PCB in the selected queue is removed by indirect reference 
through the forward link of the listhead. 

If the removed PCB was the only one in the queue, the corresponding 
SCH$GL_COMQS bit must now be cleared because the queue is now 
empty. 



10.3.5 Hardware Assistance in Context Switching 

The VAX architecture was designed to assist the software in performing criti- 
cal, commonly performed operations. One example is the delivery of asyn- 
chronous system traps through the REI instruction (see Chapter 7). The 
mechanism of replacing the hardware context of the current process with the 
context of the highest priority resident process is another example of hard- 
ware assistance to the operating system. The switching of hardware context 
is performed by two special purpose instructions, SVPCTX and LDPCTX. 

10.3.5.1 SVPCTX Instruction. The save process context instruction, SVPCTX, per- 
forms several operations and assumes a special set of initial and final condi- 
tions. The following initial conditions are assumed: 

• The current access mode must be kernel. 

• The program counter (PC) and processor status longword (PSL) are on the 
current stack (either kernel or interrupt stack). If the SVPCTX instruction 
that executes is the one in the rescheduling interrupt service routine, both 
the PC and PSL are on the kernel stack as a result of the IPL 3 software 
interrupt. 

• The process control block base register (PR$_PCBB) contains the physical 
address of the hardware PCB for the current process. 

• The current values of ASTLVL, POBR, POLR, P1BR, and P1LR are already 
stored in the hardware PCB. 

When the SVPCTX instruction is executed, the following operations are 
performed by the VAX hardware: 

1. The per-process stack pointers for the four access mode stacks are moved 
to the hardware PCB. 

2. The general purpose registers, R0 through Rl 1, the argument pointer (AP), 
and the frame pointer (FP) to the hardware PCB are moved to the hardware 

PCB. 

3. The program counter (PC) and the process status longword (PSL) are 
popped from the current stack and moved to the hardware PCB. 



207 



to 

o 

00 



Co 
TO 

a, 
c 

13 



51 OODO'CF 

5S DB Al 

DD OODO'CF SB 

ac ai oc 

53 0000 'CF<5 
93 tl 



07 
DD 
RA 
E3 
BD 
7E 
OE 



0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0000 
0003 
0004 

oooq 

DOOD 
0013 
0017 
001D 

ooao 
ooao 
ooao 
ooao 
ooao 
ooao 
ooao 
ooao 
ooao 
ooao 



A3 
AA 
AS 

Ay 

Al 
Ah 
AH 
50 
51 
55 
53 
SA 
55 
St 
57 
SA 

sq 
to 

tl 

ta 

t3 
hA 

t5 
tt 
t7 
tfl 

tq 

70 
71 

7a 

73 
1A 

75 
7t 
77 



.SBTIL SCH$EESCHED RESCHEDULING INTERRUPT HANDLER 
t-+ 
SCH3RESCHED - RESCHEDULING INTERRUPT HANDLER 

THIS ROUTINE IS ENTERED VIA THE IPL 3 RESCHEDULING INTERRUPT. 
THE VECTOR FOR THIS INTERRUPT IS CODED TO CAUSE EXECUTION 
ON THE KERNEL STACK. 

ENVIRONMENT: 

IPL=3 M0DE=KERNEL IS=0 
INPUT: 

00(SP)=PC AT RESCHEDULE INTERRUPT 

04(SP)=PSL AT INTERRUPT. 



.ALIGN LONG 



MPH$RESCHED: : 
SCH$RESCHED:: 
SETIPL 
SVPCTX 
MOVL 
MOVZBL 
BBSS 
MOVW 
MOVAQ 
INSQUE 



10$: 



#IPLS_S¥NCH 

W " SCH$GL_CURPCB,R1 

PCB$B_PRI(Rl),Ra 

Re,H " SCH$GL-jC0MQS,1Q$ 

#SCH$C-jC0M, PCB$W_STATE ( Rl ) 

W " SCH$AQ_C0MT[Ra],R3 

(Rl),«(R3)+ 



HULTI-PEOCESSING CODE HOOKS IN HERE 

RESCHEDULE INTERRUPT HANDLER 

SYNCHRONIZE SCHEDULER WITH EVENT REPORTING 

SAVE CONTEXT OF PROCESS 

GET ADDRESS OF CURRENT PCB 

CURRENT PRIORITY 

MARK QUEUE NON-EMPTY 

SET STATE TO RES COMPUTE 

COMPUTE ADDRESS OF QUEUE 

INSERT AT TAIL OF QUEUE 



SCH$SCHED - SCHEDULE NEW PROCESS FOR EXECUTION 

THIS ROUTINE SELECTS THE HIGHEST PRIORITY EXECUTABLE PROCESS 
AND PLACES IT IN EXECUTION. 



MPH$SCHED: : 
SCH$SCHED: : 

SETIPL #IPL$_SYNCH 



; MULTI-PROCESSING CODE HOOKS IN HERE 

; SCHEDULE FOR EXECUTION 

; SYNCHRONIZE SCHEDULER WITH EVENT REPORTING 



5S 



DODD'CF 50 DO 
3D 
53 0000'CF45 
54 93 
3C 
Ob 
55 



DD 0000 'CF 

OA A4 

2C A4 

OOOO'GF 

DB A4 



DC 
5E 
OE 
54 
5F A4 



Dfi 

D3 DB A4 D4 

OB A4 

0000 'CF DB A4 

ID Ifl A4 



DODO'CF 



5D 
AD 



EA 
13 
7E 
OF 
ID 

ia 

E5 

51 
15 
BD 
DD 
51 

13 
El 
5b 
50 
DA 
Ob 
D5 



50 
11 



D053 
DD5A 
005C 
0D35 
0035 
003? 
DD35 
003F 
003F 
0043 
0045 
0045 
004E 
0053 
0D53 
0055 
0D5A 
OOSD 
00b3 
00b7 

oota 

0Db5 
00b5 
D0b5 
OObC 
0071 
DD73 
0073 
0D7? 
D077 



7 B FFS 

75 BEQL 

80 BOVAQ 

fll REMQUE 

B5 BVS 

S3 BHEQ 

&A BBCC 

85 5D$: 

8b CMPB 

87 BNEQ 

aa movw 

85 MOVL 

50 CMPB 

51 

55 BEQL 

53 BBC 

54 INCB 

55 30$: MOVB 
5b MTPR 

57 LDPCTX 

58 REI 
55 

1D0 SCHSIDLE: 

SETIPL 

MOVB 

BRB 



#D,#35,W " SCH$GLuC0MQS,R5 

SCHSIDLE 

W " SCH$AQ^C0MH[R5],R3 

8(R3)+,R4 

QEMPTY 

50$ 

R5,H " SCH$GLjC0MQS,5D$ 

#DYN$C_PCB , PCB$B JTYPE ( R4 ) 

QEMPTY 

#SCH$C_CDR , PCB$W_STATE ( R4 ) 

R4,W " SCH$GL_CURPCB 

PCB$B_PRIB ( R4 ) , PCB$B_PRI ( R4 ) 

30$ 

#4,PCB$B_PRI(R4),30$ 

PCB$B_PRI(R4) 

PCB$B_PRI(R4) ,W " SCH$GB_PRI 

PCB$L_PHYPCB(R4) ,#PR$_PCBB 



#IPL$-SCHED 

#35, W " SCH$GB_PRI 

SCH$SCHED 



101 

IDS 

103 

104 

IDS QEMPTY: BUG^CHECK QUEUEMPTY, FATAL 

10b 

ID? .END 



Figure 10-7 

Scheduler Routine That Selects Next Execution Candidate 



FIND FIRST FULL STATE 

NO EXECUTABLE PROCESS?? 

COMPUTE QUEUE HEAD ADDRESS 

GET HEAD OF QUEUE 

BR IF QUEUE HAS EMPTY (BUG CHECK) 

QUEUE NOT EMPTY 

SET QUEUE EMPTY 

MUST BE A PROCESS CONTROL BLOCK 

OTHERWISE FATAL ERROR 

SET STATE TO CURRENT 

NOTE CURRENT PCB LOC 

CHECK FOR BASE 

PRIORITY=CURRENT 

YES, DONT FLOAT PRIORITY 

DONT FLOAT REAL TIME PRIORITY 

MOVE TOWARD BASE PRIO 

SET GLOBAL PRIORITY 

SET PCB BASE PHYS ADDR 

RESTORE CONTEXT 

NORMAL RETURN 

NO ACTIVE, EXECUTABLE PROCESS 

DROP IPL TO SCHEDULING LEVEL 

SET PRIORITY TO -1(35) TO SIGNAL IDLE 

AND TRY AGAIN 

SCHEDULING QUEUE EMPTY 



O 






TO 

& 

TO 

a, 
s 

S3 
09 



TO 

si 



Scheduling 

Finally, if the current stack is the kernel stack, the SVPCTX instruction 
saves the current stack pointer (SP) in the kernel stack field of the hardware 
process control block and switches to the interrupt stack (by setting the 
PSL$V_IS bit and copying the PR$_ISP register contents into the SP register). 
Switching to the system-wide interrupt stack is essential because there is no 
current process once the instruction completes. 

The ASTLVL, POBR, POLR, P1BR, and P1LR fields of the hardware process 
control block are not changed. It is the responsibility of the various system 
components that alter these fields to always update both the hardware proc- 
ess control block fields and the per-process processor registers. ASTLVL is 
unusual in that it can be altered even when the process is not current. In that 
case, only the hardware PCB field is altered. The processor register is not 
altered because the process does not own that register when it is not the 
current process. These fields do not change frequently compared to the fre- 
quency of context switching. The overhead of storing these fields in the hard- 
ware process control block is incurred only when the field values change. 
The SVPCTX instruction occurs in several locations in the executive: 

• The rescheduling interrupt service routine contains an instance of this 
instruction when the current process remains computable after it is re- 
moved from execution. 

• Module SYSWAIT contains another example of the instruction when the 
current process is being placed into a scheduling wait state. 

• The pager (module PAGEFAULT) issues a SVPCTX instruction directly 
when it places a process into one of the memory management wait states 
(PFW, FPG, COLPG). 

• One of the last steps of process deletion involves removing the process 
being deleted from execution with a SVPCTX instruction. 

10.3.5.2 LDPCTX Instruction. The load process context instruction, LDPCTX, per- 
forms the operations required in establishing the hardware context of the 
process. As with the SVPCTX instruction, assumptions are made about the 
initial and final conditions of the instruction. The following initial condi- 
tions are assumed: 

• The processor must be in kernel mode, using either the kernel or the inter- 
rupt stack. (The processor is always on the interrupt stack for the one 
occurrence of the LDPCTX instruction in the VMS executive.) 

• The process control block base register (PR$_PCBB) must contain the 
physical address of the hardware process control block to be used (from the 
PCB$L_PHYPCB field of the software process control block). 

When the LDPCTX instruction is executed, the following operations are 
performed by the VAX hardware: 



210 



10.3 Rescheduling Interrupt 

1. The per-process half of the translation buffer is invalidated. All of the 
previous translation buffer entries belonged to the previous process. They 
are invalidated to prevent mistranslation of virtual addresses and to pro- 
tect the data of the previous process. 

2. The per-process access mode stack pointers (KSP, ESP, SSP, and USP) are 
loaded from the hardware process control block. 

3. The general purpose registers, RO through Rl 1, the argument pointer (AP), 
and the frame pointer (FP) are loaded into the corresponding processor 

registers. 

4. The memory management mapping registers (POBR, POLR, P1BR, and 
P1LR) are checked for legal values and loaded from the hardware process 
control block. Note that although the SVPCTX instruction does not save 
these registers, the LDPCTX must load them. Until they are loaded, the 
values in the registers belong to the previous process. 

5. The ASTLVL register is loaded. This register was also not saved by the 
SVPCTX instruction. 

6. If the instruction began execution using the interrupt stack, then the fol- 
lowing operations are performed: 

• The contents of the current stack pointer register (SP) are saved in the 
interrupt stack pointer register (ISP). 

• The PSL$V_IS bit is cleared to indicate the switch to the kernel stack. 

• The current stack pointer is updated with the contents of the kernel 
stack pointer register (KSP). 

7. Finally, the saved program counter (PC) and processor status longword 
(PSL) are pushed onto the kernel stack from the hardware process control 
block. These values are not stored into the appropriate registers. This par- 
ticular operation occurs because the next instruction (in the scheduler 
routine) is expected to be an REI instruction. The REI pops the two long- 
words, verifies the PSL format, and inserts the two longwords into the 
appropriate registers. 

The only occurrence of a LDPCTX instruction in the entire VMS system is 
the one shown in Figure 10-7, the second half of the rescheduling interrupt 
service routine. 



211 



11 Timer Support 



Love, all alike, no season knows, nor clime, 

Nor hours, days, months, which are the rags of time. 

—John Donne, The Sun Rising 

Support for time-related activities that require either the time of day and date 
or the measurement of an interval of time is implemented both in the 
VAX- 11 hardware and in the VAX/VMS operating system. 

11.1 TIMEKEEPING IN THE VAX/VMS OPERATING SYSTEM 

Two hardware clocks are updated at regular intervals, the interval clock and 
the time-of-day clock. These clocks are used by the VMS system to manage 
two different times, the system time and the time since the system was last 
bootstrapped. Additionally, the software timer interrupt service routine pro- 
vides timer services, such as scheduled wakeups, by maintaining a time-or- 
dered queue of requests and delivering them as the expiration times occur. 

11.1.1 Hardware Clocks 

The hardware clocks are a set of processor registers that are used or updated 
regularly by timing circuitry. Initialization, calibration, and interpretation of 
the registers are performed by VMS routines during system initialization and 
normal operations. 

The processor registers that implement the hardware clocks are summa- 
rized in Table 11-1, along with the memory locations that implement the 
various software time values. 

11.1.1.1 Interval Clock. The interval clock is implemented as a set of three 32-bit 
processor registers. The clock "ticks" at one microsecond intervals with an 
accuracy of at least 0.01 percent (an error of less than nine seconds per day). 
The frequency at which the interval clock causes an interrupt is determined 
by the value in one of the processor registers, PR$_NICR. 
The three interval clock registers (see Table 11-1) are used as follows. 

1. The interval clock control/status register (PR$_ICCS) controls the inter- 
rupt status of the interval clock. This register is set by the CPU hardware 
and then reset by the hardware clock interrupt service routine (see Section 
11.2) each time the interval clock interrupts. 



212 



Table 11-1: VAX/VMS Hardware Clocks and Software Timers 






Name 


Use 


Size 
(bits) 


Units 


Frequency 


Updated by 


PR$_ICR 


Interval clock 


32 


1 microsecond 


1 microsecond 


CPU hardware 


PR$_NICR 


Next interval 


32 


1 microsecond 


(1) 


System initialization 


PR$_ICCS 


Interval clock 
control/status 


32 


control/status 
bits 


10 milliseconds 


Hardware clock interrupt 
service routine 


PR$_TODR 


Time-of-day 
clock 


32 


10 milliseconds 


10 milliseconds 


CPU hardware, 
$SETIME system service 


EXE$GQ_SYSTIME 


System time 


64 


100 nanoseconds 


10 milliseconds 


Hardware clock interrupt 
service routine, 
SSETIME system service 


EXE$GL_ABSTIM 


System absolute 
time 


32 


1 second 


1 second 


System initialization, 
EXE$TIMEOUT repeating 
system subroutine 


EXE$GL_TODR 


Time-of-year 
base value 


32 


10 milliseconds 


(2) 


SSETIME system service 


EXE$GQ_TODCBASE 


Time-of-year 
base value 
(in system 
time format) 


64 


100 nanoseconds 


(2) 


SSETIME system service 



to 

I— 1 
0-> 



(1) PR$_NICR is written only at system initialization time and after powerfail recovery. 

(2) EXE$GL_TODR and EXESGQ-TODCBASE are modified only when one of the following is true: 

• The time-of-day value is changed by a SSETIME system service request (either explicitly or as an integral part of the system 
bootstrap operation). 

• The PR$_TODR has been lost due to a prolonged power failure. 



to 

is 



fcs- 
TO 

I 

fcs 

OQ 



Timer Support 



2. The next interval count register (PR$_NICR) defines how often the inter- 
val clock will cause a hardware interrupt. During system initialization, 
the routine INIT loads this processor register with a value of - 10000. This 
value defines the hardware clock interrupt interval to be 10 milliseconds 
(10000 microseconds). 

3. The interval count register (PR$_ICR) is incremented every microsecond 
from the PR$_NICR value toward zero. When PR$_ICR becomes zero, 
the register overflows, causing the following actions: 

a. The PR$_NICR value is copied into PR$_ICR to define the next inter- 
val. 

b. The PR$_ICCS register is set to indicate the overflow condition. This 
operation causes a hardware interrupt (IPL 24) to occur, serviced by the 
hardware clock interrupt service routine. 

The PR$_ICCS is reset by the hardware clock interrupt service routine 
to indicate servicing of the interrupt and reenabling of the hardware 
clock. 



11.1.1.2 Time-of-Day Clock. The time-of-day clock is a hardware component consist- 
ing of one 32-bit processor register and a battery backup supply for at least 
100 hours of operation (the battery backup is not a standard feature on the 
VAX-1 1/730). The time-of-day clock has an accuracy of at least 0.0025 per- 
cent (an error of about 65 seconds per month) and a resolution of 10 millisec- 
onds. The base time for the time-of-day clock is 00:00:00.00 hours on Jan- 
ary first of the current year. The time-of-day clock overflows after 497 
days. 

Values in PR$_TODR are biased by 10000000 [hex]. Values smaller than 
this indicate loss of power or time-of-day overflow, conditions causing the 
system to prompt the operator to reset the time (through the $SETIME sys- 
tem service). 

The validity of the time-of-day clock is determined at system initialization 
time. If the contents of the time-of-day clock are valid, the initialization 
process, SYSINIT, will not prompt the operator for the time. If the contents of 
the time-of-day clock are not valid (the value is less than 10000000 [hex]), the 
value of the SYSBOOT parameter TIMEPROMPTWAIT determines the proc- 
essor action on recovery from a power failure (see Section 27.2.2). 

Because the time-of-day clock has a better accuracy than the interval 
clock, the time-of-day clock is used for recalibrating the system time 
(EXE$GQ_SYSTIME) at system initialization and at other times when the 
$SETIME system service is called (see Section 11.1.3). In addition, because 
the time-of-day clock has battery backup (except on the VAX-1 1/730), it is 
used to reset the system time after a power failure or after the machine has 
been turned off. 



214 



11.1 Timekeeping in the VAX/VMS Operating System 

11.1.2 Software Time 

Software time is managed by VMS routines as a result of changes in the 
hardware clocks. The system time is defined by a quadword value measuring 
the number of 100-nanosecond intervals since 00:00 hours, November 17, 
1858 (the time base for the Smithsonian Institution astronomical calendar). 
EXE$GQ_SYSTIME (see Table 11-1) is updated every 10 milliseconds by the 
hardware clock interrupt service routine (see Section 11.2). This quadword is 
the reference for nearly all time-related software activities in the system. For 
example, the $GETTIM system service simply writes this quadword value 
into a user-defined buffer. 

EXE$GL_ABSTIM measures the number of one-second intervals that have 
elapsed since the system was last bootstrapped. This absolute time is used to 
periodically check for I/O device and lock request timeouts. The absolute 
time is also the value for "system uptime" interpreted and displayed by the 
DCL command SHOW SYSTEM. 

EXE$GL_TODR contains the base 32-bit time value. EXE$GQ_ 
TODCBASE contains the base quadword system time value. These base time 
values represent the more recent of the following times: 

• 00:00 hours on January 1 of the current year 

• The last time that the time-of-day was redefined by $SETIME 

PR$_TODR (and EXE$GL_TODR) are biased by a factor of 10000000 (hex). 
If a power failure occurs, the value in PR$_TODR will be zeroed and the 
clock will start to count from there. If the value in PR$_TODR is less than 
1000000 (hex), it can safely be assumed that a power failure has occurred. 

Both the values in EXE$GQ_TODCBASE and EXE$GL_TODR are main- 
tained in the system image file as a semipermanent record of the base system 
time on which the contents of the time-of-year clock (PR$_TODR) are based. 
Both represent the same time (the last time they were adjusted), in different 
formats. EXE$GQ_TODCBASE represents the time of last adjustment in 
standard 64-bit time ; EXE$GL_TODR represents the time of last adjustment 
in the same 32-bit format as the time-of-year clock (PR$_TODR). 
PR$_TODR cannot be set to zero (because of the 10000000 hex bias), rather 
it is initialized to the contents of EXE$GL_TODR. 

When a new system time is specified, EXE$GQ_TODCBASE, 
EXE$GL_TODR, and PR$_TODR are modified, and the new base values are 
written to the system image file. When the system time (EXE$GQ_ 
SYSTIME) is recalibrated, the values are modified only when more than a 
year has passed since the last recalibration. 



11.1.3 Set Time System Service 

The $SETIME system service allows a system manager or operator to change 
the system time while the operating system is running. This may be neces- 



215 



Timer Support 

sary because of a power failure longer than the battery backup time of the 
time-of-day clock or because of changes between standard and daylight sav- 
ing time, for example. The new system time (absolute time, not relative 
time) is passed as the optional single argument of the system service. The 
$SETIME system service is also invoked during system initialization to reset 
the system time (and possibly the time-of-day clock). 

If the requesting process does not have the process privileges OPER and 
LOG_IO, the routine returns with an SS$_NOPRIV error status code. If the 
input quadword cannot be read, the routine returns with an SS$_ACCVIO 
error status code. 

11.1.3.1 $SETIME System Time Recalibration Requests. If no argument was passed to 
the system service or the time argument is a zero value, then the request is 
considered a request to recalibrate the system time (EXE$GQ_SYSTIME). 
The following actions take place. 

1. The new system time, EXE$GQ_SYSTIME, is computed by the following 
equation: 

EXE$GQ_SYSTIME = EXE$GQ_TODCBASE+ 

((PR$_TODR - EXE$GL_TODR) x 100000) 

EXE$GQ_SYSTIME and EXE$GQ_TODCBASE are quadword system 
times in units of 100 nanoseconds. PR$_TODR and EXE$GL_TODR are 
longword time-of-day times in units of 10 milliseconds. The multiplier of 
100000 is the number of 100-nanosecond intervals in 10 milliseconds. 

2. The values in PR$_TODR, EXE$GL_TODR, and EXE$GQ_TODCBASE 
are corrected if more than one year has passed since the system time was 
recalibrated (in order to prevent PR$_TODR from overflowing its 497-day 
limit). 

3. Each element in the timer queue (see Section 1 1.3.2) that specified a delta 
time has its expiration time adjusted by the difference between the previ- 
ous system time and the new system time. This modification prevents the 
actual delta time value from being changed by a modification to system 
time. TQEs containing absolute times are not adjusted so that the TQE 
will come due at the time that was specified by the user. 

4. The entire collection of system parameters, including EXE$GQ_ 
TODCBASE and EXE$GL_TODR, is written back to the system image 
file. 

11.1.3.2 $SETIME Time-of-Day Readjustment Requests. If a nonzero time value is 
supplied as an argument to $SETIME, then the following operations occur. 

1. The input argument, specified in system time units of 100 nanoseconds, is 
converted into time-of-day units (the number of 10-millisecond intervals 
after 00:00 hours on January 1 of the base year). 



216 



11.2 Hardware Clock Interrupt Service Routine 

2. The converted specified time is written into PR$_TODR and 
EXE$GL_TODR. 

3. The unconverted specified time is written into EXE$GQ_TODCBASE and 
EXE$GQ_SYSTIME. 

4. Finally, the timer queue is updated and the new values for the time-of-day 
clock base are written to the system image file (along with the system 
parameters). (See steps 3 and 4 described above in Section 11.1.3.1). 



1 1.2 HARDWARE CLOCK INTERRUPT SERVICE ROUTINE 

The hardware clock interrupt service routine, EXE$HWCLKINT in module 
TIMESCHDL, services the IPL 24 hardware interrupt signaled when the in- 
terval clock, PR$_ICR, reaches zero. The interval clock is set (through 
PR$_NICR) to interrupt every 10 milliseconds. 
The hardware clock interrupt service routine has two major functions. 

• Updating the system time (and possibly process accounting) 

• Checking the timer queue for timer events that have timed out 



11.2.1 System Time Updating 

The updating of the system time and the potential updating of process ac- 
counting fields requires several distinct actions. 

1. The PR$_ICCS register is reset to indicate the servicing of the interrupt 
and the reenabling of the hardware clock. 

2. The system time, EXE$GQ_SYSTIME, is updated by adding the equiva- 
lent of 10 milliseconds to the quadword value. 

3. If the hardware clock interrupts while a process is executing (the former 
current stack was not the interrupt stack), then the accumulated CPU 
utilization and quantum value are incremented in the process header. The 
quantum value is used to determine quantum end (see Section 11.3.1 and 
Chapter 10). If the quantum value reaches zero, an IPL 7 software inter- 
rupt, serviced by the software timer routine, is requested. The check for 
whether the interrupt occurred while already on the interrupt stack pre- 
vents a process from being charged for CPU time that the system was 
using to service interrupts. 



1 1 .2.2 Timer Queue Testing 

The timer queue is discussed with the software timer in the next section. 
The hardware clock interrupt service routine has the responsibility to deter- 
mine if the software timer must be requested to service the timer queue. If 
the first timer queue element has an expiration time less than or equal to the 



217 



Timer Support 



newly updated system time, then the timer event is due. The software timer 
routine is requested through the IPL 7 interrupt. 



1 1.3 SOFTWARE TIMER INTERRUPT SERVICE ROUTINE 

The software timer interrupt service routine, EXE$SWTIMINT in module 
TIMESCHDL, is invoked through the IPL 7 software interrupt. The software 
timer is requested because either the current process has reached quantum 
end or the first timer queue element must be serviced. 

11.3.1 Quantum Expiration 

The expiration of the quantum interval for the current process is determined 
by testing the PHD$W_QUANT field. This field is incremented by the hard- 
ware clock service routine. A zero quantum value indicates quantum expira- 
tion. The processing of the quantum end event is performed by the scheduler 
in routine SCH$QEND, which is described in Chapter 10. 

1 1.3.2 Timer Queue and Timer Queue Elements 

If the system time, EXE$GQ_SYSTIME, is greater than or equal to the expi- 
ration time of the first element in the timer queue, then the timer event is 
due. The comparison with the system time must be performed at IPL 24 to 
block the hardware clock interrupt. 

If a timer request is due, then the TQE is removed from the timer queue, 
the IPL dropped back to IPL$_TIMER (IPL 7), and one of three sequences of 
code is performed (depending upon the type of request). 

Timer requests are maintained in a doubly linked list that is ordered by the 
expiration time of the requests. EXE$GL_TQFL and EXE$GL_TQBL are a 
pair of longwords (defined in the module SYSCOMMON) that form the list- 
head of the timer queue. Elements in the timer queue are data structures that 
are generally allocated from nonpaged dynamic memory and initialized as a 
result of $SETIMR system service calls (see Section 11.4.1). The allocation of 
timer queue elements (TQEs) is governed by the pooled job quota 
JIB$W_TQCNT. 

The format of the timer queue element is shown in Figure 11-1. The link 
fields (TQE$L_TQFL and TQE$L_TQBL), the TQE$W_SIZE field, and the 
TQE$B_TYPE field are characteristic of system data structures allocated 
from dynamic memory. The TQE$B_RQTYPE field defines the type of timer 
request (process timer request, periodic system routine request, or process 
wake request) and whether the request is a one-time or repeating request (see 
the list of TQE request types in Figure 11-1). Bit <6> of TQE$B_RMOD is 
set if an AST is to be delivered when the timer event occurs. This bit is 



218 



1 1.3 Software Timer Interrupt Service Routine 



TQFL 



TQBL 



RQTYPE TYPE 



SIZE 



PID/PC 



AST/FR3 



ASTPRM/FR4 



DELTA 



EFN RMOD 



RQPID 




Process timer request 

1 System subroutine request 
V_ 2 Scheduled wake request 

One-time request 

1 Repeat request 
(not allowed for process 
timer requests) 

Relative time request 

1 Absolute time request 



Figure 11-1 

Layout of a Timer Queue Element 



equivalent to the ACB$V_QUOTA bit of the AST control block described in 
Chapter 7. 

The interpretation of the next three longword fields depends upon whether 
the request is from a system subroutine or a user process. For system subrou- 
tine requests, the fields contain the PC, R3, and R4 register values to be 
loaded before passing control to the subroutine. For process timer requests, 
the fields define the process ID of the process to report the event, the address 
of an AST routine to execute (if requested), and an optional AST parameter. 

TQE$Q_TIME is the quadword absolute system time at which a particular 
timer event is to occur. TQE$Q_DELTA is the quadword delta time for re- 



219 



Timer Support 

peating requests. The access mode of the requesting process is stored in 
TQE$B_RMOD. The event flag to set when the timer event occurs is defined 
by TQE$B_EFN. The TQE$L_RQPID contains the process ID of the process 
that made the initial timer request. (The requesting process is not necessarily 
the same as the target process.) 

If an AST is requested, the timer queue element will be reformatted into an 
AST control block (ACB) when the event occurs. 

11.3.3 Timer Request Servicing 

If the TQE is a process timer request (created by a $SETIMR system service 
call and indicated by a TQE$B_RQTYPE value of zero), then the following 
operations are performed: 

1. The event flag associated with this timer event is set by using the 
TQE$L_PID and TQE$B_EFN fields and invoking the SCH$POSTEF rou- 
tine. A software priority increment of three may be applied when the proc- 
ess next executes (see Chapter 10). 

2. If the target process is no longer in the system, the TQE is simply deallo- 
cated without further action. 

3. Otherwise, the JIB$W_TQCNT quota is incremented to indicate the de- 
livery of the timer event and the impending deallocation of the TQE. 

4. If an AST was requested (indicated by bit <6> of TQE$B_RQTYPE), then 
the TQE$B_RMOD field is moved to TQE$B_RQTYPE to reformat the 
TQE into an AST control block (ACB). The ACB is then queued to the 
target process, in the access mode of the original timer request, by calling 
the. routine SCH$QAST (see Chapter 7). 

When the processing of this timer queue element has been completed, the 
software timer routine checks to see if another TQE element can be removed 
from the queue. 

Note that process timer requests are strictly one-time requests. Any repeti- 
tion of timer requests must be implemented within the requesting process. 

11.3.4 Scheduled Wakeup 

The second type of timer queue element is associated with a request for a 
scheduled $WAKE to a hibernating process. This type of request may be ei- 
ther one-time or repeating and may be requested by a process other than the 
target process. 
The following operations are performed for scheduled wake TQEs. 

1. The target process (indicated by TQE$L_PID) is awakened by executing 
the routine SCH$WAKE. If the target process is no longer in the system, 
the PCB$W_ASTCNT quota of the requesting process (TQE$L_RQPID) is 



220 



11.3 Software Timer Interrupt Service Routine 

incremented and the control block is deallocated to nonpaged dynamic 
memory. 

2. If the request is a one-time request (indicated by a cleared TQE$V_ 
REPEAT bit in the TQE$B_RQTYPE field), then the deallocation opera- 
tion is the same as that described in item 1. 

3. If the request is a repeating type, then the repeat interval (TQE$Q_ 
DELTA) is added to the request time (TQE$Q_TIME), and the timer queue 
element is reinserted in the timer queue. 

The software timer routine then checks to see if the next timer request can 
also be performed at this time. 



11.3.5 Periodic System Procedures 

The third type of timer queue element defines a system subroutine request. A 
request of this type is not the result of any process request, but is a system- 
requested time-dependent event. The software timer interrupt service rou- 
tine handles this type of TQE by the following action: 

• Loading R3 and R4 from the TQE$L_FR3 and TQE$L_FR4 fields (nor- 
mally defined as the TQE$L_AST and TQE$L_ASTPRM fields) 

• Executing a JSB instruction using the TQE$L_FPC field (normally defined 
as the TQE$L_PID field) 

On return from the system subroutine, the TQE$V_REPEAT bit is tested. 
If the bit is set, then the TQE is reinserted in the timer queue using the 
TQE$Q_DELTA time field. If the request was a nonrepeating one, then the 
timer routine immediately checks the timer queue for further TQEs to serv- 
ice. The TQE is not deallocated because these requests do not use dynamic 
memory. This type of TQE is defined in static nonpaged portions of system 
space, such as the module SYSCOMMON in the case of the EXE$TIMEOUT 
subroutine. 

One example of this type of request, a repeating system subroutine request, 
is the once-per-second execution of the subroutine EXE$TIMEOUT 

1. The routine SCH$SWPWAKE is called to possibly awaken the swapper 
process (see Chapter 17). 

2. The EXE$TIMEOUT subroutine updates the EXE$GL_ABSTIM field to 
indicate the passing of one second of system uptime. 

3. The routine ERL$WAKE is called to possibly awaken the ERRFMT process 
(see Chapter 8). 

4. This subroutine scans the I/O database for devices that have exceeded 
their timeout intervals. Drivers for such devices are called at their timeout 
entry points at device IPL. A path through this subroutine checks for ter- 
minal timed reads that have expired. 



221 



Timer Support 

5. The first entry on the lock manager time out queue is checked to see if it 
has expired. If it has, a deadlock search is initiated. 

6. The PCB pointer list is searched for normal-priority (priority less than 16) 
processes in the COM or COMO state, whose priority is less than that of 
the current process (or the highest priority computable process). The cur- 
rent priority of these lower priority processes is boosted so that they be- 
come the highest priority COM or CUR process. This feature was imple- 
mented to prevent a high-priority, compute-intensive job from causing 
other processes to be unable to release system (or other) resources. The 
number of processes that can receive this boost is determined by the spe- 
cial SYSBOOT parameter PIXSCAN. The PCB pointer list is searched in a 
circular fashion, in order that all processes will eventually receive the 
priority boost. 

The TQE for this subroutine is permanently defined in the module SYS- 
COMMON, and the timer queue is initialized at bootstrap time with this 
data structure as the first element in the queue. 

The terminal driver also uses a repeating system timer routine to imple- 
ment its modem polling. The controller initialization routine in the terminal 
driver loads the expiration time field in a TQE in the terminal driver 
with the current system time, sets the repeat bit, and loads the repeat 
interval with the SYSBOOT parameter TTY_SCANDELTA. When the 
timer routine expires, it polls each modem looking for state changes. 

1 1.4 TIMER SYSTEM SERVICES 

Two system services are used to insert entries in the timer queue, Schedule 
Wakeup request ($SCHDWK) and Set Timer request ($SETIMR). Both of 
these services are contained in the module SYSSCHEVT. Two comple- 
mentary services delete entries from the timer queue, $CANWAK and 
$CANTIM. These system service routines are in the module SYSCANEVT. 

11.4.1 $SETIMR Requests 

The $SETIMR system service calls produce timer queue entries of the single 
process request type, TQE$C_TMSNGL. The following steps are performed: 

1. The event flag specified as an argument to the system service is cleared in 
preparation for subsequent setting at expiration time. 

2. The request is checked to make sure that the following are true: 

• The delta time location is accessible by the requesting process. 

• The PCB$W_ASTCNT of the requesting process is not exceeded (if an 
AST is to be associated with this timer request). 

• The JIB$W_TQCNT of the requesting job is not exceeded 



222 



11.4 Tim er Sys tern Services 

3. A timer queue element is allocated from nonpaged dynamic memory and 
the TQE is initialized from the system service arguments (delta time, re- 
quest type, and process ID). 

4. If the expiration time was expressed as an interval (a negative argument), 
then the absolute expiration time of the request is calculated by adding the 
delta time of the request to the current system time, EXE$GQ_SYSTIME. 
The absolute expiration time is stored in the TQE$Q_TIME field. 

5. The JIB$W_TQCNT field of the pooled job quotas is decremented to indi- 
cate the allocation of the TQE. 

6. The access mode of the system service caller is stored in the 
TQE$B_RMOD field. If an AST routine was specified as an argument to 
the $SETIMR call, then the process PCB$W_ASTCNT is decremented to 
indicate the future AST delivery and bit <6> of TQE$B_RMOD is set to 
indicate the AST accounting. 

7. The AST parameter (request identification) and event flag number argu- 
ments are copied to the TQE. 

8. The TQE is then inserted into the timer queue and the routine returns. 

The $CANTIM system service removes one or more timer queue elements 
before expiration. Two arguments, the request identification parameter and 
the access mode, control the actions taken by this routine. 

1. The access mode requested is maximized with that of the caller. (That is, 
no requests can be deleted for access modes more privileged than the 
caller.) 

2. Each TQE in the timer queue that meets all of the following criteria is 
removed and deallocated: 

• The process ID of the $CANTIM system service caller is the same as 
the process ID stored in the TQE. 

• The access mode of the caller is at least as privileged as the access mode 
stored in the TQE. 

• The request identification parameter argument is the same as that 
stored in the TQE. If the argument value is zero, then all TQEs meeting 
the first two criteria are removed. 

1 1.4.2 Scheduled Wakeup Operations 

The logic for managing scheduled wakeup requests is similar to that for 
SSETIMR requests. Two differences are the ability to specify repeating sched- 
uled wakeup requests and the ability to schedule wakeup requests for an- 
other process. The following steps create a scheduled wakeup request. 

1. The target process ID is verified from a system service argument. If the 
target process is not in the system, the scheduled wakeup request is ig- 
nored. 



223 



Timer Support 



2. If the target process exists, and if the current process is suitably privileged 
(GROUP or WORLD) with respect to it, then the repeat time is tested to 
determine whether the request is a one-time or repeating scheduled 
wakeup, TQE$C_WKSNGL or TQE$C_WKREPT of the TQE$B_RQTYPE 
field. 

3. The requested repeat time is formatted for insertion in the TQE. If the 
repeat time is less than 10 milliseconds, it is increased to that value (the 
resolution of the hardware clock interrupt). 

4. A TQE is allocated from nonpaged dynamic memory. 

5. The repeat time, request type, and target process ID are inserted into the 
TQE. 

6. If the initial scheduled wakeup time is expressed as an interval, then the 
initial absolute expiration time is calculated as in $SETIMR from the ini- 
tial delta time and the current system time. 

7. The ASTCNT quota of the requesting process is decremented to account 
for the allocation of the TQE. 

8. The TQE is inserted into the timer queue. 

When the expiration time is reached, a process wakeup is set to the target 
process (see Section 1 1.3.4). Deallocation of the TQE occurs after delivery of a 
one-time scheduled wakeup request or as a result of a $CANWAK system 
service call. 

The $CANWAK system service cancels all one-time and repeat scheduled 
wakeup requests for a target process. Each canceled TQE is deallocated to 
nonpaged dynamic memory and the PCB$W_ASTCNT of the initial request- 
ing process is incremented to indicate the deallocation. 



224 



12 Process Control and 
Communication 



I claim not to have controlled events, but confess plainly that 

events have controlled me. 

— Abraham Lincoln, letter to A.G. Hodges, April 4, 1864 



The VMS operating system provides many services that allow processes to 
communicate with one another and allow one process to control the execu- 
tion of another. Event flags are the most primitive control and communica- 
tion tool available (in terms of amount of information). Other communica- 
tion techniques include logical names, mailboxes, the VAX/VMS lock 
management system services (lock manager), global shared data sections, and 
shared files. (The lock manager is discussed only briefly here,- for a full de- 
scription, see Chapter 13.) System services allow a process to alter some of its 
parameters (such as name or priority). Other services allow a process to affect 
its own scheduling state or that of another process. A summary of process 
control system services is listed in Table 12-1. 



12.1 EVENT FLAG SERVICES 

Event flags are used within a single process for synchronization of I/O re- 
quests, enqueue lock requests, $GETJPI system service calls, and timer re- 
quests. They can also be used either within a single process or among several 
processes in the same group as application-specific synchronization tools. 
System services are provided to read, set, or clear collections of event flags. 
Other services allow a process to wait for one event flag or a collection of 
event flags. 



12.1.1 Local Event Flags 

Each process has available to it 64 local (process-specific) event flags and 64 
shareable event flags (among processes in the same group). The 64 local event 
flags are stored directly in the software PCB, at offsets PCB$L_EFCS and 
PCB$L_EFCU (see Figure 12-1). Local event flags to 31 are located in long- 
word PCB$L_EFCS. Local event flags 32 to 63 are located in longword 
PCB$L_EFCU. 



225 



Process Control and Communication 



Table 12-1: Summary of Process Control System Services 

Service Name Affect Other Processes 

Create Common Event Flag Cluster Same group only 



Delete Common Event Flag Cluster 


Same \ 


Wait for Single Event Flag 




Wait for Logical AND of Event Flags 




Wait for Logical OR of Event Flags 




Hibernate 


No(l) 


Wake 


YES 


Schedule Wakeup 


YES 


Cancel Wakeup 


YES 


Suspend 


YES 


Resume 


YES 


Exit 


No 


Forced Exit 


YES 


Create Process 


YES 


Delete Process 


YES 


Set AST Enable 


No 


Set Power Recovery AST 


No 


Set Priority 


YES 


Set Process Name 


No 


Set Resource Wait Mode 


No (2) 


Set Swap Mode 


No (2) 


Set System Failure Mode 


No (2) 


Get Job/Process Information 


YES 



Privilege Checks 

PRMCEB (for permanent 
clusters only) 
PRMCEB 



None 

GROUP or WORLD 
GROUP or WORLD 
GROUP or WORLD 
GROUP or WORLD 
GROUP or WORLD 

None 

GROUP or WORLD 

DETACH for other 

than subprocesses 

GROUP or WORLD 

Access Mode Check 

Access Mode Check 

ALTPRI and GROUP or 

WORLD 

None 

None 

PSWAPM 

Access Mode Check 

GROUP or WORLD 



( 1 ) As part of the Create Process system service, a process can specify that the process being created 
hibernate before a specified image executes. 

(2) These three features can each be specified as a part of the Create Process system service. 



12.1.2 Common Event Flags 

Common event flag clusters do not initially exist. They must be created by 
the first process that calls the Associate Event Flag Cluster system service for 
a given cluster. This service allocates a structure called a common event 
block (see Figure 12-2) from nonpaged pool and loads its address into the PCB 
pointer field (either PCB$L_EFC2P or PCB$L_EFC3P). The common event 
block is linked into a system-wide list of common event blocks located by 
global listhead SCH$GQ_CEBHD (see Figure 12-3). 

As additional processes associate with this cluster, the CEB list is searched 
in order to locate the CEB, the event flag cluster pointers in their PCBs are 
updated, and the reference count for that cluster is updated. As processes 



226 



12.1 Even t Flag Services 



Software PCB 



SQFL 



SQBL 



WEFC 



STATE 



EFWM/PQB 



EFCS 



EFCU 



EFC2P 



EFC3P 



CEB Forward Link 



CEB Backward Link 



Status 



Type 



Size 



Process ID of Creator 



Event Flags 



Wait Queue Forward Link 



Wait Queue Backward Link 



CEF State Number 



Wait Count 



UIC of Creator 



Reference Count 



Protection Mask 



Count 



Cluster Name 
(up to 1 5 characters) 



Figure 12-1 

Software PCB Fields That Support 
Event Flags 



Figure 12-2 

Layout of Common Event Block 



disassociate from a cluster (with the $DACEFC system service), the reference 
count is decremented. When the reference count for a temporary cluster goes 
to zero, the cluster is automatically deleted and the CEB deallocated. 

Permanent clusters must be explicitly deleted (using the $DLCEFC system 
service) in order to cause the CEB to be deallocated when the reference count 
goes to zero. Alternatively, permanent clusters can continue to exist without 
requiring that they be associated with any processes. In fact, the only opera- 
tion performed by the Delete Common Event Flag Cluster system service is 
to turn off the CEB$V_PERM bit. (If the reference count of the cluster is zero 
when the permanent bit is turned off, the cluster is deleted.) 



227 



Process Control and Communication 

SCH$GQ_CEBHD:: 



ru 



CEB 



Wait Queue 



CEB Name 



PCB 



PCB 



PCB 



CEB 



Wait Queue 



PCB 





CEB 
















Wait Queue 


*- 


PCB 


i 


*- 


PCB 
• 








CEB 














No processes are waiting 

for flags in this 

common event flag cluster. 






Wait Queue 











Figure 12-3 

Common Event Flag Wait Queues 



12.1.3 Event Flag Wait States 

Processes are placed into event flag wait states implicitly when any of the 
following actions are performed: 

• Executing a $QIOW or $ENQW system service 

• Using the RMS services as synchronous operations (the usual way they are 
called) 

• Executing one of the three event flag wait services ($WAITFR, $WFLOR, 
$WFLAND) 

If the flag or flags in question are already set, the system service immedi- 
ately returns successfully to its caller. Otherwise, the process is placed into 
either a local or common event flag wait state. The saved PC in the hardware 
PCB is backed up by 4 (see Chapter 10) to allow ASTs to be delivered to the 
process while it is waiting for the flag(s) to be set. The event flag cluster 
number (0 or 1 for local clusters and 2 or 3 for global clusters), indicating 
which flags are being waited for, is stored in the PCB (at offset 



228 



12.1 Even t Flag Services 

PCB$B_WEFC). The list (mask) of event flags being waited for is stored (in 
one's complement form) in PCB$L_EFWM. 

• If the process is waiting for a single event flag (SYS$WAITFR), the 
PCB$L_EFWM mask contains a 1 in every bit except the bit number corre- 
sponding to the specified flag. 

• If the process is waiting for any one of several flags to be set (SYS$WFLOR), 
the PCB$L_EFWM mask contains the one's complement of the mask 
passed to the $WFLOR system service. (The $WAITFR mask is thus a spe- 
cial case of a wait for any one of a group of flags to be set.) If any of the flags 
in the requested mask is set when $WFLOR is called, the process is not 
placed into a wait state. Instead, the service immediately returns a success 
code to its caller. 

• If a process calls the $WFLAND system service, indicating a wait for all 
flags in a given mask to be set, the wait mask is modified so that event 
flags that are set when the service is called are not represented in the wait 
mask. In addition, a bit in the process status longword (PCB$V_WALL in 
PCB$L_STS) is set, indicating that all flags represented by the mask must 
be set before the wait is satisfied. 

There exist two local event flag wait states (LEF and LEFO) and two corre- 
sponding wait queue listheads (SCH$GQ_LEFWQ and SCH$GQ_LEFOWQ) 
for the entire system. On the other hand, there exists one common event flag 
wait queue listhead for each common event cluster that currently exists. 
Each common event flag wait queue listhead is located in the corresponding 
common event block (see Figure 12-2) and has the same overall structure as 
any other wait queue listhead (see Figure 12-3). 



12.1.4 Setting and Clearing Event Flags 

Event flags can be set directly by a process by calling the Set Event Flag 
system service. A process could use this service at AST level to communicate 
with its mainline code. It can also set common event flags to communicate 
with other processes. Event flags are also set in response to I/O completion, 
timer expiration, the granting of a lock request, and delivery of a $GETDVI, 
$GETTPI, or $GETSYI request. 

It should be noted here that when the VAX/VMS operating system uses 
shared event flags to communicate information between processes, a strict 
set of ownership rules is used. When a controlling process is getting ready to 
set an event flag, it owns the flag. When the process has set the flag (thereby 
allowing waiting processes to become computable), it relinquishes its owner- 
ship of the flag to the other processes. It is then the responsibility of the other 
processes to clear the flag and notify the controlling process that it has re- 
gained ownership of the flag. In this scheme, ownership is maintained by 
convention alone,- it is not enforced by the software. DIGITAL recommends 



229 



Process Control and Communication 

that applications that use shared event flags as a communications tool adhere 
to these same conventions. 

Both the system service and the special paths call the same routine 
(SCH$POSTEF) to perform the actual event flag setting and check for possi- 
ble scheduling implications. 

The operation of SCH$POSTEF depends on what kind of event flag is being 
set. 

• If the event flag that is being set is local, a check is made to determine 
whether this flag satisfies the process's wait request. In a $WFLOR wait, 
this flag merely has to match one of the flags being waited for. In a 
$WFLAND wait, all of the flags being waited for must be set in order to 
satisfy the process's wait request and report an event to the scheduler. 

• When a common event flag is set, the list of PCBs in the common event 
block wait queue is scanned to determine if any of the processes waiting 
for flags in this cluster satisfy its wait request as a result of setting this flag. 
A system event is reported for each such process. 

All such processes are made computable. If the priority of any one of 
them is greater than the priority of the currently executing process, a re- 
scheduling interrupt is requested. As with all other cases in the system 
where several processes become computable as a result of the same sys- 
tem-wide event, the process with the highest software priority will be se- 
lected for execution. 

• For common event flags located in shared memory, there is one more level 
of complication. The event flag must be set in the master CEB located in 
shared memory, and other processors connected to this shared memory 
unit must be notified that a shared memory common event flag was just 
set. (Shared memory common event flag data structures are discussed at 
the end of this chapter. Other shared memory data structures are described 
in Chapter 14.) 

Any other processor connected to the same global event flag cluster re- 
ceives initial notification through an MA780 interrupt. The interrupt serv- 
ice routine determines that the interrupt was due to an event flag in shared 
memory being set, copies the entire set of event flags from the master CEB 
to the slave CEB, and checks whether any of the processes waiting for flags 
in this cluster are now computable. 

12.1.4.1 Other Event Flag Services. The Clear Event Flag system service simply clears 

the specified event flag. Note that when clearing a flag in common event flag 
clusters in shared memory, only the event flag in the master CEB is cleared. It 
is not necessary to copy the set of flags from the master CEB to the slave 
CEBs on other processors when an event flag is cleared for the following two 
reasons: 



230 



12.2 Affecting the Computability of Another Process 

The event flag wait services only use the master CEB when checking 
whether to place a process into a wait state or return immediate success. 
The event flag posting routine copies the master set of flags to the local 
slave CEB before testing whether any process wait requests are satisfied. 
The master set of flags is copied into all other slave CEBs as a result of 
notifying other processors that a flag has been set. 

The Read Event Flag system service is simply informational. It has no 
effect on the computability of any process on any processor. The event flag 
cluster is read from the same destinations as those affected by the Clear 
Event Flag system service. 

Local event flag clusters are read from the software PCB. 
Regular common event flag clusters are read from the CEB. 
Common event flag clusters located in shared memory are read from the 
master CEB located in shared memory. 



12.2 AFFECTING THE COMPUTABILITY OF ANOTHER PROCESS 

In any multiprocessing application, it is necessary for one process to control 
whether and when other processes in the application can execute. The VMS 
operating system contains several services that provide this control. 



12.2.1 Common Event Flags 

Common event flags described in the previous section are one method of 
synchronization control. One process can reach a critical point in its\ execu- 
tion and wait on a global event flag. Another process can allow this process to 
continue its execution by setting the flag in question. 

Common event flags are also used as semaphores for more complicated 
forms of interprocess communication that use logical names or global sec- 
tions. 



12.2.2 Process Control Services 

Several system services allow one process to directly alter the scheduling 
state of another process. 

12.2.2.1 Privilege Checks. All system services that permit one process to directly af- 
fect another allow the process to be specified either by process name or by 
process identification (PID). In either case, the VMS operating system must 
determine whether the specified process exists and whether the caller has the 
proper privilege (GROUP, WORLD) or is part of the same process tree and can 
thus affect the other process. This work is centralized in a routine called 
EXE$NAMPID that is called by all such system services. 



231 



Process Control and Communication 

If the specified process exists, and the caller can affect the specified proc- 
ess, EXE$NAMPID returns successfully (at IPL 7) with the PCB address of the 
specified process in R4. Note that this return condition alters the contents of 
R4, which usually contains the caller's PCB address. If the specified process is 
a part of the same process tree as the caller (the JIB address is identical), 
EXE$NAMPID will return successfully. A second important use of 
EXE$NAMPID is in obtaining a PID when the process name is known. If a 
process name is specified and the PID address argument points to a zero long- 
word, the PID of the named specified process is returned to the caller at the 
designated location. 

12.2.2.2 Process Creation and Deletion. A first step in a multiprocess application 
requires that a controlling process create other processes for designated work. 
These processes may be deleted when they have completed their work or 
they may exist in some wait state in anticipation of additional work. The 
detailed operation of process creation is described in Chapter 20. Process de- 
letion is described in Chapter 22. 

12.2.2.3 Hibernate/Wake. There are two different ways that a process can be tempo- 
rarily halted, called hibernation and suspension. The differences between 
these two wait states are described in the VAX/VMS System Services Refer- 
ence Manual. 

A process can only put itself into the hibernate state. (That is, a process 
cannot put another process into the HIB state.) If the wake pending flag is not 
set (this flag check also clears the flag), indicating that an associated wake has 
not preceded the hibernate call, the process is placed into the hibernate wait 
state. As described in Chapter 10, the saved PC is backed up by 4 so that the 
process will be put back into the hibernate state in case it receives ASTs 
while it is hibernating. (Note that the check of the wake pending flag by the 
Hibernate system service includes the case where a process first hibernates 
and then is awakened by a wake call issued from an AST.) 

The $WAKE system service is the complementary service to Hibernate. A 
process may awaken itself (by calling $WAKE from an AST) or it may be 
awakened when another process calls $WAKE with the target process speci- 
fied either by name (if the target process is in the same group, and the caller 
has GROUP privilege) or by process ID (if the caller has GROUP or WORLD 
privilege). This service sets the wake pending flag in the software PCB and 
reports the awakening event to the scheduler. The process is removed from 
the HIB or HIBO queue and placed into the COM or COMO state in the 
queue corresponding to its updated priority. (A wake event results in a prior- 
ity boost class of PRI$_RESAVL, which is equivalent to a boost of 3.) 

The next time the process executes, the hibernate service executes again 
(because the PC was backed up by 4). Because the wake pending flag is now 



232 



12.2 Affecting the Computability of Another Process 

set, the process returns immediately from the hibernate call (with the wake 
pending flag now clear). Notice that if the process is in any state other than 
HIB or HIBO when it is awakened, the net result is to leave the wake pending 
flag set with no other change in its scheduling state. 

12.2.2.4 Suspend/Resume. Process suspension is slightly more complicated internally 
than hibernation because a process can be placed into the SUSP state by other 
processes. The scheduling philosophy of the VMS operating system, illus- 
trated in Figure 10-5, assumes that processes enter various wait states from 
the state of being the current process and in no other way. This assumption 
requires that the process being suspended (the target) become current, replac- 
ing the currently executing process, the caller of the Suspend system service. 
The VMS operating system accommodates this scheduling constraint by 
using a special kernel AST, the same tool that it uses when it needs access to 
a portion of process address space. In this case, it is not the process address 
space that is so important. Rather, the process must first be made current 
before it is placed into the SUSP state. 

12.2.2.4.1 Process Suspension. Process suspension occurs in two pieces. The portion of 
the service that executes in the context of the caller sets the suspend pending 
bit in the software PCB of the target process and queues the special kernel 
AST (the routine that performs the actual suspension) to that process. This 
implementation includes the special case where a process suspends itself. 

Through the normal scheduling selection process, the target process even- 
tually executes. The special kernel AST that performs the suspension exe- 
cutes first unless there are previously queued special kernel ASTs. This AST 
first checks (and clears) the resume pending flag in PCB$L_STS. (This check 
avoids the deadlock that could otherwise occur if the associated call to the 
$RESUME service preceded the call to $SUSPEND.) If the resume pending 
flag is set, the process simply clears the suspend pending bit, returns from the 
AST, and continues with its execution. 

Otherwise, it is placed into the SUSP wait state. The saved PSL contains 
IPL 2, preventing delivery of ASTs while a process is suspended. (In addition, 
the AST system event is ignored for processes in either the SUSP or the 
SUSPO state.) The saved PC is an address within the suspend special kernel 
AST. When the process is resumed (the only way that a suspended process 
can continue with its execution), it reexecutes the check of the resume pend- 
ing flag, which is now set, causing the process to return successfully from the 
special AST. 

12.2.2.4.2 Operation of the Resume System Service. The Resume system service is 
very simple. The resume pending flag in PCB$L_STS of the target process is 
set and (if the target process of the resume request is in either the SUSP or 



233 



Process Control and Communication 

SUSPO state) a resume event is reported to the scheduler. As with all other 
system events, this report may result in a rescheduling pass, a request to 
wake the swapper process, or nothing at all. 

12.2.2.5 Exit and Forced Exit. The Exit system service terminates the currently exe- 
cuting image. If the process is executing a single image (it is neither an inter- 
active nor batch job), image exit usually results in process deletion. A de- 
tailed discussion of the Exit system service, including the calling sequence of 
termination handlers, is given in Chapter 21. 

The Force Exit system service is a tool that allows one process to execute 
the Exit system service on behalf of another process. The service simply sets 
the force exit pending flag in PCB$L_STS and queues a user mode AST to the 
target process. This AST, executing in user mode, calls the Exit system serv- 
ice after clearing the AST active flag by executing the following instruction: 

CHMK #ASTEXIT 

(For more information on this instruction, see Chapter 7). The call to Exit is 
executed in the context of the target process. Execution proceeds in exactly 
the same manner as it would if the target process had called Exit itself. 



12.2.3 Miscellaneous Process Attribute Changes 

Finally, there are several system services that allow a process to alter its 
characteristics, such as its response to system service failures, its software 
priority, and its process name. Some of these changes (such as priority eleva- 
tion or swap disabling) require privilege. The Set Priority system service is 
the only service described in this section that can be issued for a process 
other than the caller. 

12.2.3.1 Set Priority. The Set Priority system service allows a process to alter its own 
software priority or the priority of other processes that it is allowed (through 
GROUP or WORLD privileges) to affect. If a process has the ALTPRI privi- 
lege, it can change priority to any value between and 31. A process without 
this privilege is restricted to the range between and its own base priority. In 
VAX/VMS Version 3.0, the cell PHD$B_AUTHPRI was added to the process 
header. Storing a process's base priority in this cell allows the process to 
lower its priority below its base priority and raise it again up to its base 
priority. 

For most scheduling states (everything except COM, COMO, and CUR), 
the Set Priority system service simply changes the base software priority in 
the software PCB (at offset PCB$B_PRIB). If a process alters its own priority, 
not only its base but also its current priority (at offset PCB$B_PRI) is 
changed. When the priority of a computable process (either COM or COMO) 



234 



12.3 Interprocess Communication 

is altered, the process is removed from the COM or COMO queue corre- 
sponding to its current priority and placed into a COM or COMO queue 
corresponding to its new priority (the new base with a boost of 2). In addition, 
a scheduling event is reported. If the new process priority (new base plus a 
boost of 2) is greater than or equal to the current priority of the current proc- 
ess, a rescheduling interrupt is requested. 

12.2.3.2 Set Process Name. Both the Set Process Name system service and the DCL 
command SET PROCESS/NAME = allows a process to change its process 
name. The new name cannot contain more than 15 characters. If no other 
process in the same group has the same name, the new name is placed into 
the software PCB (at offset PCB$T_LNAME). (Note that this service allows 
more flexibility in establishing a process name than is available from the 
usual channels, such as the authorization file or a $JOB card, because there 
are no restrictions imposed by the service on characters that can make up the 
process name. Even the DCL command is limited by characters that are un- 
acceptable to DCL.) 

12.2.3.3 Process Mode Services. There are several miscellaneous system services 
whose only action is to set or clear a bit in some field in the software PCB. In 
particular, the software PCB contains a status longword (not to be confused 
with the hardware entity, the PSL or processor status longword) that records 
the current software status of the process. Table 12-2 lists each of the flags in 
this longword, and the direct or indirect ways that these flags can be set or 
cleared. 

The Set Resource Wait Mode, Set System Service Failure Exception Mode, 
and Set Swap Mode system services all set (or clear) bits in this status long- 
word. The ability to disable swapping is protected by the PSWAPM privilege. 
The other two services require no privilege. Several other system services 
(such as $DELPRC, $FORCEX, $RESUME, or $SUSPND) set or clear bits in 
the status longword as an indication of their primary operation. 

The Set AST system service sets or clears (enables or disables) delivery of 
ASTs for a given access mode. The AST enable flags are stored at offset 
PCB$B_ASTEN within the PCB. These flags are discussed in Chapter 7. 



12.3 INTERPROCESS COMMUNICATION 

In any application involving more than one process, it is necessary for data to 
be shared among the several processes or for information to be sent from one 
process to another. The VMS operating system provides several services that 
accomplish this information exchange. The services vary in the amount of 
information that can be transmitted, the transparency of the transmission, 
and the amount of synchronization provided by the VMS operating system. 



235 



OS 











o 
o 
TO 

Co 

o 

§ 

c-f 


Table 12-2: Meanings 


. of Flags in PCB Status Longword (PCB$LSTS| 






n 

g 


Symbolic Name 


Meaning of Flag if Set 


Flag Set by 


Flag Cleared by 


3 


PCB$V_RES 


Process is resident (in the balance set) 


Swapper 


Swapper 


e 


PCB$V_DELPEN 


Process deletion is pending 


$DELPRC 






PCB$V_FORCPEN 


Forced exit is pending 


$FORCEX 


Image and process rundown 




PCB$V_INQUAN 


Process is in its initial quantum 






§' 




(following inswap) 


Swapper 


Quantum end routine 


PCB$V_PSWAPM 


Process swapping is disabled 


SSETSWM, $CREPRC 


$SETSWM 




PCB$V_RESPEN 


Resume is pending (skip suspend) 


$RESUME 


Suspend special AST 




PCB$V_SSFEXC 


Enable system service exceptions 










for kernel mode 


$SETSFM 


SSETSFM, process rundown 




PCB$V_SSFEXCE 


Enable system service exceptions 










for executive mode 


SSETSFM 


SSETSFM, process rundown 




PCB$V_SSFEXCS 


Enable system service exceptions 










for supervisor mode 


$SETSFM 


SSETSFM, process rundown 




PCB$V_SSFEXCU 


Enable system service exceptions 










for user mode 


SSETSFM, $CREPRC 


SSETSFM, image and 
process rundown 




PCB$V_SSRWAIT 


Disable resource wait mode 


$SETRWM, $CREPRC 


SSETRWM 




PCB$V_SUSPEN 


Suspend is pending 


$SUSPND 


Suspend special AST 





Table 12-2: Meanings 


! of Flags in PCB Status Longword (PCB$LSTS) (continued) 






Symbolic Name 


Meaning of Flag if Set 


Flag Set by 


Flag Cleared by 




PCB$V_WAKEPEN 


Wake is pending (skip hibernate) 


$WAKE, expiration of 
scheduled wakeup 


$HIBER 




PCB$V_WALL 


Wait for all event flags in mask 


SWFLAND 


Next SWFLOR or $WAITFR 




PCB$V_BATCH 


Process is a batch job 


$CREPRC 






PCB$V_NOACNT 


Do not write an accounting record 
for this process 


$CREPRC 






PCB$V_SWPVBN 


Modified page write to the swap file 










is in progress 


Modified page writer 


Modified page writer 


K* 


PCB$V_ASTPEN 


AST is pending (No longer used) 






Oo 


PCB$V_PHDRES 


Process header is resident 


Swapper 


Swapper 




PCB$V_HIBER 


Hibernate after initial image activation 


$CREPRC 




& 


PCB$V_LOGIN 


Login without reading the authorization file 


$CREPRC 




s 


PCB$V_NETWRK 


Process is a network job 


$CREPRC 




•a 


PCB$V_PWRAST 


Process has declared a power recovery AST 


$SETPRA 


Routine that queues 
recovery ASTs, image 
and process rundown 


o 
o 

TO 

Co 

Co 


PCB$V_NODELET 


Do not delete this process (not used) 








PCB$V_DISAWS 


Do not perform automatic working 


SET WORKING-SET/NOADJUST 


SET WORKING_SET/ ADJUST 


5 




set adjustment on this process 


$CREPRC 




3 



to 
Go 



o 



Process Control and Communication 

12.3.1 Event Flags 

Global or common event flags can be treated as a method for several proc- 
esses to share single bits of information. In fact, the typical use of common 
event flags is as a synchronization tool for other more complicated communi- 
cation techniques. The internal operations of common event flags are de- 
scribed in the beginning of this chapter. 



12.3.2 VAX/VMS Lock Management System Services 

The lock management system services allow processes to name a shared re- 
source and request locks on that resource. If access to a resource cannot be 
immediately granted to a lock, a queuing mechanism is provided for a process 
to wait until it can be granted access to the resource. The lock manager 
provides a number of lock modes to control how the resource is to be shared 
with other processes. Blocking ASTs and a lock value block are also provided 
to pass information about, or synchronize access to, a resource. The internals 
of the lock manager are described in Chapter 13. 



12.3.3 Mailboxes 

Mailboxes are I/O devices in that they are written to and read from by the 
normal VMS I/O system, either through RMS or with the $QIO interface. 
Although process-specific or system-wide parameters may control the 
amount of data that can be written to a mailbox in one operation, there is no 
limit to the total amount of information that can be passed through a mail- 
box with a series of reads and writes. 

There are two forms of synchronization provided for mailbox I/O. Because 
mailboxes are I/O devices, a simple but restrictive technique would have the 
receiving process issue a read from the mailbox and wait until the read com- 
pletes. Of course, the read could not complete until the process writing to the 
mailbox completed its transmission of data. The limitation of this technique 
is that the receiving process cannot do anything else while it is waiting for 
data. Even if the process issues asynchronous I/O requests, an I/O request 
must be outstanding at all times in order to receive notification when some 
other process writes to the mailbox. In some applications, these limitations 
may be acceptable and so this technique can be used. 

Other applications may have a receiving process that can perform different 
tasks, depending on the information available to it. Putting such a process 
into a wait state for one task prevents it from servicing any of its other tasks. 
For such applications, the VMS operating system provides a special $QIO 
request called Set Attention AST that allows a process to receive notification 
through an AST when anyone writes into its mailbox. This technique allows 



238 



12.3 Interprocess Communication 

a process to continue its mainline processing and handle requests from other 
processes only when such work is needed, without having an I/O request 
outstanding at all times. 



12.3.4 Logical Names 

Logical names (see Chapter 29) are used extensively by the VMS operating 
system to provide total device independence in the I/O system. However, 
logical names can be used for many other purposes as well. Specifically, one 
process can pass information to another process by creating a logical name (in 
the group or system table) with information stored in the equivalence string. 
The receiving process simply translates the name to retrieve the data. 

Although some form of synchronization is provided by an error return 
(SS$_NOTRAN) from the Translate Logical Name system service, processes 
using such a technique should use event flags (or an equivalent method) to 
synchronize this communication technique. One use of this technique where 
synchronization is not required occurs when a process creates a subprocess or 
detached process and passes the new process data in the equivalence strings 
for SYS$INPUT, SYS$OUTPUT, or SYS$ERROR. Using this method, there is 
no possibility for the translation to occur before the creation. 



12.3.5 Global Sections 

Global sections provide the fastest method for one process to pass informa- 
tion to another process. Because the two processes have the data area mapped 
into their address space, no movement of data takes place. Instead, the 
method provides for a sharing of the data. The method is not transparent 
because each process must map the global section that will be used to share 
data. In addition, the processes must use event flags, the lock management 
system services, or their own synchronization to prevent the receiver from 
reading data before it has been made available by the sender. 



12.3.6 Interprocessor Communication with the MA780 

VMS support for the MA780 shared memory unit provides a transparent com- 
munication path for interprocess communication even when processes are 
located on different processors connected through a shared memory unit 
(MA780). The three communication paths provided are common event flags, 
mailboxes, and global sections. 

Each of these entities is described by a name. When a process connects to 
one of them (with the Associate Common Event Flag Cluster system service, 
the Create Mailbox system service, or the Create and Map Section or Map 
Global Section system services), a logical name translation is performed on 



239 



Process Control and Communication 

the name of the object. If the equivalence name is of the following form, the 
service makes the appropriate connection between the process and the data 
structure describing the object that exists in shared memory. 

shared-memory-name: object-name 
If the shared memory data structure does not exist, it is created (except that 
the Map Global Section system service does not create global sections that do 
not exist). The data structures that the VMS operating system uses to de- 
scribe shared memory are pictured in Chapter 14. In addition, memory man- 
agement data structures, including those structures that describe shared 
memory global sections, are found in that chapter. 

• For a common event flag cluster in shared memory, the event flag cluster 
in the software PCB (PCB$L_EFC2P or PCB$L_EFC3P) points to the slave 
CEB for the local processor. The slave CEB contains information that de- 
scribes the master CEB that is located in the shared memory (see Figure 
12-4). The following procedures are used to identify the slave PCB: 

—If the slave CEB already exists, the system service simply points the 

PCB to the CEB. 
—If the slave CEB does not exist but the master does (there are currently 

no references to this cluster on this CPU), then a slave CEB is created; 

the address of the master is stored in the slave; and the address of the 

slave is stored in the PCB. 
—If the master CEB does not exist either, it is created first in the shared 

memory. Then the slave is created and execution proceeds as described 

in the previous case. 

The way in which common event flags are set and cleared is described in 
the beginning of this chapter. The differences between shared memory 
common event blocks (master and slave) and local memory common event 
blocks are pictured in Figure 12-5. (A local memory common event block 
is pictured in Figure 12-2). 

• For a mailbox in shared memory, there are also three cases. 

—If the mailbox already exists on this port, the Create Mailbox system 
service simply assigns a channel to it. (The UCB pointer in an available 
channel control block is loaded with the address of the UCB describing 
the shared memory mailbox.) 

—If the mailbox is being created on this node for the first time, a UCB is 
allocated and loaded with parameters that describe the mailbox. A bit is 
set in a mailbox-dependent field indicating that this mailbox UCB de- 
scribes a mailbox in shared memory. Finally, the address of the shared 
memory mailbox control block is loaded into the UCB. 



240 



Processor 1 Local Memory 

SCH$GQ__CEBHD:: 



PCB 



PCB 



PCB 



PCB 



TZ.«. 



Slave CEB 



BETA 



Local CEB 



ALPHA 



Local CEB 



GAMMA 



to 

■P* 



Figure 12-4 

Relationship between Master and Slave CEB 



Shared Memory 



Master CEB 

for shared 

memory CEF 

cluster 

-• 



BETA 



Master CEB 

for shared 

memory CEF 

cluster 



GAMMA 



Processor 2 Local Memory 



::SCH$GQ_CEBHD 



Slave CEB 



BETA 



Slave CEB 



GAMMA 



Local CEB 



ALPHA 



PCB 



PCB 



PCB 



PCB 



PCB 



Co 



>-< 

O 

o 

TO 
Co 

Co 

O 
1 

3 
c 

13 

Ki. 

o 

a 

a. 
§ 



Process Control and Communication 



Master CEB 
(resides in shared memory) 



Valid and Interlock Bits 



Unused 



Status 



Type 



Size 



Unused 



Event Flags 



Unused 



Unused 



Deleter 
Port 



Creator 
Port 



Number 
of Processes 



Inter- 
processor 
Lock 



UIC of Creator 



Unused 



Protection Mask 



Count 



Cluster Name 
(up to 15 characters) 



VA of Processor Slave CEB 



VA of Processor N Slave CEB 



Processor 1 
Reference Count 



Processor 
Reference Count 



Processor N 
Reference Count 



Processor N-1 
Reference Count 



Slave CEB 
(resides In processor local memory) 



Same as 

Local Memory 

Common Event 

Block 



X 



VA of Shared Memory Control Block 




Index to 
Master CEB 



VA of Master CEB 



£ 



Figure 12-5 

Shared Memory Common Event Flag Data Structures 



— If the shared memory mailbox control block (see Figure 18-2) does not 
exist, it is created before the rest of the operations described in the previ- 
ous step are performed. 

Shared memory mailbox data structures are pictured in Figures 18-2 and 
18-3. Mailbox creation is described in more detail in Chapter 18. 
For a global section in shared memory, a special global section descriptor is 
allocated that describes the global section in shared memory. Unlike glo- 
bal sections that exist in local memory, there are no global page table 
entries set up for global sections in shared memory. 

When a process maps to the shared memory global section, its process 
page tables are set up to contain the PFNs of the shared memory pages and 



242 



12.3 Interprocess Communication 

marked as valid. Such pages are not counted against the process working 
set. That is, pages in shared memory do not incur page faults. They are 
always valid, and therefore they can be described with a simple descriptor 
that is contained in the global section descriptor, rather than a set of global 
page table entries required for global pages that exist in local memory. 
Memory management data structures are described in Chapter 14. The 
memory management system services are discussed in Chapter 16. 



243 



13 VAX/VMS Lock Manager 



'Tis in my memory lock'd, 

And you yourself shall keep the key of it. 

—Hamlet 1, 3 

The VAX/VMS lock manager provides semaphores that cooperating processes 
can use to synchronize access to shared resources. The lock manager allows 
callers to specify one of six degrees of shareability (lock modes) ranging from 
no access to exclusive access. Once the lock is granted, the owning process 
can request a lock conversion to change the lock mode. The lock manager 
provides a queuing mechanism by which processes can wait in turn until a 
shared resource becomes available. Two queues are available: a waiting 
queue for new locks and a conversion queue for lock conversions. 
The lock modes are: 

NL Null lock. Owner can neither read nor write; compatible with all 

other locks. 

CR Concurrent read. Read access and sharing with other readers and 

writers. 

CW Concurrent write. Write access and sharing with other readers and 
writers. 

PR Protected read. Read access and sharing with other readers; no writ- 

ers allowed. 

PW Protected write. Write access and sharing with CR mode readers; no 
other writers allowed. 

EX Exclusive access. Write access; denies access to any other readers or 

writers. 

This chapter first discusses the data structures used by the lock manager. The 
action of the lock manager when locks are queued and dequeued is then 
described. The last section in this chapter describes deadlock detection. The 
treatment in this chapter assumes that the reader is familiar with the descrip- 
tion of the VAX/VMS lock management system services found in the VAX/ 
VMS System Services Reference Manual. 



13.1 LOCK MANAGER DATA STRUCTURES 

Essentially the lock database consists of the following four structures: 

• Lock blocks that describe the locks requested by processes 



244 



13.1 Lock Manager Data Structures 

• Resource blocks that describe the resource names for which locks have 
been requested 

• The lock ID table that locates the lock blocks 

• The resource hash table that locates the resource blocks 



13.1.1 Lock Blocks 



Figure 13-1 shows the structure of the lock block (LKB). The lock block is 
allocated from nonpaged pool, and is composed of two overlaying structures. 
The first structure in the lock block contains an AST control block (ACB). 
When a lock is granted, the ACB is used to queue a kernel mode AST to 
perform kernel mode operations in the context of the caller; the ACB is also 
used to queue completion ASTs. When a blocking AST is required, the ACB 
is used to queue the blocking AST. 

The second part of the lock block describes the information specific to the 
lock request (for example, a blocking AST address, the event flag number, and 
the address of the lock status block) and the current state of the lock (for 
example, the lock mode and the queue links used to locate the lock). The 



Lock Block 



ASTQFL 



ASTQBL 



RMOD 



TYPE 



SIZE 



PID 



AST 



ASTPRM 



KAST 



CPLASTADR 



BLKASTADR 



LKSB 



STATUS 



FLAGS 



LKST1 



LKST2 



EFN 



STATE 



GRMODE RQMODE 



SQFL 



SQBL 



OWNQFL 



OWNQBL 



PARENT 



REFCHT 



RSB 



ACB Portion 



State Queue Links 
Owner Queue Links 



Figure 13-1 

Layout of a Lock Block 



245 



VAX/VMS Lock Manager 



state queue links in the lock block are used to link the LKB into a resource's 
state queue. 

The lock block is created when a process requests a new lock and is owned 
only by that process. When a process dequeues a lock, the lock block is deal- 
located. 



13.1.2 Resource Blocks 

A resource block describes a resource and contains listheads for the granted, 
conversion, and waiting queues for the resource. The state queue links in the 
lock block (LKB$L_SQFL and LKB$L_SQBL) link the lock blocks to these 
queues. Note that the conversion and waiting queues are ordered first-in/ 
first-out; the granted queue has no order. Figure 13-2 shows the structure of 
the resource block. The resource blocks are allocated from nonpaged pool. In 
addition to queue heads, a resource block contains the lock value block for 
the resource, the address of the resource's parent resource block (if any), and 



DEPTH 



Resource Block 



HSHCHN 



HSHCNNBK 



TYPE 



SIZE 



PARENT 



REFCNT 



BLKASTCNT 



GRQFL 



GRQBL 



CUTQFL 



CUTQBL 



WTQFL 



WTQBL 



VALBLK 



PROT 



RSNLEM 



RMOD 



CGMODE GGMODE 



GROUP 



RESNAM 



(31 bytes) 



Granted Queue Head 



Conversion Queue Head 



Waiting Queue Head 



spare 



Figure 13-2 

Layout of a Resource Block 



246 



13.1 Lock Manager Data Structures 

the number of sublocks owned by the resource. Only one resource block will 
exist for each resource being locked. 

Resource blocks are deallocated when there are no locks associated with 
the resource (the state queues in the resource block are empty). 



13.1.3 Accessing the Lock and Resource Blocks 

The VAX/VMS lock manager has two ways in which information in the lock 
management database can be located, the lock ID table and the resource hash 
table. The lock ID table is used to locate lock blocks; the resource hash table 
is used to locate resource blocks. Both of these structures are allocated from 
nonpaged pool. 

Once a resource block has been located through the resource hash table, 
the lock blocks associated with the resource can be found through the state 
queue pointers. Conversely, once a lock block has been located through the 
lock ID table, the name of the resource that is locked can be located by the 
resource block address field in the lock block. (A third way to locate informa- 
tion in the lock management database using process control blocks is dis- 
cussed in Section 13.1.4.) 

13.1.3.1 The Lock ID Table. The lock ID table is used to locate locks when the lock ID 
is known. When a caller requests a new lock, the $ENQ system service re- 
turns a lock ID to the caller. The lock ID is actually an index into the lock ID 
table. The caller can then use the lock ID to identify a specific lock when 
performing conversions or dequeuing locks. The lock ID table is located by 
the global symbol LCK$GL_IDTBL. Figure 13-3 shows the structure of the 
lock ID table. 

When an entry in the lock ID table is in use, it contains the address of the 
lock block that is associated with the lock ID. When an entry in the lock ID 
table is not used, the low-order word contains an index to the next unused 
entry in the lock ID table. When the VAX/VMS operating system is initial- 
ized, the module INIT loads each entry in the lock ID table with the index of 
the subsequent entry in the table. The first entry in the table is initialized to 
zero and is not used. A zero entry indicates an unusable lock ID table entry. 

The global symbol LCK$GL_NXTID contains a lock ID table index that 
points to the first free lock ID table entry. When a caller requests a new lock, 
LCK$GL_NXTID is used to locate the new lock ID table entry. The low- 
order word of LCK$GL_NXTID is returned to the caller as the new lock ID. 
Two actions are then performed on the new lock ID table entry. 

• The contents of the new lock ID table entry (which contains a pointer to 
the next free lock ID table entry) are copied into LCK$GL_NXTID. 

• The address of the new lock block is written into the lock ID table entry. 

Because it is possible that an error in a calling routine could pass an errone- 

247 



VAX/VMS Lock Manager 



Lock ID Table 



LKB 



~i — r 



LKB 



Type 



Size 



LCK$GL_MAXID:: 
LCK$GL_NXTID:: 



Figure 13-3 

Structure of the Lock ID Table 



::LCK$GI IDTBL 



The Indexes do not always 
point forward. 



13.1.3.2 



ous value as the lock ID, the lock manager compares the caller's process 
identification and access mode with the process identification and access 
mode stored in the lock block. If the comparison fails, the lock manager exits 
with the return status code SS$_IVLOCKID. 

When a lock block is deallocated, the lock ID table entry is located by its 
lock ID. The contents of LCK$GL_NXTID are written into the lock ID table 
entry (replacing the address of the deallocated lock block) and the lock ID is 
written into LCK$GL_NXTID. 

The global symbol LCK$GL_MAXID contains the index to the last entry 
in the lock ID table. The lock ID table entry at that location always contains 
a zero. The size of the lock ID table is controlled by the SYSBOOT parameter 
LOCKIDTBL. 

The Resource Hash Table. The resource hash table is used to locate resource 
blocks. The resource name is hashed and the result of the hash is used as an 



248 



13.1 Lock Manager Data Structures 

index into the resource hash table. Note that the entries in the resource hash 
table are longword addresses, not quadword queue heads; the resource hash 
table contains only forward pointers to the lists. The table is located by the 
global symbol LCK$GL_HASHTBL. The size of the hash table is determined 
by the SYSBOOT parameter RESHASHTBL. The hashing algorithm is similar 
to the algorithm used for hashing logical names (see Section 29.1.4). 

Each longword entry in the resource hash table points to the first resource 
block in a resource hash chain. Because the resource blocks are maintained in 
a list that is doubly linked, but not circular (the resource hash table contains 
no backward pointers), the list of resource blocks is termed a chain. The first 
two longwords in each resource block contain the forward and backward 
pointers for the resource hash chain. The last block in the chain has a 
zero forward pointer. If a longword entry in the resource hash table con- 
tains a zero, there are no resource blocks associated with that hash table 
entry. 

Figure 13-4 shows the structure of the resource hash table and its relation- 
ships to hash chains. 



Resource Hash Table 



Type 



Size 



::LCK$GI HASHTBL 



RSB 




RSB 




RSB 























































RSB 



Figure 13-4 

Resource Hash Table and Hash Chains 



249 



VAX/VMS Lock Manager 

13.1.4 Relationships in the Lock Database 

There are three ways in which the lock manager can access the lock database. 

• Given a resource name, the lock manager can locate the RSB through the 
resource hash table. Using the state queue heads, all locks associated with 
the resource can be located. 

• Given a lock ID, the lock manager can locate the lock block through the 
lock ID table. Using the resource address field in the lock block, the re- 
source associated with the lock can be located. 

• Given a process control block, the lock manager can locate the lock queue 
header (at offsets PCB$L_LOCKQFL and PCB$L_LOCKQBL). Using the 
lock queue links, all locks owned by a specific process can be located. 

A lock with a parent lock and resource is termed a sublock. When a sublock 
is requested, the new lock block will contain the address of the parent lock 
block (at offset LKB$L_PARENT) ; the resource block associated with the 
sublock will point to the parent resource (at offset RSB$L_ PARENT). This 
relationship is shown in Figure 13-5. When a sublock is created, the reference 
count fields in the parent lock block and resource block are incremented to 
account for the sublocks. A lock block or resource block cannot be deal- 
located unless the reference count equals zero. By the reference count, parent 
locks can tell the number of sublocks they own ; they do not have a list of 
their sublocks. 



13.2 QUEUING AND DEQUEUING LOCKS 

The lock manager becomes active only when calls are made to the $ENQ or 
$DEQ system services. When the $ENQ service is called, the lock manager 
attempts to grant the requested new lock or the lock conversion immedi- 
ately. If the new lock or conversion cannot be granted, the lock block is 
placed on the waiting or conversion queue. When the $DEQ service is called, 
the lock manager dequeues the lock from the resource and then searches the 
resource's state queues for locks that are compatible with the currently 
granted locks. Lock compatibility is described fully in the VAX/VMS System 
Services Reference Manual. The following sections describe the action of the 
$ENQ and $DEQ services. 

13.2. 1 The $ENQ System Service 

When a process calls the $ENQ system service, the event flag and lock mode 
are validated and the lock status block is checked for read/write access. If 
these checks are successful, the request type is checked (new lock or conver- 
sion). Section 13.2.2 discusses in detail the action of the lock manager for 
lock conversions. 



250 



13.2 Queuing and Dequeuing Locks 



Resource Hash 
Table 



Lock ID Table 



LKB 



RSB 



Granted 



RSB 



_» Parent 



Waiting 



1 r 



State Queue 



Owner Queue — 



RSB 



LKB 



State Queue 



. Owner Queue 



-• Parent 



-• RSB 




PCB 




Figure 13-5 

Relationships between Locks and Sublocks 



251 



VAX/VMS Lock Manager 

If a new lock is requested, a lock block and a resource block are allocated. 
The fields of the lock block are initialized, including the fields in the ACB at 
the top of the lock block. A new resource block for the resource is allocated 
and initialized (even if the resource exists already). After hashing the new 
resource name and finding an index into the resource hash table, the lock 
manager searches the hash chain for a resource block with the same resource 
name. For each resource block encountered on the hash chain, the following 
fields are compared with the new resource block: 

• Parent resource block address 

• UIC group number (the UIC group number is zero for system locks) 

• Access mode (user through kernel mode) 

• Name space (system or group wide) 

• Length of the resource name string 

• Resource name string 

If the resource block for the named resource is not found, the new resource 
block is added to the end of the hash chain and the new lock is granted (see 
Section 13.2.1.1). If the flag bit LKB$M_SYNCSTS is set, the success status 
code SS$_SYNCH is returned to the caller. 

If the named resource block is found in the search for the resource name, 
the new resource block is deallocated and the existing one is used. The re- 
quested mode in the lock block is tested for compatibility with the currently 
granted locks. If the new lock is compatible, the new lock is granted. Again, if 
the bit LKB$M_SYNCSTS is set, the success status code SS$_SYNCH is 
returned to the caller. 

In order to speed checks for compatibility with the currently granted locks, 
each resource block contains a field indicating the highest granted lock mode 
of all locks in the granted and conversion queue for that resource. This field is 
termed the group grant mode. Note that locks on the conversion queue retain 
their granted mode; it is the granted mode of these locks that is used in 
calculating the group grant mode, not their requested mode. The value of the 
group grant mode is stored in the resource block at offset RSB$B_GGMODE. 
Because this value is calculated only when a new lock is granted and is main- 
tained in the resource block, compatibility checking involves only one com- 
pare operation; the lock manager does not have to spend time comparing lock 
modes each time it attempts to grant a lock. 

13.2.1.1 Granting a Lock. The action of granting a lock involves five steps: 

1. The compatibility of the locks (group grant mode) is recomputed. 

2. The lock block is placed on the granted queue. 

3. The event flag is set. 



252 



13.2 Queuing and Dequeuing Locks 

4. If a completion AST was specified, it is queued. 

5. If a blocking AST was specified and the lock is blocking another lock 
request, the blocking AST is queued. 

To place a lock on the granted queue, the listheads for the granted queue are 
located in the resource block at offsets RSB$L_GRQFL and RSB$L_GRQBL. 
The lock block is then linked into the granted queue. The order in which 
locks are placed on the queue is unimportant. The only time that the granted 
queue is traversed is when the group grant mode is computed, and, in that 
case, no particular order is required. 

The event flag number is stored in the lock block at offset LKB$B_EFN. 
The global routine SCH$POSTEF is called to set the event flag. 

13.2.1.2 ASTs and the Lock Manager. Because the lock manager must modify informa- 
tion in per-process space, a special kernel mode AST routine is required to 
perform some actions when granting a lock. The following operations are 
performed by the special kernel mode AST routine. 

• The contents of the lock status block (and optionally the contents of the 
lock value block) are copied to the caller's lock status block. 

• If a completion AST has been queued and if a blocking AST is required at 
this time, the blocking AST is queued. 

• If the NODELETE bit is clear in the ACB, the ACB is deallocated. 

If no completion AST or blocking AST routine is specified by the caller, a 
special kernel mode AST is used to perform these actions. However, if an 
AST routine was specified by the caller, the special kernel AST is queued as a 
piggyback special kernel AST in the caller's ACB (see Section 7.2.4). 

Because the ACB can contain the address of only one AST routine, special 
treatment is required when a the lock manager must signal both a comple- 
tion AST and a blocking AST. When the lock is granted, the AST routine field 
in the lock block ACB (offset LKB$L_AST) is loaded with the the address of 
the completion AST routine (stored at offset LKB$L_CPLASTADR). When 
the completion AST is delivered, the contents of the ACB are saved on the 
stack and the piggyback special kernel AST is delivered. Because the contents 
of the ACB were saved, it can be modified now to contain the address of the 
blocking AST. The special kernel mode AST routine loads offset LKB$L_ AST 
with the address of the blocking AST routine (stored at offset 
LKB$L_BLKASTADR) and requeues the AST. When the special kernel mode 
AST routine exits, the completion AST routine is executed. 

13.2.1.3 Waiting Locks. Before an incompatible lock can be placed on the waiting 
queue, the flag LKB$M_NOQUEUE is checked. If the flag is set, the lock is 



253 



VAX/VMS Lock Manager 



not queued and the failure return status SS$_NOTQUEUED is returned to 
the caller. If the flag is not set, the lock block is queued to the end of the 
waiting queue for the resource. The queue headers for the waiting queue are 
found at offsets RSB$L_WTQFL and RSB$L_WTQBL. 



13.2.2 Lock Conversions 

When a caller requests a lock conversion, the lock manager is passed the lock 
ID of the lock to be converted and the new lock mode for the conversion. The 
new lock mode is compared with the value of the group grant mode. If the 
new lock mode is compatible with the current granted locks, the lock is 
granted (see Section 13.2.1.1). 

If the requested mode of the conversion is not compatible with the group 
grant mode, the requested lock mode is compared to the value of the conver- 
sion grant mode (stored at offset RSB$B_CGMODE). If the lock is compatible 
with the conversion grant mode, the lock is granted. If the lock is incompati- 
ble, it is placed at the tail of the conversion queue. 

Most of the time the conversion grant mode contains the same value as the 
group grant mode. The only time the conversion grant mode is different from 
the group grant mode is when both of the following are true: 

• The current lock mode of the lock at the head of the conversion queue is 
the most restrictive lock mode for the resource. 

• That lock is the only lock at the current mode. 

If both of these conditions are true, the granted lock mode of the lock on 
the conversion queue is omitted from the calculation of the conversion grant 
mode. The use of the conversion grant mode insures that lock conversions 
between incompatible lock modes will not block themselves. 

Suppose that a resource has one lock in its granted queue at null (NL) 
mode. If a lock request is issued for the resource at protected write (PW) 
mode, the group grant mode is NL mode, so the PW mode lock is granted. 
When the new lock is granted, the group grant and conversion grant modes 
are recalculated; both equal PW mode. 

Now the PW mode lock requests a conversion to exclusive (EX) mode. If 
the group grant mode was used to determine compatibility, the conversion to 
EX mode could not be granted, because the PW mode lock is actually block- 
ing its own conversion (remember that group grant mode includes both the 
granted and conversion queues). However, the lock at the head of the conver- 
sion queue has the most restrictive lock mode currently granted. In calculat- 
ing the conversion grant mode, the lock at the head of the conversion queue 
is omitted. Thus, the conversion grant mode is NL mode, and the conversion 
can be granted. 



254 



13.3 Handling Deadlocks 

13.2.3 The $DEQ System Service 

When making a call to the $DEQ system service, the caller passes the lock ID 
of the lock to be dequeued to the lock manager. The $DEQ system service 
uses the lock ID to locate the lock block and then verifies that the caller has 
the correct access mode and PID to access the lock. The resource block ad- 
dress in the lock block is used to locate the resource block. If the reference 
count in the lock block is zero, the lock block is dequeued from its current 
state queue and is deallocated. The lock manager then checks the state queue 
headers in the resource block to which the lock was queued. If all of the state 
queues in the resource block are empty and the reference count is zero, the 
resource block is removed from the hash chain and is deallocated. 

If the resource block reference count is nonzero, the lock manager attempts 
to grant locks waiting on the conversion or waiting queues. 

• The lock mode of the first lock in the conversion queue is compared with 
the conversion grant mode. 

— If the lock is incompatible, the $DEQ system service exits and returns 

control to the user. 
— If the lock is compatible, it is dequeued from the conversion queue and 

is granted. 
— When the lock is dequeued from the conversion queue, a new lock takes 

its place as the first lock on the conversion queue. 

This step is repeated for the new first entry in the conversion queue until 
either the conversion queue is emptied or an incompatible lock is found 
and the lock manager exits. 

• If the conversion queue is emptied, the lock mode of the first lock in the 
waiting queue is compared against the group grant mode. 

— If the lock is incompatible, the $DEQ system service exits and returns 

control to the user. 
— If the lock is compatible, it is dequeued from the waiting queue and 

granted. 
— When the lock is dequeued from the waiting queue, a new lock takes its 

place as the first lock on the waiting queue. 

This step is repeated on the new first entry in the waiting queue until 
either the waiting queue is emptied, or an incompatible lock is found. 

13.3 HANDLING DEADLOCKS 

A deadlock occurs when several locks are waiting for each other in a circular 
fashion. The VAX/VMS lock manager resolves deadlocks by choosing a par- 
ticipant in the deadlock cycle (a lock request that is waiting on the conver- 



255 



VAX/VMS Lock Manager 



sion or waiting queue) and refusing that participant's lock request. The par- 
ticipant that is chosen to break the deadlock is termed the victim. The 
victim's lock or conversion request fails and the error status code 
SS$ -DEADLOCK is returned in the victim's lock status block. 
There are three parts to deadlock handling in the VAX/VMS lock manager. 

• The lock manager suspects that a deadlock exists. 

• A deadlock search proves that a deadlock actually exists. 

• The victim is chosen. 



13.3.1 Initiating a Deadlock Search 

Because deadlock detection is a time-consuming task, it is not desirable to 
search for deadlocks every time a lock or conversion is requested. It is far 
better to search for a deadlock only when the system suspects that a deadlock 
exists. The VAX/VMS lock manager searches for a deadlock only when a 
process has been waiting for a resource for a specified amount of time. The 
SYSBOOT parameter DEADLOCK_WAIT specifies the amount of time to 
wait before initiating a deadlock search. 

Whenever a lock is placed in the conversion or waiting queue, the lock 
block is also queued to the lock manager timeout queue (located by the global 
symbol LCK$GL_TIMOUTQ). The AST queue fields in the lock block are 
used to link the lock block into the timeout queue. When a lock must wait 
on the conversion or waiting queue, the value in DEADLOCK- WAIT is 
added to the current absolute system time (EXE$GL_ABSTIM), and the re- 
sult is stored in the lock block at offset LKB$L_DUETIME. 
(LKB$L_DUETIME is actually a double use of the special kernel AST routine 
address field, LKB$L_KAST.) 

Once every second, the VAX/VMS operating system executes the routine 
EXE$TIMEOUT. In addition to checking for device timeouts, this routine 
checks to see if the the first entry in the lock manager timeout queue has 
timed out. The value in LKB$L_DUETIME is compared with the absolute 
system time. If the due time has not been reached, the routine exits. How- 
ever, if the due time has passed, a deadlock search is initiated. 

13.3.2 Deadlock Detection 

There are two separate forms of deadlock that can occur in the VAX/VMS 
lock manager. Each requires a different form of detection. One form (a con- 
version deadlock) is easily detected, because it is restricted to a single re- 
source. Multiple resource deadlocks require a more complex search to locate. 

13.3.2.1 Conversion Deadlocks. Conversion deadlocks occur when there are at least 
two locks in the conversion queue for a resource. When the requested mode 



256 



13.3 Handling Deadlocks 

of the first lock in the conversion queue is incompatible with the granted 
mode of the second lock in the conversion queue, a deadlock exists. 

For example, assume that there are two protected read (PR) mode locks on a 
resource. One PR mode lock requests a conversion to exclusive (EX) mode. 
Because PR mode is incompatible with EX mode, the conversion request 
must wait. While the first conversion request is waiting, the second PR mode 
lock also requests a conversion to EX mode. Now, the first lock will never get 
granted because its requested mode (EX) is incompatible with the second 
lock's granted mode (PR). The second conversion request will never get 
granted because it is waiting behind the first. 

In detecting a conversion deadlock, the search begins with the lock block 
indicated by the lock manager timeout queue. The state queue backward link 
is used to locate the previous lock in the conversion queue. The granted 
mode of the previous lock is compared with the requested mode of the lock 
that timed out. If the modes are compatible, the previous lock in the conver- 
sion queue is located using the state queue backward link. The test is re- 
peated until an incompatible lock is found or the beginning of the queue is 
found. 

If an incompatible lock is found, a deadlock exists and a victim is selected 
(see Section 13.3.3). If the beginning of the queue is reached, a conversion 
deadlock does not exist, and a search for a multiple resource deadlock is 
initiated. 

13.3.2.2 Multiple Resource Deadlocks. Multiple resource deadlocks occur when a cir- 
cular list of processes are each waiting for one another on two or more re- 
sources. 

For example, assume Process A locks Resource 1 and Process B locks Re- 
source 2. Process A then requests a lock on Resource 2 that is incompatible 
with B's lock on resource 2, and thus, Process A must wait. Note that at this 
point, a circular list does not exist. When Process B then requests a lock on 
Resource 1 that is incompatible with A's lock on Resource 1, it must wait. A 
multiple resource deadlock now exists. Processes A and B are both waiting 
for each other to release different resources. These steps are shown in Figure 
13-6. In the figure, locks that are blocking a resource (incompatible with 
waiting locks) are shown beneath the resource block; locks that are waiting 
on a resource are shown above the resource block. 

This type of deadlock normally involves two or more resources, unless one 
process locks the same resource twice. (Usually a process will not lock the 
same resource twice; however, if the process is multithreaded, double 
locking may occur. Double locking also represents a multiple resource 
deadlock.) 

To verify that a multiple resource deadlock exists, a recursive algorithm is 
used. The approach Is summarized as follows: 



257 



VAX/VMS Lock Manager 



Resource 1 




Resource 1 




Resource 1 




♦ 




" 




" 






B 




B 




B 












♦ 




t 




Resource 2 




Resource 2 




Resource 2 




t 








' 








A 












A 












/ 


\ 



I 



Figure 13-6 

Example of a Deadlock Occurring 



• A waiting lock will be waiting for locks owned by other processes. 

• Any of the other processes might themselves have waiting locks. 

• Those waiting locks will be waiting for locks owned by other blocking 
processes. 

In implementation, the lock manager starts with the lock that timed out on 
the lock manager timeout queue. The address of the PCB associated with the 
lock that timed out is saved and the multiple resource deadlock routine 
(SEARCH_RESDLCK) is called. If a lock with the same owner PCB can be 
found blocking a resource, a deadlock exists. 

Each time SEARCH_RESDLCK is called, a stack frame is pushed onto the 
stack. Each stack frame contains information on the current position in the 
search. Figure 13-7 shows the the contents of the stack frame. 

Each call to SEARCH- RESDLCK specifies the address of a waiting lock 
block. The resource associated with the lock block is located and the re- 
source state queues are searched for lock blocks whose granted or requested 
lock mode is incompatible with that of the waiting lock block. If an incom- 
patible lock block is found, that lock is considered to be blocking the waiting 
lock block. 

When a blocking lock is found, the owner PCB of the blocking lock is 
located. If the owner PCB is the same as the PCB of the lock that initiated the 
deadlock search, the list is proven to be circular and a deadlock exists. A 
victim is chosen (see Section 13.3.3 for details on victim selection), and dead- 



258 



13.3 Handling Deadlocks 



Saved R2 



Saved R3 



Saved R4 (PCB + LOCKQFL) 



Saved R5 



Saved R6 (Address of LKB) 



Return Address 



Figure 13-7 

Stack Frame Built by the Lock Manager 



13.3.2.3 



lock detection returns control to EXE$TIMEOUT. If the PCB of the blocking 
lock is not the same as the saved PCB, another call is made to 
SEARCH _RESDLCK, specifying the address of the new blocking lock block. 

Each time SEARCH_RESDLCK is called, it searches the state queues asso- 
ciated with the specified lock block, to see if the lock block is waiting on a 
resource. 

When all the state queues for a given resource have been searched and no 
blocking lock has been found for that lock block, the routine removes the 
stack frame and returns control to its caller. If the caller itself was 
SEARCH_RESDLCK, the previous search for blocked locks on the resource 
can now be resumed. 

A process bitmap is maintained by the VAX/ VMS lock manager in order to 
reduce the number of repeated searches for blocking locks on a particular 
process. Each time a new blocking PCB is located, a bit corresponding to that 
process is set. If the bit for the PCB is set already, the search for locks block- 
ing that process is terminated, because its locks have been searched already. 

Unsuspected Deadlocks. Note that the use of the process bitmap speeds the 
location of the suspected deadlock, but prevents the accidental detection of 
unsuspected deadlocks. An unsuspected deadlock is one that exists within 
the lock management database, but has not been detected so far, because 
none of its locks have timed out on the lock manager timeout queue. This 
behavior is acceptable in the VAX/VMS lock manager for the following rea- 
sons: 

• Deadlocks should be rare. 

• Finding a process a second time in a deadlock search does not necessarily 
indicate that an unsuspected deadlock exists. 

• The occurrence of unsuspected deadlocks should be rarer still. 



259 



VAX/VMS Lock Manager 



• Any deadlock search that does not find a deadlock is a waste of processor 
time. 

• The unsuspected deadlock will become a suspected deadlock when one of 
its own locks times out on the lock manager timeout queue and a deadlock 
search is initiated on its behalf. 

Figure 13-8 shows two deadlocks. One deadlock is suspected and a search is 
in progress (the path with the heavy arrows); the other is unsuspected. This 
figure is an extension of the deadlock cycle shown in Figure 13-6. In this case, 
the deadlock search was initiated as a search for the locks blocking Process A. 
Because Process C is the first process found granted for Resource 2, it was the 
first lock that is investigated for participation in the deadlock cycle. Process 
C is waiting for Resource 3. The bit corresponding to Process C is set in the 
process bitmap. The context of the search is saved on the stack and 
SEARCH.RESDLCK is called to search for processes blocking Process C's 
lock. 

Process D has a blocking lock on Resource 3. Process D is also waiting for 
Resource 2. The bit corresponding to Process D is set in the process bitmap. 
The context of the search is saved on the stack and SEARCH_RESDLCK is 
called to search for processes blocking Process D's lock. Process C has a 
blocking lock on Resource 2. This situation is a deadlock. However, because 
the bit corresponding to Process C was set in the process bitmap, the dead- 
lock search for Process C is abandoned. One by one the stack frames are 
removed and the search whose context was saved continues. Eventually the 



1 



Resource 3 



I 



Resource 1 



X i 



Resource 2 

I T 



© 



Figure 13-8 

Suspected and Unsuspected Deadlocks 



260 



13.3 Handling Deadlocks 

deadlock search will continue with locks blocking Resource 2 and the dead- 
lock cycle of Processes A and B will be discovered. 

Eventually one of the locks requested by Processes C and D will time out, 
and a deadlock search will be initiated for that deadlock. 

13.3.2.4 Example of a Search for a Multiple Resource Deadlock. Figure 13-9 shows a 
series of locks that result in a deadlock. The heavy arrows in the figure show 
the path of the deadlock cycle. 

Assume that the lock owned by Process A timed out on the resource timer 
queue. Process A is waiting for a lock on Resource 1. The deadlock search 
routine saves Process A's PCB and calls SEARCH_RESDLCK, passing the 
address of Process A's LKB. 

The incompatible lock on Resource 1 is owned by Process C. Process C has 
no other waiting locks, so SEARCH _RESDLCK moves on to the next incom- 
patible lock. This lock is owned by Process D. When SEARCH_RESDLCK 
follows the PCB queue for Process D, it finds that this process is waiting for a 
lock on Resource 3. 

SEARCH-RESDLCK calls itself, passing the address of the lock block 
owned by process D. The new invocation of SEARCH-RESDLCK pushes a 
stack frame detailing the position of the search on Resource 1, and 
SEARCH-RESDLCK starts to search for locks on Resource 3 that are incom- 
patible with Process D's lock. Resource 3 has two incompatible locks, owned 



Resource 1 



Resource 3 



till 



Resource 2 



IT 



I 



Figure 13-9 

Example of a Multiple Resource Deadlock 



261 



VAX/VMS Lock Manager 



by Processes E and F. Neither of these processes is waiting for a lock, so the 
search on Resource 3 terminates. The contents of the stack frame are restored 
and SEARCH _RESDLCK returns to its previous invocation. The search for 
processes blocking Process A resumes. c 

The next incompatible lock found on Resource 1 is owned by Process G. 
Process G has no waiting locks, so the search continues with Process B. The 
PCB queue for Process B shows that it is waiting for a lock on Resource 2. 

Again, SEARCH_RESDLCK calls itself, passing the address of the lock 
block owned by Process B. The new invocation of SEARCH _RESDLCK 
pushes a new stack frame onto the stack, and SEARCH _RESDLCK finds that 
Process D owns a lock that is incompatible with the lock owned by process B. 
However, because locks owned by Process D have been searched already (the 
bit for Process D is set in the lock manager process bitmap), the search moves 
on to the next process. 

The next incompatible lock is owned by Process A. Because the PCB ad- 
dress of Process A matches the PCB address that was saved initially, the list 
is proven to be circular and a deadlock exists. Now a victim must be chosen. 



13.3.3 Victim Selection 

Because conversion deadlocks involve only two processes, the victim selec- 
tion routine simply chooses the process with the lower deadlock priority 
(stored in the PCB at offset PCB$L_DLCKPRI). 

For multiple resource deadlocks, the victim selection routine is only 
slightly more complicated. The frames that were pushed onto the stack in 
each recursion into the deadlock location routine are searched for the lowest 
deadlock priority. Each time a lower deadlock priority value is found, the 
priority and the owner PCB are noted. If a deadlock priority of zero is found, 
that process is immediately chosen as the victim. When all frames have been 
searched, or a deadlock priority of zero is found, the stack pointer is restored 
and the process whose PCB had the lowest deadlock priority is chosen as the 
victim. 

Note that the current implementation of the VAX/VMS operating system 
initializes the deadlock priority of all new processes to zero. Thus, it is not 
possible to assume which process will be chosen as the victim. With the 
current implementation, victim selection depends primarily on timing. How- 
ever, other applications or implementations of the VAX/VMS operating sys- 
tem may use the deadlock priority to determine victim selection. If other 
applications need to use the deadlock priority scheme, they must write a 
privileged shareable image that accesses the PCB and loads a value into the 
deadlock priority field (PCB$L_DLCKPRI). 

A last note on victim selection may be of interest to users intending to 
implement a binary victim selection. In this search, specific processes are 



262 



13.3 Handling Deadlocks 

always victims (their deadlock priority is zero); other processes are never 
selected as victims (their deadlock priority is always set to a predetermined 
value). If this victim selection scheme is used, the implementation must 
make sure that at least one process exists in a deadlock cycle that can be 
chosen as the victim to break the deadlock. Otherwise, the victim will be 
chosen at random. 



263 



PART IV/Memory Management 



14 Memory Management Data 
Structures 



. . . but there's one great advantage in it, that one's memory 

works both ways. 

— The Queen in Lewis Carroll, Through the Looking Glass 

Virtual memory support in the VAX/VMS operating system is implemented 
by several distinct pieces of the executive. The translation-not-valid fault 
handler (pager) is the exception service routine that responds to page faults 
and brings process virtual pages into memory on behalf of a process. The 
swapper process keeps the highest-priority computable processes in physical 
memory. In order to keep processes in memory, the swapper is responsible for 
shrinking process working set sizes and removing processes that are blocked 
for some reason in order to gain more pages of memory. Several system serv- 
ices allow a program to exercise some control over its behavior in memory 
while it is executing. 

The system maintains many tables, some process-specific and others sys- 
tem-wide, that must be manipulated by the major components of the mem- 
ory management subsystem. Before these components are described in the 
following chapters of this section, this chapter will describe the tables used 
by the components. The following structures are presented and described in 
this chapter: 

• The process-specific data, found mostly in the process header. 

• The data that is used to account for physical memory, the so-called PFN 
database. 

• The special structures that are used for system and global pages. 

• The structures that are required to keep track of processes in memory. 

• The structures that are required to swap processes out of memory. 

• The structures that are required to describe the page and swap files. 

• The structures that support the MA780 shared memory. 

14.1 PROCESS DATA STRUCTURES (PROCESS HEADER) 

The most important process-specific data structures used by the memory 
management subsystem are contained in the process header (Figure 14-1). 
The process header contains all of the process-specific data that can be re- 
moved from memory when, a process is outswapped. The address of the proc- 
ess header is stored in the software PCB. 



267 



Memory Management Data Structures 



Process Header (PHD) 



Contains pointers to variable 
portions of the Process Header. 



Contains valid page table entries 
that can become invalid. 



Describes pages in image f 



Reserved for expansion of the 
working set list. 

Describes pages in the process 
header itself. 



Fixed Portion of Process Header 



Working Set List 





\ 



Process Section Table 



Describes the virtual address 
space used by the process. 



Empty Pages 



Arrays for Process Header Pages 



PO Page Table 





P1 Page Table 



\ 



>; 



/ 



Figure 14-1 

Discrete Portions of the Process Header 



Figure 14-1 shows the portions of the process header that are of special 
interest to memory management. Chapter 26 describes how the sizes of the 
pieces of the process header are related to SYSBOOT parameters. The smaller 
figure to the right of the process header shows the relative sizes of the por- 
tions of the process header on a typical system. The following pieces of the 
process header are of interest to this discussion: 

• The PO and PI page tables are the largest contributors to the size of the 
process header and contain the complete description of the virtual address 
space currently being used by the process. 

• The working set list describes the subset of process page table entries that 
are currently valid but can become invalid in the future. PFN-mapped 
pages and pages in shared memory are valid for the entire time that they 
are mapped and do not appear in the working set list. 

• The process section table contains information used by the pager when a 
page resides in an image file. 



268 



14.1 Process Data Structures (Process Header) 

Because the sizes of the different pieces of the process header vary from 
system to system, there must be some method of determining where each 
piece is located. Pointers or indexes in the fixed portion of the process 
header serve this purpose. Process accounting information, some of which 
is used by the pager or the swapper, is also located in this area. 
There are several arrays that contain information about each process 
header page. This information is used by the swapper when it is necessary 
to outswap the process header. 



14.1.1 Process Page Tables 

The process page tables are the first memory management data structures 
encountered by either hardware or software. The contents of the page table 
entries are used by the hardware to translate a virtual address to its physical 
counterpart. When translation fails to determine the physical location of a 
page, the page table entries are used by the page fault handler to locate the 
invalid page. 

Figure 14-2 shows the portion of the process header devoted to the PO and 
PI page tables. The figure also shows those fields in the fixed portion that are 
used to locate different pieces of the PO or PI page table. 

• The PO page table contains page table entries for all pages currently defined 
in PO space. The number of pages in PO space is stored in offset 
PHD$L_P0LR (and moved into PR$_P0LR by LDPCTX when the process 
is selected for execution). The virtual page number of the first unmapped 
page in PO space (the index of the first nonexistent POPTE) is stored at 
offset PHD$L_FREPOVA. 

• In a similar manner, the PI page table contains page table entries for the 
pages currently defined in PI space. Like PI space itself, the PI page table 
grows toward smaller addresses. To simplify the address translation logic, 
the PI base register contains the virtual address of the page table entry that 
would map virtual address 40000000. The PI length register contains the 
number of PI page table entries that do not exist. The virtual page number 
of the high address end of the unmapped portion of PI space (Figure 14-2) is 
stored at offset PHD$L_FREP1VA. 

• The number of page table entries available for the expansion of either P0 
space or PI space is stored in offset PHD$L_FREPTECNT. The number of 
entries here depends on the SYSBOOT parameter VIRTUALPAGECNT, 
minus the current sizes of the P0 and PI page tables. 

When a process references a virtual address that is not valid, it incurs a page 
fault, an exception that transfers control to the page fault handler. One of the 
exception-specific parameters pushed onto the stack by the page fault handler 
is the invalid virtual address. This address enables the pager to retrieve the 



269 



Memory Management Data Structures 



Process Header (PHD) 



PCB$I PHD 



FREP0VA= P0BR + 4X POLR 
FREP1VA=P1BR+4xP1LR 



These four values 
are stored in the 
Hardware PCB, a 
part of the fixed 
portion of the 
process header. 




Jf- 



FREPOVA 



FREPTECNT 



FREP1VA 



POBR 



O POLR 



P1BR 



P1LR 



PO Page Table 

(Maps Virtual Addresses from 

0toFREP0VA-1) 



Room for Expansion of Either 
PO Page Table 

or 
P1 Page Table 



^-PILR) Entries 



P1 Page Table 

(Maps Virtual Addresses from 

(FREP1VA+200 16 ) to 7FFFFFFF) 



End of Process Header 



Figure 14-2 

Process Page Tables 



page table entry for the invalid page in order to determine where the page is 
located. 

The page table entries for invalid pages are set up in such a way that they 
contain either the location of the page or a pointer to further information 
about the page. Figure 14-3 shows the different forms that an invalid page 
table entry can take. A valid page table entry is included for comparison. 
Notice that bits <31> (valid bit), <30:27> (protection code), and <24:23> 
(owner access mode) have the same meaning in all possible forms of page 
table entry. Table 14-1 lists the symbolic and numeric forms of possible pro- 
tection codes. 

The pager uses bits <26> and <22> in the invalid page table entry to 
distinguish the different PTE forms. (Because protection checks are made 
before the valid bit is checked, PTE <30:27> must contain a protection code, 
even when the valid bit is clear.) The various forms are described in the 
following paragraphs, starting with the entry at the bottom of the figure. 



270 



14.1 Process Data Structures (Process Header) 



Valid PTE 



Modify Bit 



- Set by Hardware on Write or 
Modify Access to Page 



31 30 



r*- Window Bit - Indicates Page Mapped by PFN 
27 26 25 24 23 22 21 20 19 18 17 16 15 



1 


Protection 
Code 


M 


t 


Owner 
Access 
Mode 


X 


W 


Page Frame Number (PFN) 



Page Is Active 
and Valid 



/ 









Protection 
Code 

(See 
Table 14-1) 





X 


Owner 
Access 
Mode 





1 










x 





X 


Page Frame Number (PFN) 


Different 
Forms of / 
Invalid PTEs 





X 


1 


Global Page Table Index 











Page File Virtual Block Number 






i 


1 


I 


X 


X 


w 


D 
Z 
R 



C 
R 

F 


Process Section Table Index 



31 30 27 26 25 24 23 22 2120 19 181716 15 



Demand Zero 
Page 

Page Is in 

Transition 



Invalid Global 
Page 

Page Is in 
Page File 

Page Is in 
Image File 



■+- TYPO Bit - Low Order Bit of PTE Type 
-*- TYP1 Bit - High Order Bit of PTE Type 
-»- Valid Bit - Page Table Entry Valid Bit 



Figure 14-3 

Different Forms of Page Table Entry 



14.1.1.1 Process Section Table Index. When a page is located in an image file, the page 
table entry contains an index into the process section table. This index lo- 
cates a process section table entry, which contains information about where 
the image file is located and which block in the image file contains the fault- 
ing page. Control bits in the process section table entry indicate whether the 
section is a global section <0> (process section table entries always have this 
bit clear), whether it is writeable <3>, and whether the section is copy on 
reference <1>. Process section tables are discussed in Section 14.1.3 and 
further in Chapter 15. 

14.1.1.2 Page File Virtual Block Number. When a virtual page resides in a page file, its 
associated page table entry contains the virtual block number within the 
page file where the page is located. The page file that is used by this process is 
indicated by the field PHD$B_PAGFIL in the process header. PHD$L_ 
PAGFIL, a longword field that contains zero in its low-order three bytes and 
overlaps PHD$B_ PAGFIL in the high-order byte, is a skeleton for any page 
table entry that acquires a page file backing store address. A virtual block 



271 



to 



Table 14-1 : Memory Access Protection Codes in Page Table Entries 



Protection 

No Access Allowed 
Reserved 

Kernel Write (Kernel Read) 
Kernel Read (No Write) 
User Write (User Read) 
Executive Write (Executive Read) 
Executive Read, Kernel Write 
Executive Read (No Write) 
Supervisor Write (Supervisor Read) 
Supervisor Read, Executive Write 
Supervisor Read, Kernel Write 
Supervisor Read (No Write) 
User Read, Supervisor Write 
User Read, Executive Write 
User Read, Kernel Write 
User Read (No Write) 



SYMBOL = binary value 



PRT$C_NA 


= 0000 


PRT$C -RESERVED 


= 0001 


PRT$C_KW 


= 0010 


PRT$C_KR 


= 0011 


PRT$G_UW 


= 0100 


PRT$C_EW 


= 0101 


PRT$C_ERKW 


= 0110 


PRT$C_ER 


= 0111 


PRT$C_SW 


= 1000 


PRT$C_SREW 


= 1001 


PRT$C_SRKW 


= 1010 


PRT$C_SR 


= 1011 


PRT$C_URSW 


= 1100 


PRT$C_UREW 


= 1101 


PRT$C_URKW 


= 1110 


PRT$C_UR 


= mi 



Protection Mask 
PTE$C_NA = 00000000 



PTE$C. 
PTE$C. 
PTESC. 
PTE$C. 
PTE$C. 
PTE$C. 
PTE$C. 
PTE$C. 
PTE$C. 
PTE$C. 
PTE$C. 
PTESC. 
PTESC. 
PTESC. 



KW 

KR 

UW 

EW 

ERKW ■ 

ER 

SW 

SREW ■■ 

SRKW : 

.SR 

URSW 

.UREW 

.URKW 

.UR 



10000000 
18000000 
20000000 
28000000 
30000000 
38000000 
40000000 
48000000 
50000000 
58000000 
60000000 

• 68000000 
70000000 

= 78000000 



Note that the following rules govern memory access protection: 

• If a given access mode has write access to a specific page, then that access mode also has read access to that page. 

• If a given access mode can read a specific page, then all more privileged access modes can read the same page. 

• If a given access mode can write a specific page, then all more privileged access modes can write the same page. 
Access that is implied (rather than explicitly a part of the symbolic protection name) is included in parentheses. 



o 

a 

Oq 
TO 



§ 

a 

On 

e 
o 

r+ 
C 

TO 
Cm 



14.1 Process Data Structures (Process Header) 

number of zero indicates that a block in the page file will exist for the page, 
but has not yet been reserved. 

14.1.1.3 Global Page Table Index. An invalid process page mapped to a global page 
contains an index into the global page table, where an associated global page 
table entry contains further information used to locate the page. The global 
page table is described in Section 14.3. Page faults involving global pages are 
discussed in Chapter 1 7. 

14.1.1.4 Page in Transition. There are several different situations where a virtual page 
can be associated with a physical page, and yet the page is not valid, not in 
the process working set. For example, when a page is removed from a process 
working set, it is not discarded but put on the free page list or modified page 
list. Such a page is called a transition page. The process page table entry 
contains a PFN, but the valid bit is clear. The two type bits (PTE<26> and 
PTE<22>) are also clear. 

Transition pages are described by the entries for the physical page found in 
the PFN database (see Section 14.2). In particular, the PFN STATE array des- 
ignates the particular transition state the physical page is in. 

14.1.1.5 Demand Zero Pages. A special form of the transition page table entry format 
has a zero in the PFN field. This zero indicates a special form of page called a 
demand-allocate zero-fill page or demand zero page for short. When a page 
fault occurs for such a page, the pager allocates a physical page, fills the page 
with zeros, inserts the PFN into the PTE, sets the valid bit, and dismisses the 
exception. (For this reason, and a second reason explained in Section 14.2.5, 
physical page zero cannot be used by memory management.) 

14.1.2 Working Set List 

The working set list contains the subset of a process's page table entries that 
are currently valid. The working set list is used by the pager and swapper to 
determine which virtual page to discard (to mark invalid) when it is neces- 
sary to take a physical page away from the process. The swapper also uses the 
working set list to determine which virtual pages need to be written to the 
swap file when the process is outswapped. 

Figure 14-4 shows the working set list in the process header and the various 
fields in the fixed portion that locate different pieces of the list. Each of these 
fields, including the quota fields, contains a longword index (multiply con- 
tents by four or use context index addressing) to the working set list entry in 
question. 

14.1.2.1 Division of the Working Set List. The working set list consists of three pieces: 
the permanently locked portion of the working set list, the pages that are 



273 



Memory Management Data Structures 



PCB$L_PHD *■ 



PHD+4xWSLIST- 



PHD + 4xWSL0CK- 



PHD + 4XWSDYN- 



Process Header (PHD) 



PHD + 4 x DFWSCNT *- 

PHD+ 4 x WSNEXT *■ 



PHD + 4XWSLAST- 



PHD + 4xWSQUOTA- 



PHD + 4XWSEXTENT- 



WSAUTH 



WSDYN 



WSLAST 



WSEXTENT 



DFWSCNT 



WSLIST 



WSLOCK 



WSNEXT 



WSAUTHEXT 



WSQUOTA 



WSS1ZE O- 



Rest of Fixed Portion 
of Process Header 



Pages Permanently 

Locked in 

Working Set 



Pages LocKed by 
User Request 
($LKWSET) 



Working Set 

List 

Dynamic Space 



Room for Expansion of WSL 



Rest of Process Header 



These values are longword 
\ indexes from the top of the 
process header. 



Working Set List 



Figure 14-4 
Working Set List 

locked by user request, and the dynamic portion of the working set. The 
quota fields in the fixed portion of the process header determine how large 
the working set list may grow in response to different working set size adjust- 
ments. The contents of the three pieces are as follows: 

• The permanently locked portion of the working set list (from WSLIST to 
WSLOCK) contains the pages that are forever a part of the process working 
set. These include the following structures: 

— The kernel stack. 

— The PI pointer page. 

— The PI page table page that maps the kernel stack and the PI pointer 

page. 
— The PI page table page that maps the PI window to the process header. 
— The process header pages that are not page table pages. These include 



274 



14.1 Process Data Structures (Process Header) 

the fixed portion, the working set list, the process section table, and the 
process header page arrays. 

• The portion of the working set list between WSLOCK and WSDYN con- 
tains all pages that are locked by user request, specifically with the Lock 
Pages in Working Set or Lock Pages in Memory system services. 

• The dynamic portion of the working set list is the portion that is used for 
page replacement. It is delimited by WSDYN and WSEXTENT. The entry 
that was just put into the table is pointed to by WSNEXT. The replacement 
algorithm, explained in detail in Chapter 15, is a modified first-in/first-out 
scheme. 

The current size of the working set list is WSSIZE. The actual number of 
pages that a process is currently occupying is the sum of the process private 
page count (PCB$W_PPGCNT) and the global page count (PCB$W_ 
GPGCNT). 

Normally, the maximum size to which the working set can grow is 
WSQUOTA. However, if there are more than BORROWLIM pages on the free 
page list, the working set list can be extended up to WSEXTENT (at quantum 
end). If there are more than GROWLIM pages on the free page list, pages can 
be added to a process's working set above WSQUOTA (on resolution of a page 
fault). WSQUOTA can be altered in interactive and batch jobs by the SET 
WORKING- SET/QUOTA command. Part of the image reset logic, invoked 
at image exit, resets the end of the working set list to DFWSCNT. The mean- 
ings of the various working set list quotas and limits are summarized in 
Table 16-1. 

The format of a working set list entry (WSLE) is shown in Figure 14-5. 
Notice that the virtual page number is contained in the upper 23 bits, in the 
same location that virtual page numbers are found in virtual addresses. The 
placement of the virtual page number allows the WSLE to be passed to sev- 
eral utility routines as a virtual address, where the byte offset bits (WSLE 
control bits) are not looked at. The meanings of the various control bits are as 
follows: 

<0> When the WSL Entry Valid bit is clear, the working set list 

entry can be used without removing a page from the work- 
ing set. 

< 1 :3> The Page Type field (a duplicate of the contents of the PFN 

TYPE array) distinguishes pages that require different action 
when removed from a process working set. 

<4> The Page Locked in Memory bit indicates that this page is 

locked into physical memory with the Lock Pages in Mem- 
ory system service. Such pages are also locked into the proc- 
ess working set. (The working set lock bit is not set but the 
WSLEs are moved into the portion of the working set list 
that contains pages locked by user request.) 



275 



Memory Management Data Structures 



31 



9 8 7 



6 5 4 3 



1 



Virtual Page Number 




Saved Modify Bit 

Page Locked in Working Set 

Page Locked in Memory 

I Page Type 

| WSL Entry Valid 

T 
Code Page Type 

Process Page 

1 System Page 

2 Global Read-Only Page 

3 Global Read/Write Page 

4 Process Page Table Page 

5 Global Page Table Page 



Figure 14-5 

Format of Working Set List Entry 



<5> The Page Locked in Working Set bit indicates those pages 

that are permanently or dynamically locked into the process 
working set. The only pages that can be dynamically locked 
are page table pages that map currently valid pages. (Pages 
that are permanently locked or locked into the working set 
by user request also have this bit set in their working set list 
entries.) 

<8> The Saved Modify bit in the WSLE is used when the process 

is outswapped to record the logical OR of the modify bit in 
the page table entry and the saved modify bit in the PFN 
STATE array. 

14. 1 .3 Process Section Table 

The process section table contains process section table entries (PSTEs). 
PSTEs are data structures used to locate image sections within image files. 
The location of the process section table within the process header is pic- 
tured in Figure 14-6. Offset PHD$L_PSTBASOFF contains the byte offset to 
the bottom of the process section table. All process section table entries 
within the table are then located through negative longword indexes from the 
bottom of the PST. 

The PSTEs are maintained in two doubly linked lists. One list of PSTEs 
contains those that are in use. The negative index PHD$W_PSTLAST points 
to the most recent addition to the in-use list. Figure 14-6 shows a hypotheti- 
cal list of free and allocated PSTEs; the allocated PSTEs are shaded. When a 
section is deallocated, the PSTE that mapped the section is placed on a free 
list so that it can be reused. The negative index PHD$W_PSTFREE points to 



276 



14.1 Process Data Structures (Process Header) 



PCB$I PHD — *■ 



Process Header (PHD) 



G> 



PSTBASOFF 



* PSTFREE PSTLAST X- 



X PSTBASMAX 



Rest of Fixed Portion, 
Working Set List 



Room for Expansion of PST 



Process Section Table 



Empty Pages 



Process Header Page Arrays, 
PO Page Table, P1 Page Table 



H 



^V 



PSTLAST and PSTFREE are 
both negative longword 
indexes from the bottom 
of the process section table. 



..... 



Most Recently Freed PSTE x- 



Most Rseentty Allocated PSTE 



_ 



W 



<D 



The process section table 

cannot extend beyond 
this point. 



Figure 14-6 

Process Section Table 



the most recent addition to the free list. The first longword in the PSTEs on 
the free list contains a negative index that can be used to find the previous 
element on the free list. When sections are created, the allocation routine for 
PSTEs first checks the free list. If there are no free PSTEs, a new PSTE is 
created from the expansion region between the working set list and the PST. 

When it is necessary to expand the working set list into the area already 
occupied by the process section table, space is allocated from the empty page 
area (if it exists). Then the entire PST is moved into the allocated space and a 
new value of PSTBASOFF is inserted into the fixed portion of the process 
header. All other references to individual process section table entries are 
unaffected by this change. For more information on expansion of the working 
set list see Chapter 15. 

The format of a process section table entry is pictured in Figure 14-7. The 
following steps are used to locate a block in an image file: 



277 



to 

oo 






Figure 14-7 

Layout of Process Section Table Entry 



TO 



Pointer to Channel Control Block 



Backward Link 
Index 



Forward Link 
Index 



Page 
Fault 
Cluster 



Starting Virtual 
Page Number (22 bits) 



Address of Window Control Block 



Base Virtual Block Number 
for This Section 




Control 



Count of PTEs Referencing 
This Section 



Number of Pages in 
This Section 



Control Flags Word in Process/Global Section Table Entry 
15 14 13 12 11 10 9 8 7 6 5 4 3 2 



ZMZM 



M 



LE 



Global 

Copy on Reference 
-*- Demand Zero 
-*- Writeable 
-+- Shared Memory Global 



-»- Access Mode for Writing 
-»- Access Mode of Section 



->- Permanent 
->- System Global (Set) 
Group Global (Clear) 



§ 



S 
o 



OS 

On 



14.2 PFN Database 

1 . The WCB address points to the window control block for the image file. 
The WCB contains the mapping information that relates virtual block 
numbers in a file to logical block numbers on a volume. 

2. The starting virtual page number for the section, when subtracted from 
the virtual page number of the faulting page, gives the page offset into the 
section. 

3. The starting virtual block number of the section is added to the difference 
computed in step 2 to give the virtual block number of the faulting page 
within the image file. 



14.1.4 Process Header Page Arrays 

When a process header is outswapped, some information about each process 
header page must be stored in the outswapped process header. The process 
header page array portion of the process header provides an area where this 
information can be stored (Figure 14-8). Two of the arrays, the BAK array and 
the WSLX array, save information from the PFN database about each process 
header page in the working set. The other two arrays (locked WSLE count and 
valid WSLE count) keep statistics about each page table page. These four 
arrays are described in greater detail in Chapter 17. 

14.2 PFN DATABASE 

The memory management data structures include information about the 
available pages of physical memory. The fact that this information must be 
available while the page is being used prevents this information from being 
stored in the page itself. In addition, the caching strategy of the free page list 
and modified page list requires physical page information to be available even 
when pages are not currently active and valid. A portion of the nonpaged 
executive is set aside for this accounting data, called the PFN database. 

The PFN database, unlike many of the other executive data structures, is 
not a table-oriented structure. Rather, the same item of information about all 
physical pages is stored in successive elements of an array (see Figure 14-9). 
The page frame number is then used as an index into each array. Table 14-2 
lists each item of information in the PFN database, including the global name 
of the pointer to the beginning of each array. 



14.2.1 PTE Array 

When a physical page is assigned to another use, the pager must be able to 
find the PTE that maps the page. The PFN PTE longword array contains the 
system virtual address of the page table entry that maps each physical page. 



279 



Memory Management Data Structures 

Process Header (PHD) 



PCB$L_PHD 



■% 



PHD+4xWSLX 
(Longword Index) 



PHD + 4XBAK 
(Longword Index) 



PHD + 4XPTWSLELCK- 
(Byte Index) 



PHD + PTWSLEVAL 
(Byte Index) 



x 



WSLX 



BAK 



PTWSLELCK 



PTWSLEVAL 



PTCNTVAL 



PTCNTMAX 



PTCNTLCK 



PTCNTACT 



Rest of Fixed Portion, 
WSL, PST, 
Empty Pages 



WSL Index Save Area 

(One Word for Each 

Process Header Page) 



Backup Address Save Area 

(One Longword for Each 

Process Header Page) 



Locked WSLE Counts Array 

(One Byte per Page Table Page) 

(-1-»-None) 



Valid WSLE Counts Array 

(One Byte per Page Table Page) 

(-1->None) 



PO and P1 Page Tables 



Fixed Portion of 
Process Header 



Process Header 
Page Arrays 

> (Eight bytes per 
' process header 
page, rounded up 
to page boundary) 



y 



Figure 14-8 

Process Header Page Arrays 



PFN PTE array elements for global pages point to the global page table 
entries. 

14.2.2 BAK Array 

The PFN BAK longword array stores the original contents of the PTEs. When 
a physical page is assigned to another use, all links with the PTE that cur- 
rently maps the page must be broken. The PTE is set to indicate where the 
contents of the page can be obtained the next time that they are needed. The 
BAK array element contains the information that goes back into the PTE. 
The PFN PTE array element is used to locate the PTE that must be altered. 
Figure 14-10 shows the possible contents of a PFN BAK array element. In 
terms of page table entry contents (see Figure 14-3), the only forms of PTE 
that can go into the BAK array are a process section table index or a page file 
virtual block number. 



280 



PFN Data for 
Process or Global 
Page in Process 
Working Set 



1 






































— fc- 


BAK 




wsu 




SHRCNT 




PTE 




REFCNT 




STATE 




8WPVBN 




TYPE 








' ; 


' ; 


' 11 f f f k fill Til 




BAK 




BUNK 




FUNK 




, PTE •'• 




REFCNT 




STATE 




SWPVBN 




TYPE 











































Array 

of 

Longwords 



Array 

of 
Words 



Array 

of 
Words 



Array 

of 

Longwords 



Array 

of 
Words 



Array 

of 
Bytes 



Array 

of 
Words 



Array 

of 
Bytes 



PFN Data for Page 
on Free or Modified 
Page List 



Both of these arrays 
are overlaid. 






Figure 14-9 

PFN Database Arrays 



ro 

00 



t3 
a 

r+ 

a 

cr 

a 

Co 

TO 



Memory Management Data Structures 



Table 14-2: PFN Database Arrays 



Array Element Contents 

System Virtual Address 

of Page Table Entry 
Backing Store Address 
Physical Page State 
Page Type 
Forward Link 



Backward Link 

Reference Count 
Global Share Count 



Global Address of 
Pointer to 
Start of Array 

PFN$AL_PTE 

PFN$AL_BAK 
PFN$AB_STATE 
PFN$AB_TYPE 
PFN$AW_FLINK 



PFN$AW_BLINK 

PFN$AW_REFCNT 
PFN$AW_SHRCNT 



Working Set List Index PFN$AW_WSLX 

Swap File Virtual Block Number PFN$ AW_ SWP VBN 



Size of Array 
Element 

Longword Array 

Longword Array 
Byte Array 
Byte Array 
Word Array 



Word Array 

Word Array 
Word Array 

Word Array 

Word Array 



Comment 



(Figure 14-10) 
(Figure 14-11) 
(Figure 14-12) 
(Figure 14-13) 
Overlays the 
SHRCNT array 
(Figure 14-13) 
Overlays the 
WSLX Array 

Overlays the 
FLINK Array 
Overlays the 
BLINK Array 



14.2.3 STATE Array 

The PFN STATE array (see Figure 14-11) indicates the physical state of each 
physical page. The low three bits contain the page location code. The upper 
bit in a STATE array element is extremely important. It is the setting of this 
bit that determines whether a physical page is put on the free page list or the 
modified page list when the page is released. 

There are a number of paths that can cause the modify bit in the STATE 
array to be set. 



31 



24 23 22 21 



Page File Index 








Page File Virtual Block Number 



31 



24 23 22 21 









1 


Process PTE <21:0> 



Figure 14-10 

Possible Contents of PFN BAK Array Element 



282 



14.2 PFN Database 



7 4 2 

XX X 







^ 





STATE Array Element 



*- Location of Page (See Below.) 



Delete PFN Contents When 
Reference Count Goes to 



>- Saved Modify Bit from PTE 



Code Location 

Page on Free Page List 

1 Page on Modified Page List 

2 Page on Bad Page List 

3 Release Pending (When Reference Count Goes to 0, Put Page on 
Free or Modified Page List) 

4 Read Error Occurred While Page Read Was in Progress 

5 Write in Progress by Modified Page Writer 

6 Read in Progress by Page Fault Handler 

7 Page Is Active and Valid 

Figure 14-11 

Contents of PFN STATE Array Element 



• When a page is removed from a process working set, the modify bit in the 
page table entry is logically ORed into the saved modify bit in the STATE 
array. 

• When pages are to be used as read buffers in direct I/O, the executive rou- 
tine that locks down pages (IOLOCK) sets the modify bit in the PTE. When 
the page is removed from the process's working set, the OR operation will 
cause the bit to be set in the PFN STATE array. 

• When copy-on-reference pages are faulted into a process's working set, the 
modify bit in the STATE array is set. The set bit forces a write to the page 
file when the page is removed from the process working set. 

The delete bit in the PFN STATE array element affects physical page con- 
tents. When the reference count of a physical page goes to zero, all ties with a 
virtual page (PFN PTE array contents) are destroyed. The physical page is 
then put at the front of the free page list where it will be reused as quickly as 
possible. 



14.2.4 TYPE Array 

The PFN TYPE array (see Figure 14-12) distinguishes the different types of 
valid pages. The reason for this distinction is that either the pager or swapper 
must take different action depending on what type of page is being acted on. 
The collided page bit in the TYPE array element is set when a page fault 
occurs while the page is already being read in from its backing store address. 
Collided pages are described briefly in Chapter 17. 



283 



Memory Management Data Structures 



6 5 4 2 

X X 






1 *- 










fe- 



TYPE Array Element 

Page Type (See Below.) 

Collided Page (Empty COLPG State 
When Page Read Completes) 
Bad Page Bit (When Reference 
Count=0, Put Page on Bad Page List) 
Report Event on I/O Completion 



14.2.5 



Code Page Type 

Process Page 

1 System Page 

2 Global Read-Only Page 

3 Global Read/Write Page 

4 Process Page Table Page 

5 Global Page Table Page 



Figure 14-12 

Contents of PFN TYPE Array Element 



Forward and Backward Links 



The three page lists (free page list, modified page list, and bad page list) must 
all be doubly linked lists because an arbitrary page is often removed from the 
middle of the list. However, the links cannot exist in the pages themselves 
because the original contents of each page must be preserved. Two word ar- 
rays, the FLINK array and the BLINK array, contain elements that are inter- 
preted as the physical page numbers of the successor and predecessor to a 
given physical page. 

A zero in one of the link fields indicates the end of the list (and is not a 
pointer to physical page zero). For this reason, physical page zero cannot be 
used in any dynamic function by the VMS operating system but may be 
mapped by some system virtual page that is always resident. The usual con- 
tents of physical page zero are the restart parameter block (see Chapter 24). 

Figure 14-13 shows an example of pages on the free list, along with the 
corresponding FLINK and BLINK array elements. The STATE array elements 
for all of these pages contain zero, indicating that the physical pages are on 
the free page list. 



14.2.6 REFCNT Array 

The PFN REFCNT array counts the number of reasons why a page should not 
be put on the free or modified page list. One reason for incrementing the 
reference count is that a page is in a process working set. Pages are locked 
down for direct I/O by incrementing the reference count. 



284 



Head of 
Free Page List 


BLINK 
Array 


. . ::PFN$AX_BLINK 

PFN 1 UNK 
Array 


i < :.PFN$AX_FLINK 

PFN ^TATE 
Array 


i 




28 






















1 




, 




28 


5 

11 
15 

28 
33 


15 


5 

11 
15 

28 
33 

















5 








15 


33 



















I 


i 




5 


11 

















15 








' 


■t 






Previous 


5 









11 


















11 


Next 









' 




, 














33 






Fre 


T 


ail 

3 a 





List 





14.2 PFN Database 

PFN$AB_STATE 



Figure 14-13 

Example of Free Page List Showing Linkage Method 



I/O completion and working set replacement use the same routine to dec- 
rement the reference count. If the reference count goes to zero, the physical 
page is released to the free or modified page list as indicated by the saved 
modify bit in the PFN STATE array. Manipulations of the reference count are 
illustrated in the discussion of paging dynamics in Chapter 17. 



14.2.7 SHRCNT Array 

A second form of reference count is kept for global pages. The PFN SHRCNT 
array counts the number of process page table entries that are mapped to a 
particular global page. When the SHRCNT for a particular page goes from 
zero to one, the reference count is incremented. Further additions to the 
share count do not affect the reference count. 

As the global page is removed from the working set of each process mapped 
to the page, the share count is decremented. When the share count finally 



285 



Memory Management Data Structures 

reaches zero, the reference count for the page is also decremented. 

When a physical page has a nonzero share count, it cannot be on one of the 
page lists. The forward and backward link words are not needed. The global 
share count array overlays the forward link array. (PFN$AX_FLINK and 
PFN$AX_SHRCNT are the same global location in system space.) The global 
share count is only used for global pages. 

The SHRCNT array is used for a second purpose when the physical page in 
question is a process page table page or a global page table page. In either of 
these cases, the array element counts the number of active page table entries 
in the process or global page table page. When this value passes from zero to 
nonzero, process page table pages are dynamically locked into the process 
working set and global page table pages are locked into the system working 
set. 



14.2.8 WSLX Array 

The working set list index array contains an index into a process or system 
working set list for valid pages. The content of an array element is a longword 
index from the beginning of the process (or system) header to the working set 
list element in question. 

Because a physical page that is in some working set is not on one of the 
page lists, the link words are available for other uses. The working set list 
index array overlays the backward link array. (PFN$AX_BLINK and 
PFN$AX_WSLX are the same global location in system space.) The WSLX 
array is not used for global pages. 

14.2.9 SWPVBN Array 

The swap virtual block number array is used to support the outswap of a 
process with I/O in progress. When such an outswap occurs, the virtual block 
number in the swap file where the locked-down page would go is recorded in 
the SWPVBN array. The modified page writer checks this array for nonzero 
contents and, if they are nonzero, diverts the page from its normal backing 
store address to the designated block in the swap file. 

14.3 DATA STRUCTURES FOR GLOBAL PAGES 

The treatment of global pages is not much different from that of process 
private pages. However, the system is required to keep some system-wide 
database of the various global pages in the system. 

14.3.1 Global Section Descriptor 

When a global section is created, a structure called a global section descriptor 
(GSD) is allocated from paged dynamic memory and loaded with information 



286 



14.3 Data Structures for Global Pages 



Regular Global Section Descriptor 



This portion of a GSD 
appears in extended 
GSDs (used for 
map-by-PFN) and 
shared memory GSDs 
(see Figure 14-27). 



SI 



GSD Forward Link 



GSD Backward Link 



Type 



Size 



UIC of Creator of Section 



UIC of File Owner 



Global Section 
Table Index 



Protection 
Mask 



Global Section Ident 



Count 



Section Name 
(Up to 15 Characters) 
(Counted ASCII String) 



Section Flags 



\ Extended Global Section Descriptor 
\ for Map-by-PFN Global Sections 



Regular Global 
Section Descriptor 



Base PFN 



Number of Pages in Section 



Reference Count 



Figure 14-14 

Layout of Global Section Descriptor 



that describes the section (see Figure 14-14). The information about the sec- 
tion stored in the GSD is only used when the section is created or deleted, or 
when some process attempts to map to the section. The pager does not use 
this data structure. 

The GSD is linked into one of two GSD lists maintained by the system. All 
system global sections are put into one list; group global sections (independ- 
ent of group number) are put into the other list. The global section table 
index field of the GSD contains an index that allows a second structure 
(called a global section table entry) to be located. 



14.3.2 The System Header and Global Section Table Entries 

The system maintains two data structures for itself that parallel structures 
maintained for each process in the system. The system PCB and system 
header are used by the pager to allow page faults of system pages to be treated 
almost identically to page faults for process pages. 

The system header (see Figure 14-15) contains the working set list that 
governs page replacement for system pages. The section table area in the 
system header contains section table entries for the image files that contain 
pageable system pages. These include the executive image (SYS.EXE), the 
record management services image (RMS.EXE), and the system message file 
(SYSMSG.EXE). 



287 



Memory Management Data Structures 



MMG$GL_SYSPHD »- 



System Header 



Global (System) 
Section ^ 
Table 



PSTBASOFF 



PSTFREE PSTLAST 



System Working Set List 



Room for Expansion of GST 



GSTE 



System Page Table 



- Movable Boundary 
Between System Working Set 
List and Global Section Table 



GSTX 



Figure 14-15 

The System Header Containing the System Working Set 
List and the Global Section Table 



The section table area in the system header serves a second purpose. When 
a global section is created, a section table entry that describes the global 
image file is created. The new section table entry is placed into an area of the 
system header called the global section table. The format of a global section 
table entry (see Figure 14-16) is nearly identical to the format of a process 
section table entry. The only difference is that the first longword points to 
the global section descriptor (instead of the channel control block). 

Global section table entries are accessed in exactly the same way as process 
section table entries, with a negative longword index from the bottom of the 
global section table. The global section table index in the global section de- 
scriptor is such an index, associating a GSTE with a GSD. 



14.3.3 Global Page Table Entries 

A third set of data is also created for each global section. Each page in the 
global section is described by a global page table entry in the global page table 
(see Figure 14-17). The pager uses global page table entries just like process 
page table entries to locate global pages. 

Global page table entries are restricted to a subset of the forms illustrated 
in Figure 14-3. 



288 



14.3 Data Structures for Global Pages 



Global Section Descriptor Address 



Backward Link 
Index 




Foward Link 
Index 



Starting Virtual 
Page Number (22 bits) 



Pointer to Window Control Block 
(for Virtual to Logical Mapping) 



Base Virtual Block Number 
for this Section 



Count of PTEs Referencing 
This Section 



Number of Pages in the Section 



Figure 14-16 

Layout of Global or System Section Table Entry (Global 
Page Table Entries) 



• The global page table entry can be valid, indicating that the global page is 
in at least one process working set. 

• The global page table entry can indicate a demand zero page. Global de- 
mand zero pages are used to initialize global page file sections. 

• The global page table entry can indicate some transition state. The 
PFN STATE array indicates which transition state is involved in the usual 
way. 

• The global page can be in a global image file, in which case the global page 
table entry contains a global section table index. 



14.3.4 Global Page Table and System Page Table 

Global page table entries are located in exactly the same manner as process or 
system page table entries. Location MMG$GL_GPTBASE contains the ad- 
dress of the base of the global page table. All references to global page table 
entries use what can be thought of as a virtual page number as an index into 
the global page table. 

The interesting thing to note about this approach is that the base of the 
global page table coincides with the base of the system page table. Further, 
the virtual page numbers that are used as indexes into the global page table 
are system virtual page numbers. In fact, when looking at system virtual 
address space, the global page table simply appears as an extension to the 
system page table. The global page table index associated with the first global 



289 



Memory Management Data Structures 



MMGSGI SYSPHD •- 



MMG$GI SPTBASE •- 



MMG$Gl GPTE •- 



System 

Header 



^ 



System 
Page 
Table 



Global 
Page 

Table 



Global Page Table Entries are 
located with a virtual page 
number from the beginning 
of the System Page Table. 



Global Page Table Entry 



GPTE 



GPTE 



GPTE 



GPTE 



GPTE 



GPTE 



GPTE 



1 



Global Page Table Entries may 
indicate pages that are: 

1. Valid 

2. In Transition 

3. In a Global Image File 
(In this case, the 
Global Page Table Entry 
contains an index into 
the Global Section Table 
in the System Header.) 



Figure 14-17 

Location of Global Page Table at Virtual End of System 
Page Table 



page is one greater than the largest system virtual page number for a given 
configuration. 

This logical extension of the system page table exists only when looking at 
system virtual address space. The global page table does not exist in physical 
pages adjacent to the system page table. The system length register only rec- 
ords the number of real system page table entries, not the logical extensions. 
In other words, global pages are not mapped into system virtual address space 
and are not accessible through system virtual addresses. This pseudoexten- 
sion to the system page table is only available to the software routines in the 
memory management subsystem. 

Figure 14-18 shows how the global page table relates to the system page 



290 



14.3 Data Structures for Global Pages 



Global Section 




System Header 












Fixed Size Portion 








— 

/ 

/ 
— >■»- 










\ 




System Working 
Set List 








y 
/ 
/ 
/ 
/ 
1 


^ 
















GSTE 








B 


rh 




WCB Address 


Global Section Table 


V 


^ 


Global 


Base VBN 


/ 
/ 


^ 
' 


System 
Page Table ', 






Page 
Table 
Entries 






/ 


0/ 


GPTE 




/ 
/ 


/ 


GPTE 




f 


Global _ 
Page Table ' 


GPTE 


Global Section Descriptor 




GPTE 






GPTE 
















GPTE 
























(*) 












Section 








Nar 


ie 









Figure 14-18 

Relationships among Global Section Data Structures 



table. It also shows the relationship among the global section descriptor, the 
global section table entry, and the global page table entries for a given sec- 
tion. There are several relationships among these three structures. 

• The central structure is the global section table entry (see Figure 14-16 
The first longword in the GSTE points to the global section descriptor. 

• The virtual page number field (labeled (B) in Figure 14-18) contains the 
pseudo system virtual page number that serves as a longword index to the 
first global page table entry that maps this section. 

• The global section descriptor contains a global section table index (labeled 
(A) in the figure) that allows the GSTE to be located from the GSD. 

• The original form of each global page table entry is a section table index 
(identical to the GSTX found in the global section descriptor), effectively 
pointing to the GSTE. When any given GPTE is either valid or in transi- 
tion, the GSTX is stored in the PFN BAK array. Note that GPTEs for global 
page file sections contain the page file backing store address. 



291 



Memory Management Data Structures 



Process Page Table 



GPTIndex=N X- 



GPTIndex=N+1 X- 



GPTIndex=N+Z * 



MMG$GI GPTBASEH 



Global Page Table 



GPTE 



GPTE 



GPTE 



N Entries 



Z Entries 



y 



Figure 14-19 

Relationship between Process PTEs and Global PTEs 



14.3.5 Process PTEs for Global Pages 

When a process maps a portion of its virtual address space to a global section, 
its process page table entries that map the section are in the form used for 
global page table indexes. The process PTE that maps the first global section 
page contains the GPTX of the first page in the global section. Each succes- 
sive process page table entry contains the next pseudo system virtual page 
number (GPTX), so that each PTE effectively points to the GPTE that maps 
that particular page in the global section. This concept is shown in Figure 
14-19. Assume that the section shown in the figure contains Z number of 
pages. 

Figure 14-3 shows the possible forms for process page table entries. 

All of the data structures associated with global sections will be described 
in detail in Chapter 17 where page faults for global pages are discussed. The 
initial allocation of these structures is briefly described along with the Create 
and Map Section and Map Global Section system services in Section 16.3.1. 



14.4 SWAPPING DATA STRUCTURES 

There are three data structures that are used primarily by the swapper but 
indirectly by the pager. The SYSBOOT parameter BALSETCNT determines 
the maximum number of concurrently resident processes. In particular, it 
determines the amount of system address space set aside for process headers. 



14.4.1 Balance Slots 

When the system is initialized, an amount of virtual address space equal to 
the size of a process header times BALSETCNT is allocated exclusively for 
process headers (see Figure 14-20). Each of these process header areas is called 



292 



PFN$AL_PTE 



SVAPTE 



-2 



PTE Longword 
Array in PFN 
Database 



t 



14.4 Swapping Data Structures 

SWP$GL_BALBASE 



Process Header (PHD) 



PHVINDEX 



Working Set List 
Process Section Table 



Process Header 
Page Arrays 



PO Page Table 



PFN 



/ 

/ 



P1 Page Table 



> ' 



Balance 
SlotO 



Balance 
Slotl 



Balance 
SlotM 



Last 
Balance Slot 



} 



All balance slots 
are exactly the same 
size. 



The size of a 
balance slot in 
pages is stored in 
global location 
SWPSGI BSLOTSZ. 



There are 
BALSETCNT slots. 



Figure 14-20 

Balance Slots Contain Process Headers 



a balance slot. The location of the first balance slot is stored in global loca- 
tion SWP$GL_BALBASE. The size of a process header (in pages) is stored in 
global location SWP$GL_BSLOTSZ. The calculations that are performed by 
SYSBOOT to determine the size of the process header are described in Chap- 
ter 26. 



14.4.2 Balance Slot Arrays 

The system maintains two word arrays describing each process with a proc- 
ess header stored in a balance slot (see Figure 14-21). Both of the word arrays 
are indexed by the balance slot number occupied by the resident process. The 
balance slot number is stored in the fixed portion of the process header at 
offset PHD$W_PHVINDEX. Entries in the first array contain the number of 
references to each process header,- entries in the second array contain an 
index into a longword array that points to the process control block for each 
process header. 

The entries in the reference count array (based at the global pointer 
PHV$GL_REFCBAS) count the number of reasons why the process header 
cannot be removed from memory. Specifically, this array element counts the 
number of page table pages that contain either valid or transition PTEs. 

The entries in the process index array (based at the global pointer 
PHV$GL_PIXBAS) contain an index into the longword array based at the 



293 



Memory Management Data Structures 



The contents of 
PHD$W_PHVINDEX 
are used as a \ 

word index into 
each of these arrays. 



> PHV$GI REFCBAS 


1 ' PHV$GI PIXBAS 














s 




Ref. Count 




Process Index 











BALSETCNT 
^Entries in 
Each Array 



• ::SCH$GL_PCBVEC 



@SCH$GL_PCBVEC_ 
+ 4 x (Process Index) 



Pointer to PCB •- 



PCB Vector 



PCB of Process 
Whose PHD is 
in Balance Slot 
M 



MAX PROCESSCNT 
Entries 



Figure 14-21 

Process Header Vector Arrays 



global pointer SCH$GL_PCBVEC. The entries in the longword array contain 
pointers to the process control blocks of the processes with a process header 
in a balance slot. Figure 14-21 illustrates how the executive turns the address 
of a process header into the address of the PCB for that process, using the 
entry in the process index array. 

If the process header address is known, the balance slot index can be calcu- 
lated (as described in the next section). By using this as a word index into the 
process index array, the longword index into the PCB vector is found. The 
array element in the PCB vector is the address of the PCB (whose PCB$L_PHD 
entry points back to the process header). A more detailed description of the 
PCB vector can be found in Chapter 20, where its use by the Create Process 
system service is discussed. 

14.4.3 Comment on Equal-Size Balance Slots 

The choice of equal-size balance slots, at first sight seemingly inefficient, has 
some subtle benefits to portions of the memory management subsystem. 



294 



14.5 Data Structures that Describe the Page and Swap Files 

There are several instances, most notably within the modified page writer, 
when it is necessary to obtain a process header address from a physical page's 
page frame number (PFN). With fixed size balance slots, this operation is 
straightforward. 

The contents of the PFN PTE array point to a page table entry somewhere 
in the balance slot area. Subtracting the contents of SWP$GL_BALBASE 
from the PFN PTE array contents and dividing the result by the size of a 
balance slot (the size of a process header) in bytes produces the balance slot 
index. If this index is multiplied by the size of the process header in bytes and 
added to the contents of SWP$GL_BALBASE, the final result is the address of 
the process header that contains the page table entry that maps the physical 
page in question. 



14.5 DATA STRUCTURES THAT DESCRIBE THE PAGE AND 

SWAP FILES 

Page and swap files are used by the memory management subsystem to save 
physical page contents or process working sets. Page files are used to save the 
contents of modified pages that are not in physical memory. Both the swap 
and page files are used to save the working sets of processes that are not in the 
balance set. 



14.5.1 Structure of Page and Swap Files 

Figure 14-22 illustrates the data structures used to access page and swap files. 
Location MMG$GL_PAGSWPVC contains the address of an array of long- 
word pointers, called the page and swap file vector. The number of pointers in 
the array is the maximum number of page and swap files allowed on the 
system (SYSGEN parameters SWPFILCNT and PAGFILCNT) plus one. 

INIT initializes the page and swap file vector and loads the pointers with 
the address of a null page file control block. The first pointer in the array is 
loaded with the address of the page file control block for the shell process. 
When SYSINIT initializes the primary page file control blocks, the pointer 
located by the index SWPFILCNT + 1 is redirected to the control block for the 
primary page file (SYS$SYSTEM:PAGEFILE.SYS). 

The second pointer in the page and swap file vector is redirected to point to 
the control block for the primary swap file (SYS$SYSTEM:SWAPFILE.SYS). If 
there is no swap file, or if the value of the SYSGEN parameter SWPFILCNT 
equals zero, this pointer is not redirected. In this case all swap operations are 
performed to the primary page file. 

The page file control blocks and pointers for the alternate page and swap 
files are created by SYSGEN. 

Page file control blocks are used to describe both page and swap files. When 



295 



to 

ON 



f MMG$GL_PAGSWPVC 



SWPFILCNT+1^ 



Process Header (PHD) 



PHD$B_PAGFIL: 



Process Control Block (PCB) 



PCB$L_WSSWP: 



■K 



Entry for 

SHELL 

(Not Used) 




Entry for 

SWAPRLE.SYS 

(Initialized by SYSINIT) 



Entries for 

Alternate 

Swap Files 

(Initialized by SYSGEN) 



Page File Control Block 



Entry for 

PAGEFILE.SYS 

(Initialized by SYSINIT) 



Entries for 

Alternate 

Page Files 

(Initialized by SYSGEN) 



Address of Start of Bitmap 



Starting Byte Offset to Scan 



Page Fault 
Cluster 



Type 



Size 



Pointer to Window Control Block 



Base Virtual Block Number 



Size in Bytes of Bitmap 



Count -1 of Pages Which May Be 
Allocated 



Count -1 of Pages Which May Be 
Reserved 



Bitmap 
One Bit per Block in Page or Swap 

File 
(A bit set means a block is available.) 



^ 



O 
3 



fcs 
a 
oq 



a 

c 
o 

r-f 

e 
TO 



Figure 14-22 

Page and Swap File Database 



14.5 Data Structures that Describe the Page and Swap Files 

the SYSINIT process initializes the page file control blocks for the primary 
page and swap files, the following operations are performed: 

1. The file is opened. 

2. The address of the window control block is stored in the control block. 

3. The page file bitmap is allocated from nonpaged pool and initialized to all 
bits set. 

4. The address of the control block is stored in the appropriate location in the 
page and swap file vector. 

The SYSINIT process is described in more detail in Chapter 25. 

Note that the locations of the window control block field, the virtual block 
number field, and the page fault cluster factor field are in the same relative 
offsets in these structures as they are in a section table entry. Because the 
offsets are the same, I/O requests can be processed by common code, inde- 
pendent of the data structure that describes the file being read or written. 

When any page or swap file is opened, all mapping information for the file 
is copied into the window control block. These so-called cathedral windows 
insure that the memory management subsystem does not have to take a 
window turn (see Section 19.1.4), which could lead to system deadlock. 



14.5.2 The Shell Process 

The first longword in the page and swap file vector points to the control block 
for the shell process. This control block is initialized by the module INIT (see 
Chapter 25) and contains the starting VBN of the shell process and the sys- 
tem window control block. This information is used in process creation to 
read copies of the shell process into the system. When INIT initializes the 
shell control block, it adds one to the value of the SYSGEN parameter 
SWPFILCNT and stores the result in the global location SGN$GW_ 
SWPFILCT. For more information on the shell process, see Chapter 20. 

14.5.3 Structure of Swap Files 

When a process is created, it is assigned a swap space within the swap or page 
file. This swap space contains room for the process header and the process 
body (the P0 and PI pages belonging to the process). The initial size of the 
swap space is equal to the value of the SYSGEN parameter MPW_ 
WRTCLUSTER. If the value of MPW_WRTCLUSTER is less than the size of 
the shell process, the initial size of the swap space is set to the size of the 
shell (16 pages). This initial swap space size insures that a system being 
bootstrapped can create processes. The structure of swap spaces is illustrated 
by Figure 14-23. 
If a process's working set list grows so that it no longer fits its swap space, 



297 



oo 



From Page and 
Swap File 
Vector Array 
(Figure 14-22) 



Page File Control Block 




PCB 



1) The upper byte contains an 
index into the swap and 
page file vector. 

2) The lower three bytes 
contain the virtual 
block number of the 
beginning of the slot 
allocated to this 
process. 



X VBNofSlot» 



APTCNT 



/ 








\ 

I \ 

\ 


/ 










SLOT 















/ 



/ 



/ 



/ 



/ 



^ 



Non-Page-Table 
Process Header I 



Active | 
Page Tables i 



Process Body 
(PO and P1 pages) 



Bit PCB$V_RES in PCB$L_STS 
indicates residency of process: 

1 = Resident 

0= Outswapped 



v 

PCB$W_APTCNT 
Pages 



i 

a 
ts 
a 

1 

r-f 

a 

r-f 

a 

c 
o 

e 

US 



Figure 14-23 

Swap File Database 



14.6 Swapper and Modified Page Writer Page Table Arrays 

the process is reassigned to a new swap space, which is MPW_ WRTCLUSTER 
pages bigger. In this manner, the process's swap space is increased in multi- 
ples of MPW-WRTCLUSTER. A process's swap space can grow up to 
WSQUOTA pages. At image exit, the process's working set is reduced back to 
PHD$W_DFWSCNT, and the process is reassigned to an initial size swap 
space. 

Dynamically allocated swap spaces represent a significant change from 
previous versions of the VAX/VMS operating system. Previously, swap files 
were composed of a number of fixed size areas known as swap slots. These 
swap slots were permanently allocated. The size of the swap slots was tied 
directly to the SYSGEN parameter WSMAX. This rigidity placed some re- 
strictions on the system. The fixed size of the swap slots limited the possible 
growth of process working sets; because each swap slot was the maximum 
required size (for WSMAX), this limited the number of processes that could 
be created. VAX/VMS Version 3.0 decoupled the link with WSMAX, in part 
to accommodate the new working set expansion provided with the new sys- 
tem. Now the size of the swap spaces is limited only by WSQUOTA. 



14.5.4 Alternate Page and Swap Files 

Alternate page and swap files can be created by the SYSGEN commands 
INSTALL/PAGEFILE and INSTALL/SWAPFILE. A system with alternate 
swap files can support a greater number of processes or processes with larger 
working sets. In a system with alternate page files, newly created processs are 
assigned to the page file that contains the most free pages. The assignment 
lasts for the life of the process. Thus, adding alternate page files enhances 
system performance by reducing paging activity to the existing page files (and 
again, making more space available for swap spaces). 



14.6 SWAPPER AND MODIFIED PAGE WRITER PAGE TABLE 

ARRAYS 

The VAX/VMS I/O subsystem allows direct I/O requests (DMA transfers) to 
virtually contiguous buffers. There is no requirement that pages in the buffer 
be physically contiguous or even have any relationship to each other. 



14.6.1 Direct I/O and Scatter/Gather 

The I/O locking mechanism invoked at the FDT level brings each page into 
the working set of the requesting process, makes it valid, and increments that 
page's reference count (in PFN REFCNT array) to reflect the pending read or 
write. The buffer is generally described in the I/O request packet through 
three fields. 



299 



Memory Management Data Structures 

• IRP$L_SVAPTE contains the system virtual address of the first PTE that 
maps the buffer. 

• IRP$W_BOFF and IRP$W_BCNT together describe the buffer size that is 
used to calculate how many PTEs are required to map the buffer. 

When a driver processes this I/O request, it allocates the required number of 
MBA or UBA mapping registers and loads them with the page frame numbers 
found in the page table entries. The adapter hardware handles the mapping 
from its address space to VAX physical addresses. The ability to transfer to 
discontiguous physical pages (the so-called scatter-read/gather- write capabil- 
ity) is a beneficial side effect of this mapping. 

14.6.2 Swapper I/O 

The swapper is presented with a more difficult problem. It must write a col- 
lection of pages to disk that are not even virtually contiguous. It solves this 
problem elegantly. 

When the system is initialized, an array of WSMAX longwords is allocated 
from nonpaged pool for use as the swapper's I/O table. The starting address of 
this array is stored in global pointer SWP$GL_MAP. (The address is also 
stored in the saved PO base register in the swapper's process header so the 
pages mapped by this array are effectively the swapper's PO space. This use is 
discussed in Chapter 20.) 

When the swapper scans the working set list of the process being 
outswapped, the page frame numbers in each valid PTE are moved to succes- 
sive entries in the swapper's I/O table. The address of the base of the table is 
put into the SVAPTE field of the IRP by the swapper before the IRP is passed 
on to the driver. (The swapper can exercise this control because it builds a 
portion of its own IRP, rather than using the entire $QIO mechanism.) The 
I/O table looks just like any other page table to the mapping register subrou- 
tines called by the driver. The PFNs are extracted from this array and loaded 
into adapter mapping registers. 

What the swapper has succeeded in doing is making pages that are not 
virtually contiguous appear to be virtually contiguous to the I/O subsystem. 
(A different interpretation is that the pages are virtually contiguous in the PO 
space of the swapper, the process that is actually performing the I/O.) At the 
same time that each PTE is being processed, any special actions based on the 
type of page are also taken care of. The whole operation of outswap and the 
complementary steps taken when the process is swapped back into memory 
are discussed in Chapter 17. 

14.6.3 Modified Page Writer PTE Array 

The modified page writer, in its attempt to write many pages to backing store 
with a single write request (so-called modified page write clustering), is faced 



300 



14.6 Swapper and Modified Page Writer Page Table Arrays 



SWPSGI MAP* *■ 

(This address is stored 
in the swapper's PO 
base register.) 



MPW$AL_PTEt 



<Z 



Swapper's 

I/O 

Page Table 

Entry 

Array of 

Longwords 



# 



Modified 
Page Writer's 

I/O 

Page Table 

Entry 

Array of 

Longwords 



WSMAX 

Elements 

(This number is stored in 

the swapper's PO length 

register.) 



VMPW_WRTCLUSTERj/ 
f Elements S 



Jf 



MPW's 

Process 

Header 

Vector 

Index 

Array 

of 
Words 



►MPW$AW_PHVINDEX 



<£ 



Figure 14-24 

Swapper and Modified Page Writer PTE Arrays 



with a problem similar to the swapper's problem, with one additional twist. 
When the modified page writer is building an I/O request, there are three 
forms of page that it can encounter. Pages that are bound for the swap file 
(SWPVBN nonzero) are written individually. Pages that are bound for an 
image file are not necessarily virtually contiguous, these pages will be writ- 
ten as a group only if they are contiguous. Pages on the modified page list that 
are to be written to a page file may be not only discontiguous within a process 
address space but may also belong to several processes. The modified page 
writer builds a table of PTEs in a manner similar to the swapper. 

At initialization time (in module INIT), two arrays are allocated from 
nonpaged pool for the modified page writer (see Figure 14-24). Each array 
contains MPW-WRTCLUSTER elements. The longword array will be filled 
with page table entries containing PFNs analogous to the swapper map. The 
word array contains an index into the process header vector for each page in 
the map. In this way, each page that is put into the map and written to its 
backing store location is related to the process header containing the PTE 
that maps this page. The operation of the modified page writer, including its 
clustered writes to a page file, is discussed in detail in Chapter 17. 

14.6.4 Nonreentrancy of Swapper and Modified Page Writer 

The use of these arrays to hold page table entries for the I/O subsystem 
makes the swapper and the modified page writer not reentrant. That is, the 
swapper process can perform only the following simultaneous operations: 

• An inswap or outswap operation that uses the swapper map. This action is 
recorded by setting the swap in progress flag (SCH$V_SIP) in location 
SCH$GB_SIP. 



301 



Memory Management Data Structures 

• A modified page write to a page file, an image file, or a swap file VBN. The 
modified page write in progress flag (SCH$V_MPW) in the same global 
location (SCH$GB_SIP) records this action. 

14.7 DATA STRUCTURES USED WITH SHARED MEMORY 

The MA780 shared memory unit can be used as an interprocessor communi- 
cation path with common event flags, mailboxes, or global sections. This 
VMS support requires data structures located in the shared memory that de- 
scribe the shared memory itself and the shared memory common event flag 
clusters, mailboxes, or global sections used. In addition, each processor con- 
nected to the shared memory requires data structures located in local mem- 
ory that describe processor-specific information (such as the starting PFN or 
port number). Information common to both processors (for example, the size 
of the global section descriptor tables) is maintained in the shared memory 
data structures. 

Note that the shared memory described in this section differs significantly 
from the MA780 shared memory used in the VAX-11/782. In the VAX-11/ 
780, shared memory is used as a common data area or communications path 
between two processors; in the VAX-11/782, the MA780 is used as main 
memory. 

14.7.1 Shared Memory Control Structures 

The shared memory unit consists of a series of pages of physical memory. 
The bootstrap sequence records the presence of the shared memory unit but 
does not configure the physical pages into the system (unless the processor is 
a VAX-1 1/782), allowing the user to include shared memory in a site-specific 
way (for example, whether to reinitialize the MA780 shared memory after 
each reboot or not). In either case, the physical memory pages must be virtu- 
ally mapped so that they are accessible to program code (because memory 
management is enabled). 

The virtual mapping used by one processor to access shared memory pages 
may be different from the virtual mapping used by another processor. For this 
reason, some of the data structures that the VMS operating system uses to 
manipulate its data structures located in shared memory are self-relative 
queue elements. (Self -relative queue elements are described in the VAX-11 
Architecture Reference Manual.) 

Note that the VMS operating system cannot use one of its usual synchroni- 
zation techniques, elevated IPL, to control access to shared memory data 
structures. Elevated IPL blocks interrupts, but only on one processor. Instead, 
all accesses to shared memory data that must be synchronized are done with 
one of the interlocked instructions provided for just this purpose in the VAX 
architecture. These instructions are: 



302 



14.7 Data Structures Used with Shared Memory 

INSQHI Insert Entry into Queue at Head, Interlocked 

INSQTI Insert Entry into Queue at Tail, Interlocked 

REMQHI Remove Entry from Queue at Head, Interlocked 

REMQTI Remove Entry from Queue at Tail, Interlocked 

BBSSI Branch on Bit Set and Set Interlocked 

BBCCI Branch on Bit Clear and Clear Interlocked 

ADAWI Add Aligned Word Interlocked 

The four instructions that manipulate self-relative queues actually provide 
two levels of interlocking. Because self-relative queue elements must be 
quadword aligned, the low three address bits (all zero) are available for other 
uses. The low-order bit in the forward link is used as a secondary interlock. 
When this bit is set, interlocked access to the head or tail of the queue is 
denied. This interlock bit is read in a interlocked fashion that is used by the 
other three inteructions in the list (BBSSI, BBCCI, and ADAWI). 

14.7.1.1 Physical Layout of Shared Memory. If the shared memory is to be supported 
by the VMS operating system, it must be configured into the system with the 
SYSGEN utility. This installation step is described in the VAX/VMS System 
Management and Operations Guide. The resulting physical layout of shared 
memory is illustrated in Figure 14-25. The VMS data areas are initialized 
when the first processor (port) connects the shared memory unit. As other 
ports make their connection, their local memory data structures are simply 
initialized to point to the shared structures. 



<£ 



Balance of Memory 

Available for 

Shared Memory 

Global Section 

Pages 



Lowest Physical 
Address 



X 



Global Page Allocation Bitmap 



Pool Space 



Table for Shared Memory CEBs 



Mailbox Table 



Table for Shared Memory GSDs 



Shared Memory Common Data Page 



Highest Physical 
Address 



Figure 14-25 

Physical Layout of Shared Memory 



303 



Memory Management Data Structures 

14.7. 1 .2 Shared Memory Common Data Page. The shared memory page with the high- 
est physical address is used by the VMS operating system to contain the 
information that describes this shared memory unit. This page is called the 
common data page. Because this page may be virtually mapped in different 
ways on each port (and may not even exist at the same physical address), each 
pointer in the common data page is a relative pointer from the base virtual 
address of the common data page. The contents of the common data page are 
listed in Table 14-3. 

14.7.1.3 Processor-Specific Control. As each processor connects itself to the shared 
memory unit, a data structure in processor local memory is initialized that 
allows that processor to locate the common data page. That structure also 
contains physical page information that allows the shared physical memory 
to be virtually mapped on that processor. The layout of the shared memory 
control block is pictured in Figure 14-26. 



14.7.2 Global Sections in Shared Memory 

The creation and mapping of a global section in shared memory are slightly 
different from the corresponding actions for local memory global sections. 
The global section is recognized as a shared memory global section because 
its name translates to an equivalence name of the form: 

shared-memory-name : section-name 

The Create and Map Section system service then creates the data structures 
necessary to describe this section. 

• The global section descriptor for such a section (see Figure 14-27) is located 
in shared memory and contains information used to map the section. 

• Only the port that creates the global section has a global section table 
entry (in the local memory of the creating processor) describing the sec- 
tion. This section table entry is used by the VMS operating system to load 
the physical pages of the section with the contents of the designated file 
when the section is created. The GSTE is also used if the Delete Global 
Section or Update Section system services are called to write the contents 
of a writeable global section located in shared memory back to its original 
file. (Either system service will not have any effect if it is issued from any 
port other than the creator port.) 

• Because the pages of a shared memory global section are always valid, 
there is no need to page those pages; therefore, no global page table entries 
are created for the section. Instead, when a process maps to such a section, 
its process page table entries are loaded with the page frame numbers of 
the shared memory section pages and marked valid. These pages are not 
charged against the process's working set. 



304 



14.7 Data Structures Used with Shared Memory 



Table 14-3: Contents of Shared Memory Common Data Page 



Mnemonic 

SHD$L_MBXPTR 

SHD$L_GSDPTR 

SHD$L_CEFPTR 

SHD$L_GSBITMAP 

SHD$L_GSPAGCNT 

SHD$L_GSPFN 

SHD$W_GSDMAX 

SHD$W_MBXMAX 

SHD$W_CEFMAX 

SHD$T_NAME 

SHD$Q_INITTIME 



Item 

Relative Pointer to Mailbox Table 
Relative Pointer to GSD Table 
Relative Pointer to CEB Table 
Relative Pointer to Global Page Bitmap 
Total Count of Pages for Global Sections 
Relative PFN of First Global Section Page 
Number of entries in GSD Table 
Number of entries in MBX Table 
Number of entries in CEB Table 

(spare word for alignment] 
Name of Shared Memory 

(counted ASCII string) 
Initialization Time 



Size 

Longword 

Longword 

Longword 

Longword 

Longword 

Longword 

Word 

Word 

Word 

Word 

16 Bytes 



Quadword 



This is the end of the constant area of the shared memory common data page. 



SHD$L_ 
SHD$W. 
SHD$W. 
SHD$W 
SHD$B_ 
SHD$B_ 
SHD$B_ 
SHD$B_ 
SHD$B. 
SHD$B- 
SHD$B. 



CRC 

.GSDQUOTA 

_MBXQUOTA 

_CEFQUOTA 

PORTS 

INITLCK 

BITMAPLCK 

FLAGS 

GSDLOCK 

MBXLOCK 

.CEFLOCK 



SHD$W_PRQWAIT 

SHD$W_POLL 

SHD$W_RESWAIT 

SHD$W_RESAVAIL 

SHD$W_RESSUM 



SHD$Q_PRQ 

SHD$Q_POOL 

SHD$Q_PRQWRK 



CRC of Fields in Constant Area 

Count of GSDs Created (one word per port) 

Count of Mailboxes Created (one word per port) 

Count of CEBs Created (one word per port) 

Number of Ports 

Owner of Initialization Lock 

Owner of Global Page Bitmap Lock 

Flags for Locking Data Structures 

Owner of GSD Table Lock 

Owner of MBX Table Lock 

Owner of CEF Table Lock 

(spare byte for alignment) 
Ports Waiting for Interprocessor 

Request Blocks (one bit per port) 
Ports Actively Using the Memory 

(one bit per port) 
Ports Waiting for a Resource 

(one bit per port) 

(one word mask per resource) 
Ports Needing to Report Resource Available 

(one bit per port) 

(one word mask per resource) 
Ports with Resources to Report 

(one bit per port) 

(three spare words for alignment) 
Free Interprocessor Request Block Listhead 
Free Pool Block Listhead 
Interprocessor Request Work Queue Listheads 

(one listhead per port) 



Longword 

16 Words 

16 Words 

16 Words 

Byte 

Byte 

Byte 

Byte 

Byte 

Byte 

Byte 

Word 

Word 

16 Words 

16 Words 



Word 

3 Words 
Quadword 
Quadword 
16 Quadwords 



305 



Memory Management Data Structures 

Shared Memory Control Block 



Link to Next SHB 



VA of Common Data Page 



Flags 



Type 



Size 



Reference Count 



Base PFN for Global Section Pages 




Address Past Last Byte of Shared Memory Pool 



Address of Adapter Control Block 



Figure 14-26 

Contents of Shared Memory Control Block 



Because of the way in which the VMS operating system uses shared memory 
for global sections, putting global sections into shared memory, even when 
the memory unit is not connected to another processor, improves system 
utilization. Each process using the shared sections is getting a free extension 
to its working set. There is no demand placed on the global page table. The 
local physical memory that would otherwise be required to contain such 



The assembly- 
time parameter 
GSD$C_PFNBASMAX 
(currently =4) 
defines the 
number of 
discontiguous 
pieces in a 
single section. 



Shared Memory Flags 



Rest of Regular 
Global Section Descriptor 



Deleter 
Port 



Creator 
Port 



Number of 
Processor 
Ref. Counts 



Inter- 
processor 
Lock 



Base PFN for Section Pages 
Number of Pages 



Second Base PFN 
Page Count Pair 



Third Base PFN 
Page Count Pair 



Fourth Base PFN 
Page Count Pair 



PTE Count for First Processor 



PTE Count for Second Processor 



PTE Count for Third Processor 



PTE Count for Fourth Processor 



• See Figure 14-14. 



Figure 14-27 

Contents of Shared Memory Global Section Descriptor 



306 



14.7 Data Structures Used with Shared Memory 

entities as DCL or the Run-Time Library is available for other uses such as an 
expanded physical page cache (free page list). 



14.7.3 Mailboxes in Shared Memory 

When a mailbox is created in shared memory, it is described by a shared 
memory mailbox descriptor block (MBX) located in the shared memory (see 
Figure 18-2). In addition, each port connected to the shared memory mailbox 
has a unit control block (UCB) in its local memory I/O database that makes 
the connection between the local I/O system and the shared memory mail- 
box. The relationships of shared memory mailbox data structures are pic- 
tured in Figure 18-3. 



14.7.4 Common Event Flag Clusters in Shared Memory 

As with global sections and mailboxes (and the shared memory itself), there 
are data structures in shared memory and other structures in local memory 
required to fully describe a common event flag cluster located in shared 
memory. The shared memory data structure is called a master CEB (common 
event block) and contains the only valid set of event flags. Each port con- 
nected to this common event flag cluster has a slave CEB that locates the 
master. The relationship between the master CEB and the slave CEBs is pic- 
tured in Figure 12-4. The layouts of the master and slave common event 
blocks are pictured in Figure 12-5. 



307 



15 Paging Dynamics 



I consider that a man's brain originally is like a little empty attic, 
and you have to stock it with such furniture as you choose. . . . 
Now, the skillful workman is very careful indeed as to what he 
takes into his brain-attic. He will have nothing but the tools 
which may help him in doing his work, but of these he has a large 
assortment, and all in the most perfect order. It is a mistake to 
think that the little room has elastic walls and can distend to any 
extent. Depend upon it, there comes a time when for every 
addition of knowledge you forget something that you knew 
before. It is of highest importance, therefore, not to have useless 
facts elbowing out the useful ones. 
— Sir Arthur Conan Doyle, A Study in Scarlet 



In the previous chapter, the various data structures that are maintained by 
memory management were described apart from the context in which they 
are used. This chapter shows how the various structures are manipulated by 
the pager in response to different forms of page faults. 

Although pager action is described here, it is not presented in a flowchart 
or decision fashion. Rather, the actions are described in terms of modifica- 
tions to data structures. 



15.1 OVERVIEW OF PAGER OPERATION 

Before discussing how the pager reacts to different forms of page faults, this 
chapter will briefly describe the overall operation of the pager. 



15.1.1 Hardware Action 

All program references generated by the CPU are virtual addresses. Each ad- 
dress must be translated to a physical address before a reference to memory 
(or an I/O space page) can be made. The virtual address (see Figure 15-1) is 
used by the address translation mechanism to find the page table entry that 
will be used to translate the address. 

If the page table entry is valid, its contents are used to translate the virtual 
address to a physical address and execution continues. If the page table entry 
is invalid (PTE<31> = 0), then a translation-not-valid fault is generated. 
Figure 15-2 shows the state of the kernel stack following a page fault. 



308 



15.1 Overview of Pager Operation 



31 30 29 



Virtual Page Number 



Byte.Offset 



-»- P1 Space Indicator if VA <31 > =0 

> System Virtual Address Space 
Indicator 



VA <31:30> Selects the page table: 

0=P0 Page Table 

1 = P1 Page Table 

2= System Page Table 

3= Reserved 
VA < 29:9 > is used as a longword 
index into the selected table. 



Figure 15-1 

Format of Virtual Address Showing Fields Used 
to Locate Page Table Entry That Maps the Page 




Reason Mask 



Invalid Virtual Address 



PC of Faulting Instruction 



PSL at Time of Fault 



State of the Kernel Stack 

Following a 

Translation - Not - Valid Fault 



-SP 



\ 

I 

\ 
\ 

\ 



Reason Mask for 
Translation - Not 



Valid Fault 



r 



This Bit Is Always for 
Translation - Not - Valid Faults 

• PTE Reference 

0-»Virtual Address Not Valid 
1 -^Associated PTE Not Valid 

• Intended Access Type 

0->Read Access 
1-*Modify or Write Access 



Figure 15-2 

State of the Kernel Stack Following 
Translation-Not- Valid Fault 

15.1.2 Initial Pager Action 

Before the pager does any work, it performs a consistency check by demand- 
ing that the IPL be no higher than 2. If the IPL is higher than 2, a fatal bug- 
check is generated. This check is made for the following two reasons: 



309 



Paging Dynamics 



• Code that is executing at a higher IPL needs to perform a series of instruc- 
tions without being interrupted. If a page fault happens, the faulting proc- 
ess might be removed from execution, allowing another process to execute 
the same routine or access the same protected data structure. 

• Page faults are exceptions that happen to a process. When the system is 
executing at IPL higher than 2, it is often on the interrupt stack, acting in 
response to an external trigger. There is not necessarily a process that can 
be charged for the page fault. 

The next step that the pager takes is to retrieve the invalid virtual address 
from the kernel stack. It uses this address to locate the page table entry that 
maps this page by performing the same operations that the address transla- 
tion mechanism uses. 

1. The upper two bits of the virtual address (VA<31:30>) select which page 
table (or which base register) to use. 

2. The virtual address field (VA<29:9>) is used as a longword index into the 
page table. 

Before the page table entry is examined, the pager determines whether the 
system virtual page containing the page table entry is itself valid. (This check 
avoids the necessity of making the pager recursive.) If not, the page table page 
is made valid first. Note that the pager does not perform this check using the 
page table valid bit in the exception parameter; rather, it checks the valid bit 
in the page table entry for the system virtual page. 

Once the page table entry is available, the pager takes different actions 
depending on the nature of the invalid page table entry. (See Figure 14-3 for 
the different forms of invalid page table entry.) The next several sections 
describe some of the major paths through the pager. Extraordinary conditions 
such as read and write errors are only mentioned in passing. 



15.2 PAGE FAULTS FOR PROCESS PRIVATE PAGES 

The first set of page faults concern process private pages. The different path 
through the pager when sharing is involved is discussed in the next section. 
There are four cases that must be described. 

• Two of the cases involve a page that is originally faulted from an image 
file. The two cases are distinguished by whether or not the section is copy 
on reference. 

• A third private section can consist of a series of demand zero pages. 

• Finally, an intermediate state that can result from both copy-on-reference 
pages and demand zero pages has the faulting page residing in a page file. 



310 



15.2 Page Faults for Process Private Pages 

15.2.1 Page Located in an Image File 

There are two different types of page that can initially reside in a private 
image file, pages that are copy-on-reference, and those that are not. The page 
table entry for either page contains a process section table index. The only 
initial difference between the two pages is the setting of the copy-on-refer- 
ence bit in the page table entry (see Figure 14-3). 

15.2.1.1 Image Page That Is Not Copy on Reference. The first type of page fault in- 
volves a page in an image file that is not copy on reference. The various 
transitions that such a page can possibly make are illustrated in Figure 15-3. 
The numbers in circles are keyed to explanations of each transition listed 
below. (For simplicity, clustered reads and writes are ignored in the discus- 
sion that follows. Section 15.5 discusses all aspects of paging I/O.) The page 
table entry is initially set to the form illustrated at the top of Figure 15-3. It 
contains a process section table index (PSTX) with the copy-on-reference bit 
(PTE<16>) clear. 

(T) A page fault occurs. The pager uses the virtual address exception parame- 
ter to locate the page table entry. The page table entry contains a process 
section table index. Information contained in the process section table 
entry indicates which virtual block in the image file should be read. The 
pager allocates a physical page from the head of the free page list. The 
page is added to the process working set. This step may require the pager 
to remove another page from the working set in order to make room for 
the page currently being added. 

The PFN arrays are initialized. The STATE array element indicates 
that a read is in progress. The PTE array element points to the process 
page table entry. The working set list index array element locates the 
working list entry just set up. The BAK array element is loaded with the 
initial contents of the page table entry, the process section table index. 
The reference count array element contains a two, one for being in the 
working set and one for the read in progress. 

The pager builds an I/O request packet (see Section 15.5) that describes 
the read that is being done. The process is placed into a page fault wait 
state. 

(2) Because most of the work was done in response to the initial fault, there 
is little left to do when the page read completes. The reference count is 
decremented (but stays above zero, so nothing special happens). The state 
of the page is changed to active and valid. Finally, the valid bit is set in 
the process page table entry and the process is removed from the page 
fault wait state. The next time that the process is selected for execution, 
it will execute the same instruction that caused the initial page fault. 



311 



Paging Dynamics 



START 



U<P>- 



PTE contains 

Process Section 

Table Index (PSTX) 



-d^ ; 



From bottom 
page 



Legend 



<E>- 



Page Fault Transition 



■ PTE-»Transition 
- In Working Set 



I 



o— 



Other Transitions 



■ PTE is valid 

• In Working Set 

■ Modify Bit Clear 



<£> 



• PTE is valid 

■ In Working Set 

■ Modify Bit Set 







<P>~~ 



PFN Data 



Page NOT in 
physical memory; 
no PFN data 



Read in Progress 

REFCNT=2 

BAK=PSTX 



Active and Valid 
-- , REFCNT>0 
BAK=PSTX 







PTE-»Transition 

Saved Modify | Saved Modify 
Bit Clear ! Bit Set 



-<E>- 



PTE->Transition 
Saved Modify Bit Set 



-<P>J 



<^" 



PTE->Transition 
Saved Modify Bit Clear 



PTE-»Transition 
Saved Modify Bit Clear 



9 



> To top 
of page 



Release Pending 

REFCNT>0 

BAK=PSTX 



Modified Page List 
REFCNT=0 
BAK = PSTX 



Write in Progress 

REFCNT=1 

BAK=PSTX 



Free Page List 
REFCNT=0 
BAK = PSTX 



Figure 15-3 

State Diagram Showing Page Transitions for Private 
Section Page That Is Not Copy on Reference 



312 



15.2 Page Faults for Process Private Pages 

(3) One transition that a valid page can undergo (and still remain valid) oc- 
curs when the page is modified as a result of instruction execution. The 
hardware sets the modify bit in the page table entry. The change is not 
noted at this time in the PFN database. 

(4) When the page is removed from the process working set, several things 
happen. 

a. The working set list entry is made available. 

b. The WSLX array element is cleared. 

c. The modify bit in the page table entry is logically ORed into the PFN 
state array element. 

d. The VALID, TYPO, and TYP1 bits in the PTE are all cleared. The PFN 
field is left alone. 

e. The REFCNT array element is decremented. If the reference count 
goes to zero, the page is put the free or modified page list, according to 
the setting of the saved modify bit in the PFN STATE array element. 
The new location of the page is inserted into the STATE array. 

Note that pages are not removed from the working set until room is 
required for other pages, until the virtual pages are deleted, or in response 
to a $PURGWS system service call. 

(5) If the reference count does not go to zero, there is outstanding I/O for this 
page. The state is changed to release pending. The ultimate destination 
for the page (free or modified list) is recorded in the saved modify bit in 
the STATE array. 

(6) The I/O completion routine decrements reference counts for pages that 
are locked down. When this routine detects that the count has gone to 
zero, it places the page on either the free list or the modified list as appro- 
priate. The STATE array element is changed. 

If the page is placed on the modified list and if it has a backing store 
address already, the page file index is cleared and the page file dealloca- 
tion routine is called to release the page in the page file. Because the page 
has been modifed, it is assumed that the contents at its backing store 
address are now invalid. 

(7) The modified page writer will eventually write this physical page to its 
backing store address, which is located in the PFN BAK array. Writeable 
pages that are not copy on reference are written back to the image file 
from which they originally came. 

The state of the page is set to write in progress. The saved modify bit is 
cleared. The reference count of one reflects this outstanding output oper- 
ation. 

It is worth noting at this time that writeable private pages that are not 
copy on reference are not usual products of the linker. Such sections must 
be created with the Create and Map (Private) Section system service. 



313 



Paging Dynamics 

(8) When the modified page write completes, the page is placed on the free 
page list. The same routine decrements the reference count, notes that 
the reference count went to zero, and notes that the saved modify bit is 
clear. 

(9) While the physical page has remained attached to the process, the page 
table entry has always contained a PFN and the PFN PTE array has al- 
ways contained the address of the process page table entry. 

When the physical page is reused for another purpose, several steps 
must be taken to break the ties between the process virtual page and the 
physical page that is about to be reused. 

The process PTE must be altered to reflect the backing store address of 
the page. (The PFN PTE array is used to locate the page table entry.) In 
this case, the PTE is reset so that it contains a process section table index 
(PSTX), the same contents that it had before the initial page fault. 
The PFN array elements for this physical page are all cleared before the 
page is passed on to the new owner of the physical page. In particular, the 
PTE array element, the only connection from the PFN database to the 
process page table, is cleared. 

15.2.1.2 Page Faults Out of Transition States. Figure 15-3 also shows the transitions 
that a page makes when a page fault occurs while the physical page is in the 
transition state. While the changes back to the active state are somewhat 
straightforward, there are details about each fault that should be mentioned. 
Note that each of these page faults requires that a new working set list entry 
be acquired, and the acquisition may involve the removal of some other page 
from the process working set. 

1. A page fault from the free page list is resolved by placing the page back 
into the active and valid state, resetting the PTE, and incrementing the 
reference count. 

2. A page fault from the modified list has exactly the same effect. The fact 
that the page was previously modified but never written to its backing 
store address is shown in the figure by putting the page back into its modi- 
fied state. 

In fact, the modify bit in the PTE is not actually turned on by the pager. 
Rather, the saved modify bit in the PFN STATE array records the fact that 
the page has not been backed up. 

3. A page fault from the release pending state has no special effects. Again, 
the state is changed to active, the valid bit in the PTE is turned on, and the 
reference count is incremented. 

Artistic license is taken in the figure to differentiate physical pages that 
were modified from pages that were not. Again, the only difference be- 
tween the two pages is the setting of the saved modify bit in the PFN 
STATE array, not the setting of the modify bit in the PTE. 



314 



15.2 Page Faults for Process Private Pages 

4. The transition that deserves special comment is a page fault that occurs 
while the modified page writer is writing the page to its backing store 
address. The saved modify bit is cleared before the write begins so that the 
page will be placed on the free list when the write completes. Although 
the page has not yet been completely backed up, the assumption is made 
that the write will complete successfully. Page faults can thus put the page 
into the active but unmodified state. The only difficulty occurs in the 
event of a write error. The I/O completion routine detects this state of 
affairs and turns the saved modify bit back on. 

15.2.1.3 Copy-on-Reference Page. A more common type of writeable process private 
page is called copy on reference. Figure 15-4 illustrates the transitions that 
such a page makes from its initial page fault until it is written to some back- 
ing store address. 

Many of the transitions that occur here are no different from the case just 
described. This section will note each transition but only elaborate on those 
areas that are different. 

(T) The initial setting of the page table entry ( START 1 in the figure) is again 
the process section table index, but the copy-on-reference bit (PTE>16<) 
is now set. When a page fault occurs, the pager again allocates a physical 
page, sets its PFN into the PTE, and initiates the read. Two important 
steps are taken at this time that differ from the previous case. 

First, the saved modify bit in the PFN STATE array is turned on. Set- 
ting the bit guarantees that the page will be written to its backing store 
address when removed from the process working set, regardless of what 
instructions or I/O operations the process chooses to execute. 

Second, the BAK array element is set to point to the page file, with an 
indication that no block has yet been allocated. At this time, all ties to 
the original image file are broken. When the modified page writer wants 
to write this page to its backing store address (as it certainly will because 
the saved modify bit was just turned on), it will allocate a block in the 
page file and write the contents of the physical page there. 

(2) When the read completes, the page is marked as active and valid (and 
effectively modified). 

(3) When the page is removed from the process working set (and the refer- 
ence count is zero), the page is unconditionally placed on the modified 
page list. 

@ If the reference count did not go to zero when the page was removed from 
the process working set, the physical page is placed into the release pend- 
ing state until the I/O completes. 

(5) At that time, the page is placed on the modified page list. 

A page fault from either the release pending state or from the modified page 
list puts the page back into the active (but effectively modified) state. That is, 



315 



Paging Dynamics 




PTE contains 
PSTX.CRF 



PTE contains GPTX 

GPTE contains 

GSTX.CRF 



Ol 



PTE->Transition 
In Working Set 
Saved Modify Bit Set 



The area within these 
dotted lines is also 
shown in Figure 1 5-7. 



PTE-<- 
Demand Zero Page 



Q 



L<2> 



PTE contains GPTX 
GPTE contains 



Page NOT in 
physical memory; 
no PFN data 



Read in Progress 

REFCNT=2 

BAK=PGFLX,0 



I JyC I 



- PTE is Valid 

- In Working Set 

- Modify Bit Set 



r®- 1 



PTE-t-Transition 
Saved Modify Bit Set 




-& -I 



PTE-^Transition 
Saved Modify Bit Set 



-GH 



V 



To Figure 15-5 

Figure 15-4 

State Diagram Showing Page Transitions for Private 

and Global Copy-on-Reference Pages and for Demand Zero 

Pages 



The area within 
these dotted lines is 
also shown in Figure 15-8. 



Active and Valid 
REFCNT>0 
BAK = PGFLX,0 



Release Pending 
REFCNT>0 
BAK = PGFLX,0 



Modified Page List 

REFCNT=0 

BAK=PGFLX,0 



316 



15.2 Page Faults for Process Private Pages 

the saved modify bit in the PFN STATE array remains set, causing the page to 
be put back on the modified page list when it is removed from the working 
set again. 

The transition from the modified page list that is taken when the modified 
page writer writes the page to its backing store address (in the page file) fits 
into the transition diagram for faults from the page file (see Figure 15-5). The 
connection between Figure 15-4 and Figure 15-5 is indicated by path C in the 
two figures. 



15.2.2 Demand Zero Pages 

The initial setting of a page table entry can be set to demand zero as a result 
of a Create Virtual Address Region system service. One of these services can 
be issued explicitly by the process or on its behalf by the system (as part of 
image activation or in the LIB$GET_VM Run-Time Library procedure). 

When the pager detects a page fault for a demand zero page, it takes the 
following steps. 

1. A physical page is allocated from the beginning of the free page list. 

2. The PFN array elements are initialized. The PTE array element points to 
the process page table entry. 

3. The BAK array element denotes a not-yet-allocated block in the page file. 

4. The page is filled with zeros. This is done with a MOVC5 instruction that 
uses a zero-length source string and a null fill character. 

5. The reference count is incremented; the page is added to the process work- 
ing set; and the state is set to active. 

6. Finally, the fault is dismissed and control is passed back to the user proc- 
ess without interruption. 

These steps all take place along path 3 in the upper righthand portion of 
Figure 15-4. 



15.2.3 Global Copy-on-Reference and Page-File Pages 

There are two forms of pages that merge into the same set of state transitions 
as private copy-on-reference sections and demand zero pages. These forms are 
global copy-on-reference pages and global page-file backing-store pages. The 
details of global page fault resolution are discussed in Section 15.3. 

Suffice it to say here that that global copy-on-reference pages are initially 
faulted from a global image file but, from that time on, are indistinguishable 
from other global writeable pages. Global page-file backing-store pages are 
initially faulted as global demand zero pages and from then on are indistin- 
guishable from private demand zero pages. 



317 



Paging Dynamics 



PTE contains 

Page File Virtual 

Block Number (PGFLVB) 



Legend 

— -©— 
Page Fault Transition 



o— 



Other Transitions 



Page NOT in 
physical memory; 
no PFN data 



PTE-»Transition 
in Working Set 



-PTE is Valid 

- In Working Set 

- Modify Bit Clear 







— ©- 



-©- 



• PTE is Valid 

- In Working Set 

- Modify Bit Set 




(s> 



PTE-»Transition 



Saved Modify 
Bit Clear 



Saved Modify 
Bit Set 



-©— 



-©- 



PTE-»Transition 
Saved Modify Bit Set 



Read in Progress 
REFCNT=2 
BAK = PGFLVB 



Active and Valid 
REFCNT>0 
BAK= PGFLVB 



Release Pending 
REFCNT>0 
BAK= PGFLVB 



-©- 1 



Modified Page List 
REFCNT=0 
BAK= PGFLVB 



PTE->Transition 
Saved Modify Bit Clear 



kE^ 



PTE->Transition 
Saved Modify Bit Clear 



. . Write in Progress 

-* — LcJ REFCNT=1 

V BAK= new PGFLVB 
From 
Figure 15-4 



Free Page List 
REFCNT=0 
BAK= new PGFLVB 



(£> 



_ To top 
of page 



Figure 15-5 

Transitions for Pages Located in a Page File 



318 



15.3 Page Faults for Global Pages 

15.2.4 Page Located in the Page File 

The transitions that a page faulted from the page file goes through (see Figure 
15-5) are no different from the transitions described for pages that are not 
copy on reference (see Figure 15-3). The only difference in the PFN data be- 
tween the two figures is that the BAK array element in Figure 15-5 indicates 
that the page belongs in the page file. The BAK array element in Figure 15-3 
contains a process section table index. 

The other difference between the two figures is the entry point into the 
transition diagram. Pages can start out in an image file (PTE contains PSTX) 
but pages can never start out in a page file. The entry into Figure 15-5 is from 
Figure 15-4, from one of three initial states that eventually result in the phys- 
ical page contents being written to the page file. 



15.3 PAGE FAULTS FOR GLOBAL PAGES 

The page fault resolution for global pages can be described in exactly the 
same way as process private pages are described. Following the transition of a 
global page table entry and its associated PFN database entries adds nothing 
to the information already presented in Figure 15-3. 

A more interesting approach is to look at the interaction of the process 
page table entries and the global page table entries that they point to. The 
following discussion uses a specific example rather than a general case, to 
allow specific numbers to be used. 



15.3.1 Page Fault for Global Read-Only Page 

Figure 15-6 illustrates the transitions that occur for a global read-only page 
that is mapped by two processes. The mapping is shown separately from the 
operation of section creation to simplify the figure. A second simplification 
in the figure is that the page is assumed to be read only. The implications of 
a read/write global page are described in the next section without the benefit 
of a figure. 

(START) 

When the global section is initially created, the data structures described 
in the previous chapter are all set up. The global page table entry for the 
page we will follow contains a global section table index, which locates 
the global section table entry containing information about the global 
image file. 

® When Process A maps to the section, the process page table entry con- 
tains a global page table index, effectively a pointer to the global page 
table entry. 



319 



Paging Dynamics 



START 



PTE = GPTX 



■ PTE is Valid 
- In Working Set 



No Change 



• PTE is Valid 
- In Working Set 



PTE = GPTX 



1 

No Change 



PTE = GPTX 



No Change 



Process A 



1 



PTE contains 

Global Page Table 

Index (GPTX) 



PTE=GPTX 



1 

No Change 



PTE = GPTX 



I 



• PTE is Valid 
■ In Working Set 



No Change 



- PTE is Valid 

- In Working Set 



PTE = GPTX 



No Change 



PTE = GPTX 



PTE = GPTX 



GPTE contains 

Global Section Table 

Index (GSTX) 



I 



GPTE = GSTX 



GPTE->Transition 



I 



GPTE is Valid 



GPTE is Valid 



GPTE is Valid 



GPTE-MYansition 



GPTE = GSTX 



Page NOT in 
physical memory; 
no PFN data 



No PFN data 





No PFN data 



Read in Progress 

REFCNT=2 

SHRCNT=1 

BAK=GSTX 

PTE-K3PTE 

Active and Valid 
REFCNT= 1 
SHRCNT=1 
BAK = GSTX 
PTE-+GPTE 

Active and Valid 

REFCNT=1 

SHRCNT=2 

BAK=GSTX 

PTE-X3PTE 

Active and Valid 

REFCNT=1 

SHRCNT=1 

BAK=GSTX 

PTE->GPTE 



Free Page List 
REFCNT=0 
SHRCNT=0 
BAK = GSTX 
PTE-K3PTE 



No PFN data 



Figure 15-6 

Example of Page Transitions Made by a Global Page 
Mapped by Two Processes 



320 



15.3 Page Faults for Global Pages 

(2) When Process B maps to the section, its page table entry contains exactly 
the same global page table index as found in Process A's PTE. 

(3) Process B happens to incur a page fault on this global page first. Several 
things happen. 

a. The pager notes that the process PTE contains a global page table index 
(GPTX). This index is used to locate the global page table entry 
(GPTE). 

b. The GPTE contains a global section table index (GSTX), indicating 
that the global page resides on disk somewhere. Exactly the same 
things are done to initiate the read here as in the case of a process 
private page. 

c. A physical page is allocated. 

d. The state of that page is set to read in progress. 

e. The reference count is incremented. 

f. The BAK array element is loaded with the GSTX. 

g. Note that the PFN PTE array element is loaded with the address of the 
GPTE, not the address of the process PTE. Note also that, while the 
read is in progress, the GPTE contains the transition PTE but the proc- 
ess PTE still contains the GPTX. 

h. The reference count is two, one for the read in progress and one for 
recording the fact that the page is in some process working set (the 
global share count is nonzero). The global share count array element 
contains a one while the read is in progress. 

(4) Several steps are taken when the read completes. 

a. The state of the page is changed to active and valid. 

b. The global page table entry is set to valid, to record the fact that this 
page is in some process working set. 

c. The process page table entry, located through its address stored in the 
I/O request packet, is set up to contain the low-order 21 bits from the 
global page table entry, with the valid bit set and bits 21 and 26 
cleared. 

d. The reference count and share count are both one at this point. 

(5) When Process A faults the same global page, the initial pager action is the 
same as it was in Step 3, because the page table entry is again a global 
page table index. Now, however, the pager finds a valid GPTE. Resolution 
of this page fault is simple. 

A working set list is created for Process A. The global page table entry 
is simply copied to Process A's page table. The share count is incre- 
mented, and the fault is dismissed. 

(6) When the global page is removed from Process B's working set, the share 



321 



Paging Dynamics 



count is decremented. Because the share count is still positive, nothing 
dramatic happens to the physical page. 

At this time, Process B's page table entry must be restored to its previ- 
ous state. (The page table entry does not assume some transition form.) 
The PTE array element contains the address of the global page table entry 
so the global page table index must be recalculated. 

The calculation is straightforward. The contents of MMG$GL_ 
GPTBASE are subtracted from the PTE array element, the result is di- 
vided by four (to create a longword index), and the quotient stored in the 
process page table entry in the GPTX field. 

(7) When the global page is removed from Process A's working set, the proc- 
ess page table entry is restored as described in Step 6. 

The share count is decremented. Now the share count reaches zero, so 
the reference count is also decremented. If the page is unmodified and 
there is no outstanding I/O, the physical page is placed on the free page 
list. 

The GPTE contains a transition PTE. The STATE array element indi- 
cates the free page list. The other PFN array elements are unchanged. 

(8) When the physical page is reused, the ties must be broken between the 
physical page and, in this case, the global page table entry. (None of the 
processes mapped to this page are affected in any way by this step.) 

The contents of the BAK array element (a GSTX) are inserted into the 
GPTE located by the contents of the PFN PTE array element. The PFN 
PTE array element is then cleared, breaking the connection between the 
physical page and the global page table. 

These steps put the process and global page tables back to the state they 
were in following Step 2 (although it is pictured here as a different state to 
make the figure simpler). 



15.3.2 Global Read/Write Pages 

The transitions that occur for global writeable pages are no different from the 
transitions for a process private page that is not copy on reference. The only 
difference between such transitions and the transitions illustrated in Figure 
15-3 is that the global page table entry, not the process page table entry, is 
affected by the transitions of the physical page. 

The process page table entry for global pages contains a global page table 
index up until the time that the page is made valid. Only then is a PFN 
inserted into the process PTE. As soon as the page is removed from the proc- 
ess working set, the GPTX is placed back into the process PTE. All ties to the 
PFN database are made through the global page table entry, which retains the 
PFN while the physical page is in the various transition states. 



322 



15.3 Page Faults for Global Pages 

15.3.3 Global Copy-on-Reference Pages 

The global pages previously described are all shared pages. One form of global 
page is shared only in its initial state. As soon as the fault occurs, the page is 
treated exactly like a process private page. 

These pages are global copy-on-reference pages and commonly occur in 
shareable images that contain impure data areas. For example, all of the local 
variables in a FORTRAN shareable image would be in a global copy-on-refer- 
ence section. Each process that uses the image would get its own private copy 
of the local variables, but all processes would get the same initial values for 
the variables. 

Figure 15-7 illustrates the transitions that occur for a global copy-on-refer- 
ence page. 



(T) The initial conditions are identical to those used in Figure 15-6. The sec- 
tion is created and the GPTEs contain a GSTX, although here the copy- 
on-reference bit is set. 

(2) Process A maps the page and has its PTE set to contain a GPTX. 

(3) Process B maps the page and gets the same GPTX in its PTE. Up to this 
point nothing is different from Figure 15-6. 

(4) Now when Process B incurs a page fault, the pager follows the GPTX to 
the GPTE, noting that the page is located in a global image file and is 
copy on reference. A read is initiated and the following modifications are 
made to the process PTE and the PFN database. 

a. The global page table entry is not touched. It retains its GSTX con- 
tents. 

b. The process page table entry is set to a transition PTE. 

c. The state of the physical page is set to read in progress. 

d. The BAK array element contains a page file index (with no block allo- 
cated yet). 

e. The PTE array element contains the address of Process B's PTE. 

Note that all ties between Process B and the global section are broken. 
The page is now treated exactly like a private copy-on-reference page. The 
two boxes outlined for Process B in Figure 15-7 are the boxes within the 
dashed outline in Figure 15-4. 

(5) When Process A faults the same page, exactly the same steps are taken, 
this time with a totally different physical page. 

Thus, both Process A and Process B get exactly the same initial copy of 
the global page from the global image file, but, from that point on, each 
process has its own private copy of the page to modify as it wishes. 



323 



Paging Dynamics 



Process B 



-—9—- 



PTE=GPTX 



I 



- PTE-»Transition 
■ In Working Set 

- Saved Modify Bit Set 



I 



LSZj 



To 
Figure 15-4 



Process A 



9 



PTE contains 

Global Page Table 

Index (GPTX) 



No Change 

1 L_ 



PTE = GPTX 



I 
No Change 

* 



PTE = GPTX 



I 



■ PTE->-Transition 
- In Working Set 

■ Saved Modify Bit Set 



V 



To 
Figure 15-4 



START 



GPTE contains 

Global Section Table 

Index (GSTX),CRF 



I 



GPTE = GSTX, CRF 



I 



GPTE = GSTX, CRF 



1 

No Change 



GPTE = GSTX, CRF 



No Change 



GPTE = GSTX, CRF 



Page NOT in 
physical memory; 
no PFN data 



No PFN data 



No PFN data 



Read in Progress 
2 REFCNT=2 
£ BAK = PGFLX,0 
® PTE->Process B's 
O page table entry 



Read in Progress 
REFCNT=2 
BAK = PGFLX,0 
PTE-»Process A's 
page table entry 



Figure 15-7 

Example of Page Transitions fur Global 

Copy-on-Reference Pages 



15.3.4 Global Page-File Backing-Store Pages 

Global page-file backing-store pages provide a means by which processes can 
share global pages without requiring a file for backing store. By their nature 
these pages have no initial contents, and are thus initialized as demand zero 
pages. 

Figure 15-8 illustrates the transitions that occur for a global page-file back- 
ing-store page. 

(T) The initial conditions are identical to those used in Figure 15-6. The sec- 
tion is created and the GPTEs contain a zero in the PFN field. 



324 



Process B 



!— =£— I 



PTE = GPTX 



I 



- PTE is Valid 

- In Working Set 

- Modify Bit Set 



iJsL 



To 

Figure 15-4 



15.3 Page Faults for Global Pages 



START 



Process A 

9 



GPTE contains 
Zero 



PTE contains 

Global Page Table 

Index (GPTX) 

1 

No Change 

\ 



I 



GPTE = 



PTE=GPTX 



1 

No Change 

I 



Page NOT in 
physical memory; 
no PFN data 



No PFN data 



GPTE = 



PTE = GPTX 



I 



GPTE is Valid 



No Change 



- PTE is Valid 

- In Working Set 

- Modify Bit Set 



GPTE is Valid 



^ 



No PFN data 



Active and Valid 
REFCNT>0 
BAK = PGFLX,0 



Active and Valid 
REFCNT>0 
BAK=PGFLX,0 
PFN in PTE(A) and 
PTE(B) is identical 



To 

Figure 15-4 



Figure 15-8 

Example of Page Transitions for Global 
Page-File Backing-Store Pages 



Process A maps the page and has its PTE set to contain a GPTX. 
Process B maps the page and has its PTE set to contain a GPTX. 
When Process B incurs a page fault, the pager follows the GPTX to the 
GPTE and notes that the GPTE is demand zero. The following modifica- 
tions are made to the PTEs and to the PFN database. 

a. An entry in the PFN database is allocated. 

b. The PTE array element in the PFN database points to the GPTE. 

c. The BAK array element in the PFN database contains the system page 
file index (with no block allocated). 



325 



Paging Dynamics 

d. The new PFN is stored in the GPTE. 

e. The valid bit is set in the GPTE. 

f. The PFN in inserted into Process B's PTE and the valid bit is set. 

(5) When Process A incurs a fault on the page, the pager follows the GPTX to 
the GPTE and finds that the GPTE is valid. The valid GPTE is copied to 
Process A's PTE. 

Transitions for a global page-file backing-store page are no different from the 
transitions for a page located in a page file (see Figure 15-5). However, in 
global page-file backing store pages, the GPTE, not the process PTE, is af- 
fected by the transitions that the physical page makes. Once the global page 
is removed from the working set, the process PTE reverts to the GPTX form. 

15.4 WORKING SET REPLACEMENT 

The working set list replacement algorithm that the VMS executive uses is a 
modified first-in/first-out scheme. The page that has been in the working set 
list for the longest time is the one first considered for replacement. 

15.4.1 Scan of Working Set List 

When the pager needs an empty working set list entry, it calls routine 
MMG$FREWSLE. This routine manipulates the working set list (see Figure 
14-4) in the following fashion: 

1. If the WSLE indexed by PHD$W_WSNEXT is already available (contents 
are zero), that entry is used. (For details on checks that are made before a 
page is used, see Section 15.4.3.) 

2. If not, the WSNEXT pointer is incremented. If the WSNEXT pointer ex- 
ceeds the end of the list (WSLAST), it is reset to the beginning of the 
dynamic working set list (WSDYN), thus implementing the working set 
list as a circular buffer. 

3. If the newly indexed WSLE is available, then it is simply used. (Again, see 
the checks made before it can be used.) 

4. If the new WSLE is locked into the dynamic portion of the working set list, 
that entry is skipped (which means going back to Step 2.) Only process 
page table pages can be locked into the dynamic portion of the working set 
list. Pages locked by user request result in a shuffling of the working set 
list (see Chapters 14 and Chapter 16). 

15.4.2 Reusing Working Set List Entries 

Dropping through the previous checks indicates that the virtual page indi- 
cated by the WSLE must be removed before this WSLE can be reused. If work- 



326 



15.4 Working Set Replacement 

ing set list skipping (described in Section 15.4.4) is disabled, the working set 
list entry is reused, whatever its state. 

For global pages, the share count is decremented. If the share count goes to 
zero, the reference count is decremented. 

For process private pages, the reference count is decremented. If the page is 
placed into a transition state, the balance slot reference count for this process 
header is incremented to prevent the outswap of the process header. 



15.4.3 Using an Available Entry in the Working Set List 

If an available WSLE is found, checks must be made to see if the page can be 
added to the working set. If there are fewer pages in the working set than are 
indicated by WSQUOTA, a new physical page can always be added to the 
working set. It may also be possible to add physical pages to the working set 
list above WSQUOTA (up to WSEXTENT), depending on the size of the free 
page list. 

The following checks are made before an available working set entry can be 
used: 

1. If the size of the working set (process page count plus global page count) 
equals the size of the working set list (WSSIZE), the next WSLE is reused. 
(In other words, the working set is full.) 

2. If the WSNEXT pointer exceeds the end of the list (WSLAST), WSNEXT is 
reset to the beginning of the dynamic working set list. If an available 
WSLE is found at the end of the list, and if the working set is full, WSLAST 
is reset to point to the last unavailable (nonzero) WSLE in the working set 
list. In other words, the working set list is shrunk if it contains more 
entries than the size of the working set will allow. 

3. If the working set is not full, the size of the working set is compared to 
WSQUOTA. If the size of the working set is less than WSQUOTA, a new 
page is allowed in the working set. 

4. If there are more than WSQUOTA pages in use, the number of pages on 
the free page list is compared to the SYSBOOT parameter GROWLIM. If 
there are more than GROWLIM pages on the free page list, a new page is 
allowed in the working set. 

Note that in order to extend the working set above WSQUOTA, the 
working set list itself must have been extended above WSQUOTA. To 
extend the working set list above WSQUOTA, the free page list must con- 
tain more than the SYSBOOT parameter BORROWLIM pages. For more 
information on BORROWLIM and automatic working set adjustment, see 
Section .16.4.1.3. 

5. If there are fewer than GROWLIM pages on the free page list, the next 
WSLE in the working set list is reused. Again, if the WSNEXT pointer 



327 



Paging Dynamics 

exceeds the end of the list, the pointer is reset to the beginning of the list 
and WSLAST is shrunk back over available entries at the end of the list (as 
in Step 2). 

15.4.4 Skipping Working Set List Entries 

The special SYSBOOT parameter TBSKIPWSL (which has a default value of 
eight) is used by the working set removal routine to permit frequently refer- 
enced pages to remain in the working set, thereby allowing the operating 
system to modify its strict first-in/first-out page replacement algorithm with 
some frequency of use information. 

The modified algorithm works in the following manner. Before a WSLE can 
be reused, a check is made to see if the virtual address contained in that 
WSLE is still valid in the translation buffer. If the virtual address is valid, the 
search for an available WSLE starts again with the next WSLE. After 
TBSKIPWSL WSLEs have been skipped in this manner, the translation buffer 
checks are abandoned and the next WSLE is simply reused. If the value of 
TBSKIPWSL is set to zero, no entries are checked in the translation buffer and 
the scheme is defeated. 

The following pages in the working set are skipped over in this scan: 

• Pages that are valid in the translation buffer 

• Pages that are locked in the working set 



15.5 INPUT AND OUTPUT THAT SUPPORT PAGING 

There is very little special-purpose code in the I/O subsystem to support 
pager I/O and swapper I/O. The pager and swapper each build their own I/O 
request packets, but these packets are queued to the device driver in the 
normal fashion. These are the only differences. 

• Module SYSQIOREQ contains special entry points for pager and swapper 
I/O that insert special I/O function codes into the I/O request packet. 

• These codes are detected by the I/O postprocessing service routine. There 
are special completion paths for page read (the process is removed from 
PFW state and made computable) and for other forms of I/O (the address of 
a special kernel mode AST stored in IRP$L_ASTPRM field is used to no- 
tify modified page writer or swapper that I/O has completed). 

In order to make reading and writing as efficient as possible, the pager 
supports a feature called clustering, where it checks to see whether pages 
adjacent to the virtual page that it is reading are located in the same file in 
adjacent virtual blocks. If so, a multiple block read is issued and several 
pages are brought into the working set at one time. 

The modified page writer and the Update Section system service also 



328 



15.5 Input and Output That Support Paging 

cluster their write operations, both to make their writes as efficient as 
possible and to allow subsequent clustered reads for the pages that are 
being written. 

15.5.1 Page Reads and Clustering 

When the pager determines that a read is required to satisfy a page fault, it 
allocates an I/O request packet and fills it with parameters that describe the 
read. Table 15-1 lists those fields that are used for special purposes by the 
pager. 

The pager attempts to create a cluster of pages to read. The manner in 
which this cluster is formed depends on the initial state of the faulting page 
table entry. 

15.5.1.1 Terminating Condition for Clustered Reads. The pager scans PTEs that map 
larger virtual addresses, checking for more virtual pages that are located in 
the same backing-store location, until the desired cluster size is reached or 
until one of the following other terminating conditions is reached: 

• A page table entry different from the original faulting PTE is encountered. 

• The page table page is itself not valid. (Satisfying this fault would offset the 
benefits gained by clustering.) 

• No more working set list entries are available. (Each page in the cluster is 
added to the working set.) 

• No physical page is available. 

If, after scanning the adjacent page table entries toward higher virtual ad- 
dresses, no pages have been clustered, the process is repeated toward lower 
virtual addresses with the same terminating conditions. The scan is made 
initially toward higher virtual addresses because programs typically execute 
sequentially toward higher virtual addresses and these pages are likely to be 
needed soon. If the forward attempt fails, the pager attempts to read pages 
adjacent to the faulting page on the assumption that even pages at lower 
virtual addresses but near the faulting page are likely to be needed soon. 

15.5.1.2 Matching Conditions While Scanning Page Table. The match that is looked 
for when scanning the adjacent page table entries depends on the form of the 
initial page table entry. 

• If the original PTE contains a process section table index, successive PTEs 
must contain exactly the same PSTX. 

• If the original PTE contains a page file virtual block number, successive 
PTEs must contain PTEs with successively increasing (or decreasing) vir- 
tual block numbers. 



329 



Paging Dynamics 



Table 15-1 

Description of I/O Requests Issued by Memory Management 



Type of/ Description of 


Priority 


Process ID 


System Virtual 


AST Address 


I/O Request 






Address of PTE 






IRP$B-PRI 


IRP$L-PID 


IRP$LSVAPTE 


IRPSL-AST 


Process Page Read 


Priority of 
Faulting 


PIDof 
Faulting 






1. Page in Image File(l) 


Process 


Process 


1. P0PT/P1PT 


1. 


2. Page in Page File 






2. P0PT/P1PT 


2. 


3. Page Table Page 






3. SPT 


3. 


System Page Read 


Priority of 
"System" 


PIDof 

"System" 






1. System Page) 2) 


Process 
16 


Process 


1. SPT 


1. 


2. Global Page 




2. GPT 


2. Slave PTE 










Address( <0) 


3. Global CRF Page 






3. Process Page 
Table 


3. Master PTE 
Contents(>0) 


4. Global Page 






4. SPT 


4. 


Table Page 










Modified Page Write 


MPW_PRIO 


PIDof 
Modified 


Points to 
Modified 





1. To Page File 






Page Writer 


Page Writer's 


2. To Image File(3) 




(PID of Map 
Swapper) 






3. To Swap File 










(SWPVBN=0) 










Update Section 


Priority 


PIDof 


a. Process Page Table 


AST Address 


Page Write(4) 


of Caller 


Caller 


b. Global Page Table 


(if specified) 


Swapper I/O 


SWP-PRIO 


PIDof 
Swapper 


Points to 
Swapper Map 






( 1 ) One field in the I/O request packet (IRP$L_ ASTPRM) for page reads from a private section is sensitive 
to whether the section is copy on reference. These two cases are distinguished as: 

a. Not Copy on Reference 

b. Copy on Reference 

(2) Pageable executive routines originate in one of three image files (SYS.EXE, RMS.EXE, and 
SYSMSG.EXE) described by three system section table entries (SSTE) located in the system header. 

The static executive data is all located in the nonpaged executive. The only pageable writeable data 
is the paged pool area, which starts out as a series of demand zero pages. Paged pool pages are written 
to and subsequently faulted from the page file. 

These two cases are distinguished as: 

a. Pageable executive routines 

b. Paged pool pages 

(3) The modified page writer takes special note of whether pages that are written back to an image file are 
part of a 

a. Private section 

b. Global section 



330 



15.5 Input and Output That Support Paging 



Table 15-1 (continued) 

Description of I/O Requests Issued by Memory Management 



AST Parameter 


Address of Window 


Cluster 


Priority 




Control Block 


Factor 


Boost at I/O 
Completion 


IRP$L-ASTPRM 


IRP$L-WIND ' 


— 


— 


la. 


l.FromPSTE 


1. pfc/PFCDEFAULT(6) 


Class =0 


lb. PSTX 






Boost =0 


2.0 


2. From PFL 


2. PFCDEFAULT 




3.0 


3. From PFL(5) 


3. PAGTBLPFC 


Class=0 
Boost=0 


1.0 


la. From SSTE 


la. SYSPFC 






lb. From PFL 


lb. PFCDEFAULT 




2.0 


2. From GSTE 


2. pfc/PFCDEFAULT(6) 




3. GSTX 


3. From GSTE 


3. pfc/PFCDEFAULT(6) 




(PFN$V_GBLBAK 








is set) 








4.0 


4. From PFL(5) 


4. 1 




Address of 






None(7) 


MPW's special 








kernel AST 


1. From PFL 


1. MPW-WRTCLUSTER 




(WRITEDONE) 


2a. From PSTE 
2b. From GSTE 


2. MPW-WRTCLUSTER 






3. FromSFTE 


3.1 




AST Parameter 


a. PSTE 


MPW-WRTCLUSTER 


Class =1 


(if specified) 


b. GSTE 




Boost =2 


Swapper's KAST 


SFTE 


Not Applicable 


None(7) 


(IODONE) 









(4) In a similar manner, the Update Section system service behaves differently depending on whether the 
pages are part of a 

a. Private section 

b. Global section 

(5) Process page tables and global page tables originate as demand zero pages that are written to and 
faulted from the page file. 

(6) The cluster factor for a private section or a global section can be specified at link time or when the 
section is mapped by explicitly declaring a cluster factor (pfc). In the absence of such a specification, 
the pager uses the default system cluster factor determined by the SYSBOOT parameter 
PFCDEFAULT. 

(7) The swapper (and by implication the modified page writer) is a real-time process and is therefore not 
subject to priority boosts. 



331 



Paging Dynamics 

• If the original page table entry contains a global page table index, succes- 
sive PTEs must contain successively increasing (or decreasing) indexes. In 
addition, the global page table entries must all contain exactly the same 
global section table index. 

15.5.1.3 Maximum Cluster Size for Page Read. The maximum number of pages that 
can be in a cluster is determined in several ways, depending on the type of 
page being read. 

• Global page table pages are not clustered. 

• The cluster factor for process page table pages is taken from offset 
PHD$B_PGTBPFC in the fixed portion of the process header. Unless some 
user-written kernel mode routine has modified this field, the value of this 
field is taken from the special SYSBOOT parameter PAGTBLPFC for all 
processes in the system. The default value for this parameter is two. This 
value is chosen to avoid an artificial end to building a cluster when the 
page table page also had to be faulted. Two page table pages are guaranteed 
to span 127 pages, regardless of the initial faulting virtual address. Decreas- 
ing this value may defeat clustered reads. Increasing it above two is likely 
to have negligible effect in most systems. 

• The cluster factor for page file pages is taken from the PFL$B_PFC field of 
the page file control block (see Figure 14-22). The usual contents of this 
field are zero. In that case the cluster factor is taken from the 
PHD$B_DFPFC field of the process header. In the absence of user-written 
modification, the value placed into this field is the SYSBOOT parameter 
PFCDEFAULT. 

• The cluster factor for process or global sections is taken from the 
SEC$B_PFC field of the process or global section table entry (see Figures 
14-7 and 14-16). These fields usually contain values of zero, in which case 
the default page fault cluster is used. (Just as for clustered reads from the 
page file, this default is taken from the PHD$B_DFPFC field in the process 
header. The value of this field is usually equal to the PFCDEFAULT SYS- 
BOOT parameter.) 

There are two methods available to the user to control the cluster factor 
of process or global sections. By including the following line in the linker 
options file, the page fault cluster factor in the image section descriptor 
can be set to nonzero contents: 

CLUSTER = cluster-name, [base-address] , [pfc] , [file-spec, . . . ] 

Sections that are mapped by the user (with a Create and Map [Private or 
Global] Section system service) can have their page fault cluster factor 
specified by including the optional PFC argument in the system service 
call. 

15.5.1.4 Page Read Completion. The page read completion is detected by the I/O post- 
processing routine (IPL 4 software interrupt service routine) by the special 
code inserted in the IRP before the request was queued. 



332 



15.5 Input and Output That Support Paging 

Page read completion is not reported to the faulting process in the normal 
fashion with a special kernel mode AST because none of the postprocessing 
has to be performed in the context of the faulting process. Instead, the work is 
done by this service routine and the process made computable by reporting a 
page read completion event to the scheduler. 

The details that the service routine takes care of when a page read success- 
fully completes include the following steps for each page: 

1 . The reference count is decremented, indicating that the read in progress 
has completed. 

2. The physical page state is set to active and valid. 

3. The valid bit in the page table entry is set. 

4. If the page is a global page, the valid bit set in Step 3 was in the global page 
table entry. In this case, the process (slave) PTE must be loaded with the 
PFN and made valid. 

After the individual pages have been tended to, the scheduler is notified that 
a page read has completed (by reporting a page fault completion event with a 
null priority increment) so that the process that was put into a page fault wait 
state when the read was initiated can be made computable. (If any of the 
pages just read were collided pages, the collided page wait queue is also emp- 
tied. That is, all processes in that state are made computable. Collided pages 
are discussed in Section 15.6.3.) 



15.5.2 Modified Page Writing 

The modified page writer (a subroutine of the SWAPPER process) also at- 
tempts to cluster when writing modified pages to their backing store ad- 
dresses. There are not so many special cases here as there are in the page read 
situation. The three different cases encountered by the modified page writer 
depend on the three possible backing store locations that pages on the modi- 
fied page list can have. 

15.5.2.1 Operation of the Modified Page Writer. The modified page writer proceeds in 
approximately the following fashion: 

1. The first page is removed from the modified page list. Its page table entry 
address is retrieved from the PFN PTE array. 

2. Adjacent page table entries are scanned (first toward lower virtual ad- 
dresses and then toward higher virtual addresses) to look for transition 
page table entries that map pages on the modified page list either until the 
desired cluster size is reached or until one of the other terminating condi- 
tions is reached. 

This scan begins first toward smaller virtual addresses for the same rea- 
son that the read cluster routine begins toward larger addresses. If the 



333 



Paging Dynamics 

program is more likely to reference higher addresses, the modified page 
writer does not want to initiate a write operation, only to have the page 
immediately faulted (and likely modified again). The modified page writer 
chooses to first write those pages with a smaller likelihood of being refer- 
enced in the near future. 

3. The write is initiated, the state of all of the pages is changed to write in 
progress, and their reference counts are incremented. 

4. The modifiecfpage writer returns to the SWAPPER process until notified 
by its special kernel mode AST that the modified page write has com- 
pleted. 

15.5.2.2 Modified Page Write Clustering. The terminating conditions for the scan of 
the page table include the following: 

• The page table page is not valid, implying that there are no transition pages 
in this page table page. The special check is made to avoid an unnecessary 
page fault. 

• The page table entry does not indicate a transition format. 

• The page table entry indicates a page in transition, but the physical page is 
not on the modified page list. 

• The physical page number is greater than the contents of global location 
MMG$GL_MAXPFN. This check avoids pages in shared memory, which 
have no PFN data associated with them. 

• The SWPVBN array element must be zero. Pages with nonzero SWPVBN 
contents are treated in a special way by the modified page writer. 

• If the contents of the BAK array indicate that the backing store location for 
the page is a (private or global) image file, the section index must be the 
same for all pages in the cluster. 

• If the BAK array element indicates that the pages are to be written to the 
page file, the contents of the virtual block number field are ignored. How- 
ever, all pages must contain the same page file index in their BAK array 
elements. 



15.5.2.3 Backing Store Addresses for Modified Pages. There are three different kinds of 
backing store address that the modified page writer encounters as the modi- 
fied page writer removes pages from the modified page list. 

• If the SWPVBN array element is nonzero, this indicates that the process is 
outswapped and this page remained behind, probably due to an outstand- 
ing read request. The modified page writer does not attempt to cluster. 
Instead, a write of a single page to the designated block in the swap file is 
issued. A description of how the SWPVBN array element can be loaded is 
found in Chapter 17, where the entire outswap operation is discussed. 

• If the backing store address is a section, the modified page writer creates a 



334 



15.5 Input and Output That Support Paging 

cluster (up to the value of the SYSBOOT parameter MPW_WRTCLUSTER). 
Any of the terminating conditions listed in the previous section will limit 
the size of the cluster. 
• If the backing store address is a page file, adjacent pages bound for the same 
page file are also written at the same time. 

The modified page writer attempts to allocate a number of blocks in the 
page file equal to MPW_WRTCLUSTER. The desired cluster factor is re- 
duced to the number of blocks actually allocated. Section 15.5.2.4 de- 
scribes allocation of space within the page file. 

The actual cluster created for a write to the page file consists of several 
smaller clusters, each one representing a series of virtually contiguous 
pages (see Figure 15-9). 

— The modified page writer creates a cluster of virtually contiguous pages, 
all bound for the same page file. 

— If the desired cluster size has not yet been reached, the modified page 
list is searched until another physical page bound for the same page file 
is found. 

— Pages virtually contiguous to this page form the second minicluster that 
is added to the eventual cluster to be written to the page file. 

— This process continues until either the cluster size is reached or no 
more pages on the modified page list have the designated page file as 
their backing store address. The modified page writer is building a large 
cluster that consists of a series of smaller clusters. The large cluster 
terminates only when the desired size is reached or the modified page 
list contains no more pages bound to the page file in question. Each 
smaller cluster can terminate on any of the conditions listed in the pre- 
vious section, or on the two terminating conditions for the large cluster. 

15.5.2.4 Page File Space Allocation. Before the modified page writer searches for pages 
to write, it must first determine the size of the write cluster. To do this, it 
must determine the number of contiguous blocks in the page file that can be 
allocated. 

When the modified page writer attempts to allocate blocks in the page file, 
it looks for a cluster of blocks that is the current allocation size in length (the 
current allocation size is stored in the page file control block at the offset 
PFL$L_ALLOCSIZ and is usually equal to MPW_WRTCLUSTER). If the de- 
sired number of blocks is not available, the allocation size is reduced by 16 
blocks and the search for contiguous blocks starts again at the beginning of 
the page file. If the page file deallocation routine determines that it has freed 
a large enough cluster, it increases the allocation size by 8 (up to 
MPW_WRTCLUSTER). 

When the allocation size for the page file is less than or equal to 16, a 
special-case allocation routine is called. This special-case allocation routine 



335 



Paging Dynamics 



Modified Page List 



SWP$GL_BALBASE 

Balance Slot Area 



PTE 



BAK 



Jf- 



V 



gptx 



#. 



PFND 



^ 



Correct pgflx but Cluster Is Full 



\ 



Transition PTE (free list) 



PFNH 



PFNF 



PFNA 



PFNE 



1 PFN (valid) 



J? 



Process Section Table Index 



PFNG 



^ 



PFNB 



PFN J 



Demand Zero PTE 



\ 

\ ■*" 
\ 
\ 
\ 
\ 
\ 
\ \ 



\ 

\ 
\ 

\ 



V- 



J*^'-!^ 



£ 



w. 
\ 



f 

V-'' 

\ 
Ur 



.<; 



pgflx 



pgflx 



pstx 



pgflx 



pgflx 



pgflx 



pgflx 



pgflx 



pgflx 



pgflx 



A 
B 
C 

D 
E 
F 
G 
H 
I 
J 



MPW$AL_PTE 
Modified Page Writer's Map 



■£ 



£ 







1 PFNH 



1 PFNF 



1 PFNA 



1 PFNE 



1 PFNG 



1 PFNB 



sp 



1 PFN J 



1 PFND 



*f 



?r 



Jf 



& 



Figure 15-9 

Example of Clustered Write to a Page File 



336 



15.5 Input and Output That Support Paging 

searches for and allocates the first available cluster of blocks that it encoun- 
ters. The routine can allocate between 1 and 16 contiguous blocks. If the 
special-case allocation routine determines that more than 65 percent of the 
page file is in use, the following message is issued on the console terminal: 

SYSTEM-W-PAGEFRAG, Page file E>5 full, system continuing 
If the allocation routine determines that more than 90 percent of the page file 
is in use, the following message is issued on the console terminal: 

SYSTEM-W-PAGECRIT, Page file HO full, system trying to continue 
If you see either of these messages on the console terminal, it is a good indica- 
tion that the system requires an(other) alternate page file. 

15.5.2.5 Example of Modified Page Write to a Page File. Figure 15-9 illustrates a sample 
cluster for writing to a page file. The modified page list (pictured in the upper 
right-hand corner of the figure) is shown as a sequential array to simplify the 
figure. 

1. The first page on the modified page list is PFN A. By scanning backward, 
first PFN F and then PFN H are located. The PTE preceding the one that 
contains PFN H is also a transition PTE, but the page is on the free page 
list. This page terminates the backward search. 

2. The modified page writer map begins with PFN H, PFN F, and PFN A. The 
search now goes in the forward direction, with each page bound for the 
page file added to the map up to and including PFN E. The next page table 
entry is valid so the first minicluster is terminated. 

3. The next page on the modified page list, PFN B, leads to the addition of a 
second cluster to the map. This cluster begins with PFN G and ends with 
PFN J. The backward search was terminated with a PTE containing a sec- 
tion table index. The forward search terminated with a demand zero PTE. 

Note that this second cluster consists of pages belonging to a different 
process from the first cluster. The difference is reflected in the word array 
element for each PTE in the map that contains a process header vector 
index for each page (see Figure 14-24). 

4. The next page on the modified page list is PFN C. This page belongs in a 
global image file and is skipped over during the current write attempt. 

5. PFN D leads to a third cluster that was terminated in the backward direc- 
tion with a page table entry that contains a global page table index. The 
search in the forward direction terminated when the desired cluster size 
was reached, even though the next PTE was bound to the same page file. 
This size is either MPW_ WRTCLUSTER or a number of virtually contigu- 
ous blocks available in the page file, whichever is smaller. In any case, this 
cluster will be written with a single write request. 

6. Note that reaching the desired size resulted in leaving some pages on the 
modified page list bound for the same page file, such as PFN I in the figure. 



337 



Paging Dynamics 

15.5.2.6 Modified Page Write Completion. The modified page writer is notified that 
the write is complete by a special kernel mode AST (whose address was 
stored in the ASTPRM field of the IRP while the write was in progress). 
Modified page writing is recorded in the IRP as a swap write to allow this 
completion method to be used. For the purposes of the I/O postprocessing 
routine, the only form of page write request is the one issued by the Update 
Section system service. 

This kernel mode AST decrements various reference counts that indicated 
the write in progress. If the reference count is now zero, the pages are placed 
on the free page list. If the number of pages on the modified page list 
(SCH$GL_MFYCNT) is still above the low limit threshold for the modified 
page list (SCH$GL_MFYLOLIM), then the modified page writer removes the 
new first page from the modified page list and starts all over. 



15.5.3 Update Section System Service 

The Update Section system service allows a process to write pages in a sec- 
tion to their backing store addresses in a controlled fashion, without waiting 
for the modified page writer to do the backup. This system service is espe- 
cially useful for frequently accessed pages that may never be written by the 
modified page writer, because they are always being faulted from the modi- 
fied page list back into the working set before they are backed up. 

This system service is a cross between modified page writing and a normal 
write request. Like any Queued I/O request, this service can receive comple- 
tion notification with an event flag, an AST, or through an I/O status block. 
The number of pages written is specified by the address range passed as an 
input parameter to the service. The cluster factor is the minimum of 
MPW_WRTCLUSTER and the number of pages in the input range. The di- 
rection of search for modified pages is determined by the order that the ad- 
dress range is specified to the service. 

15.5.3.1 Page Selection. If the section that is being backed up is a process private 
section, only those pages that have the modified bit set in the page table entry 
(or in the PFN state array for transition pages) are written out. 

If the section is a global section, then information about whether the page 
is modified is found in both the PFN database and the page table entries of all 
processes mapped to this global page. (The modify bit in the global page table 
entry is inaccessible to hardware and contains no useful information.) Be- 
cause there are no back pointers for valid global pages, this information is 
unavailable. Therefore, all pages in a global section are written to their back- 
ing store location, regardless of whether the pages have been modified. 

If the flags parameter passed to Update Section has its low bit set, the set 
bit indicates that the caller is the only process capable of modifying the sec- 



338 



15.6 Paging and Scheduling 

tion. In that case, the process page table entries (and the PFN database) are 
used to select candidate pages for backing up, and only modified pages are 
written. 

15.5.3.2 Write Completion. The process that issued the Update Section system serv- 
ice is first notified about write completion with a special kernel mode AST. 
This AST first checks whether all the pages requested by the original call 
have been written or whether another write is required. If more pages have to 
be written, another cluster is set up and queued. If all requested pages have 
been written, the normal I/O completion path involving event flags, I/O sta- 
tus blocks, and user-requested ASTs is entered, and the process is notified. 



15.6 PAGING AND SCHEDULING 

Page fault handling can influence the scheduling state of processes in several 
different ways. If a read is required to satisfy a page fault, the faulting process 
is placed into a page fault wait state. If a resource such as physical memory or 
page file space is not available, the process is placed into an appropriate wait 
state. There are several other wait states that a process may be placed into as 
a result of a page fault. 



15.6.1 Page Fault Wait State 

The most obvious wait state is page fault wait (PFW), which is required if a 
read is required to resolve the fault. The process that requires the read to 
resolve its page fault is placed into a page fault wait state. The I/O comple- 
tion routine detects that a page read has completed and reports a page fault 
completion event to the scheduler. The scheduler removes the process from 
the page fault wait state and makes it computable. There is no priority incre- 
ment due to page fault read completion so the scheduling decision is made 
based on the process's current priority. 



15.6.2 Free Page Wait State 

If there is not enough physical memory available to satisfy the page fault, the 
process is placed into a free page wait state (FPG). The physical page manager 
(module ALLOCPFN) checks for processes in this state whenever pages are 
added to an empty list. If the free page wait state is not empty, all processes in 
the state are made computable. 

The physical page manager makes no scheduling decision about which 
process will get the page. There is no first-in/first-out approach to the free 
page wait state. Rather, all processes waiting for the page are made computa- 
ble. The next process to execute will be chosen by the scheduler, using the 



339 



Paging Dynamics 



normal algorithm that the highest priority resident computable process exe- 
cutes next. 



15.6.3 Collided Page Wait State 

It is possible for a page fault to occur for a page which is already being read 
from disk. Such a page is referred to as a collided page. The collided bit (in the 
PFN TYPE array) is set and the process placed into the collided page (COLPG) 
wait state. 

One of the details that the page read completion routine checks is the 
collided bit in the TYPE array element for the page. If the collided bit is set, 
the collided page wait state is emptied. There is no check for the page that is 
being waited for by each process as it is made computable. 

This lack of check has two advantages. 

• As was the case for free page availability, there is no special code to deter- 
mine which process will get the page first. All processes are made comput- 
able, and the normal scheduling algorithm selects the process that exe- 
cutes next. 

• The probability of a collided page is small. The probability of two different 
collided pages is even smaller. If a process waiting for another collided page 
is selected for execution, that process will incur a page fault and get put 
right back into the collided wait state. Nothing unusual occurs and the 
operating system avoids a lot of special-case code to handle a situation that 
rarely, if ever, occurs. 



340 



16 Memory Management System 
Services 



Confusion now hath made his masterpiece! 

—Macbeth 2,3 

The previous two chapters discussed the data structures used by the memory 
management subsystem to describe physical and virtual memory and the 
action of the page fault handler when a page was referenced in which the 
valid bit was not set. This chapter describes the system services available to 
the user (and also used internally by the operating system) to allocate these 
structures and initialize their contents. 

1. Some system services create or delete virtual address space within 
the limitations imposed by process quotas and limits and SYSBOOT 
parameters. 

2. Private and global sections can be created that allow the blocks of a file to 
be mapped as a portion of a process address space. Although the section 
services are also associated with the layout of virtual address space, they 
are treated separately because of their added level of complexity. 

3. System services allow users to lock portions of their working sets into 
memory, avoiding the overhead of page faults or allowing portions of code 
to execute at elevated IPL. A process can also disable swapping, preventing 
itself from being removed from memory. 

4. There are other miscellaneous operations associated with the memory 
management available to a process. For example, a process may force the 
contents of all modified pages to be written to their backing store ad- 
dresses (Update Section system service) or purge some or all pages from its 
working set (Purge Working Set system service). 



16.1 DISPATCH METHOD FOR MEMORY MANAGEMENT 

SYSTEM SERVICES 

Almost all of the memory management system services specify a desired 
address range as an input parameter. The page table entries associated with 
these addresses contain an owner field (see Figure 14-3), indicating whether 
the caller of each service can manipulate the pages in the desired fashion. 
Another peculiarity of the memory management system services is that 
many of the services can partially succeed (because they are done on a page- 



341 



Memory Management System Services 

by-page basis). This partial success is indicated by returning an error code 
combined with the address range over which the operation was completed (in 
the retadr argument). 

A common dispatch method is used by most of the memory management 
system services to reflect the similarity of the services: 

• Information about the specific service, including the input parameters, is 
placed on the stack for later retrieval. 

• Page ownership is checked to insure that a less privileged access mode is 
not attempting to alter the properties of some pages owned by a more 
privileged access mode. 

• The address of a page-by-page routine to accomplish the desired action of 
the original service is placed into R6. 

• A common routine is called that performs general page processing and 
calls the single page service-specific routine for each page in the desired 
range. 

• The address range actually operated on is returned to the caller (if it is 
requested). 



16.2 VIRTUAL ADDRESS CREATION AND DELETION 

The first level of memory management available to a process is the creation 
or deletion of virtual address space. These services are also used by the sys- 
tem when an image first begins executing (the image activator calls several 
services to create process address space) and as part of image exit (the image 
reset routine deletes all of PO space and a small part of PI space). The memory 
management performed by the system as part of image activation or process 
deletion is described in Chapter 21. 



16.2.1 Address Space Creation 

Address space creation is essentially a simple operation. A series of demand 
zero pages is created, either at the end of the designated address space (the 
Expand Region [$EXPREG] system service) or in the specified address range 
(the Create Virtual Address Space [$CRETVA] system service). If any pages 
already exist in the requested range, they must be deleted first. 

These two system services can partially succeed. That is, a number of 
pages smaller than the number originally requested may be created. Once the 
specified address range is determined, the demand zero pages are created one 
at a time. It is possible to run into one of the limits on the number of pages 
that can be created after several pages have already been successfully created. 
For this reason, it is especially important for the caller of either $CRETVA or 
$EXPREG to look at the retadr argument to determine whether the service 
($CRETVA or $EXPREG) was partially successful. 



342 



16.2 Virtual Address Creation and Deletion 

16.2.1.1 Limits on Virtual Address Space Creation. There are three limitations on the 
amount of virtual address space that can be created. 

• The SYSBOOT parameter VIRTUALPAGECNT controls the total number 
of page table entries (POPTEs plus PIPTEs) that any process can have in its 
process header. The division of these pages between PO space and PI space 
is totally arbitrary and process specific. It is only the sum of PO and PI 
pages that is limited by the SYSBOOT parameter. 

• The size of a process working set also controls the size of that process's 
address space. When a process page is valid, the page table page for that 
page is not only valid but also dynamically locked into the working set. For 
small address spaces, the set of valid process pages can be represented by a 
small number of page table pages. 

As the address space grows, the probability that a given page table page 
maps more than one valid process page decreases. (The limiting case, one 
that can usually be reached only with very large process address spaces, 
requires two working set list entries for each valid process page.) In any 
case, there is an implicit limit to the process address space imposed by the 
process working set quotas. 

The specific check that is made is whether the size of the dynamic 
working set list can lock down all the page table pages necessary to map 
the process address space and still leave enough fluid working set 
(PHD$W_FLUID), plus the worst case number of page table pages required 
to map PHD$W_FLUID pages, in order to allow the process to perform 
useful work. The number of page table pages that results is the minimum 
of PHD$W_FLUID and the number of page table pages not already locked 
down. If this check fails, the working set list is expanded. If the working 
set is at its limit, the virtual address creation fails with the status of 
SS$_INSFWSL. 
• The third constraint on the total size of the process address space is the 
page file quota. Each demand zero page and each copy-on-reference section 
page is charged against the job's page file quota (JIB$L_PGFLCNT). 

16.2.1.2 Expand Region System Service. The Expand Region system service is a special 
case of the Create Virtual Address Space system service. The requested num- 
ber of pages is simply converted into a PO or PI page range and control is 
passed to a page creation routine that is common between the two services. 

16.2.1.3 Automatic User Stack Expansion. A special form of PI space expansion oc- 
curs when a request for user stack space exceeds the remaining size of the 
user stack. Such a request can be reported by the hardware as an access viola- 
tion exception or by software when insufficient user stack space is detected. 
(Software detection is done by the AST delivery routine and the Adjust Stack 
system service if the request is for user mode stack space.) 



343 



Memory Management System Services 

The routine EXE$EXPANDSTK is called directly by the two software rou- 
tines and invoked by the access violation exception handler if the access 
violation occurred in user mode. This routine checks that a length violation 
(as opposed to a protection violation) occurred and that the inaccessible ad- 
dress is in PI space. If so, PI space is expanded from its current low address 
end to the specified inaccessible address. For the usual case, one in which a 
program requires more user stack space than requested at link time, the ex- 
pansion typically occurs one page at a time. 

Because this automatic expansion cannot be disabled on a process-specific 
or system-wide basis, a runaway program (one that is using stack space with- 
out returning it) will not be aborted until it exceeds the virtual address size 
determined by the SYSBOOT parameter VIRTUALPAGECNT (a quota viola- 
tion which is indicated by $CRETVA returning an error status of SS$_ 
VASFULL). In addition, a program that makes a random (and probably incor- 
rect) reference to an arbitrary PI address smaller than the top of the user stack 
will probably continue to execute (after the creation of many demand zero 
pages) rather than exiting with some error status. 

If the stack expansion fails for whatever reason (the Create Virtual Address 
system service can fail for several reasons), the process is notified in a way 
that depends on who originally called EXE$EXPANDSTK. 

• The Adjust Stack system service for user mode can fail with several of the 
error codes returned by $CRETVA. 

• An attempt to deliver an AST to a process with insufficient user stack 
space results in an AST delivery stack fault exception being reported to the 
process. (Enough information is removed from the stack by the error rou- 
tine that the exception dispatcher can at least get started in reporting the 
exception.) 

• If the user stack cannot be expanded in response to a PI space length viola- 
tion, then an access violation fault is reported to the process. If there is not 
enough user stack to report the exception, the normal condition handler 
search is bypassed and the exception is reported directly to the last chance 
handler (see Chapter 4). In the default case, this handler causes the cur- 
rently executing image to terminate. 

16.2.2 Address Space Deletion 

For a couple of reasons, page deletion is more complicated than page creation. 

• Creation involves taking the process from one known state (address space 
does not yet exist) to another known state (the page table entries contain 
demand zero PTEs). Page deletion must deal with initial conditions that 
include all the possible states that a virtual page can be in. 

• Page creation may first require that the specified pages be deleted in order 



344 



16.2 Virtual Address Creation and Deletion 

to put the process page tables into their known state. That is, page deletion 
is often an integral part of page creation. 

16.2.2.1 Delete Virtual Address Space System Service. When a page is deleted, all proc- 
ess and system resources associated with the page must be returned. These 
include the following forms: 

• A page frame for valid and transition pages 

• A page file virtual block for pages whose backing store address indicates an 
already allocated block 

• A working set list entry for a page in the process working set list 

• Page file quota for all pages with a page file backing store address, includ- 
ing pages that have not yet allocated a block in the page file 

Private section pages that are deleted cause the reference count in the process 
section table entry (see Figure 14-7) to be decremented. If the reference count 
goes to zero, the PSTE itself can be released. 

In addition, valid or modified pages with a section backing store address (as 
opposed to a page file backing store address) must have their latest contents 
written back to the section file. (The contents of pages with a page file back- 
ing store address are unimportant after the virtual page is deleted and do not 
have to be saved before the physical page is reused.) 

16.2.2.2 Page Deletion and Scheduling. Pages that have I/O in progress cannot be dele- 
ted until the I/O completes. Such processes are placed into a page fault wait 
state (requesting that a system event be reported when I/O completes) until 
the page read or write completes. Pages in the write-in-progress transition 
state will cause the same effect. Pages in the read-in-progress transition state 
are faulted, with the immediate result that the process is placed into the 
collided page wait state. Special action must be taken for global pages with 
I/O in progress because there is no way to determine if the process deleting 
the page is also responsible for the I/O. In such cases, the process is placed 
into a miscellaneous wait state (MWAIT) until its direct I/O completes. (If 
the process has no direct I/O in progress, the problem does not arise in the 
first place, and the deletion is allowed to proceed.) 

Once all reasons for keeping the page around have been taken care of, the 
page is deleted. Deletion of a physical page means that the contents of the 
PFN PTE array are cleared, destroying all ties between the physical page and 
any process virtual address. In addition, the page is placed at the head of the 
free page list, causing it to be used before other pages whose contents are still 
useful. 

16.2.2.3 Contract Region System Service. The Contract Region system service is a 
special case of the Delete Virtual Address Space system service. The re- 



345 



Memory Management System Services 

quested number of pages is simply converted into a PO or PI page range and 
control is passed to a page deletion routine that is common between the two 
services. 



16.2.3 Controlled Allocation of Virtual Memory 

There is a second level of memory management available to a process. The 
Run-Time Library procedures LIB$GET_VM and LIB$FREE_VM provide a 
mechanism for allocating small blocks of virtual memory in a controlled 
fashion. Allocation from the free memory pool is performed in much the 
same way as pool space is allocated by the VMS operating system (see Chap- 
ter 3). If there is not a block of memory in the pool large enough to satisfy the 
request, PO space is expanded (by calling $EXPREG), and the pool is extended 
to include the newly created virtual address space. 



16.3 PRIVATE AND GLOBAL SECTIONS 

A second method of creating address space is available. The Create and Map 
Section system service allows a process to associate a portion of its address 
space with a specified portion of a file. The section may be specific to a 
process (private section) or shared among several processes (global section). 
The Map Global Section system service allows a process to map a portion of 
its virtual address space to an already existing global section. These two ser- 
vices are used by the image activator (see Chapter 21) to map portions of 
process address space to either the image file or previously installed global 
sections. 

The Create and Map Section system service also provides two special op- 
tions. Rather than mapping a portion of process address space to a file, a 
suitably privileged process (with PFNMAP privilege) can associate (map) vir- 
tual addresses to specific physical addresses. Global sections can be created 
and mapped in shared memory as well as in local memory. 

16.3.1 Create and Map Section System Service 

The Create and Map Section system service is the system service that per- 
forms all of these operations. (In a sense, the Map Global Section system 
service is a special case of $CRMPSC where the section does not have to be 
created.) The particular path that is taken through the service is determined 
by the contents of the flags argument passed to the service. (The VAX/VMS 
System Services Reference Manual lists those flags that can be used together 
and those that are incompatible.) One way of looking at the action of this 
service is to examine the data structures that are created as a result of exercis- 
ing one of the several options available to it. 



346 



16.3 Private and Global Sections 

16.3.1.1 Private Section Creation. When a process private section is created, a process 
section table entry (see Figure 14-7) is allocated from the area of the process 
header set aside for PSTEs. The information that associates the virtual ad- 
dress range with virtual blocks in the file is loaded into the PSTE. (When the 
private section is being created as a part of image activation as described in 
Chapter 21, the original source for much of the data stored in the PSTE is an 
image section descriptor contained in the image file.) In addition, each proc- 
ess page table entry in the designated address range is loaded with identical 
contents, namely a process section table index (see Figure 14-3). 

The memory management subsystem cannot take a window turn on pages 
within a section (see Section 19.1.4). Therefore, it requires that all the map- 
ping information for the newly mapped file be available in the window con- 
trol block. If the Create and Map Section system service determines that not 
all mapping information is available, its operations are temporarily sus- 
pended while a request is made to the ACP for all mapping information for 
the file. Because the window control block occupies nonpaged pool, the ex- 
tension of the window control block is charged against the process's BYTLM 
quota. 

Because of the way space is allocated in the process header (see Chapter 26), 
it is possible that the space to hold a section table entry may extend into the 
working set list. When this occurs, the entire process section table can slide 
down into one of the empty pages set aside in the process header for exactly 
this purpose. All references to process section table entries are relative to the 
bottom (high address end) of the table that is located through offset 
PHD$L_PSTBASOFF. That is, the entire structure is position independent. 
Header expansion involves mapping the first empty page, moving the entire 
structure down one page, and changing PHD$L_PSTBASOFF to locate the 
new bottom of the table. 

16.3.1.2 Global Section Creation. The creation of a global section (located in local 
memory) is similar to the creation of a private section except that the data 
structures are located in the system header (see Figures 14-15 and 14-18) in- 
stead of the process header: 

1. A global section descriptor (see Figure 14-14) is allocated from paged dy- 
namic memory and loaded with information that describes the name and 
protection attributes of the section. This data structure is used by subse- 
quent Map Global Section system service calls to determine whether the 
named section exists and to locate the global section table entry in the 
system header that more fully describes the section. 

2. A global section table entry (see Figure 14-16) in the«ystem header (see 
Figure 14-15) is the analogous structure to the process section table entry. 

3. A series of global page table entries are created in a virtual extension to the 



347 



Memory Management System Services 

system header (see Figure 14-17). These page table entries contain infor- 
mation that describes the current state of each global page in the section. 
They are not available to the memory management hardware but are used 
by the page fault handler when a process incurs a page fault for a global 
page. 
4. A global section can be created and mapped by a single system service call. 
Alternatively, the section can be created in one step and mapped later on 
by either the creating process or by any other process allowed to map the 
section. In any case, mapping to a global section results in no changes to 
the global database. Rather, the process page table has a series of page table 
entries that contain a global page table index (see Figure 14-19) added to 
describe the designated address range. The process page table entries for 
global pages can be in one of two states, either valid or containing the 
appropriate global page table index. 

16.3.1.3 Global Sections in Shared Memory. Global sections that are located in shared 
memory are treated in a slightly different fashion from local memory global 
sections. The sections are created by the Install Utility (INSTALL) after 
shared memory has been initialized. (See Chapter 14 for a description of the 
data structures that describe global sections in shared memory.) Global sec- 
tions in shared memory have the following characteristics: 

1. A special global section descriptor (see Figure 14-27) is created that 
contains, among other things, a list of the physical pages in shared memory 
that will contain the section. The section is temporarily mapped by 
INSTALL and each page of the section is loaded from the image file. 

2. A global section table entry is created only on the CPU that originally 
creates the section. This GSTE allows the initial read to be performed and 
allows subsequent section updates (with SYS$UPDSEC) for writeable sec- 
tions. Pages are also written back to the image file on the creating CPU 
when the section is deleted. 

3. No global page table entries are needed for global sections in shared mem- 
ory because the state of each page is known to be valid. The PFN informa- 
tion necessary to allow processes to map into this section is contained in 
the shared memory GSD. 

4. When a process maps to the shared memory global section, the process 
page table entries are set to valid with the appropriate page frame numbers 
loaded into the PTEs. These pages are not counted against the process 
working set. 

16.3.1.4 Map by PFN. The Create and Map Section system service allows a privileged 
process (one with PFNMAP privilege) to map a portion of its virtual address 
space to specific physical addresses. Although the primary intention of this 



348 



16.3 Private and Global Sections 

service is to allow process address space to be mapped to I/O addresses, it can 
also be used to map specific physical memory pages. 

When a private PFN-mapped section is created, the only effect is to add a 
series of valid PTEs to the process page table. The PFN fields in these PTEs 
contain the requested physical page numbers. The PTE$V_WINDOW bit in 
the PTE (see Figure 14-3) is set in each PTE to indicate that each of these 
virtual pages is PFN mapped. These pages are not counted against the process 
working set. In addition, no record is maintained in the PFN database that 
such pages are PFN mapped. 

When a global PFN mapped section is created, the only data structure cre- 
ated to describe such a mapping request is a special form of global section 
descriptor (see Figure 14-14). There are no global page table entries nor is 
there a global section table entry. When a process maps to such a section, its 
process page table entries are set to valid, mapped by PFN (PFN$V_ 
WINDOW is set), and the PFN fields are filled in according to the contents 
of the extended GSD (see Figure 14-14). 



16.3.2 Map Global Section System Service 

The Map Global Section system service can be considered a special case of 
the Create and Map (Global) Section system service, where the global section 
already exists. This service usually has no effect on the global database (other 
than to include the latest mapping in various reference counts). Rather, this 
service allows a range of process addresses to become mapped to the named 
global section. 

The actual effect of this service is to load each of the designated process 
PTEs with a global page table index (see Figures 14-3 and 14-19). These global 
page table indexes are effectively pointers to global page table entries in the 
system header, where the current state of each global page is actually re- 
corded. 

When a process maps to a global section in shared memory or to a section 
that is PFN-mapped, there are no global page table entries to be pointed to. 
Instead, each process page table entry is set to valid with the PFN field con- 
taining a physical page number either in shared memory (for shared memory 
global sections) or anywhere in physical address space (as indicated by the 
extended GSD for PFN-mapped global sections). 



16.3.3 Delete Global Section System Service 

Like the Delete Virtual Address Space system service, the Delete Global Sec- 
tion system service is more complicated than global section creation because 
the section must be reduced from one of many states to nothing. In addition, 
global writeable pages must be written to their backing store addresses before 



349 



Memory Management System Services 

a global section can be fully deleted. For these reasons, the global section 
deletion is often separated in time from the system service call. 

When the Delete Global Section system service is called, the named sec- 
tion is marked for deletion, which means that the GSD is moved from the 
normal doubly linked GSD list to the delete pending list. The delete pending 
bit in the GSD is set. In addition, the permanent indicator in the GSD is 
turned off. However, the actual section deletion cannot occur until the refer- 
ence in the global section table entry, the count of process page table entries 
mapped to the section, goes to zero. Although it is possible for the reference 
count to be zero when the section is marked for deletion, the more typical 
global section deletion occurs as a side effect of virtual address deletion 
(which itself might occur as a result of image exit or process deletion). 

A reference count of zero indicates that no more process page table entries 
are mapped to the section. At that time, the following data structures that 
describe the system can be deallocated: 

• The global page table entries in the system header are freed for further use. 
If an entire page of global page table entries is freed, that page can be un- 
locked from the system working set. 

• The global section table entry in the system header is removed from the 
active list and placed on the free list of system section table entries for 
possible later use. 

• The global section descriptor is placed on the free list of GSDs. When a 
global section is later created, this list is checked for a GSD before a new 
structure is allocated from paged dynamic memory. 

Global sections in shared memory and PFN-mapped global sections exercise 
some of the same logic when the sections are deleted, but the effects are 
different because not all of the global data structures exist for these special 
global sections. A PFN-mapped section is described entirely by an extended 
global section descriptor (see Figure 14-14). In addition, no reference counts 
are kept for such sections, so the GSD can be placed on the free list of GSDs 
immediately. 

When a shared memory global section is deleted, there are no global page 
table entries to delete. In addition, a global section table entry only exists on 
the port from which the section was created (to allow the section to be loaded 
when it was initially created and to allow the Update Section system service 
or Delete Global Section system service to preserve its contents). 



16.3.4 Update Section System Service 

The Update Section system service requests that a specified range of process 
private or global pages be written to their backing store addresses. When a 
private section is being updated, only those pages that have been modified (as 



350 



16.4 Related System Services 

indicated either by the PTE$V_MODIFY bit in the PTE or by the 
PFN$V_MODIFY bit in the PFN STATE array) are written. With global 
pages, the modify state of a physical page is the logical OR of the PFN STATE 
array modify bit and the modify bits in all of the process page table entries 
mapped to the section. Because there are no back pointers to all of these 
PTEs, this information is not available. Instead, when a global section is up- 
dated, all pages in the designated address range are written back to the global 
image file. (When the "exclusive writer" flag is passed to the Update Section 
system service, only those pages modified by the caller are written.) The 
interaction between the Update Section system service and the I/O subsys- 
tem is described in Chapter 17. 



16.4 RELATED SYSTEM SERVICES 

Other memory management system services allow a process to control its 
working set, alter page protection, and lock pages into the working set or into 
physical memory. 



16.4.1 Working Set Size Adjustment 

It is possible to make the process working set either larger or smaller, either 
manually with the Adjust Working Set Limit system service or automatically 
as a part of the quantum end routine. When the working set is expanded, new 
pages can be added to the working set without removing already valid entries. 
Adding pages to a process's working set decreases the probability that the 
process will incur a page fault. 

It is unlikely that a program will voluntarily reduce its working set limit, 
unless it has a good understanding of its paging behavior. The system reduces 
a process working set as a part of the automatic working set adjustment. The 
swapper process can shrink a process's working set in an attempt to gain 
more pages, before resorting to swapping a process out of the working set. In 
addition, a process working set limit is reset to its default value as a part of 
the image rundown procedure (see Chapter 21) that is invoked when an 
image exits. Table 16-1 lists the process-specific and system-wide working 
set list parameters. 

16.4.1.1 Adjust Working Set Size System Service. The effective result of altering the 
process working set size is to change the value of the WSSIZE working set list 
counter (see Figure 14-4). 

In the case of working set list expansion, the working set size is limited by 
the maximum working set size (PHD$W_WSEXTENT). If the expanded 
working set extends into the process section table (see Figure 14-1), the proc- 
ess section table is moved down in exactly the same manner as is done to 



351 



Go 
en 
to 



Table 16-1: Working Set Lists: Limits and Quotas 

Description Location or Name 

Beginning of Working Set List 



Size of the entire working set 



Beginning of list of 

permanently locked entries 
Beginning of dynamic portion 

of working set list 

Index of most recently inserted 

working set list entry 
End of current working set list 



Default working set size 

Normal limit to working set size 

Maximum limit to working set size 

Upper limit to working set quota 
Upper limit to working set extent 
Lower limit to size of dynamic 
working set size 



PHD$W_WSLIST 
PHD$W_WSSIZE 

PHD$W_WSLOCK 
PHD$W_WSDYN 

PHD$W_WSNEXT 
PHD$W_WSLAST 

PHD$W_DFWSCNT 

PHD$W_WSQUOTA 

PHD$W_WSEXTENT 

PHD$W_WSAUTH 

PHD$W_WSAUTHEXT 

PHD$W_WSFLUID 



Comments 

Always has the value 60 (hex) 

(This is PHD$K_LENGTH / 4) 
Set by LOGINOUT, altered by 

call to SYS$ADJWSL or by 

automatic working set 

adjustment 
The same for all processes 

in a given system 
Identical to WSLOCK unless this 

process has called SYS$LKWSET 

or SYS$LCKPAG 
Updated each time an entry 

is added to the working set 
Updated by calling SYS$ADJWSL, 

by image exit, by pager, or 

by automatic working set 

adjustment 
Set by LOGINOUT, altered 

by SET WORKING- SET/LIMIT command 
Set by LOGINOUT, altered 

by SET WORKING_SET/QUOTA command 
Set by LOGINOUT, altered 

by SET WORKING -SET/EXTENT command 

Set by LOGINOUT, cannot be altered 
Set by LOGINOUT, cannot be altered 
Set up by SHELL, equal to the value 
of MINWSCNT SYSBOOT parameter 



I 



§ 

to 



Co 

5* 



Co 

s. 

o 

TO 

Co 



Table 16-1: Working Set Lists: Limits and Quotas (continued) 



Desciiption 

Size of dynamic working set after 
allowing room for PHD$W_WSFLUID 
process page entries and a 
reasonable number of page table pages 

Number of pages in use by process 



Authorized default working set size 
Authorized default working set limit 



Location or Name 
PHD$W_EXTDYNWS 



PCB$W_PPGCNT 
+ PCB$W_GPGCNT 

UAF$W_DFWSCNT 
UAF$W_WSQUOTA 



Authorized default working set maximum UAF$W_WSEXTENT 



System-wide minimum working set size 
System- wide maximum working set size 
Working set size for system paging 
Default value for working set size 

default (used by SYS$CREPRC) 
Minimum value for working set size 

default (used by SYS$CREPRC) 
Default value for working set quota 

(used by SYSSCREPRC) 
Minimum value for working set quota 
SYSBOOT parameter 

(used by SYS$CREPRC) 



MINWSCNT 
WSMAX 
SYSMWCNT 
PQL-DWSDEFAULT 

PQL_MWSDEFAULT 

PQL-DWSQUOTA 

PQL-MWSQUOTA 



Comments 

Updated each time size of dynamic 
working set is changed 



Updated each time a page is 

added to or removed from 

the working set 
Loaded into PHD$W_DFWSCNT 
Loaded into both PHD$W_WSQUOTA 

and PHD$W_WSAUTH 
Loaded into both PHD$W_WSEXTENT 

and PHD$W_WSAUTHEXT 
SYSBOOT parameter 
SYSBOOT parameter 
SYSBOOT parameter 
SYSBOOT parameter 

SYSBOOT parameter 

SYSBOOT parameter 



cn 



OS 

TO 



TO 

a, 

3 



On 
TO 



O 

TO 

</3 



Memory Management System Services 

accommodate process section table expansion. However, there is not always 
enough room in the process header to accommodate the expanded work- 
ing set list. The process header size is determined by WSMAX (and 
PROCSECTCNT) and the working set parameters (PHD$W_WSEXTENT 
and PHD$W_WSAUTHEXT) are minimized with WSMAX. (The calculation 
of the size of each piece of the process header is described in Chapter 26.) 
Note that there is no check to determine how many process section table 
entries in the process header are allocated; thus, the process section table can 
grow so large that there is not enough working set list area available. 

In the case of working set list contraction, the working set cannot be con- 
tracted below MINWSCNT. In addition, the extra dynamic working set size 
(PHD$W_EXTDYNWS) cannot be reduced below zero. If the 
PHD$W_WSNEXT pointer locates an entry beyond the new end of the list, it 
is reset to point to the new end. The contracted list can have holes in it; the 
PHD$W_WSLAST pointer is only moved back as a side effect of freeing ex- 
cess working set list entries (above the new limit). 

16.4.1.2 SET WORKING-SET Command. The SET WORKING.SET command al- 
lows the default working set size (PHD$W_DFWSCNT) or the working set 
maximum (PHD$W_WSEXTENT) to be altered at the command level. Nei- 
ther the default size nor the maximum can be set to a value larger than the 
authorized upper limit (PHD$W_WSAUTHEXT). 

If the working set maximum is altered, it changes the upper limit for future 
calls to the Adjust Working Set Limit system service. If the limit (default 
size) is altered, it affects the working set list reset operation performed by the 
routine MMGSIMGRESET invoked as a result of image exit. If the limit is set 
to a value larger than the current quota, both the quota and the limit are 
altered to the new value. (Note that automatic working set adjustment is 
disabled for any process that has its quota and default (limit) set to the same 
value.) 

16.4.1.3 Automatic Working Set Size Adjustment. In addition to working set adjust- 
ment as a result of explicit calls to SYS$ADJWSL or as a side effect of image 
exit, the operating system also provides automatic working set adjustment to 
keep a process's page fault rate within limits set by one of several SYSBOOT 
parameters (see Table 16-2). All of the SYSBOOT parameters listed in this 
table are dynamic and can be altered without rebooting the system. 

The automatic working set adjustment takes place as part of the quantum 
end routine (see Chapter 10), because a process that cannot execute for even a 
single quantum will not benefit from an increased working set size. (Note 
that no adjustment takes place for real-time processes.) The adjustment takes 
place in several steps: 



354 



16.4 Related System Services 



Table 16-2: Automatic Working Set Size Adjustments: Process and System Parameters 



Description 

Total amount of CPU time charged 

to this process 
Amount of CPU time when last 

adjustment took place 
Total number of page faults 

for this process 
Number of page faults when last 

adjustment took place 
Most recent page fault rate 

for this process 
Amount of CPU time that process 

must accumulate before a page 

fault rate check is made 
Lower limit page fault rate 
Amount by which to decrease 

working set list size 
Lower bound for decreasing 

working set list size 
Upper limit page fault rate 
Amount by which to increase 

working set list size 
Free page list size to allow 

growth of working set 

Free page list size to allow 
extension of working set list 



Location or Name 
PHD$L_CPUTIM 

PHD$L_TIMREF 

PHD$L_PAGEFLTS 

PHD$L_PFLREF 

PHD$L_PFLTRATE 

AWSTIME (S) 



PFRATL (S) 
WSDEC (S) 

AWSMIN (S) 

PFRATH (S) 
WSINC (S) 

GROWLIM (S) 



BORROWLIM (S) 



Comments 

Updated by hardware clock 

service routine 
Updated by quantum end routine 

when adjustment check is made 
Updated each time this 

process incurs a page fault 
Updated by quantum end routine 

when adjustment check is made 
Recorded but not used each time 

an adjustment check is made 



Do not adjust if PCB$W_PPGCNT is 
less than or equal to this value 

Disables automatic adjustment for 
entire system if zero 

Do not adjust working set size if 
@SCH$GL_FREECNT is less 
than or equal to this value 

Do not adjust working set list size 
if @SCH$GL_FREECNT is less 
than or equal to this value 



(S) These values are SYSBOOT parameters. 

1. If the WSINC parameter is set to zero, the adjustment is disabled on a 
system-wide basis, so nothing is done. If automatic working set adjust- 
ment has been turned off by the DCL command SET WORKING_SET/ 
NOADJUST, the adjustment is disabled for the process, and, again, noth- 
ing is done. 

2. If the process default working set size (PHD$W_DFWSCNT) is equal to its 
quota (PHD$W_WSQUOTA), then adjustment is disabled for this process, 
so, again, nothing is done. 

3. If the process has not been executing long enough since the last adjust- 
ment (the difference between accumulated CPU time, PHD$L_CPUTIM, 
and the time of the last adjustment attempt, PHD$L_TIMREF, is less than 
the SYSBOOT parameter AWSTIME), no adjustment is done at this time. 



355 



Memory Management System Services 

If the process has accumulated enough CPU time, the reference time is 
updated (PHD$L_CPUTIM is loaded into PHD$L_TIMREF), and the rate 
checks are made. 

4. The current page fault rate is calculated. The philosophy for automatic 
working set adjustment consists of two premises. If the page fault rate is 
too low, the system can benefit from a smaller working set size (because 
more physical pages become available) without harming the process (by 
causing it to incur many page faults). If the page fault rate is too high, the 
process can benefit from a larger working set size (by incurring fewer 
faults), without degrading the system. 

• If the current page fault rate is too high (greater than or equal to 
PFRATH), a determination is made to see if the working set list can be 
extended. If the size of the working set list is below WSQUOTA, the 
working set list is extended by WSINC. If the size of the working set list 
is greater than or equal to WSQUOTA, the number of pages on the free 
page list is compared to the SYSBOOT parameter BORROWLIM. If 
there are more than BORROWLIM pages on the free page list, the work- 
ing set list is increased by WSINC. However, if there are fewer than 
BORROWLIM pages on the free page list, the working set list is not 
extended. The working set list can only be extended up to WSEXTENT. 

Note the adjustment taking place here affects only the working set 
list, not the working set itself. Once the working set list has been ex- 
tended, newly faulted pages can be added to the working set. The page 
fault exception handler will add pages to the working set above 
WSQUOTA only when there are more than the SYSBOOT parameter 
GROWLIM pages on the free page list (see Section 15.4.3). 

• If the current page fault rate is too low (strictly, less than PFRATL), the 
working set is decreased (by WSDEC). However, if the contents of 
PCB$W_PPGCNT are less than or equal to AWSMIN, no adjustment 
takes place. This decision is based on the assumption that many of the 
pages in the working set are global pages and that therefore the system 
will not benefit (and the process may suffer) if the working set is de- 
creased. Note that in the update for VAX/ VMS Version 3.1, PFRATL 
was set to zero, effectively turning off this method of working set reduc- 
tion in favor of swapper working set trimming. The rationale for this 
change is explained at the end of this list. 

5. The actual working set adjustment is accomplished by a regular kernel 
mode AST that executes an Adjust Working Set system service. The AST 
parameter passed to this AST is the amount of previously determined in- 
crease or decrease. This step is required because the system service must 
be called from process context (at IPL 0) and the quantum end routine is 
executing in response to the IPL 7 software timer interrupt. 



356 



16.4 Related System Services 

Two other pieces of the executive control the size of a process's working set: 
the page fault routines and the swapper. As described in the previous list, the 
page fault handler can add a page to a process's working set if the size of the 
free page list is greater than GROWLIM. In an effort to gain pages, the swap- 
per will reduce the working sets of processes in the balance set before actu- 
ally removing processes from the balance set. This working set reduction is 
known as swapper trimming or working set shrinking. Process selection is 
performed by a table-driven, prioritized scheme (see Section 17.2.2). 

Two problems are inherent in using the quantum end scheme of automatic 
working set adjustment: processes that are compute-intensive will reach 
quantum end many times and images that have been written to be efficient 
with respect to page faults (a low page fault rate) will qualify for working set 
reduction, because their page fault rate is lower than PFRATL. In both of 
these cases, working set reduction is not desirable. By contrast, swapper trim- 
ming selects its processes starting with those that are least likely to need 
large working sets. 

In what can be seen as an evolutionary change to the operating system, 
working set reduction at quantum end was turned off in the VAX/VMS Ver- 
sion 3.1 update. The default value of PFRATL has been set to zero. In this 
manner, swapper trimming and the image exit reset are the only methods 
used to reduce working set size. 

16.4.1.4 Purge Working Set System Service. The Purge Working Set system service 
requests that all virtual pages in the specified address range that happen to be 
in the working set be removed from the working set. A program could use 
this service if it recognized that a certain set of routines or data was no longer 
required. By voluntarily removing entries from the working set, a process can 
exercise a little control over the working set list replacement algorithm, in- 
creasing the chances for frequently used pages to remain in the working set. 
The VMS executive uses this service as part of the image startup sequence 
(see Chapter 21) to insure that a program starts its execution without unnec- 
essary pages (such as CLI command processing routines in its working set). 



16.4.2 Locking and Unlocking Pages 

For time-critical applications and other situations where a program wishes to 
access code or data without incurring a page fault, system services are pro- 
vided to lock pages into the process working set or into memory. 

16.4.2. 1 Locking Pages in the Working Set. A set of virtual pages can be locked into the 
process working set to prevent page faults from occurring on references to 
these pages. Locking pages in the working set guarantees that when this proc- 
ess is executing (is the current process), the locked pages are always in the 



357 



Memory Management System Services 

process working set. In addition to the obvious benefit of this service, it can 
also be used by routines that execute at elevated IPL (above IPL 2), because 
the operating system does not allow page faults to occur above IPL 2. There is 
no implication that these pages remain resident when the process is not cur- 
rent because the entire working set can be outswapped. (Residency is guaran- 
teed by either a combination of this system service and the Set Swap Mode 
system service or by using the Lock Pages in Memory system service.) 

All pages in the specified range are faulted into the working set if they are 
not already valid. The working set list (see Figure 14-4) must be reorganized 
so that the locked pages appear in the list following the WSLOCK pointer. 
This reorganization is accomplished by exchanging the locked WSLE with 
the entry pointed to by WSDYN, and then incrementing WSDYN to point to 
the next element in the list. The WSLX PFN array elements for the two valid 
pages must also be exchanged. In addition, the WSL$V_WSLOCK bit is set in 
the working set list entry. 

A check is made to insure that the process will be left with enough dy- 
namic working set after the specified number of pages are locked. Enough 
dynamic working set means that the extra dynamic working set size, the size 
of the dynamic working set after space has been allocated for page table pages 
and a minimum working set size, is greater than zero. (Like most of the 
memory management system services, this service can partially succeed. In 
this case, the address range that is actually locked is returned to the caller by 
means of the retadr argument.) 

When a process is being outswapped, global read/write pages are dropped 
from the process working set (see Chapter 17) to avoid cumbersome account- 
ing problems about whether the outswapped page contains the most up-to- 
date information. For this reason, global read/write pages cannot be locked 
into the process working set. (Such pages can be locked into memory because 
the Lock Pages in Memory system service prevents outswap of either the 
process header or the locked pages, avoiding the swapping situation alto- 
gether.) The swapper also performs an optimization with global read-only 
pages by dropping them from the working set on outswap if the global share 
count is larger than one. If such pages are locked into the working set, they 
are not dropped from the working set, regardless of the contents of the PFN 
SHRCNT array. 

16.4.2.2 Locking Pages in Memory. The Lock Page in Memory system service is simi- 
lar to the Lock Page in the Working Set service except that the 
WSL$V_PFNLOCK bit in the WSLE is set and the process header is locked 
into memory. This service performs an implicit working set lock in addition 
to guaranteeing permanent residency to the specified virtual address range. 
Because this operation is permanently allocating a system resource, physical 
memory, it requires a privilege (PSWAPM). 



358 



16.4 Related System Services 

16.4.2.3 Unlocking Pages. The converse of either of the two locking services unlocks 
pages from either the working set or physical memory. In addition, the work- 
ing set list entries may have to be exchanged with other locked entries to 
place the unlocked entries back into the dynamic portion of the list. As with 
the exchange associated with locking pages, the WSLX PFN array elements 
must also be exchanged. Finally, the appropriate bit in the WSLE 
(WSL$V_WSLOCK or WSL$V_PFNLOCK) is cleared. 



16.4.3 Process Swap Mode 

A process with PSWAPM privilege can prevent itself from being removed 
from memory. The set process swap mode ($SETSWM) system service simply 
sets the PCB$V_PSWAPM bit in the status longword (PCB$L_STS) in the 
software PCB. When the swapper is searching for suitable outswap candi- 
dates, processes with this bit set are passed over. 

16.4.4 Altering Page Protection 

It is possible for a process to alter the page protection of a set of pages in its 
address range with the Set Protection on Pages system service ($SETPRT). In 
general, the operation of this service is straightforward. However, there is one 
interesting side effect. If a section page for a read-only section has its protec- 
tion set to writeable, the copy-on-reference bit is set. This set bit will force 
the page to have its backing store address changed to the page file when the 
page is faulted, preventing a later attempt to write the modified section pages 
back to a file to which the process may be denied write access. 

The symbolic debugger uses this service to implement its watchpoint facil- 
ity. The page containing the data element in question is set to no write access 
for user mode. When the program attempts to access the page, an access 
violation occurs, which is fielded by the debugger's condition handler. This 
handler performs the following actions: 

1. Checks whether the inaccessible address is the one being watched and 
reports the modification if it is 

2. Sets the page protection to PRT$C_UW to allow the modification 

3. Sets the TBIT in the PSL to give the debugger control after the instruction 
completes 

4. Dismisses the exception 

When the instruction completes, the debugger's TBIT handler gains control, 
sets the page protection back to no write access for user mode, and allows the 
program to continue its execution. 



359 



17 Swapping 



A time to cast away stones and a time to gather stones 
together. . . 
— Ecclesiastes 3:5 

The VAX/VMS operating system does not allow the amount of physical 
memory to limit totally the number of processes allowed in the system. 
Physical memory is effectively extended by keeping only a subset of the total 
number of active processes resident at a given time. This number is kept at a 
maximum by controling the number of pages that any one process has in 
memory at any given time. The remaining processes work with reduced 
working sets or reside in backing store locations. The reduction in size of low 
priority working sets, movement of low priority processes to backing store, 
and the subsequent filling of memory with high priority computable proc- 
esses is the responsibility of the swapper. In fact, the swapper process can be 
viewed as the system-wide memory manager. 

In VAX/VMS Version 3.0 the responsiblities of the swapper changed con- 
siderably. Previous to Version 3.0, the swapper was solely responsible for 
moving processes in and out of physical memory. The swapper in Version 3.0 
attempts not to swap processes out of physical memory. Rather it will shrink 
process working sets in order to gain free pages. 

17.1 SWAPPING OVERVIEW 

Before discussing the details of swapper operation (moving a process into or 
out of memory), some basic swapper concepts will be reviewed. The specific 
uses of each of the memory management data structures manipulated by the 
swapper will be pointed out. 

17.1.1 Swapper Responsibilities 

The swapper has two main responsibilities: 

• The subset of processes that are currently resident should represent the 
highest priority executable processes in the system. When nonresident 
processes become computable, the swapper must bring them back into 
memory. 

• The swapper is also responsible for keeping the number of pages on the free 
page list above the low limit threshold established by the SYSBOOT pa- 
rameters FREELIM and FREEGOAL. Requests for physical pages come 



360 



1 7. 1 Swapping Overview 

from several sources. One request comes from the pager in resolving a page 
fault for a page that is not currently in memory. Another originates with 
an attempt by the swapper to acquire enough physical pages to inswap a 
computable but outswapped process. There are four operations that the 
swapper performs to keep pages on the free page list. 

1 Process headers of previously outswapped process bodies may be eligi- 
ble for outswap. If so, they will be outswapped. (Process headers for 
already deleted processes are simply deleted.) 

2 The swapper will write modified pages until the number of pages on the 
modified list falls below the low limit threshold stored in global loca- 
tion SCH$GL_MFYLOLIM. However, the swapper will not write modi- 
fied pages if there are fewer than the SYSBOOT parameter MPW_ 
THRESH pages on the modified list. The value of SCH$GL_MFYLOLIM 
ensures that a certain number of pages will be available on the modified 
list for page faults; MPW_THRESH simply sets a lower bound to be 
met before the swapper can write the modified page list to gain pages. 

3. In an attempt not to outswap processes, the swapper will shrink work- 
ing set sizes. The table used to determine outswap selection is also used 
to determine the order by which working sets will be reduced. See Sec- 
tion 17.2.2 for more information on outswap selection. 

4 As a last resort to maintaining the size of the free page list, the swapper 
will select an eligible process for outswap and remove that process from 
memory. The table used to determine outswap selection is also used in 
reducing working set sizes. 



17.1.2 Swapper Implementation 

The swapper is a separate process in the operating system. As such, it can be 
selected for execution just like any other process in the system. It also has its 
own resources and quotas that are charged when the swapper does I/O. 

By making the swapper a separate process, the pieces of the system that 
detect a need for one of the swapper's duties simply have to wake the swapper 
up (by issuing a JSB to routine SCH$SWPWAKE). As already noted in Chapter 
10 this routine does not simply wake the swapper. Instead, it performs a 
series of checks to determine whether there is a need for swapper activity. If 
so the swapper process is awakened. If not, the routine simply returns. By 
performing these checks in this routine rather than in the swapper process 
itself, the overhead of two needless context switches is avoided. 

When the swapper is the current process, it executes entirely in kernel 
mode. All of the swapper code resides in system space. (The swapper makes 
use of its P0 space when it creates a new proces by using the module SHELL 
in the executive image. This operation is described in Chapter 20.) 



361 



Swapping 

1 7. 1 .3 Comparison of Paging and Swapping 

The VMS operating system uses two different techniques to make efficient 
use of available physical memory. The ability to support programs with vir- 
tual address spaces larger than physical memory is the responsibility of the 
pager. The swapper allows a running system to support more active processes 
than can fit into physical memory at one time. The swapper's responsibilities 
are more global or system wide than the pager's. Table 17-1 compares and 
contrasts the pager and swapper in several details. 



17.2 SWAP SCHEDULING 

The swapper is a part of the system that performs both memory management 
and scheduling functions. The scheduling aspects of the swapper are here 
discussed from two points of view. First, the actions that the swapper takes 
to determine whether to inswap, outswap, or shrink a particular process are 
discussed. Then, those system events that trigger swapper activity are briefly 
described. 



1 7.2. 1 Selection of Inswap Candidate 

The scheduler maintains 32 quadword listheads for outswapped computable 
(COMO) processes, one for each software priority (see Figure 10-3). These 
queues are identical to the 32 queues maintained for the computable resident 
(COM) processes. The steps that the swapper takes to locate an inswap candi- 
date (once it has decided that an inswap can be performed) exactly parallel the 
steps that the rescheduling interrupt service routine takes (see Chapter 10) to 
select the next candidate for execution. 

1. A FFS instruction on the COMO queue summary longword (SCH$GL_ 
COMOQS) locates the highest priority nonempty COMO queue. 

2. The first process in this queue is removed and prepared for being swapped 
into memory. 

Figure 17-1 shows the parallel between the inswap candidate selection and 
the operation of the rescheduling interrupt service routine. The key instruc- 
tions in the two routines are identical. The only differences are in the global 
data items referenced by the instructions. 

After a process has been chosen for inswap, the swapper checks if there are 
enough pages on the free page list to hold the inswap candidate and leave at 
least FREELIM pages remaining on the list. If so, the inswap proceeds. If not, 
the swapper attempts to make more pages available by shrinking working 
sets, outswapping one or more processes, writing modified pages, or deleting 
process headers of already deleted process bodies. 



362 



17.2 Swap Scheduling 



Table 17-1: Comparison of Paging and Swapping 



Differences 



Paging 

The pager is a process-wide 
component of the executive that 
moves pages into and out of 
process working sets. 

The page fault handler is an 
exception service routine that 
executes in the context of the 
process that incurred the page 
fault. 

The unit of paging is the 
page, although the pager 
attempts to read more than one 
page with a single disk read. 

Page read requests for process 
pages are queued to the driver 
according to the base priority 
of the process incurring the 
page fault. Modified page 
write requests are queued 
according to the SYSBOOT 
parameter MPW-PRIO. 

Paging supports programs with 
very large address spaces. 



Swapping 

The swapper is a system-wide 
component of the executive 
that moves entire processes 
into and out of physical 
memory. 

The swapper is a separate 
process that is awakened from 
its hibernating state by 
components that detect a need 
for swapper activity. 

The unit of swapping is the 
process (or more accurately, 
the process working set). 

Swapper I/O requests are 
queued according to the value 
of the SYSBOOT parameter 
SWP-PRIO. 



Swapping supports a large 
number of concurrently active 
processes. 



Similarities 

1. The pager and swapper work from a common database. The most impor- 
tant structures that are used for both paging and swapping are the proc- 
ess page tables, the working set list, and the PFN database. 

2. The pager and swapper do conventional I/O. There are only slight differ- 
ences in detail between pager I/O and swapper I/O on the one hand and 
normal Queued I/O requests on the other. 

3. Both components attempt to maximize the number of blocks read or 
written with a given I/O request. The pager accomplishes this with read 
and write clustering. The swapper attempts to inswap or outswap the 
entire working set in one (or a small number of) I/O request(s). 



363 



ON 



The routine SCH$SCHED, that selects the next execution candidate has an exact parallel in the swapper. The first half of the parallel shows the 
swapper's selection of the next inswap candidate and the nearly identical instructions in the scheduler. 



Swapper's Selection of Inswap Candidate 



Notes 



Scheduler's Selection of Execution Candidate 



QEMPTY: BUG_CHECK QOEUEMPTY, FATAL 



SHAPSCHED: 

DSBINT 

BBSS 

FFS 

BNEQ 

BBCC 

5$: ENBINT 
RSB 



#IPL$_SYNCH 

S"#SCH$ V.SIP , W~SCH$GB_SIP , 5$ 

#D , #32 , W"SCH$GL_COMOQS , RS 

1D$ 

S"#SCH$ V_SIP , H"SCH$GB_SIP , 5$ 





SCH$IDLE: 






SETIPL 


#IPL$_SCHED 




MOVB 


#32,H~SCH$GB_PRI 




BRB 


SCHSSCHED 




SCH$SCHED: : 




(1) 


SETIPL 


#IPL$_SYNCH 


(2) 


FFS 


#D , #35 , W*SCH$GL_COMQS , R5 




BEQL 


SCH$IDLE 



I 

a 

13 
Oq 



1D$: PUSHR # M(Rb,R7,Rfl,Rq,RlD,Rll,AP,FP> 

MOVAQ fTSCH$AQ_C0M0H[RS],R3 

MOVL (R3),R< 

CMPB #DYN$C_PCB,PCB$B_TYPE(R4) 

BNEQ QEMPTY 



(3) 
(4! 



MOVAQ WSCH$AQ_COMOH[R2],R3 

REMQUE 9(R3)+,R4 



At this point, the swapper has found an inswap candidate. It then takes the steps necessary to bring this process into memory. The scheduler, on 
the other hand, continues executipn. The REMQUE instruction shown above for the scheduler is duplicated below to emphasize that, while a 
long time elapses between inswap candidate selection and completion of the inswap, there is no time lapse for execution selection. 



Some time later, the inswap operation completes. The swapper rebuilds the working set list and the process page tables. The parallel resumes when 
the swapper calls the scheduler to make the newly inswapped process computable. 



(1) IPL is raised to synchronize access to the scheduler's database. 

(2) The highest priority (COMO/COM) queue is selected. 



(3) The address of its forward pointer is loaded into R3. 

(4) The address of the selected PCB is loaded into R4. 



State Change from COMO to COM 



Notes 



State Change from Computable to Current 






SCH$SCHEP: 






REHQUE 


(R4),R1 




BNEQ 


10$ . 




MOVZWL 


PCB$W_STATE(R4),R1 




BBC 


R1,EXESTATE,1D$ 




MOVZBL 


PCB$B_PRI(R^),R1 




BLBC 


PCB$W_STATE(R4),S$ 




ADDL 


#35, Rl 


5$: 


BBCC 


R1,W"SCH$GL_COMQS,10$ 


10$: 


MOVB 


R0,PCB$B_PRI(R4) 




MOVL 


#SCH$C_C0B,R1 



30$: 



40$: 



MOVW 

MOVAQ 

BBSS 

INSQUE 

RSB 



R1,PCB$W_STATE(R4) 
L"SCH$AQ_C0BT[R0],R1 
R0,W"SCHGL_COMQS,40$ 
(R<),9(R1)+ 



(5) 



REMQUE 3(R3)+,R4 
BVS QEMPTY 
BNEQ 50$ 



(7) 


50$: 


BBCC 


Ra,H"SCH$GL_COMQS,S0$ 






CMPB 


#DYN$C_PCB , PCB$B_TYPE ( R4 ) 






BNEQ 


QEMPTY 


(8) 




MOVH 


#SCH$C_CUR , PCB$H_STATE ( R4 ) 


(9) 




MOVL 


R4,ifSCH$GL_CURPCB 



At this point, the parallel ends. If the process just made computable is of higher priority than the swapper, that process will be scheduled as soon 
as the IPL is lowered below 3 and the rescheduling interrupt occurs. In other cases, the process will not execute until it becomes the highest 
priority computable process. The scheduler's service routine continues its operation, placing the selected process into execution. ^ 



(5) Remove the selected PCB from former state (COMO/COM). 

(6) Bias Rl so that it points to SCH$GL_COMOQS, the summary 
longword for the COMO state. (This is noted so the BBCC instruc- 
tion makes sense.) 

(7) If the removal of the PCB emptied the queue, clear the associated 
priority bit in the summary longword. 



(8) Load the STATE field in the PCB with the new state (COM/CUR) of 
the process. 

(9) Finally, place the PCB into its new scheduling queue. 



Figure 17-1 

Parallels between Inswap Candidate Selection by the Swapper and Execution Candidate Selection by the Scheduler 



I 
g. 

TO 
a, 
c 

a 



Swapping 



There is one optimization that the swapper performs that may prevent an 
eventual outswap. The swapper only inswaps compute-bound low priority 
processes at a rate determined by the special SYSBOOT parameter SWPRATE. 
(The definition of such a process is one whose current priority is equal to its 
base priority, which priority is less than or equal to the SYSBOOT parameter 
DEFPRI.) The inswap is abandoned if all of the following are true: 

• The swapper is attempting to inswap such a process. 

• The process will not fit. 

• The SWPRATE interval has not yet expired. 

Each time that the swapper successfully inswaps one of these so-called 
cruncher processes, it resets its inswap clock to contain the current time plus 
SWPRATE. 



1 7.2.2 Selection of Shrink or Outswap Candidates 

When the swapper must resort to shrinking or swapping resident processes to 
make room for a computable (but outswapped) process, it must determine 
which process to select first. The examination order for potential outswap 
candidates attempts to modify last those processes that would suffer the 
most from a working set reduction or an outswap. Note that this algorithm is 
not altogether straightforward; some processes benefit from being swapped, 
rather than having their working sets reduced. 

Any time that free pages are gained by action of the swapper, a check is 
made to see if there are enough pages on the free and modified page lists to 
satisfy the deficit. If enough pages are available, the swapper completes its 
actions and hibernates. 

The swapper maintains a table (in module OSWPSCHED) that determines 
the order and conditions for which the various resident scheduling states are 
examined. When the swapper searches for candidates, it starts at the first 
section in its table and evaluates all the processes indicated by that section. 
For each section in the table, the swapper makes three passes looking for 
candidates. On each pass, the criteria for a process to remain inswapped in- 
crease in severity. When all three passes have been completed for all the 
processes represented by the section, the swapper evaluates the next section 
in the table. 

The selection table is shown in Table 17-2. Note that the table may have 
more than one scheduling state in each section of the table. These states are 
viewed by the determination algorithm as being more or less equivalent in 
their requirements. Processes cannot be outswapped if they have locked 
themselves into the balance set. 

In addition to the process's scheduling state, the following characteristics 
can be used to select processes: 



366 



1 7.2 Swap Scheduling 



Table 17-2 


,: Selection of Shrink and Outswap Candidates 








Selection dependent on: 




FLAGS 




Process 
State 


Direct 
I/O! 


Priority! 


Initial 
Quantum! 


LONGWAIT 


SWAPASAP 


SWPOGOAL 


SUSP 


No 


No 


No 











LEF 


No 


No 


No 


1 





1 


HIB 


No 


No 


No 


1 





1 


CEF 


No 


No 


No 








1 


LEF 


No 


No 


No 








1 


HIB 


No 


No 


No 








1 


FPG 


No 


Yes 


No 





1 





COLPG 


No 


Yes 


No 





1 





MWAIT 


No 


No 


No 





1 





CEF 


Yes 


Yes 


Yes 











LEF 


Yes 


Yes 


Yes 











PFW 


No 


Yes 


Yes 





1 





COM 


No 


Yes 


Yes 





1 






In some entries, processes that have not completed their initial quantum 
(those that have the initial quantum flag PCB$V_INQUAN set in 
PCB$L_STS) are not considered as candidates for outswap. There are two 
circumstances under which the swapper does not make the initial quan- 
tum check: a real-time process (a process whose priority is greater than or 
equal to 16) must be swapped in, or the swapper has failed to swap out a 
process on the SYSBOOT parameter SWPFAIL number of tries. 

The swapper maintains a failure counter that records the number of 
times that it attempted to locate an outswap candidate and failed. When 
this count reaches a value equal to SWPFAIL, the swapper ignores the 
setting of the initial quantum flag. The counter is reset each time that an 
outswap candidate is successfully located. 

In some entries, processes can be considered for swapper action if their 
priority is less than or equal to that of the potential inswap process (stored 
in global location SWP$GB_ISWPRI). 
■ Processes that are performing direct I/O are selected later than those that 
are not. If a process is doing direct I/O and is waiting on an event flag, the 
swapper assumes that the event flag wait is associated with the direct I/O. 
The motivation behind delaying direct I/O process selection is the desire 
to avoid the overhead of swapping the process, only to have the process's 
state change to COM, even before the outswap completes. 
» The following three flags are used in the selection of processes. The flags 
are maintained for table entries and direct the swapper to include specific 
processes in the table entry or to take specific action on one of the passes 
through the table entry. 



367 



Swapping 



LONGWAIT When this flag is set, processes can be included in the 

table entry if they have been waiting in a scheduling state 
for longer than the SYSBOOT parameter LONGWAIT. 
This flag is only applicable to processes in the LEF or HIB 
scheduling states. 

The effect of the LONGWAIT flag is to subdivide the 
processes in LEF and HIB scheduling states into processes 
that have been waiting a long time to become computable 
and those that have been waiting a short time. The philos- 
ophy here is that processes that have been waiting a long 
time will probably wait longer still, whereas those that 
have only been waiting a short time could become com- 
putable rather quickly. 

SWAP ASAP This flag indicates that the swapper must swap out proc- 

esses indicated by this state, after reducing their working 
set to WSQUOTA. The processes indicated by a table 
entry with SWAPASAP set are computable or are likely to 
become computable very soon. If the system needs mem- 
ory badly enough, one of these processes will be swapped 
out at its current size. When the outswapped process be- 
comes computable again, it will not have to waste com- 
pute time rebuilding its working set. 

SWPOGOAL This flag indicates that the swapper must shrink the 

working set size of processes indicated by the table entry 
to SWPOUTPGCNT. 

The three passes made on each table section are as follows: 

1. The first pass reduces extended working sets to WSQUOTA. If the 
SWAPASAP flag is set for the table section, processes are shrunk and then 
outswapped as they are processed. 

2. If the current section of the selection table is affected by the SWPOGOAL 
flag, the second pass reduces the working set size of processes indicated by 
this section. Working sets are reduced to the SYSBOOT parameter 
SWPOUTPGCNT. 

3. In the third pass, processes selected by this section are swapped out of 
physical memory. 

When the swapper scans a series of processes queued to a particular priority 
within a scheduling state, the scan begins with the most recently queued 
entry (at the tail of the queue). This starting point insures that the longer a 
process has been waiting in a queue, the less chance it has of being shrunk or 
swapped. 



368 



17.2 Swap Scheduling 



Table 17-3: Events That Cause the Swapper or Modified Page Writer to Be Awakened 



Event Module, 

Process that is outswapped RSE 

becomes computable 

Quantum End RSE 



CPU Time Expiration RSE 

Process Enters Wait State SYSWAIT 



Modified Page List Exceeds ALLOCPFN 

Upper Limit Threshold 
Free Page List Drops Below ALLOCPFN 

Low Limit Threshold 



Free Page Limit Exceeds ALLOCPFN 

Upper Limit Threshold 

Balance Slot of Deleted SYSDELPRC 

Process Becomes Available 
Process Header Reference PAGEFAULT 

Count Goes to Zero 

System Timer Subroutine TIMESCHDL 

Executes 



Additional Comments 

The swapper will attempt to make 

this process resident. 
An outswap previously blocked by 

initial quantum flag setting may 

now be possible. 
The process may be deleted, allowing a 

previously blocked inswap to occur. 
The process that entered a wait state 

may be a suitable outswap candidate. 

(For example, priority may not be 

important for this wait state.) 
Modified page writing is performed 

by swapper. 
The swapper must balance free page 

count by: 

1. Writing modified pages 

2. Swapping headers of previously 
outswapped process bodies 

3. Swapping more processes 

A process that could not be inswapped 

due to lack of physical pages 

may now fit. 
A previously blocked inswap may now 

be possible. 
A process header can now be outswapped 

to join a previously outswapped 

process body. 
The swapper is awakened every second 

to check if there is any work 

to be done. 



17.2.3 System Events that Trigger Swapper Activity 

The swapper spends its idle time in a hibernating state. Those components 
that detect a need for swapper activity wake the swapper (by calling routine 
SCH$SWPWAKE). Table 17-3 lists the system events that trigger a need for 
swapper activity, the module that contains the routine that detects each 
need, and the reason why the swapper needs to be informed about these sys- 
tem events. 

The swapper does not worry about why it was awakened. Every time that it 
is awakened, it tends to all of its responsibilities. The main loop of the swap- 
per performs the following steps: 



369 



Swapping 



1 . If the free page count is too low, the list is replenished, which might result 
in an outswap of a process if modified page writing (Step 2) will not free 
enough physical pages. 

2. Modified pages are written. Every time the swapper is awakened, the mod- 
ified page writer is called. If the size of the modified page list exceeds its 
upper limit threshold (SCH$GL_MFYLIM), modified pages will be written 
until the size of the list falls below the low limit threshold (SCH$GL_ 
MFYLOLIM). 

There are times when the swapper wants to flush the entire modified 
page list. The logic of the modified page writer requires that both of these 
threshold parameters be zeroed for the list to be flushed. The last step that 
the modified page writer takes before exiting is to restore the two modified 
page list thresholds to the values described by the SYSBOOT parameters 
MPW-HILIMIT and MPW-LOLIMIT. 

3. The swapper attempts to inswap a process in the COMO state (if one 
exists). This attempt can fail if there are not enough physical pages to 
accommodate the outswapped process and none of the resident processes 
are suitable outswap candidates. 

4. The fact that the swapper is a separate process that executes fairly fre- 
quently (at least once a second) makes it a convenient vehicle for testing 
whether a powerfail recovery has occurred and, if so, notifying all proc- 
esses that have requested power recovery AST notification (with the Set 
Powerfail Recovery AST system service). The details of this delivery 
mechanism are described in Chapter 27. 

5. Finally, the swapper puts itself into the hibernate state, after checking its 
wake pending flag. If anyone (including the swapper itself in one of its 
three main subroutines) has requested swapper activity since the swapper 
began execution, the hibernate is skipped and the swapper goes back to 
Step 1. 



17.3 SWAPPER'S USE OF MEMORY MANAGEMENT DATA 

STRUCTURES 

In Chapter 16, the memory management data structures that are used by 
both the pager and the swapper were described. The discussion here will 
review those structures and add descriptions of those structures that are used 
exclusively by the swapper. 



17.3.1 Process Header 

The bulk of information that the swapper uses in managing the details of 
either inswapping or outswapping is contained in the process header. The 
process page tables contain a complete description of the address space for a 
given process. 



370 



17.3 Swapper's Use of Memory Management Data Structures 

The working set list describes those PTEs that are valid. This list is crucial 
for the swapper because it is only the process working set that will be written 
to backing store when the process is outswapped. In a similar fashion, when 
it is time for a process to be inswapped, the working set list in the process 
header in an outswapped process describes what the rest of the process looks 
like in the swap file. 

17.3.1.1 Working Set List. The working set list describes the portion of a process vir- 
tual address space that must be written to the swap file when the process is 
outswapped. A page in the process working set can be in one of the following 
three states: 

1. The page is valid. 

2. The page is currently being read into memory. The swapper treats page 
reads like any other I/O in progress when swapping a process. This treat- 
ment is described in Section 17.4. 

3. The process page table contains a global page table index and the indexed 
global page table entry indicates a transition state. The swapper handles 
global pages in a special manner when outswapping a process. This treat- 
ment is also described in Section 17.4. 

The operation of the swapper's scan of the process working set list at outswap 
is discussed in Section 17.4. 

17.3.1.2 Process Page Tables. The working set list does not supply the swapper with 
all the information necessary to outswap a process. Other information is con- 
tained in either the valid (or transition) PTE or in one of the PFN array ele- 
ments associated with the physical page. Each working set list entry effec- 
tively points to a different process (or system) page table entry that contains a 
page frame number. The PTE is copied to the swapper's I/O map and then the 
contents of the BAK array element for this physical page are put back into the 
process PTE. These actions eliminate any ties between an outswapped 
process's page tables and physical memory. 

17.3.1.3 Process Header Page Arrays. The breaking of ties between process PTEs and 
physical memory is straightforward for process pages. The contents of the 
BAK array element are simply merged into the PTE. However, process header 
pages are also a part of the process working set. These pages reside in system 
space and are mapped by system page table entries that map the balance slot 
in which the process header resides. 

The relinquishing of the balance slot implies that these SPTEs must also be 
surrendered. There is no analogous way to store the BAK array contents for 
process header pages. For this reason, the process header page arrays (see Fig- 
ure 14-8) exist in the process header. There exists an array element for each 



371 



Swapping 

page in the process header. When a process is outswapped, those process 
header pages currently in the working set have their BAK addresses put into 
the corresponding array elements in the process header page BAK array. 
When the process is swapped back into memory, the process header pages can 
be scanned and the BAK contents copied from the array back into the PFN 
BAK array elements for the physical pages that contain the process header. 
In a similar manner, it is necessary to remember where each process header 
page fits into the working set. This record keeping is done by storing the 
WSLX PFN array element into the corresponding process header page WSLX 
array element. The use of this array while the process header is being rebuilt 
following inswap prevents a prohibitively long search of the working set list 
for each process header page. 

17.3.2 Swapper I/O Data Structures 

Like the pager, the swapper uses the conventional VMS I/O subsystem. It 
allocates its own I/O request packet and fills in some of the fields that will be 
interpreted in a special manner by the I/O postprocessing routine. After these 
fields have been filled in, it jumps to one of the swapper I/O entry points in 
module SYSQIOREQ (EXE$BLDPKTSWPR or EXE$BLDPKTSWPW) that fills 
in an appropriate function code and queues the packet to the appropriate disk 
driver. Table 15T shows how the I/O request packet is used by the swapper 
for its I/O activities. 

Two other structures are used by the swapper. The system maintains a 
page file control block for each page and swap file in the system. The swapper 
uses a special I/O array that allows it to read or write a process working set, a 
collection of virtually discontiguous pages, in one or a small number of I/O 
requests. 

17.3.2.1 Page File Control Blocks Used by the Swapper. Figure 14-23 shows the layout 
of a page file control block, the structure that allows a page or swap file to be 
located on disk. Notice that the window control block pointer and virtual 
block number field are located at the same offsets in page file control blocks 
and in process or global section table entries, which allow these data struc- 
tures to be used by common routines that need not distinguish the type of 
structure being used to describe a memory management I/O request. 

17.3.2.2 Swap File Initialization. When the system is initialized, the SYSINIT process 
initializes the swap file SYS$SYSTEM:SWAPFILE.SYS. If alternate swap files 
are installed (with the SYSGEN command INSTALL), the page file control 
block for the new swap file is initialized by SYSGEN. 

17.3.2.3 Allocation of Swap Space. For each process, the indication of which page file 
control block to use is contained in the software PCB in field PCB$L_ 



372 



1 7.4 Outswap Operation 

WSSWP. The page file control block then indicates the file in which swap- 
ping space is assigned to the process. The upper byte is a longword index into 
the array of pointers to page file control blocks (see Figure 14-22). 

When a process is first created, its initial swap space is allocated for the 
process in a call to the Create Process ($CREPRC) system service. The initial 
size of the swap space is the SYSBOOT parameter MPW_WRTCLUSTER 
(minimized by the size of the SHELL process). The page file index and the 
virtual block number of the beginning of the space are recorded in the process 
control block as negative values. A negative value indicates to the swapper 
that this PCB requires an inswap from the SHELL. After the SHELL has been 
swapped in, the values are restored to their positive form. 

If a process control block contains a zero at location PCB$L_WSSWP, the 
swapping and paging systems assume that the process is permanently mem- 
ory resident. Only the processes that are created before the page and swap 
files are located (NULL process, SWAPPER process, and SYSINIT process) are 
permanently memory resident. 

When a process's working set list is extended, a check is made to see if the 
new working set will fit in the currently allocated swap space. If the new 
sized working set list will not fit in the current swap space, a new swap space 
(that is MPW-WRTCLUSTER pages larger) is allocated. The old swap space 
is deallocated. 



17.3.2.4 Swapper PTE Array. The need for the swapper PTE array that allows it to 
write pages that are virtually discontiguous in the context of the process 
being swapped was described in Chapter 16. This array contains WSMAX 
longwords and is used for both outswap and inswap operations. 

At outswap, the PFN of each page that will be written to the swap file is 
loaded into the array. This array is then passed on to the I/O system to per- 
form the write. At inswap, the swapper allocates a number of PFNs to hold 
the process and reads the swap image into these pages. Each PFN is then 
placed into the appropriate page table as the working set list and process page 
tables are rebuilt. 



17.4 OUTSWAP OPERATION 

Outswap is described before inswap because it is easier to explain inswap in 
terms of what the swapper put into the swap file. The swapper does not 
remove processes from the balance set indiscriminately. In fact, the swapper 
tries hard not to swap. Processes are only removed if there is a need for physi- 
cal pages that cannot be satisfied by shrinking working sets and flushing the 
modified page list. 



373 



Swapping 

17.4.1 Selection of Outswap Candidate 

As is mentioned in Section 17.2, the outswap selection is driven by tables 
that contain a weight for each resident scheduling state. The swapper selects 
the process that it judges will benefit the least from remaining in memory. 
Once a candidate is selected, the swapper prepares the working set of that 
process for outswap. 

1 7.4.2 Outswap of the Process Body 

The swapper outswaps the process body (PO and PI pages) separately from the 
process header. There are two reasons for doing this: 

• Fields in the process header (most notably working set list entries and 
process page table entries) are modified as the working set list is processed. 

• The process header may not be swappable at this time due to outstanding 
I/O, pages on the modified page list, or some other reason. 

17.4.2.1 Scanning the Working Set List. The process body is prepared for outswap by 
scanning the working set list. Each page in the working set list must be 
looked at to determine if any special action is required. The swapper looks at 
a combination of the page type (found in the working set list entry as well as 
the PFN TYPE array) and the valid bit. Table 17-4 lists all combinations of 
page type and valid bit setting that the swapper encounters and the action 
that it takes for each. Several cases are discussed further here. 

The basic step that the swapper must take as it scans the working set list is 
to move each swappable page into the swapper's I/O map. This causes the 
virtually discontiguous pages in the process's working set to appear virtually 
contiguous to the I/O system (see Figures 17-3 and 17-6). For each page, the 
swapper performs the following steps: 

1. Locates the page table entry from the virtual page number field in the 
working set list entry. 

2. Determines any special action based on page validity and page type. 

3. Moves the PFN from the page table entry to the swapper map. 

4. Records the modify bit (logical OR or PTE modify bit and PFN STATE 
array saved modify bit) in the working set list entry. 

5. Sets the Delete Contents bit in the PFN STATE array element. This set bit 
will cause the page to be placed at the head of the free page list when its 
reference count goes to zero (which in normal circumstances will be when 
the swap write completes). 

Note that the swapper does not have to explicitly put the contents of the PFN 
BAK array into each PTE. The contents are replaced when the page is released 
(after the swap write completes and all other references to the page have been 
eliminated). 



374 



1 7.4 Outswap Operation 



Table 17-4: Scan of Working Set List of Outswap 

The scan of the working set list on outswap is determined by a combination of the physical 
page type (WSL<3:1>) and the valid bit (PTE<31>). 



Type of Page 
1. Process Page 



2. Process Page 



3. System Page 

4. Global Read Only 



5. Global Read Only Valid 



6. Global Read/Write 



7. Page Table Page 



Valid Bit Action of Swapper for This Page 

Transition a. (STATE = Read in Progress) 

Treat as page with I/O in progress. 
Special action may be taken at inswap 
or by modified page writer. 

b. (STATE = Active) 

Outswap. The page will be put back into 
active transition state at inswap time. 

c. (STATE = Read Error) 
Drop from working set. 

d. No other transition states are possible 
for a page in the working set. 

Valid Outswap page. 

If there is outstanding I/O and 

the page is modified, load SWPVBN array 

element with block in swap file where 

the updated page contents should be 

written when the I/O completes. 

It is impossible for a system page to be in 

process working set. The swapper generates 

an error. 

Transition a. If the process page table entry 

still contains a PFN, this page is in 
active transition page. Outswap the page, 
b. If the process page table entry contains 
a global page table index, then the 
global page table must contain a 
transition PTE. The page is dropped 
from the process working set. 

a. If SHRCNT = 1, then outswap. 

b. If SHRCNT > 1, drop from working set. 
It is highly likely that a process can fault 
a page later without I/O. This check avoids 
multiple copies of same page in swap file. 
Drop from working set. It is extremely 
difficult to determine whether the page in 
memory was modified after this copy was 
written to the swap file. 
Not part of the process body. However, while 
the swapper is scanning the process body, the 
VPN field in the WSL is modified to reflect 
the offset from the beginning of the process 
header because page table pages will 
probably be located at different virtual 
addresses following inswap. 



375 



Swapping 

17.4.2.2 Pages with Direct I/O in Progress. If a (modified) page has outstanding I/O 
while the process is being outswapped, the swapper takes note of this by 
loading the SWPVBN PFN array element with the virtual block number in 
the swap file where the page is being written to. The page is nevertheless 
swapped at this time to reserve a place for it in the swap file. 

If the I/O operation is a read (or it is a write and some other action has 
caused the page to be modified), the physical page will be placed on the modi- 
fied page list when the I/O completes. MMG$RELPFN, the routine that re- 
leases the page, puts pages on the modified page list either if the modify bit 
in the PFN STATE array is set or if the PFN SWPVBN array has nonzero 
contents. 

The modified page writer takes special action for modified pages with non- 
zero contents in the SWPVBN array. That is, it writes each page to the desig- 
nated block in the swap file rather than to its normal backing store address. 

If the I/O operation is a write (from memory to mass storage) and the page 
was not otherwise modified, the contents that are currently being written to 
the swap file are good. The page will be placed on the free list when the write 
completes. 

17.4.2.3 Global Pages. Global pages are also given special treatment at outswap. If the 
global page is writeable, it is dropped from the process working set before the 
process is swapped to disk. The task of recording whether the contents that 
are swapped are up to date when the process is brought back into memory is 
more complicated than simply refaulting the page (often without I/O) when 
the process is swapped back into memory. 

Global read-only pages are only swapped if the global share count (PFN 
SHRCNT array) is one. In all other cases, the page is dropped from the work- 
ing set and must be refaulted (most likely without I/O) when the process is 
inswapped. (Global pages that are explicitly or implicitly locked into the 
process working set are not dropped from the working set.) Global transition 
pages are also dropped from the process working set. 

17.4.2.4 Example of Process Body Outswap. Figures 17-2 through 17-4 show some of 
the special cases encountered by the swapper while it is scanning the process 
working set list. As mentioned in connection with Table 17-4, the key infor- 
mation about each page is a combination of the PTE valid bit and the physical 
page type. The order of the scan is determined by the order defined by the 
working set list. Figure 17-2 shows the process working set, the process page 
tables, and the associated PFN database entries before the swapper begins its 
working set scan. Figure 17-3 shows the modified working set and the 
swapper map after the working set list scan but before the I/O request is 
initiated. Figure 17-4 shows the state of the page table entries after the swap 
write has completed and the physical pages have been released. 



376 



17.4 Outswap Operation 



Process Header for 
Swapped Process 



WSLX 



PTE 



BAK 



vpn W 



vpnX 



vpn Y 



vpnZ 



Fixed Portion 
Working Set List 



vpn Y 



vpn Z 



vpn W 



vpn X PPG 



GRO 



PPG 



GRW 



Process Section 
Table, etc. 



PO Page Table 



pfoB 



1 pfnD 



1 pfnA 



gpteQ 



wsle2 



pteZ 



wsle 4 



pteX 



A 

B 
wsle 1 

wsle 2 C 

wsle 3 D 

wsle 4 

pteW 

te x Global Page Table 

gpteQ 



gstx 



gstx 



STATE TYPE 

F1 



other 



SHRCNT=1 



SHRCNT = 4 



pgflx 



act PPG 



pstx 



REFCNT=2 



act PPG 



PFN Database Arrays 



pteY 



1 pfnC 



P1 Page Table 



gpteR 



pteZ 



valid, pf" A 



valid, pfn B 



SWP$GI MAP:: 




Swapper's 
I/O Map 



























Figure 17-2 

Example Working Set List before Outswap Scan 



1 . The first working set list entry is a global read-only page. The VPN field of 
the working set list entry locates the page table entry. The PFN field of the 
PTE locates the PFN data associated with this physical page. In particular, 
the global share count for this page is one. (This process is the only process 
that currently has this page in its working set.) The swapper will write this 
page out as part of the swap image for this process. Thus, PFN A is the first 
page in the swapper's PTE array (see Figure 17-3). 

When the swapper's write operation completes, the page will be deleted. 
That is, the PTE array element will be cleared and the page will be placed 
at the head of the free page list (see Figure 17-4). 

2. The second working set list entry is a process page that also has I/O in 
progress (REFCNT = 2). This page will be swapped. This fact is illustrated 
by the inclusion of PFN C in the swapper map. 

If the page was previously modified (either the PTE modify bit or saved 
modify bit in the PFN STATE array was set), the virtual block number in 



377 



Swapping 





Process Header for 
Swapped Process 




Fixed Portion 
Working Set List 




vpn Y 


GRO 




vpn Z 


PPG 










vpn X 


PPG 




Process Section 
Table, etc. 




PO Page Table 


vpn W 


0P*SP) 






vpnX 


1 pfnD 






vpn Y 


1 pfn A 






vpnZ 


1 pfnC 




P1 Page Table 



WSLX 



PTE 



BAK 



A 

B 
wsle 1 

wsle 2 c 

wsle 3 

D 

wsle 4 



pteW 



- 




gpteQ 








- 




gpteR 








wsle 2 




pteZ 








wsle 4 




pteX 



gstx 



gstx 



STATE TYPE 
act | GRO 

Igrw 



other 



SHRCNT=1 



pgflx 



I act I I PPG 



SHflCNT-3 



REFCNT = 2 



pstx 



act PPG 



PFN Database Arrays 



pte X Global Page Table 



gpteQ 



pteY 



gpteR 



pteZ 



valid, pfn A 



valid, pfn B 



SWP$GI MAP:: - 




Swapper's 
I/O Map 






1 pfn A 




1 pfn C 




1 pfn 









Figure 17-3 

Example Working Set List after Outswap Scan 



the swap file will be loaded into the SWPVBN array. Loading the SWPVBN 
array will force the page to the modified page list when it is released. If the 
process is still outswapped by the time that the modified page writer gets 
around to writing this page, the page will be written to the block reserved 
for it when the process is first outswapped. 

The page is marked for deletion. That is, when the reference count for 
the page reaches zero (due to completion of both the outstanding I/O and 
the swapper's write), the page is placed at the head of the free page list and 
its PTE array element cleared. 

The third working set list entry is a global read/write page. The page is 
dropped from the process working set (see Figure 17-3), meaning that the 
process page table entry is replaced with a global page table index (that 
locates global page table entry R) and the share count for PFN B is decre- 
mented. Notice that PFN B is not a part of the swapper map, which con- 
tains a list of the physical pages that will be written to the swap file. 



378 



17.4 Outswap Operation 





'rocess Header for 
Swapped Process 




Fixed Portion 
Working Set List 




vpn Y 


GRO 




vpn Z 


PPG 










vpn X 


PPG 




Process Section 
Table, etc. 




PO Page Table 


vpn W 


gptx(R) 






vpnX 


pstx 






vpn Y 


gptx (Q) 






vpnZ 


#,'. pin C ." 




P1 Page Table 



wsle 1 
wsle ? 
wsle 3 
wsle 4 



pte W 



pteX 



WSLX 



PTE 



BAK 



STATE TYPE 



other 



wsle 2 



gpteQ 



pteY 



gpteR 



pteZ 



gpteQ 


gstx 


'#^1 


GRO 




gpteR 


gstx 


act 


GRW 




pteZ 


pgflx 


act 


PPG 





< j 


!W 


PPG 






SHRCNT=3 



PFN Database Arrays 



Global Page Table 



tNti*, pin A 



valid, pfn B 



SWP$GI MAP:: — 




Swapper's 
I/O Map 






"--> 'V ; ;' 








• ,'■'.: ..- /, . 









Figure 17-4 

Process Page Table Changes after Swapper's Write 
Completes 



4. The last working set list entry in this example is a process page with 
nothing special about it. This page is added to the swapper map (PFN D) 
and its contents marked for deletion. The deletion will actually occur 
when the swapper's write operation completes. 



17.4.3 Outswap of Process Header 

The process header is not outswapped until after the process body has been 
successfully written to the swap file. The reason for this illustrates two other 
cases that can keep the process header in memory. Before the process header 
can be outswapped, all ties to physical memory that exist in the process page 
tables must be severed, including not only those pages that were in the proc- 
ess working set and written to the swap file but also those pages that are in 
some transition state, most notably pages on the free and modified page lists. 



379 



Swapping 

17.4.3.1 Partial Outswap. After the process body has been outswapped, the process 
header becomes eligible for outswap. In fact, the header of an outswapped 
process is the first thing that the swapper looks for in an attempt to balance 
the free page list. 

The indication that the process header cannot be outswapped yet is found 
in the process header vector reference count array (see Figure 14-21). This 
array counts the number of reasons (transition pages, active page table pages, 
and so on) that prevent the process header from being outswapped. 

Because the outswap of the header does not have to immediately follow the 
body outswap, it is possible (even probable) that a process header will not be 
swapped in the time between when a process body is outswapped and when 
that process is brought back into memory. Such a situation is referred to as a 
partial outswap. It has an obvious counterpart, a partial inswap, where the 
swapper does not have to allocate a balance slot and bring the process header 
into memory because the header is already resident. 

An important system management point is illustrated here. Process bodies, 
which consume physical memory, are relatively easy to remove from mem- 
ory. Process headers consume a smaller amount of physical memory but they 
also occupy a balance slot. The balance slot is not freed for other use until the 
entire header is outswapped. If the SYSBOOT parameter BALSETCNT is set 
to too small a value, the system can reach the unfortunate state where there 
is more than enough physical memory, but computable processes cannot be 
brought into memory because the balance slots are still tied to already 
outswapped processes. This situation can be avoided by setting BALSETCNT 
to an adequate value. See the VAX/VMS System Management and Oper- 
ations Guide for details on determining the correct value for SYSBOOT 
parameters. 

17.4.3.2 Scanning the Free Page List. When the swapper locates a process header that 
can be removed from its balance slot, it takes whatever actions are required 
to remove the ties that bind the process header to physical memory. The first 
such step is to eliminate any transition PTEs where the physical page is on 
the free page list. 

Transition PTEs are located by scanning the entire free page list and look- 
ing for pages whose PTE array contents lie within the PO or PI page tables of 
the process header being examined. Whenever such a page is found, the proc- 
ess PTE is reset to the contents of the BAK array,- the reference count and PTE 
array elements are cleared, and the page is moved from its current location to 
the head of the free page list. 

17.4.3.3 Flushing the Modified Page List. Because the free page list is only one of 
several transition states, the scan of the free page list may not free the process 
header for removal. Pages may be in some other transition state. Transition 



380 



27.5 Inswap Operation 

states that represent some form of I/O in progress (release pending, read in 
progress, write in progress) are left alone because there is nothing that the 
swapper can do until the I/O completes. 

However, the modified page list can be manipulated. The desired effect is 
removal of all pages from the modified page list, which is triggered by setting 
to zero both the lower and upper limit thresholds for the modified page list. 
Clearing the upper limit guarantees that a nonempty list has exceeded its 
threshold, initiating a request for modified page writing. Clearing the lower 
limit causes modified page writing to continue until the list is empty (below 
the low limit threshold). 

17.4.3.4 Outswap of the Process Header. Once the reference count for the process 
header reaches zero, the header can be outswapped and the balance slot freed. 
The outswap of the process header is entirely analogous to the outswap of a 
process body. That is, the header pages that are not page table pages and the 
active page table pages are scanned and put into the swapper's PTE array to 
form a virtually contiguous block for the I/O subsystem. 

There are several differences between the outswap of a process header and a 
process body. When a process body is outswapped, the header that maps that 
body is still resident. When the swapper's write completes and each physical 
page is deleted, the contents of the BAK array element for each page are put 
back into the process PTE. 

Process header pages are mapped by system page table entries for that bal- 
ance slot. The SPTEs are not available to hold the BAK array contents be- 
cause they will be used by the next occupant of this balance slot. One of the 
process header page arrays (see Chapter 14) is set aside for exactly this pur- 
pose. As the process header is processed for outswap, the contents of the BAK 
array for each active header page are stored in the corresponding process 
header page array element. 

At the same time, the location of each header page within the working set 
list is stored in the WSLX array. This array prevents a prohibitively long 
search to rebuild the process header when the process is swapped back into 
memory. 

Once the header is successfully outswapped, the header resident bit 
(PCB$V_PHDRES) in the PCB is cleared and the balance slot is available for 
further use. 



17.5 INSWAP OPERATION 

The inswap is exactly the opposite of the outswap operation. The swapper 
brings the process header, including active page tables, and the process body 
back into physical memory. It then uses the contents of the working set list 
to rebuild the process page tables, an operation that primarily involves updat- 



381 



Swapping 

ing each valid PTE to reflect the new PFN used by that PTE. At the same time 
that each page is being processed, the swapper can resolve any special cases 
that existed when the process was outswapped. 

17.5.1 Selection of an Inswap Candidate 

As mentioned earlier in the chapter, the swapper selects a process for inswap 
exactly as the scheduler selects a candidate for execution. The following 
processes may be potential candidates for inswap: 

• Newly created processes 

• Processes in some outswapped wait state that were just made computable 

• Processes that were outswapped while in the computable state 

The highest priority process in this collection is the one selected for inswap. 

17.5.2 Inswap of the Process Header 

If the process header was outswapped when the body was outswapped, it 
must be brought back into memory before the process body can be recon- 
structed. Unlike the special operations that took place when the process was 
outswapped, an outswapped process header merely adds two details to the 
inswap operation. 

1. If the header is resident, the number of header pages is subtracted from the 
size of the outswap image in the swap file. That is, whether the header is 
resident or not determines the total number of blocks that must be read 
from the swap file and the virtual block number where the read should 
begin. 

2. If the header was swapped, those process parameters that are tied to a 
specific balance slot (that is, specific system virtual or physical addresses) 
must be adjusted to reflect the new locations in virtual or physical address 
space. These include the following: 

• Each SPTE must be loaded with the PFN that contains the contents of 
each process header page. 

• The virtual addresses of the PO and PI page tables must be calculated 
and loaded into their locations in the hardware PCB. 

• The physical address of the hardware PCB must be calculated and 
loaded into the software PCB (in field PCB$L_PHYPCB). 

• Finally, the PI pages that double map the process header pages that are 
not page table pages must be loaded with the new page frame numbers 
that contain these pages. 

17.5.2.1 Rebuilding the Process Header. When a process header is read from the swap 
image into a new balance slot, the SPTEs that map each balance slot page 



382 



17.5 Inswap Operation 

must be loaded with the PFNs from the swapper map that contain each 
header page. In addition, the PFN database must be set up for each of these 
physical pages. The swapper does all this work in a very simple loop that it 
executes for each header page. 

The simplicity (and speed) of the loop results from the use of the two proc- 
ess header page arrays that exist in the process header. These arrays allow the 
PFN BAK and WSLX arrays to be loaded with their previous contents (be- 
cause the two header arrays were loaded when the process was outswapped). 

17.5.2.2 PI Window to the Process Header. All of the process header pages except 
process page tables are double mapped with a range of PI addresses. This 
double mapping is done for the following reason. When a process header is 
outswapped and subsequently inswapped, it probably resides in a different 
balance slot. Any routine that stores that process header address in a register 
and then references header locations with a displacement from this register 
might be referencing the header of another process if some scheduling and 
swapping occurred between obtaining the header base address and later refer- 
ences using it. 

To avoid this problem, a range of PI space is set up by the swapper to map 
these same header pages. The PI pages are mapped in such a way that, even if 
an outswap and later inswap occur between two instructions, the PI virtual 
addresses of the process header pages do not change. The conventions that 
the operating system observes about header references are these: 

• Any reference to the process header should use the PI address (CTL$GL_ 
PHD contents point to the PI map of the process header). 

• Any reference to the system space header must execute at IPL 7 (IPL$_ 
SYNCH) to prevent a swap. 

• Any reference to process page tables must execute at IPL 7 because the 
page table pages are not double mapped. 

There are two implications for the operating system here. 

• These physical pages are not kept track of in any way through reference 
counts or any other technique. However, all of these header pages are a 
permanent part of the process working set. 

• The PI page table page that maps these pages must also be a permanent 
member of the process working set. 

17.5.3 Rebuilding the Process Body 

The process header must be put into a known state before the process body 
can be put back into the approximate shape it was in before the process was 
outswapped. If the header was never outswapped, there is very little that has 
to be done. If the header was outswapped, the steps just described are taken to 
put the process header back together again. 



383 



Swapping 

17.5.3.1 Rebuilding the Working Set List and Process Page Tables. The rebuilding of 
the process body involves a simple scan of both the swapper map and the 
process working set list. Recall that at outswap, the key to each special case 
was the combination of physical page type and the setting of the valid bit in 
the page table entry. On inswap, the key to each special case is the contents 
of the page table entry located by the virtual page number field in the work- 
ing set list entry. An approximation of swapper activity for each page is as 
follows: 

1. The page table entry is located from the VPN field of the WSLE. 

2. In the usual case, the original contents of the PTE are put into the PFN 
BAK array and the PFN from the swapper map is loaded into the now valid 
PTE. 

3. If for some reason a copy of the page already exists in memory, then that 
page is put into the process working set, and the duplicate page from the 
swapper map is released to the front of the free page list. 

Table 17-5 contains a detailed list of the different cases that the swapper can 
encounter when rebuilding the process page tables. Three of the cases deserve 
special comment. 

17.5.3.2 Pages with I/O in Progress When Outswap Occurred. Pages that had I/O in 
progress when the process was outswapped were written to the swap file 
anyway to reserve space. If the page was previously unmodified, then it 
would be put onto the free page list when both the swap write and the out- 
standing write operation completed. If the page was previously modified, 
then it would be put onto the modified page list when both the swap write 
and the outstanding write operation completed (because the contents of the 
SWPVBN array were nonzero). 

In either case, it is possible for the process to be swapped back in before one 
of these physical pages was reused. The swapper uses the physical page that is 
already contained in the process PTE (as a transition page) and releases the 
duplicate physical page from the swapper map to the front of the free page 
list. 

In the case of a page on the free page list, this decision is simply one of 
convenience. In the case of a page on the modified page list, the contents of 
the page in the swap image are out of date and the swapper has no choice but 
to use the physical page that is already in memory. 

17.5.3.3 Resolution of Global Read-Only Pages. The only possible global page that 
could be in the swap file is a global read-only page that had a share count of 
one when the process was outswapped (or a page that was explicitly locked). 
All other global pages were dropped from the process working set before the 
process was outswapped. 



384 



17.5 Inswap Operation 



Table 17-5: Rebuilding the Working Set List and the Process Page Tables at Inswap 
At inswap time, the swapper uses the contents of the page table entry to determine what 
action to take for each particular page. 



Type of Page Table Entry 

1. PTE is valid. 

2. PTE indicates a transition page 
(probably due to outstanding I/O 
when process was outswapped). 

3. PTE contains a global page table 
index (GPTX). 

(Page must be global read-only 
because global read/write pages 
were dropped from the working 
set at outswap time.) 



4. PTE contains a page file index or a 
process section table index. 



Action of Swapper for This Page 

Page is locked into memory and was never 

outswapped. 

Fault transition page into process working 

set. Release duplicate page that was just 

swapped in. 

Swapper action is based on the contents of 

the global page table entry (GPTE) 

a. If the global page table entry is valid, add 
the PFN and the GPTE to the process 
working set and release the duplicate 
page. 

b. If the global page table entry indicates a 
transition page, make the global page 
table entry valid, add that physical page 
to the process working set, and release 
the duplicate page. 

c. If the global page table entry indicates a 
global section table index, then keep the 
page just swapped in, and make that the 
master page in the global page table 
entry as well as the slave page in the 
process page table entry. 

These are the usual contents for pages that 
did not have outstanding I/O or other page 
references when the process was outswapped. 

The PFN in the swapper map is inserted 
into the process page table. The PFN arrays 
are initialized for that page. 



There are two different cases that the swapper will find when rebuilding 
the process page tables. In either case, the process page table entry contains a 
global page table index so the determining factor is the contents of the global 
page table entry. 

1. The global page table entry contains a global section table index. In this 
case, the physical page from the swapper map is added to the global page 
table entry as well as the process page table entry. 

2. It is possible that the global page was referenced by some other process 
while this process was outswapped. In that case, the global page table 
entry might contain a transition or valid PTE. In either case, the PFN that 
is already in the global page table entry is kept. (If the GPTE is in transi- 



385 



Swapping 



tion, it is made valid.) The duplicate PFN from the swapper map is re- 
leased to the front of the free page list. 



17.5.3.4 Example of an Inswap Opetation. To illustrate at least some of the special 
cases that the swapper encounters when a process body is swapped back into 
memory, Figures 17-5 through 17-7 contain an example of an inswap opera- 
tion. Note that this example is not related to the outswap example used 
before (see Figures 17-2 to 17-4). This example is tailored to illustrate the 
interesting cases the swapper can encounter during an inswap operation. 
Figure 1 7-5 shows the state of the process header after the process has been 
selected as an inswap candidate. Figure 17-6 shows that four physical pages 
have been allocated to contain the four working pages that the example is 
describing. Figure 17-7 shows the rebuilt process page tables and the PFN 





Process Header for 
Swapped Process 




Fixed Portion 
Working Set List 




vpn X 


GRO 




vpn W 


PPG 




vpn Y 


GRO 




vpn Z 


PPG 




Process Section 
Table, etc. 




PO Page Table 


vpn W 


pstx 






vpn X 


9 gpftft) 






vpn Y 


gptx(S) 






vpnZ 







P1 Page Table 



, WSLX 



PTE 



BAK 



A 

B 
wsle 1 

wsle 2 C 

wsle 3 

D 

wsle 4 



pteW 



pteX 



pteY 



pteZ 



BLINK 




pteZ 


pgflx 










• 




gpteS 


gstx 










BLINK 

















BLINK 










STATE TYPE 

BD 

I act I I GRO I I 

an 

fre7| [ j 



other 



PFN Database Arrays 



SHRCNT=3 





Global Page Table 






gpteS 


valid, pfn B 






gpteT 


lipi^ii 







SWP$GL_MAP:: 



Swapper's 
I/O Map 



^ 



Figure 17-5 

Working Set List and Swapper Map before Physical Page 
Allocation 



386 



17.5 Inswap Operation 





Process Header for 
Swapped Process 




Fixed Portion 
Working Set List 




vpn X 


GRO 




vpn W 


PPG 




vpn Y 


GRO 




vpn Z 


PPG 




Process Section 

Table, etc. 




PO Page Table 


vpn W 


pstx 






vpnX 


gptx(T) 






vpn Y 


gptx(S) 






vpnZ 


pfn A 




P1 Page Table 



wsle 1 

wsle 2 
wsle 3 
wsle 4 



pteW 



WSLX 



PTE 



BAK 



BLINK 


pteZ 




pgflx 










- 


gpteS 




gstx 

























:%|v : :'>H^ 















>SiV:;'!^ 



























STATE TYPE 
act | | GRO | J 

mi i 

itflwjl"-, 1 I 

!□ 
!□ 



PFN Database Arrays 



p te x Global Page Table 



gpteS 



pteY 



gpteT 



pteZ 



valid, pfn B 



gstx 



SWP$GI MAP:: 



other 



SHRCNT=3 



Swapper's 
I/O Map 


■. '.£ 


; ;j^|p^ •' 


- 


:%' 


'.','.' J3M-; 


%'. 


s€ 


■' ' 1**.' 


k 


: : 'i": 


OW; 


: if: 





Figure 17-6 

Working Set List and Swapper Map after Physical Page 
Allocation 



database changes that result from rebuilding the working set and process 
page tables. 

1 . The first working set list entry locates virtual page number X. This PTE 
contains a global page table index. The referenced global page table entry 
(GPTE T) contains a global section table index, indicating that the global 
page table entry is not valid. 

The page frame number (PFN D) is put into the process page table. It is 
also added to the global page database by making the GPTE valid (see 
Figure 17-7), putting PFN D into the GPTE, and updating the PFN data for 
physical page D to reflect its new state. 

2. The next working set list entry is a process page mapped by PTE W (see 
Figure 17-6). This PTE contains a process section table index. The PTE is 



387 



Swapping 





Process Header for 




Swapped Process 




Fixed Portion 




Working Set List 




vpn X 


GRO 




vpn W 


PPG 




vpn Y 


GRO 




vpn Z 


PPG 




Process Section 




Table, etc. 




PO Page Table 


vpn W 


r.'"~ 


vpnX 


~.....^:::: 


vpn Y 




vpnZ 


: ~™~ - 




P1 Page Table 



pteW 



WSLX 


PTE 


SSHWs? 


pteZ 



BAK 



STATE TYPE 



other 



pgflx 




pteZ 



PFN Database Arrays 



p te x Global Page Table 



pteY 



SWPSGI MAP::- 







gpteS 


valid, pfn B 






gpteT 


valid, p*n D 







Swapper's 
I/O Map 



Z] 



Figure 17-7 

Working Set List and Rebuilt Page Tables 



updated to contain PFN C and the PSTX is stored in the BAK array ele- 
ment for that page (see Figure 17-7). Other PFN-arrays are updated accord- 
ingly. 

The next working set list entry (that locates PTE Y) is exactly like the 
first, as far as the process data is concerned. However, the global page table 
entry (GPTE S) is valid, indicating that another copy of this page already 
exists. (This second copy could only have happened if another process 
faulted the page while this process was outswapped.) 

The duplicate page (PFN E) is released to the front of the free page list. 
The process page table entry is updated to contain the physical page that 
already exists (PFN B) and the share count for that page is incremented 
(from three to four). 
The fourth working set list entry looks just like the second. However, the 



388 



17.5 Inswap Operation 

process page table entry indicates a transition page. (This implies that the 
header in this example was never outswapped.) 

The action taken here is similar to step 3, where a duplicate global page 
was discovered. The page just read (PFN F) is released to the head of the 
free list. The transition page (PFN A) is faulted back into the process work- 
ing set by removing the page from the free list, setting its state to active, 
and turning the valid bit in the PTE back on. 

17.5.3.5 Final Processing of the Inswap Operation. After the working set list has been 
scanned and the process page tables rebuilt, the process is ready to have its 
state changed from computable but outswapped to computable and resident. 
Several other scheduling details must be taken care of before the scheduler is 
notified. 

1. A new value of ASTLVL is calculated and loaded into the hardware PCB in 
the process header. ASTs may have been enqueued to the process while it 
was outswapped. The hardware PCB, which contains a copy of the 
ASTLVL register, was not available while the header was not resident. 

2. The resident bit and the initial quantum bit in the status longword in the 
software PCB are set. 

3. A new quantum interval is loaded into the process header. 

4. Finally, the scheduler is called to make the process computable. 



389 



PART V/Input/Output 



18 I/O System Services 



Delay not, Caesar. Read it instantly. 
— Julius Caesar 3,1 

Here is a letter, read it at your leisure. 

— Merchant of Venice 5, 1 

All I/O operations performed on a device are requested using the I/O system 
services. Sometimes, in addition to being called directly by the user, the I/O 
system services are called on behalf of a user by system components, such as 
RMS. 
This chapter describes the following topics: 

• What must be done before an I/O request can be made (channel assignment 
and device allocation) 

• How an I/O request is sent to a device driver 

• How a user is notified of the completion of an I/O request 

• How a user can obtain information about a particular device or I/O request 



18.1 ASSIGNING AND DEASSIGNING CHANNELS 

In order to request an I/O operation on a device, a process needs to identify 
the device to the system. The software mechanism used to link a process to a 
device is called a channel. Once a user establishes a channel to a device (using 
the $ASSIGN system service), the user may issue I/O requests (with the 
$QIO system service) for that device by specifying the channel number as- 
signed to the device. If the user no longer wants to use the device, the 
$DASSGN system service can be used to deallocate the channel assigned to 
the device. 



18.1.1 Channel Assignment 

A channel is described by a channel control block (CCB) table, located in a 
dedicated portion of PI space (see Figure 1-7 and Table 26-4). When a channel 
is assigned to certain nonshareable devices, the user may also associate a 
mailbox with that device to receive status information such as the arrival of 
unsolicited input from a terminal. It is up to the device driver for each device 
to either use or ignore this associated mailbox. The VAX/VMS Guide to Writ- 
ing a Device Driver contains a complete description of the CCB. 
The $ASSIGN system service calls on the system routines IOC$FFCHAN 



393 



I/O System Services 

and IOC$SEARCHDEV (in IOSUBPAGD) to find a free I/O channel (CCB), 
and to find the unit control block (UCB) for the device that is being assigned. 
After that, one of the paths described in the following sections is taken, de- 
pending on whether the device is one of the following: 

• A local device (not located on another node) 

• A spooled device 

• The network device NET 

• A remote process or task (located on another node) 

18.1.1.1 Local Device Assignment. This is the normal path through the Assign Chan- 
nel system service. 

1 . A check is made to see if the device is allocated to another process that is 
not a parent process of the process assigning the channel. 
• 2. The DEV$V_SHR bit in UCB$L_DEVCHAR is checked to see if the de- 
vice is a shareable device. If the device is nonshareable and the volume 
protection and owner UIC allow it, the device is implicitly allocated to the 
process (by placing the process ID, from PCB$L_PID, into UCB$L_PID). 
The UCB address is stored in CCB$L_UCB. Whenever the user issues an 
I/O request, this pointer is used to locate the device. 

3. If an associated mailbox was requested, it is identified by placing the UCB 
address (of the mailbox) in the UCB$L_AMB field of the UCB for the de- 
vice to which the channel is being assigned. The UCB$W_REFC field of 
the associated mailbox is incremented, and the CCB$V_AMB flag is set in 
CCB$B_STS to indicate that an associated mailbox is present. Note that 
no association is made if one of the following is true: 

— The device is a file-oriented device (identified by the DEV$V_FOD bit 

in UCB$L_DEVCHAR). 
—The device is shareable (DEV$V_SHR in UCB$L_DEVCHAR). 
—The device already has an associated mailbox (the UCB$L_AMB field is 

nonzero). 

4. The device reference count (UCB$W_REFC) is incremented. 

5. The access mode (plus one) at which the channel is being assigned is 
stored in CCB$B_AMOD. IOC$FFCHAN identifies an unused CCB by 
looking in the CCB$B_AMOD field. If the value stored there is a zero, the 
CCB is not being used. 

6. Any flags associated with the channel (such as'CCBSV-AMB indicating 
that an associated mailbox is present) are stored in CCB$B_STS. 

7. The channel number (really an index into the CCB table in process PI 
space, provided by IOC$FFCHAN) is returned to the user at the address 
specified in the CHAN argument to $ASSIGN. 

8. The normal successful completion code (SS$_NORMAL) is returned to 
the user. 



394 



18.1 Assigning and Deassigning Channels 

18.1.1.2 Special Action When Assigning A Spooled Device. If the DEV$V_SPL bit in 
UCB$L_DEVCHAR is set, then the device being assigned is a spooled device. 
The only difference in channel assignment for spooled devices is that the 
status field in the channel control block (CCB$B_STS) is cleared. The device 
associated with the spooled device had its UCB address stored in the 
UCB$L_AMB field when the device was set to spooled. When an I/O request 
is passed to a spooled device, the $QIO system service recognizes that the 
device is spooled and actually performs the I/O request to the associated 
device. 

18.1.1.3 Assigning a Channel to the Network Device. If the device being assigned is a 
network device (that is, the user is assigning a channel to the NET device, 
probably to perform task-to-task communication), the following steps are 
taken: 

1 . A check is made to see that the calling process has NETMBX privilege. 

2. A network UCB is created by IOC$CREATE_UCB (in IOSUBPAGD). 

3. The UCB is made to look like a mailbox UCB that is marked for deletion 
(the UCB$V_DELMBX bit in UCB$W^DEVSTS is set). When the user 
deassigns the channel, the UCB will be deleted. 

4. The user's byte count quota and limit are reduced by the size of the UCB. 

5. The NETDRIVER unit initialization routine is called. 

6. Further processing proceeds as in the case of a local, nonshareable device. 



18.1.2 Channel Deassignment 

The $DASSGN system service deassigns a previously assigned I/O channel 
and clears the linkage and control information in the corresponding CCB. 
These tasks are accomplished with the following steps: 

1. Any outstanding I/O is canceled. 

2. If a file is open on the channel (indicated by CCB$L_WIND being non- 
zero), then that file is closed (by issuing a $QIOW with the 
IO$_DEACCESS function code, and specifying event flag number 30). 
This method is also used to dissolve logical links. 

3. If any I/O is still outstanding (indicated by CCB$W_IOC being nonzero), 
the process is placed into an RSN$_ASTWAIT wait state (waiting for the 
I/O completion AST(s) to be delivered). Chapter 10 discusses wait states in 
detail. 

4. The channel is actually deassigned by clearing the CCB$B_AMOD field. 

5. If this was the last channel assigned to the device (UCB$W_REFC con- 
tains a 0), the device is implicitly deallocated (by clearing UCB$L_PID). 

6. If the device is marked, for dismount (the DEV$V_DMT bit in 
UCB$L_DEVCHAR is set) and it was not mounted with a VMS ACP (the 
foreign bit DEV$V_FOR is set), the dismount (DEV$V_DMT), mounted 



395 



I/O System Services 



(DEV$V_MNT), read check (DEV$V_RCK), write check (DEV$V_WCK), 
and software write locked (DEV$V_SWL) bits in UCB$L_DEVCHAR are 
cleared. The UCB$L_VCB field is cleared, and if that field was not zero, 
the volume control block pointed to by that field is deallocated. Also, the 
volume protection mask (UCB$W_PROT) and the software volume valid 
bit (UCB$V_VALID in UCB$W_STS) are cleared. 

7. If UCB$W_REFC equals zero, or if the calling process has allocated the 
device, the associated device driver's cancel I/O routine is called to per- 
form any device-dependent operations (see the VAX/VMS Guide to Writ- 
ing a Device Driver). The reason code CAN$C_DASSGN is passed to the 
cancel I/O routine. 

8. If a mailbox was associated with the device when the channel was as- 
signed (indicated by CCB$V_AMB in CCB$B_STS), then the linkage with 
the mailbox is cleared by taking these steps: 

a. Clearing UCB$L_AMB 

b. Decrementing UCB$W_REFC for the mailbox UCB 

c. Calling IOC$DELMBX (in IOSUBNPAG) to see if the mailbox UCB 
should be deleted (in case this was the last process referencing a tempo- 
rary mailbox) 

9. If the device to which the channel was assigned was a mailbox (indicated 
by the DEV$V_MBX bit in UCB$L_DEVCHAR), IOC$DELMBX is called 
to see if that mailbox should be deleted. 



18.2 DEVICE ALLOCATION AND DEALLOCATION 

A process allocates a device (using the $ALLOC system service) to reserve 
that device for exclusive use. A process deallocates a device (using the 
$DALLOC system service) to relinquish exclusive ownership. The code 
for the $ALLOC and $DALLOC is found in module SYSDEVALC. 

18.2.1 Device Allocation 

The following steps are taken by EXE$ALLOC to allocate a device: 

1. The generic allocation routine IOC$SEARCHGEN is called to perform 
logical name translation and select a device, if generic allocation was re- 
quested. 

2. The process ID (PCB$L_PID) is stored in the device owner field 
(UCB$L_PID). 

3. The device allocated bit (DEV$V_ALL in UCB$L_DEVCHAR) is set. 

4. The device reference count (UCB$W_REFC) is incremented. 

5. The access mode at which the device is allocated is placed in 
UCB$B_AMOD. 



396 



18.2 Device Allocation and Deallocation 

Any of the following conditions will prevent device allocation: 

• The device is already allocated by another process (UCB$L_PID is non- 
zero). 

• The device reference count (UCB$W_REFC) is nonzero. 

• The mounted bit (UCB$V_MNT in UCB$L_DEVCHAR) is set. 

• The spooled bit (UCB$V_SPL in UCB$L_DEVCHAR) is set, and the proc- 
ess does not have ALLSPOOL privilege. 

• The device is nonshareable, and the requesting process does not have ac- 
cess rights (located through PCB$L_ARB) allowing it to allocate the de- 
vice, as determined by the device's owner UIC and volume protection 
(UCB$L_OWNUIC and UCB$W_VPROT). 

18.2.2 Device Deallocation 

A process may choose to deallocate a single device or all devices allocated to 
it. For each device that is to be deallocated, EXE$DALLOC finds its UCB 
address either directly, from the DEVNAM argument in the $DALLOC call, 
or by examining each UCB in the system. The routine IOC$SEARCHDEV is 
used to relate device names to UCB addresses and to perform logical name 
translations. 

Each UCB in the system can be found by following a linked list of device 
data blocks (DDBs), that name each device controller in the system (the first 
DDB is pointed to by global symbol IOC$GL_DEVLIST). Each DDB contains 
a pointer to the first device UCB on the controller, and all of the UCBs for the 
devices on a given controller are linked together. 

A device is deallocated when the following are true: 

• The UCB$L_PID field matches the PCB$L_PID field of the process issu- 
ing the $DALLOC. 

• The access mode at which the deallocate request is being made is at least 
as privileged as the access mode at which the device was allocated. 

• The allocated bit (DEV$V_ALL in UCB$L_DEVCHAR) is set. 

• The device mounted bit (DEV$V_MNT in UCB$L_DEVCHAR) is clear. 

• The reference count (UCB$W_REFC) equals 1, indicating that no more 
channels are assigned to the device. 

The device is deallocated by taking these steps: 

1. Clearing the device allocated bit (DEV$V_ALL in UCB$L_DEVCHAR) 

2. Clearing the device owner process id field (UCB$L_PID) 

3. Decrementing the device reference count (UCB$W_REFC) 

4. Calling the device driver's cancel I/O routine with the reason code 
CAN$C_CANCEL 

5. Returning the normal successful completion code to the user in RO 
(SS$_NORMAL) 



397 



I/O System Services 

18.3 $QIO SYSTEM SERVICE 

The $QIO system service (in module SYSQIOREQ) allows a user to initiate 
an I/O operation by queuing a request to the device's associated driver. Once 
the I/O operation has been initiated, control will be returned to the user, who 
can synchronize I/O completion in one of three ways: 

• The process can enter an event flag wait state until the I/O request com- 
pletes, waiting for the specified event flag to be set. 

• The address of an AST routine that will be executed when the I/O com- 
pletes can be passed to $QIO. In this case, the process can continue execut- 
ing or wait, depending on the particular method of synchronization. 

• The I/O status block can be polled for a completion status. The status field 
in the IOSB is cleared by $QIO and set by the special kernel mode AST that 
completes an I/O request in process context. This last method is not rec- 
ommended. 

As an alternative to $QIO, the $QIOW system service may be used, which is 
equivalent to the $QIO system service followed by a $WAITFR system serv- 
ice. Using the $QIOW system service guarantees that the I/O operation will 
complete before control is transferred back to the user. 



18.3.1 Device-Independent Pteprocessing 

EXE$QIO begins preprocessing an I/O request with the following steps: 

1. Clearing the specified event flag (or event flag number if no event flag 
was specified) 

2. Validating the device-independent $QIO parameters (event flag number, 
channel number, I/O function code, and I/O status block) 

3. Verifying that the device is online (UCB$V_ONLINE in UCB$W_STS 
must be set) 

4. Clearing the I/O status block (if one was specified) 

An I/O request packet (IRP) is allocated from nonpaged pool. If possible, this 
allocation is done from a queue of preallocated IRPs (pointed to by 
IOC$GL_IRPFL). Otherwise, routine EXE$ALLOCIRP in MEMORYALC is 
called to allocate an IRP from the general nonpaged pool area. Obtaining an 
IRP from the preallocated queue takes less time than calling the allocation 
routine. 

The device-independent section of the IRP is initialized, including the fol- 
lowing fields: 

• The device-independent $QIO parameters 

• The process base priority (from PCB$B_PRIB) 



398 



18.3 $QIO System Service 

• The process ID 

• The device UCB address 

• The IRP$V_BUFIO flag in IRP$W_STS (which is set for a buffered I/O 
operation, and cleared for a direct I/O operation) 

The process's privileges are checked to guarantee that it may perform the 
requested I/O function. In the course of checking process privileges, 
EXE$QIO converts a read or write virtual I/O request function code into the 
corresponding read or write logical function code (unless the virtual request 
is for a file-oriented device, DEV$V_FOD in UCB$L_DEVCHAR is set). 

If an AST was requested, the AST quota (PCB$W_ASTCNT) is decre- 
mented, and the AST quota update flag (ACB$V_QUOTA) is set in 
IRP$B_RMOD. 

Control is then transferred to a function decision table (FDT) routine (by a 
JSB) in the selected device driver. This routine is responsible for interpreting 
the device-dependent $QIO parameters (PI to P6). If the FDT routine returns 
control to EXE$QIO (by issuing an RSB), EXE$QIO calls another FDT routine 
in the driver. Successive FDT routines are called until an FDT routine exits 
turning control over to a subroutine other than EXE$QIO (for example, 
EXE$QIODRVPKT, EXE$QIOACPPKT, or the user's routine). 



18.3.2 FDT Routines 

Function decision table (FDT) routines are device-specific extensions to 
$QIO. Their primary purpose is to validate the device-dependent $QIO pa- 
rameters (PI to P6). A device driver can include customized FDT routines or 
use some of the general purpose routines that are a part of the system image. 
Although some FDT routines are included in a driver image, they are logi- 
cally device-dependent extensions of the $QIO system service. 

FDT routines execute in the context of the process that issued the $QIO 
request. Therefore, they have access to data in the user's PO and PI address 
space. FDT routines communicate information about the I/O request to the 
driver by passing information in the device-dependent section of the IRP. 
FDT routines for direct I/O (I/O done directly to a user buffer) ensure that 
each buffer page is valid and locked into memory. (Buffer pages are locked 
into memory by incrementing the reference count in the PFN database for 
each physical page involved in the transfer.) FDT routines for buffered I/O 
operations must allocate a buffer from nonpaged pool that will be used by the 
driver for the actual transfer. If the operation is a buffered write, the data that 
is being written is copied into this buffer. System space buffers are required 
because the driver processes the I/O request in system context and only has 
access to system virtual address space. FDT routines are described in detail in 
the VAX/VMS Guide to Writing a Device Driver. 



399 



I/O System Services 

18.3.3 I/O Postprocessing 

After a device driver completes an I/O operation, it invokes the REQCOM 
macro. This macro jumps to the routine IOC$REQCOM, which places the 
IRP on the I/O postprocessing queue and requests a software interrupt at 
IPL$_IOPOST (IPL 4). The I/O postprocessing routine (IOC$IOPOST ; in 
IOCIOPOST) runs as a response to the software interrupt. It implements the 
device-independent facets of I/O completion, and handles paging I/O comple- 
tion as well (see Chapter 15). 

Some of the I/O postprocessing operations (for example, unlocking buffer 
pages, and deallocating buffers) are performed in the I/O postprocessing inter- 
rupt service routine (IOC$IOPOST), while other operations (such as writing 
the I/O status block and setting event flags) are performed by a special kernel 
mode AST routine (which executes in process context, and therefore has ac- 
cess to process address space). 

When an IRP is removed from the I/O postprocessing queue (with list head 
IOC$GL_PSFL), IOC$IOPOST first determines if the I/O operation was a 
buffered or direct request. 

18.3.3.1 Direct I/O Completion. Portions of a direct I/O request can be completed in 
the IPL 4 I/O postprocessing interrupt service routine without the benefit of 
process context. The following steps are performed in the interrupt service 
routine: 

1. The process direct I/O count in the software PCB (at offset 
PCB$W_DIOCNT) is incremented, indicating one less outstanding direct 
I/O request. 

2. The buffer pointed to by IRP$L_SVAPTE is unlocked, using the 
IRP$L_BCNT and IRP$W_BOFF fields to determine the size of the locked 
buffer. Buffer pages are unlocked by decrementing their associated refer- 
ence counts in the PFN database. This step may result in their being 
placed on the free or modified page list. 

3. The IRP$V_EXTEND bit in IRP$W_STS is checked. If that bit is set, 
it indicates an IRP extension (IRPE) is pointed to by IRP$L_EXTEND. 
The IRPE may contain up to two locked buffers (pointed to by 
IRPE$L_SVAPTE1 and IRPE$L_SVAPTE2, with sizes determined by 
IRPE$W_BOFFl and IRPE$L_BCNT1, and IRPE$W_BOFF2 and 
IRPE$L_BCNT2, respectively). These buffers, if present, are unlocked, 
and a check is made to see if the IRPE$V_EXTEND bit in IRPE$W_STS is 
set. If so, the same procedure is repeated, until the last IRPE in the linked 
list is found, and its buffers unlocked. 

4. The direct I/O special kernel mode AST (DIRPOST in IOCIOPOST) is 
queued to the process (using the IRP$L_PID field to identify the process to 
which the AST should be queued). The IRP is used as the AST control 
block for routine SCH$QAST (as described in 7). 



400 



18.3 $QIO System Service 

The remainder of I/O completion for a direct I/O request takes place in proc- 
ess context in the special kernel AST called DIRPOST, as follows: 

1. The accumulated direct I/O count (stored in PHD$L_DIOCNT) is incre- 
mented. This count is an accounting statistic that is reported to the ac- 
counting manager (the job controller) when the process is deleted. 

2. The I/O in progress counter in the channel control block (CCB$W_IOC) is 
decremented. 

3. If this was the last I/O for the channel, and there is a deaccess request for 
the channel pending (CCB$L_DIRP does not equal zero), that deaccess 
request is queued to the ACP (so that a file can be properly closed or some 
similar operation performed), by calling routine IOC$WAKACP. 

4. If an I/O status block was requested by the user, it is written using the 
quadword starting at IRP$L_IOSTl (same offset as IRP$L_MEDIA). 

5. If any IRP extensions (IRPEs) were used, they are deallocated. 

6. The event flag specified in the $QIO call is set (by calling routine 
SCH$POSTEF, whose operation is discussed in Chapter 12). 

7. If the user requested an AST for the $QIO call, the IRP is again used as an 
AST control block, and is queued to the user (the IRP will be deallocated 
by the normal AST processing scheme, as discussed in Chapter 7). 

8. If the user did not request an AST to be delivered upon the completion of 
the $QIO call, the IRP is deallocated. 

18.3.3.2 Buffered I/O Completion. The portions of buffered I/O completion that take 
place in the IPL 4 interrupt service routine differ from the direct I/O case 
because of the differences in the way the two kinds of requests are processed. 
The following steps are accomplished by the IPL 4 interrupt service routine: 

1. The process buffered I/O count (PCB$W_BIOCNT), the count of outstand- 
ing buffered I/O operations, is incremented. 

2. The byte count quota that was allocated for the system buffer is given 
back by adding IRP$W_BOFF to JIB$L_BYTCNT. 

3. If the I/O function was a read (bit IRP$V_FUNC in IRP$W_STS is set), the 
BUFPOST routine (in module IOCIOPOST) is used as the special kernel 
mode AST routine address. 

4. Otherwise, DIRPOST is used as the special kernel mode AST routine ad- 
dress, and the buffer used to hold the data written to the device, if any, is 
deallocated (the buffer's address is found in IRP$L_SVAPTE). 

The special kernel mode AST called BUFPOST is used for the case of a buf- 
fered read operation, because the data must be copied from the system buffer 
to the buffer specified in the original $QIO request. BUFPOST performs the 
following steps: 

1. After the data is copied, the system buffer is no longer needed so it is 
deallocated to nonpaged pool. 



401 



I/O System Services 



2. The accumulated buffered I/O count accounting statistic (stored in 
PHD$L_BIOCNT) is incremented. 

The remaining steps that this routine must perform are identical to the oper- 
ations performed by DIRPOST. BUFPOST continues at step 2 in that routine. 



18.4 I/O CANCELLATION 

The $CANCEL system service cancels all I/O issued to a device from a speci- 
fied channel by scanning all of the IRPs queued to the device UCB (starting at 
UCB$L_IOQFL). Several conditions must hold for an I/O request to be can- 
celed. 

• The request cannot be a virtual request (indicated by the setting of the 
IRP$V_ VIRTUAL bit in IRP$W_STS). In general, I/O cannot be canceled 
on disk or tape devices. Drivers for these devices ensure that the 
IRP$V_ VIRTUAL bit is set on all requests that cannot be canceled. 

• The requesting process ID (PCB$L_PID) matches the stored process ID in 
IRP$L_PID. 

• The requested channel number in the CHAN argument to $CANCEL 
matches the stored channel number in IRP$W_CHAN. 

The I/O is canceled by taking the following steps: 

1. Clearing the buffered read bit (IRP$V_FUNC in IRP$W_STS) for buffered 
I/O functions (identified by IRP$V_BUFIO in IRP$W_STS) 

2. Placing the SS$_CANCEL function code in the low order word of and 
clearing the high-order word of IRP$L_IOSTl 

3. Placing the IRP in the I/O postprocessing queue, and requesting an I/O 
postprocessing software interrupt 

The driver cancel I/O routine is called to allow the driver to perform any 
desired cleanup operations, and to cancel the I/O request currently in prog- 
ress. 

If there is a file open on the channel, EXE$CANCEL allocates and initial- 
izes an IRP on behalf of the user (and charges the user's buffered I/O quota, 
PCB$W_BIOCNT, for an I/O request). The IRP is queued to the ACP for 
further processing (using routine EXE$QIOACPPKT in SYSQIOREQ). The 
IRP specifies a function code of IO$_ACPCONTROL and uses event flag 
number 31 to indicate I/O completion. 



18.5 MAILBOX CREATION AND DELETION 

Mailboxes are virtual devices used for interprocess communication. They are 
created by the $CREMBX system service. There are two kinds of mailboxes, 



402 



18.5 Mailbox Crea tion an d Deletion 

temporary and permanent. Temporary mailboxes are deleted automatically 
when no more processes have channels assigned to them, while permanent 
mailboxes must be explicitly marked for deletion using the $DELMBX sys- 
tem service. (After being marked for deletion, permanent mailboxes are dele- 
ted when no more processes have channels assigned to them). 



18.5.1 Mailbox Creation 

The $CREMBX system service (located in module SYSMAILBX) creates a 
virtual mailbox device named MBn: and assigns an I/O channel to it. 

The routine EXE$CREMBX begins by translating the logical name speci- 
fied by the user in the LOGNAM parameter (if any), and finding a free chan- 
nel (CCB) to assign to the mailbox (using IOC$FFCHAN). It also verifies that 
the user has the appropriate privilege(s) for the type of mailbox being created: 

• PRMMBX for a permanent mailbox 

• TMPMBX for a temporary mailbox 

• SHMEM for a mailbox in shared memory 

If a logical name has been specified, EXE$CREMBX searches all existing 
mailbox UCBs to see if a mailbox with that name already exists. If a match is 
found and the caller has privilege to access the mailbox (or owns the mail- 
box), the reference count for that mailbox (UCB$W_REFC) is incremented, 
and a channel is assigned by taking the following steps: 

1. Placing the mailbox UCB address in CCB$L_UCB 

2. Placing the access mode at which the channel was assigned (plus one) in 
CCB$B_AMOD 

3. Returning the channel number to the user in the CHAN parameter 

4. Returning with an SS$_NORMAL completion status code 

If the mailbox being created did not previously exist and is a temporary mail- 
box, the process buffered I/O byte count quota (JIB$L_BYTCNT) is checked 
to determine if the process has enough quota do the following: 

• Support the creation of a mailbox UCB 

• Buffer messages (according to the value specified in the BUFQUO parame- 
ter to $CREMBX) 

• Allow for overhead (256 bytes) in case of process deletion 

If the BUFQUO parameter is not specified, the SYSBOOT parameter 
DEFMBXBUFQUO (stored at IOC$GW_MBXBFQUO) is used for the amount 
of space reserved to buffer messages. 

A logical name block is allocated, if required, which will contain the logi- 
cal name specified for the mailbox by the user in the $CREMBX call. Routine 
IOC$CREATE_UCB (in IOSUBPAGD) is called to actually create the mail- 



403 



I/O System Services 



box UCB. The routine allocates space for the UCB from nonpaged pool and 
initializes fields in the UCB (using a template UCB found through MB$UCB0 
in DEVICEDAT). IOC$CREATE_UCB performs the following actions: 

1. The mailbox is marked online (the UCB$V_ ONLINE bit in set in 
UCB$W_STS). 

2. The reference count (UCB$W_REFC) is set to 1. 

3. The UIC of the creating process (PCB$L_UIC) is established as the owner 
of the mailbox (by loading UCB$L_OWNUIC). 

4. The UCB is identified as being a shareable mailbox (the DEV$V_SHR and 
DEV$V_MBX bits are set in UCB$L_DEVCHAR). 

5. The UCB is linked into the mailbox controller's device list (with 
UCB$L_LINK). 

6. A unit number is assigned to the UCB (in UCB$W_UNIT). The number is 
in the range of 1 to 65535; when all unit numbers in the range have been 
used, the unit numbers start again at 1. Unit numbers that are still in use 
are skipped. 

7. The mailbox controller's device count (CRB$W_REFC) is incremented. 

After IOC$CREATE_UCB returns control, EXE$CREMBX performs the fol- 
lowing steps: 

1. It places the buffer quota calculated earlier in UCB$W_BUFQUO. 

2. It places the protection mask specified by the user in the PROMSK param- 
eter in UCB$W_VPROT. 

3. It clears the device owner process ID field (UCB$L_PID). 

4. The quota charge for the mailbox (UCB$W_CHARGE) is computed by the 
sum of UCB$W_BUFQUO and UCB$W_SIZE. 

5. It places the buffer quota plus UCB size in UCB$W_CHARGE. 

6. It places the maximum message size specified by the user in the 
MAXMSG parameter in UCB$W_DEVBUFSIZ. (If MAXMSG was not 
specified, the SYSBOOT parameter DEFMBXMXMSG, stored at 
IOC$GW_MBXMXMSG, is used). 

If the mailbox being created is a permanent mailbox, the UCB$V_PRMMBX 
bit in UCB$W_DEVSTS is set. Three other steps are taken if the mailbox is 
a temporary mailbox: 

• The UCB$V_DELMBX bit in UCB$W_DEVSTS is set to mark the mail- 
box for deletion. It will be deleted when the last channel assigned to it is 
deassigned. 

• The process byte count limit (JIB$L_BYTLM) is reduced by 
UCB$W_ CHARGE. 

• The process byte count quota (JIB$L_BYTCNT) is reduced by 
UCB$W_ CHARGE. 



404 



18.5 Mailbox Creation and Deletion 



Per-Process P1 
Space 



Process Z 



CCB for 
Process 2 



CCB for 

Process A 



Process A 



System Virtual Address 
Space 



Static Executive Data 



Paged Pool 



Mailbox 

Unit Control 

Block 



Template for 

Other Units 



(Optional) 

Logical Name 

Block 



_MBn: 



System Virtual 
Address Space 



Mailbox 
Message Queue 



Mailbox 

Unit Control 

Block n 



First 
Message 



Second 
Message 



Figure 18-1 

Data Structures Associated with Mailbox Creation 



If a logical name was specified for the mailbox, a logical name is created using 
the logical name block allocated earlier. The association with the logical 
name is made through UCB$L_LOGADR. If no logical name was specified, 
UCB$L_LOGADR is cleared. Finally, a channel is assigned to the mailbox in 
the same way as if the mailbox had already existed. The relationships among 
the data structures associated with mailbox creation are pictured in Figure 
18-1. 



18.5.2 Mailbox Creation in Shared Memory 

Note that although the format of a shared memory mailbox UCB is some- 
what different from a local memory UCB, the general steps involved in the 
creation of the mailbox are the same. All of the logic is contained within the 
same module (SYSMAILBX). 

One extra level of data structure is required to describe a shared memory 
mailbox. This structure, called a shared memory mailbox control block (Fig- 
ure 18-2), is located in the shared memory. The UCBs on each port associated 



405 



I/O System Services 

Shared Memory Mailbox Control Block 



Message Queue Listhead 


(Self-Relative Queue) 


Unit Number 


Creator 
Port 


Flags 


Waiting Reader 


Reference Flags 


Waiting Write AST 


Waiting Read AST 


Current 
Message Count 


Maximum 
Message Size 


Protection Mask 


Buffer Quota 


Owner UIC 




Count 


Mailbox 
(Up to 15 C 
(Counted AS 


Name 
haracters] 
5CII String 


) 



Figure 18-2 

Contents of a Shared Memory Mailbox Control Block 



with the shared memory mailbox contain the (processor-specific) virtual ad- 
dress of the mailbox. There are three cases that the Create Mailbox system 
service can encounter when creating a mailbox in shared memory. 

• If the shared memory mailbox control block (Figure 18-2) does not exist (if 
the mailbox does not already exist on this processor or another), it is cre- 
ated first. Then, the unit control block in local memory is created. A logi- 
cal name block is allocated because shared memory structures always have 
a name associated with them. Finally, a channel is assigned for the creat- 
ing process. 

• If the mailbox is being created on this processor for the first time (but 
already exists on another), a UCB is allocated and loaded with parameters 
that describe the mailbox. A bit is set in a mailbox-dependent field indicat- 
ing that this mailbox UCB describes a mailbox in shared memory. Finally, 
the address of the shared memory mailbox control block is loaded into the 
UCB. 



406 



18.5 Mailbox Creation and Deletion 

• If the mailbox already exists on this processor, the Create Mailbox system 
service simply assigns a channel to it. 

The data structures required to describe a shared memory mailbox are pic- 
tured in Figure 18-3. 



18.5.3 Mailbox Deletion 

The $DELMBX system service (located in module SYSMAILBX) is used to 
mark a permanent mailbox for deletion. The mailbox is actually deleted by 
IOC$DELMBX (in IOSUBNPAG) when its reference count (UCB$W_REFC) 
goes to zero (after the last channel assigned to it has been deassigned, as 
described in Section 18.1.2). 

The mailbox to be marked for delete is identified by the CHAN argument 
in the $DELMBX call. The channel number is used to locate the CCB, from 
which the mailbox UCB address can be found (in CCB$L_UCB). 



CCB for 
Process A 



Processor 1 
Local Memory 



Logical Name 

Block 

(Implicit Pointer) 



SHMEM:MBn 



CCB for 
Process B 



^T 



Mailbox 

Unit Control 

Block for Unit n 



Shared Memory 



Processor 2 Local Memory 
(Creator Port) 



Second 
Message 



First 
Message 



1 



Message Queue 



Shared Memory 

Mailbox 
Control Block 



Logical Name 

Block 

(Implicit Pointer) 



SHMEM:MBm 



Mailbox 

Unit Control 

Block for Unit m 













CCB for 
Process X 












"~ *CCB for 
Process Y 


-*-] 




L 


— • 
CCB for 
Process Z 



Figure 18-3 

Shared Memory Mailbox Creation 



407 



I/O System Services 

The routine EXE$DELMBX verifies the following: 

1. The UCB is for a mailbox (that the DEV$V_MBX bit is set in 
UCB$L_DEVCHAR). 

2. The mailbox is a permanent mailbox (that the UCB$V_PRMMBX bit is 
set in UCB$W_DEVSTS). 

3. The process has PRMMBX privilege. 

If the above conditions are met, the mailbox is marked for deletion by setting 
the UCB$V_DELMBX bit in UCB$W_DEVSTS. 

The routine IOC$DELMBX actually deletes a mailbox, whether it was 
temporary or originally permanent by taking the following steps: 

1. Verifying that the device to be deleted is a mailbox (DEV$V_MBX is set in 
UCB$L_DEVCHAR), that the reference count (UCB$W_REFC) is zero, 
and that the mailbox has been marked for deletion (UCB$V_DELMBX is 
set in UCB$W_DEVSTS) 

2. Unlinking this UCB from the other mailbox UCBs (using the 
UCB$L_LINK field) for this mailbox controller (because the UCBs for a 
controller are linked together) 

3. Decrementing the controller's device reference count (CRB$W_REFC) 

4. Removing the logical name for the mailbox (if any specified, using a non- 
zero value in UCB$L_LOGADR) from the logical name table 

5. Deallocating the logical name block used for the mailbox 

If the mailbox was a temporary mailbox (UCB$V_PRMMBX clear in 
UCB$W_DEVSTS), the byte count limit (TIB$L_BYTLM) and the byte count 
quota (JIB$L_BYTCNT) are updated (because the creation of a temporary 
mailbox required those resources). Any unprocessed messages that were 
queued to the mailbox (and are still stored in nonpaged pool) are deallocated 
(by calling EXE$DEANONPAGED in MEMORYALC). The UCB for the mail- 
box is deallocated (by calling EXE$DEANONPAGED). 

18.6 BROADCAST SYSTEM SERVICE 

The $BRDCST system service (EXE$BRDCST in SYSBRDCST) allows mes- 
sages to be sent to one or more terminals (even if an I/O operation is currently 
in progress on the terminal). 

After checking the buffer quota (to make sure enough quota is available to 
buffer the message), a broadcast descriptor block (BRD) is allocated from 
nonpaged pool and initialized. (See Figure 18-4 for the format of a BRD.) 
If the message is to be sent to a single terminal, then EXE$BRDCST performs 
the following actions: 

1. Locates the UCB address for the terminal (specified by the DEVNAM pa- 
rameter) by calling IOC$SEARCHDEV 



408 



18.6 Broadcast System Service 



Broadcast Descriptor Block 



BRD$L_FLINK: 

BRD$L_BUNK: 

BRD$W_SIZE: 
BRD$B_TYPE: 



BRD$L_PID: 

BRD$W_REFC: 
BRD$W_MSGLENGTH 

BRD$L_SCRDATA: 
BRD$L_CARCON: 

BRD$W_TRMUNIT: 
BRD$T_TRMNAME: 



BRD$W_MSGLENGTH: 
BRD$L_DATA: 



Broadcast Queue Forward Link 




Broadcast Queue Backward Link 




X 


Type 


Size 




Requesting Process PCB 




Requesting Process PID 




Message Length 


Number of Terminals 




Pointer to Screen Data Area 








V Mailbox 
( Portion 

J 

- 


Carriage Control Parameter 


Terminal Unit Number 


Mailbox Message Type 




Size 


Terminal Name 
(Up to 15 Bytes) 






Message Length 


y Messa 
■< (Up to 16, 


je Data „ 
350 Bytes) 


Screen Data Area 
(Up to 16373 Bytes) 


< 



Figure 18-4 

Layout of a Broadcast Descriptor Block 



2. Verifies that the process (or any parents of the process) either owns the 
terminal (UCB$L_PID equals PCB$L_PID) or has OPER privilege 

3. Verifies that the UCB is for a terminal (DEV$V_TRM set in UCB$L_ 
DEVCHAR), and that the terminal is online (UCB$V_ONLINE in 
UCB$W_STS) 

4. Places the BRD in a queue of BRDs to be broadcast 

5. Starts a broadcast 

If the message is to be sent to all terminals, EXE$BRDCST first checks for 
OPER privilege and then performs steps 3 to 5 above for each terminal UCB. 



409 



I/O System Services 



Before the BRD is placed in the queue of BRDs (step 5) and if the terminal is 
unowned (UCB$W_REFCNT is zero), EXE$BRDCST verifies that the termi- 
nal is not set to AUTOBAUD (TT2$V_AUTOBAUD clear in 
UCB$L_TT_DEVDP2). The rational behind this step is to make sure that 
broadcast messages are not sent to terminals having an unknown baud rate 
(resulting in garbage on the screen). 
Starting a broadcast involves several steps: 

1. Mailbox-specific information is loaded into the mailbox portion of the 
BRD (BRD$W_TRMUNIT and BRD$T_TRMNAME). 

2. If the specified terminal has enabled broadcast to mailbox (bit 
TT2$V_BRDCSTMBX set in UCB$L_TT_DEVDP1), the broadcast mes- 
sage is written to the mailbox associated with the terminal (by calling 
routine EXE$WRTMAILBOX in module MBDRIVER). 

3. A write buffer packet that points to the BRD (see Figure 18-5) is allocated 
from nonpaged pool and initialized. 

4. The write buffer packet is passed to the terminal driver's alternate start 
I/O entry point (by calling routine EXE$ALTQUEPKT in SYSQIOREQ). 
This routine activates the driver regardless of whether or not an I/O re- 
quest is in progress for the device. 

5. The terminal driver then accepts the broadcast message, or indicates that 
the message cannot be broadcast (because, for example, the user issued a 
SET TERMINAL/NOBROADCAST or /PASSALL command). 

6. If the message is not accepted by the driver, the write buffer packet is 
deallocated. 



TTY$L_WB_FLINK: 



TTY$L_WB_BLINK: 

TTY$W_WB_SIZE: 
TTY$B_WB_TYPE: 
TTY$B_WB_FIPL: 

TTY$L_WB_NEXT: 



TTY$L_WB_END: 

TTY$I WB_IRP: 

TTY$I WB_RETADDR: 



Forward Link 



Backward Link 



FIPL 



Type 



Size of Block 



Address of Start of Data 



Address of End of Data 



Address of Return Fork Routine 



Figure 18-5 

Layout of a Write Buffer Packet 



410 



18.7 Informational Services 

7. If the message is accepted by the driver, the broadcast reference count is 
incremented (BRD$W_REFC). 

While the driver is writing the message to the specified terminal! s), the proc- 
ess issuing the $BRDCST call is placed in an RSN$_BRKTHRU wait state. 
As soon as BRD$W_REFC goes to zero, indicating all of the broadcast mes- 
sages have been sent to the specified terminal) s), the process is removed from 
the wait state,- the BRD is deallocated, and the system service completes. 
The write buffer packet is deallocated after the message is output to the 
terminals. 



18.7 INFORMATIONAL SERVICES 

Application programs frequently require information about particular de- 
vices on the system. The VMS operating system allows a user to obtain spe- 
cific information about a particular device using one of several system ser- 
vices ($QIO, $GETDVI, $GETDEV, and $GETCHN). The information 
obtained may be either common to all the devices on the system (device 
independent), or specific to a particular device type (device dependent). 

18.7.1 Device-Independent Information 

Device-independent information refers to information that is present for each 
device on the system (such as the device unit number, device characteristics, 
and the device type). It is obtained by reading fields in the UCB that have the 
same interpretation for all devices on the system. 

18.7.1.1 Get Device/Volume Information. The Get Device/Volume Information 
($GETDVI) system service (located in SYSGETDEV) is provided to obtain 
device-independent information about a device (see the VAX/VMS System 
Services Reference Manual for a listing of the fields that can be returned). 
Support still exists for the older services $GETCHN and $GETDEV for up- 
ward compatibility. In the development of VAX/VMS Version 3.0, it was de- 
termined that the functions of $GETCHN and $GETDEV could not be ex- 
tended without affecting users. $GETDVI was written to replace $GETCHN 
and $GETDEV, using the item list argument mechanism implemented in 
$GETTPI. In this way $GETDVI can be extended as much as necessary in the 
future. 

Two sets of information, called the primary device characteristics and the 
secondary device characteristics, can be requested. These two sets of charac- 
teristics are identical unless one of the following conditions holds: 

• The device has an associated mailbox (nonzero entry in UCB$L_AMB), in 
which case the primary characteristics are those of the device, and the 
secondary characteristics are those of the associated mailbox. 



411 



I/O System Services 



• The device is spooled (DEV$V_SPL is set in UCB$L_DEVCHAR), in 
which case the primary characteristics are those of the intermediate de- 
vice, and the secondary characteristics are those of the spooled device. 

• If the device represents a logical link in a network, the secondary charac- 
teristics contain information about the link. 

Before it can locate the desired device's UCB address, $GETDVI must first 
determine whether it was passed a channel number or a device name. Once 
the source is determined, $GETDVI locates the UCB address in the same way 
that the UCB is located by $GETCHN and $GETDEV. The item list of re- 
quested information is then processed serially. The item codes are used to 
index a table that determines the location of the desired information within 
the UCB. If the low bit in the word containing the item code is clear, the 
primary UCB is used; if the bit is set, the secondary UCB is used. When an 
item is successfully located, it is copied into the user's buffer for that item. 
The routines EXE$GETCHN and EXE$GETDEV differ only in how they 
initially find the desired device's UCB address. In the $GETCHN case, the 
CCB$L_UCB field for the CCB identified by the CHAN argument is used. In 
the $GETDEV case, routine IOC$SEARCHDEV is called to find the UCB 
address from the DEVNAM argument. Once the UCB address is found, the 
device-independent information is copied from the primary UCB to the user 
buffer (if a primary buffer was specified). After that, the device-dependent 
information is copied from the secondary UCB (located by UCB$L_AMB in 
the primary UCB, or, if that value is 0, the primary UCB is again used) into 
the user buffer (if a secondary buffer was specified). 



18.7.2 Device-Dependent Information 

Device-dependent information refers to information that is present for a par- 
ticular device type on the system, but not for every device on the system. (For 
example, a unit control block for a card reader indicates whether that card 
reader is translating cards according to the 026 keypunch code or the 029 
keypunch code.) 

Device-dependent information can be made available to a user process by 
placing that information into the high-order longword of the I/O status block 
for a $QIO request. The information is placed there by the driver (by placing 
that information in Rl before issuing the REQCOM macro to complete the 
I/O request), and can be anything the driver writer feels is appropriate for a 
particular $QIO function code. That is, the information placed there can take 
on different meanings for different function codes. 

Often, device drivers support special function codes that only return de- 
vice-dependent information in the high-order longword of the I/O status 
block and that do not initiate any device activity. The function codes most 



412 



18.7 Informational Services 

frequently used in this way are IO$_SENSEMODE and IO$_SENSECHAR. 
For example, the magtape driver responds to the IO$_SENSEMODE $QIO by 
returning the tape characteristics in the I/O status block. Corresponding 
IO$_SETMODE and IO$_SETCHAR function codes are also usually pro- 
vided so that the user can change the device mode or characteristics if the 
current ones are not acceptable. 

In addition, the $GETDVI system service can return two longwords of 
device-dependent information (UCB$L_DEVDEPEND and UCB$L_ 
DEVDEPND2), which can be used for different purposes by different devices. 
The VAX/VMS I/O User's Guide contains complete descriptions of how the 
information in that field should be interpreted for every supported device 
type. That manual also contains a detailed explanation of what information 
is returned by the IO$_SENSEMODE and IO$_SENSECHAR $QIOs for each 
device that supports those function codes. 



413 



19 VAX/ VMS Device Drivers 



"Open the pod-bay doors, HAL." 

—Arthur C. Clarke, 2001 : A Space Odyssey 

A VAX/VMS device driver is a collection of tables and routines used to con- 
trol I/O operations on a peripheral device. The VAX/VMS Guide to Writing a 
Device Driver describes the general structure of a driver and introduces the 
system routines commonly called by device drivers. This chapter highlights 
various techniques used by selected system drivers and documents some of 
the device-specific processing performed by them. The intent is to present 
those techniques that are helpful in understanding the VAX/VMS I/O subsys- 
tem but are not described in the VAX/VMS Guide to Writing a Device Driver. 
No attempt is made to discuss each VAX/VMS device driver, nor is every 
feature of a particular driver described. For detailed descriptions of the fea- 
tures and capabilities provided by each supported device driver, see the VAX/ 
VMS I/O User's Guide. 

19.1 DISK DRIVERS 

Disks are random access mass storage devices placed either on the MASS- 
BUS, UNIBUS, UNIBUS through the UDA50, IDC (VAX- 11/730 only), or CI 
through the HSC50. The drivers written for these devices are designed to do 
the following: 

• Take advantage of the hardware error recovery and correction capabilities 
such as data checking, offset recovery, and error code correction (ECC) 

• Optimize controller operations by overlapping seek and data transfer oper- 
ations (although this is not true for all drivers) 

• Perform dynamic bad block handling (in conjuction with the ACP) 

• Support online diagnostics and error logging 

• Support I/O requests at the logical and physical levels (non-DSA disks 
only), and cooperate with an ancillary control processor (ACP) to support 
virtual I/O requests 

The VAX/VMS I/O User's Guide contains a general discussion of some of the 
disk driver characteristics listed above. The following sections supplement 
the information presented there. 

19.1.1 ECC Error Recovery 

ECC (error correcting code) errors occur only on read operations (read data, 
read header and data, write check data, and write check header and data). 



414 



19.1 Disk Drivers 

They are corrected by applying a hardware-specified correction mask to the 
appropriate memory data. The transfer is then continued as if an error never 
occurred. Note that all RA-type disks have a different ECC scheme, which is 
implemented within their controllers (the UDA or the HSC). 
The actual error correction code consists of the following: 

• An 11 -bit mask that must be XORed with the appropriate memory data 

• A bit number within the sector that specifies the start of the error burst 

Disk drivers call routine IOC$APPLYECC (in module IOSUBRAMS) to actu- 
ally apply the ECC correction. IOC$APPLYECC requires the use of a system 
page table entry (SPTE). Device drivers that support ECC recovery specify the 
DPT$V_SVP flag in the flags argument to the DPTAB macro. When this flag 
is set, the SYSGEN command CONNECT allocates one SPTE for each unit 
and stores the system virtual page number in field UCB$L_SVPN in the unit 
control block. The system page table entry is used to double map a byte to be 
corrected. The driver must also specify the number of bytes that were trans- 
ferred into memory (up to, but not including, the block to be corrected). This 
number can be calculated by adding the remaining byte count (loaded by the 
driver from a MASSBUS adapter control register, MBA$L_BCR, into the unit 
control block, in field UCB$W_BCR) to the transfer byte count 
(UCB$W_BCNT). The following steps are performed to apply the correction: 

1. The transferred byte count is decremented and then ANDed with IFF 
(hex) to calculate the byte offset from the start of the buffer to the block 
that contains the data to be corrected. 

2. The starting bit number of the error burst (a number in the range from 1 to 
4096, hex) is decremented to convert it to a relative bit number, and the 
result is separated into a byte offset within the block and a mask shift 
count. 

3. The byte offset within the block is added to the byte offset from the buffer 
calculated in step 1. The result is the byte offset within the buffer to the 
start of the error burst. 

4. The exclusive OR pattern mask is shifted left by the mask shift count 
calculated in step 2. 

At this point, the longword exclusive OR pattern and the byte offset 
within the buffer to the first byte to be corrected have been calculated. All 
that remains is to double map the data block to be corrected and XOR the 
pattern mask with memory. However, the following considerations must 
be accounted for. 

a. The transfer may have been satisfied part way through the last block, and 
the error correction is outside the data of interest. For example, suppose 
the byte count terminated after 20 bytes into the sector, and the correcta- 
ble data starts at byte 35. 

b. The transfer may have been satisfied part way through the last block, and 



415 



VAX/VMS Device Drivers 

and the error correction is partly inside and partly outside the data of 
interest. For example, the byte count terminated after 20 bytes into the 
sector, and the correctable data started at byte 19. 

Thus, the correction must be applied one byte at a time. Steps 5 through 7 
are repeated four times, if necessary. 

5. The offset to the next byte to be corrected is compared with the transfer 
byte count. If the offset byte count is greater than or equal to the transfer 
byte count, remaining corrections are outside the area of interest. Step 8 is 
executed next. 

6. The byte to be corrected is double mapped using the system virtual page 
number stored in UCB$L_SVPN, and the translation buffer is invalidated 
for that page. 

7. The next byte (lowest) of the longword pattern mask is XORed with the 
memory data, the offset in the buffer is incremented, and the pattern mask 
is right shifted 8 bits. If all four correction bytes have not been applied, 
steps 5, 6, and 7 are repeated. 

8. The transfer is continued by reexecuting the appropriate function after 
updating the current transfer parameters (byte count, disk address, and 
system virtual address of the next page table entry that maps the transfer). 

19.1.2 Offset Recovery 

Offset recovery is a technique whereby the drive read heads are moved in 
small increments (usually 200 to 400 microinches) from the track centerline 
in an attempt to pick up a stronger reading signal. The technique is per- 
formed only for read operations such as read header and data, write check 
data, and write check header and data. This technique is not implemented for 
RA-type disks, it is performed by the controllers (the UDA and the HSC). 
Upon encountering an error that may be correctable using offset recovery, 
the following steps are taken by a disk driver: 

1. The read heads are returned to the centerline. 

2. Up to 16 attempts are made to read the data at the centerline. 

3. The heads are offset an increment, and 2 retries are performed at that 
offset. This procedure is repeated up to 6 times. 

4. If after 28 attempts ( 16 at the centerline, and 2 at each of 6 offset positions) 
the data still cannot be retrieved, a failure is returned. 

19.1.3 Dynamic Bad Block Handling 

Dynamic bad block handling is implemented as a cooperative effort between 
driver FDT routines, I/O postprocessing routines, and ACPs. FDT routines 
for IO$_READVBLK and IO$_WRITEVBLK construct an I/O packet (IRP), 



416 



19.1 Disk Drivers 

and set the virtual bit in the IRP status word (IRP$V_ VIRTUAL in 
IRP$W_STS). The I/O postprocessing routines (in module IOCIOPOST) dis- 
cover transfer errors on virtual I/O functions and route the IRP to the appro- 
priate ACP 

The ACP, using information in the IRP, calculates the bad block address 
and stores that information in [0,0]BADLOG.SYS. In addition, a bit is set in 
the file control block (FCB) and in the file's header. When the file is deleted, 
the ACP creates a process running the image BADBLOCK.EXE, which diag-. 
noses the file. If the bad block is found, the image uses privileged ACP func- 
tions to mark the block as bad in the bad block file ([0,0]BADBLK.SYS ; 1|. 

Note that a bad block is not discovered until it is already part of a file and is 
not recorded in the bad block file until that file is deleted. When a bad block 
is discovered while writing a file, the bad block information is recorded; a bit 
is set in the FCB for the file, and an error indication is returned to the request- 
ing process. 

Bad block support is restricted to virtual I/O functions (that is, file I/O). 
Processes performing logical or physical I/O functions must provide their 
own bad block handling. 



19.1.4 Multiple-Block Noncontiguous Virtual I/O 

When a read or write virtual I/O function is processed by the $QIO system 
service (by routine EXE$QIO in module SYSQIOREQ), an attempt is made to 
perform the transfer without the intervention of an ACP. Conversion of vir- 
tual block numbers to logical block numbers is accomplished using mapping 
information contained in a data structure called a window control block 
(WCB) that was previously created by an ACP when the corresponding file 
was first accessed. If the WCB contains enough mapping information to con- 
vert the entire virtual range of the transfer into corresponding logical block 
numbers on the volume, then the virtual I/O transfer will be handled directly 
by the driver and I/O completion routines, even if the transfer consists of 
several noncontiguous pieces. If the WCB does not contain enough informa- 
tion to entirely map the virtual range of the transfer, the intervention of an 
ACP will be required at some time in order to complete the transfer. This 
intervention is known as a window turn. The number of window turns per 
unit of time can be displayed by the Monitor Utility with the DCL command 
MONITOR FCP. 

Because a deadlock situation could occur when a page mapped by the mem- 
ory management subsystem required a window turn, the memory manage- 
ment subsystem must avoid window turns. In order to do this, all files 
mapped by the memory management subsystem must have all their mapping 
information in the window control block. These large window control blocks 
are called cathedral windows. 



417 



VAX/VMS Device Drivers 

19.1.4.1 Mapping Information. The WCB is pointed to by the channel control block 
(CCB), which is established by the $ASSIGN system service (as described in 
Chapter 18). The WCB contains a base virtual block number and a variable 
number of map entries (controlled by the /WINDOWS =n qualifier to the 
DCL command INITIALIZE, by the SYSBOOT parameter ACP-WINDOW 
for disks mounted with the /SYSTEM qualifier, and by the FAB field RTV at 
file open time). The map entries form a subset of the file retrieval informa- 
tion for the file. Each map entry consists of an extent size and a starting 
logical block number. The map entries represent a virtually contiguous set of 
blocks that are not necessarily physically contiguous on the disk. 

When a virtual read or write request is specified, FDT routines initialize 
two fields in the IRP that will be used by the I/O postprocessing routines. The 
total byte count in the original request is stored in the original byte count 
field (IRP$L_OBCNT). The accumulated byte count field (IRP$L_ABCNT), a 
count of bytes actually transferred, is set to zero. 

Routine IOC$MAPVBLK is then called to convert the virtual range speci- 
fied in the transfer to a logical block range, using information in the WCB. 
There are three possible cases that can occur here: 

• The virtual range is logically contiguous and mapping information is con- 
tained in the window control block. 

• The window control block contains mapping information for the begin- 
ning of the virtual range, but the virtual range is not virtually contiguous. 

• The mapping information that maps the first virtual block in the range to 
its logical counterpart is not in the WCB. 

19.1.4.2 No ACP Intervention. In either of the first two cases, IOC$MAPVBLK returns 
a nonzero number of bytes mapped and a starting logical block number. 
These are loaded into the IRP (at fields IRP$L_BCNT and IRP$L_MEDIA 
respectively), and the I/O request packet is queued to the driver. Further proc- 
essing of this request takes place in the I/O postprocessing routines. These 
routines (found in module IOCIOPOST) provide the additional processing 
necessary to effect the total transfer. They are responsible for accumulating 
the total number of bytes transferred and for propagating, further processing 
of the request, if necessary. 

Whenever the I/O postprocessing code encounters an I/O request packet 
(IRP) with the virtual bit set (IRP$V_ VIRTUAL in IRP$W_STS), it updates 
the accumulated byte count (stored in IRP$L_ABCNT) by adding the number 
of bytes just transferred (IRP$L_BCNT). This updated accumulated byte 
count is then compared with the original byte count (stored in 
IRP$L_OBCNT). If the two numbers agree, the request is completed exactly 
like other direct I/O requests (as described in Chapter 18). 

In the second case, the remaining byte count is placed into IRP$L_BCNT, 



418 



19.2 Magnetic Tape Drivers 

and the segment starting virtual block number (IRP$L_SEGVBN) is re- 
trieved. Routine IOC$MAPVBLK is again called to map the remaining virtual 
range. If the mapping is successful (a nonzero count of the number of bytes 
mapped is returned), the IRP$L_BCNT and IRP$L_MEDIA fields are up- 
dated, and the IRP is again queued to the driver. In this way, the virtual 
request continues until it completes or until a virtual range that cannot be 
mapped by information in the WCB is encountered. 

19.1.4.3 ACP Intervention. If routine IOC$MAPVBLK cannot convert a virtual range 
to its logical counterpart, the files ACP associated with the volume involved 
in the transfer must be called upon to obtain the required mapping informa- 
tion. Note that this failure can be detected by FDT routines at the beginning 
of the transfer or by the I/O postprocessing routines after the request has been 
partially satisfied. In either case, the IRP is placed into a work queue and the 
associated ACP is awakened. 

When the ACP processes this IRP, it reads the file header to obtain the 
mapping information necessary for the transfer in question. This information 
is stored in the WCB, perhaps replacing other mapping information already 
contained there. The ACP then updates the BCNT and MEDIA fields in the 
IRP in order to transfer the first piece of the remaining virtual range and 
queues the IRP to the driver to continue the transfer. When the I/O 
postprocessing routine receives this packet, it will usually find that the re- 
maining virtual range can be mapped, allowing the request to complete with- 
out further ACP intervention (even though several discrete transfers may still 
be required). The only time that more than one window turn occurs is when a 
file is so badly fragmented that it cannot be mapped by the number of re- 
trieval pointers established for this volume. 



19.2 MAGNETIC TAPE DRIVERS 

Magnetic tapes are sequential access mass storage devices placed either on 
the MASSBUS or the UNIBUS. In order to perform data transfer operations, 
the MASSBUS magnetic tape driver (in TMDRIVER or TFDRIVER) has to 
obtain ownership of both the TM03 or TM78 controller (primary channel) 
and the MASSBUS Adapter (secondary channel) by issuing the REQPCHAN 
and REQSCHAN macros, respectively. At times, the secondary channel may 
be released (using the RELSCHAN macro) so that other disks may use the 
MASSBUS. The VAX/VMS Guide to Writing a Device Driver contains infor- 
mation on how drivers are written for devices on the MASSBUS. 

The VAX/VMS I/O User's Guide describes the features and capabilities 
provided by the magnetic tape drivers, and discusses the general error recov- 
ery and data check logic employed by them. The specific algorithm used to 
correct NRZI (non-return-to-zero-inverted) read errors is the following: 



419 



VAX/VMS Device Drivers 



1. If the error occurred while reading in the forward direction, the tape is 
backspaced, and the record is read again. 

2. If an error occurs while reading in the reverse direction (as the result of a 
read physical block reverse function), the following steps are taken: 

a. The record is read in the forward direction to set up the error correction 
in the hardware. 

b. The tape is backspaced over the record just read. 

c. The record is reread in the forward direction to apply the error correc- 
tion. 

d. The tape is backspaced over the record to position the tape properly 
(because the initial request was for a read in the reverse direction). 

A magnetic tape ACP is called from various driver FDT routines to perform 
functions like writing tape labels. 



19.3 CLASS AND PORT DRIVERS 

VAX/VMS Version 3.0 introduced a layered approach to device drivers and 
I/O. A number of drivers have been written (or rewritten) in two pieces: a 
class driver and a port driver. The reason for dividing the device drivers is to 
separate their functions into operations that depend on the protocol and hard- 
ware used to communicate with a device (the communications layer) and 
those operations that depend on the actual device (the function layer). The 
class and port strategy has been adopted by the terminal driver (see Section 
19.4) and by the SCA-type drivers. SCA-type drivers are class and port drivers 
written for devices that communicate using a DIGITAL standard architecture 
known as systems communication architecture (SCA). 

19.3.1 Implementation of SCA on the VAX/VMS Operating System 

SCA defines a communications layer and the external interface to that layer. 
Systems communication services (SCS) are a VMS-specific implementation 
of SCA. SCA port drivers implement SCS on specific port devices. In VAX/ 
VMS Version 3.0, SCA port drivers are provided for the CI (PADRIVER) and 
the UDA50 (PUDRIVER). SCA class drivers use SCS as a communications 
medium for some higher-level functions or protocols. The class drivers im- 
plement a function layer of the layered strategy and perform operations on a 
user-visible device without regard for the SCA communications medium 
used. 

Currently there are two protocols in the function layer that call SCS to 
communicate information: DECnet-VAX and mass storage control protocol 
(MSCP). DECnet-VAX uses SCS for communication over the CI; the 
CNDRIVER is the DECnet class driver. MSCP is a general mass storage pro- 



420 



19.3 Class an d Port Drivers 



Table 19-1 


Names of SCA Class and Port Drivers 


Type 


Name 


Application/ Device 


Class 


CNDRIVER 


DECnet on the CI 


Drivers 


DUDRIVER 


MSCP Disks 


Port 


PADRIVER 


CI port device 


Drivers 


PUDRIVER 


UDA50 port device 



tocol intended to be sufficient to describe all types of disk operation. MSCP is 
implemented by controllers for RA-type disks. The DUDRIVER is the MSCP 
class driver. 

The class and port drivers supported in VAX/ VMS Version 3.0 are shown in 
Table 19-1. Figure 19-1 shows a conceptual diagram of SCA. 

The MSCP disk class driver (DUDRIVER) can use either the CI port driver 



Host 



Remote Device 











Process 






; 


$QIO 




Class 
Driver 








scs 






Port 
Driver 






, 








' 








Port 
Device 













Remote 
Application 
or Device 








1 




Server 






SCS 






Port 
Driver 




, 








' 








Port 
Device 





Software* 
Hardware 



Communications Mechanism 



1 



*lt is possible for the remote device to implement the port driver and server in hardware. 

Figure 19-1 

Conceptual Diagram of SCA 



421 



VAX/VMS Device Drivers 



(PADRIVER) or the UDA50 port driver (PUDRIVER). The DECnet class 
driver (CNDRIVER) uses the CI port driver (PADRIVER) exclusively. 



19.3.2 I/O Processing 

When a user application performs I/O through a class and port driver, a chan- 
nel must be assigned to the class driver; $QIOs are issued to that channel. 
The following sequence illustrates how class and port drivers are used to 
communicate information from a process on a host system to a remote de- 
vice. The MSCP class driver is used as an example. 

1. The process on the host system issues a $QIO to a class driver. The $QIO 
initializes an IRP and passes it to the class driver. 

2. The class driver translates portions of the IRP to an MSCP request. The 
driver then builds an appropriate class driver request packet (CDRP). The 
CDRP contains information necessary for SCS to perform its operations 
(see Figure 19-2). As a convenience to the $QIO/class driver interface, 
CDRPs have been designed to be an extension of an IRP. 

3. The class driver then calls SCS to transmit the MSCP request to the MSCP 
server (UDA50 or HSC50). 

4. The SCS operations are interpreted by the port driver, which then commu- 
nicates the I/O request to a remote port driver through the communica- 
tions mechanism. 

5. The remote port driver communicates the request to the MSCP server 
using SCS operations. 

6. The server acts on the MSCP request and passes the I/O request to the 
remote application or device. 



19.4 TERMINAL DRIVER 

The terminal I/O subsystem is a collection of routines (in separate modules) 
that provide a flexible approach to terminal input and output (as described in 
the VAX/VMS I/O User's Guide). The terminal driver was rewritten in VAX/ 
VMS Version 3.0 using the class and port driver strategy. Note that the termi- 
nal class and port drivers do not communicate using the SCS protocol, nor do 
the terminal port devices conform to the SCA standards. The terminal class 
driver (TTDRIVER.EXE) contains FDT routines and device-independent rou- 
tines. The port drivers (DZDRIVER.EXE, YCDRIVER.EXE, and the routine 
CONINTDSP in SYS.EXE) contain interrupt service routines and controller- 
specific control subroutines for DZ-11, DZ-32, DMF-32, and the console ter- 
minal interface. 

The logical components of the terminal I/O subsystem are illustrated in 
Figure 19-3. (The console interface is discussed in Section 19.6.) 



422 



19.4 Termin al Driver 



IRP 
at Negative Offsets from CDRP 



Fork Queue FUNK 



Fork Queue BLINK 



FIPL 



Type 



CDRP size 



Fork PC 



Fork R3 



Fork R4 



Saved Return Address 



Address of Allocated MSCP Buffer 



Allocated Request ID 



Address of Connection Descriptor Table 



RWAITCNT Pointer 



Local Buffer Handle Address 



Local Byte Offset 



Remote Buffer Handle Address 



Remote Byte Offset 



Transfer Length (in Bytes) 



Block 
\ Transfer 
Extension 



Local Buffer Handle 
(12 Bytes) 



UNIBUS Mapping Resources Allocated 



I Class 
> Driver 
I Extension 



(Either of the extensions may be used) 

Figure 19-2 

Portions of a Class Driver Request Packet 



The class and port driver images are separate, loadable images. Therefore, 
changes can be made to the driver modules, and those modules can then be 
assembled and linked independently of the executive. The following steps are 
taken in assembling and linking the terminal driver. 

• First the library for the terminal driver is created: 

$ LIBRARY/CREATE/MACRO SYS$SYSTEM :TTYLIB SYSSSYSTEM : TTYUCBDEF. MAR 

• Next, the modules in the terminal driver are assembled: 

$ MACRO/LIST=SYS$SYSTEM : • module '/OBJECT=SYS$SYSTEM : 'moaule'+- 
SYS$SYSTEM: ' module '+- 
SYS$LIBRARY: LIB/LIBRARY 



This is done for each of the following modules: 



423 



VAX/VMS Device Drivers 



$! TTYCHARI 
$! TTYCHAEO 
$! TTYDRVDAT 
$! TTYFDT 
$! TTYSTRSTP 
$! TTYSUB 
$! DZDRIVER 
$! YCDRIVER 



• Finally, the object modules are linked into the terminal class driver 
(TTDRIVER) and the terminal port drivers (DZDRIVER and YCDRIVER). 

$! In the link phase the file OPTIONS. OPT contains the single 

line: 

$! line: 

$! BASE = D 

$! 

$! Link the terminal class driver (TTDRIVER) . 

$XINK/SHARE=SYS$SYSTEM:TTDRIVER/CONTIGUOUS- 

/MAP=SYS$SYSTEM : TTDRIVER/FULL/CROSS - 

SYS$SYSTEM: TTYDRVDAT, - 

TTYFDT, - 

TTYSTRSTP,- 

TTYCHARI,- 

TTYCHARO,- 

TTYSUB,- 

SYS$SYSTEM : SYS . STB/SELECTIVE_SEARCH, - 

SYS$SYSTEM: OPTIONS/OPTIONS 
$! 

$! Link port drivers. Done for DZDRIVER and YCDRIVER. 
$! 
$ LINK/SHARE=SYS$SYSTEM: ' driver '/CONTIGUOUS- 

/MAP=SYS$SYSTEM : ' driver • /FDLL/CROSS- 

SYS$SYSTEM: 'driver' ,- 

S YSISYSTEM: SYS. STB/SELECTIVE_SE ARCH, - 

SYSSSYSTEM: OPTIONS/OPTIONS 

When the system is bootstrapped, the module SYSBOOT reads the terminal 
class driver (TTDRIVER.EXE) image into nonpaged pool. INIT later creates 
the necessary linkages between the class and port drivers by first linking the 
console port driver with the terminal class driver. The device-specific exten- 
sion of a terminal UCB contains cells intended to contain pointers to the 
class and port vector dispatch tables. INIT locates the address of the dispatch 
tables for the terminal class driver and console port driver and loads these 
addresses into the console UCB. Later in system initialization, the SYSGEN 
command AUTOCONFIGURE determines the terminal controllers used by 
the system and loads the appropriate driver (DZDRIVER for DZ-11 and 
DZ-32 controllers, YCDRIVER for DMF-32 asynchronous lines). The control- 
ler and unit initialization routines of these port drivers initialize the UCB 
extensions. 

The relationships among the terminal class driver, console port driver, and 
the console UCB are shown in Figure 19-4. 



424 



19.4 Termin al Driver 



User issues $OIO request 



Terminal Driver 



"~l 



FDT and Device-Independent Routines 



Linked Against 
TTYLIB 




TTDRIVER.EXE 
(Terminal Class Driver) 



Device-Dependent 


DZDRIVER.EXE 


Device-Dependent 


YCDRIVER.EXE 


Control Subroutines 


(Terminal 


Control Subroutines 


(Terminal Port 


and 


Port Driver) 


and 


Driver) 


Interrupt Service 




Interrupt Service 




Routines for 




Routines for 




DZ-11 and 07-3? 




DMF-32 





Device-Dependent 

Control Subroutines 

and 

Interrupt Service 

Routines for 

Console 

Interface 




Terminal 

Interrupt 

(DZ-11 orDZ-32) 




Terminal 

Interrupt 

(DMF-32 Asynchronous Lines) 




Module 
CONINTDSP 
(Console 
Port Driver) 



-_J 



Figure 19-3 

Terminal I/O System 



The fact that the terminal driver class driver is loaded by SYSBOOT has 
implications for anyone who writes a new terminal class driver. It is a good 
idea to maintain a good copy of TTDRIVER in SYS$SYSTEM with a different 
name. In the event that the modified terminal driver contains errors that 
prevent the system from completing its initialization sequence, the SYS- 
BOOT parameter TTY_CLASSNAME can be set during a conversational 
bootstrap to contain the name of the good TTDRIVER. 

Normally, the only module that will need to be altered (or replaced) is the 
terminal port driver, in order to provide the device-dependent processing for a 
specific device (such as a DL11). 

To test a new terminal class driver on a system that has already autocon- 
figured the terminal devices, the system must be rebooted. A reboot is also 
necessary to use a new terminal port driver (for example, on autoconfigured 
DZlls), because the SYSGEN command RELOAD will not reload terminal 
class or port drivers. 



425 



VAX/VMS Device Drivers 



Terminal Class Driver 



DPT (Null) 



DDT 



Vectors 



::TTY$Gl_DPT 



Console UCB 



Console Port Driver 



DPT 



DDT (Null) 



Vectors 



:UCB$!_DDT 



:UCB$I_TT„_CLASS 
:UCB$L_TT_PORT 



Figure 19-4 

Terminal Driver Initialization 



19.4.1 Full Duplex Operation 

The terminal driver implements full duplex operation (unless specifically 
asked to operate in half duplex mode for a particular terminal) by utilizing an 
alternate start I/O entry point (specified as the ALTSTART parameter to the 
DDTAB macro). Whenever a write request is issued to a full duplex terminal, 
the write FDT routine (TTY$FDTWRITE in TTYFDT) allocates and initial- 
izes a write buffer packet to describe the write request, and calls routine 
EXE$ALTQUEPKT (in SYSQIOREQ) to enter the alternate start I/O routine 
of the driver. In the half duplex case, routine EXE$QIODRVPKT, also in 
SYSQIOREQ, is called. 

Normally, FDT routines call on EXE$QIODRVPKT to invoke the start I/O 
routine of the driver, if the unit is not busy, or to queue the IRP to the UCB if 
the unit is busy. EXE$ALTQUEPKT differs from EXE$QIODRVPKT in the 
following respects: 

1. No check is made to see if the UCB is busy (UCB$V_BSY set in 
UCB$W_STS). Therefore, EXE$ALTQUEPKT never queues the request to 



426 



19.4 Terminal Driver 

the UCB. It is desirable not to check the UCB busy bit because a read 
request may be in progress; if the IRP waited on the UCB queue until the 
read request finished (and the busy bit was cleared), full duplex operation 
would not be possible. 

2. The cancel and timeout bits in the UCB (UCB$V_ CANCEL and 
UCB$V_TIMOUT in UCB$W_STS) are unaffected (not cleared) because 
they may be in use by the current IRP, if the UCB is busy. 

3. The SVAPTE, BCNT, and BOFF fields are not copied from the IRP to the 
UCB because this would affect the current I/O operation if the UCB is 
busy. 

4. The alternate start I/O routine in the driver is entered (rather than the 
regular start I/O routine). 

TTY$WRTSTARTIO (in TTYSTRSTP) is the alternate start I/O routine entry 
point. This entry point is also used by the broadcast system service, as de- 
scribed in Chapter 18. This routine raises IPL to device IPL to block device 
interrupts from the current I/O operation, in case the device is busy, and 
processes the packet as follows: 

1. If a write is currently in progress, the write buffer packet is queued. 

2. If a read is occurring, but the buffer header specifies write breakthrough, 
the write is started. 

3. If a read is occurring, but no read data has echoed yet, the write is started. 

4. Otherwise, the write buffer is queued. 

In order to complete write I/O requests for full duplex operation, the driver 
exits by calling routine COM$POST (in COMDRVSUB) rather than issuing 
the REQCOM macro. COM$POST places the I/O request packet in the 
postprocessing queue, requests an IPL$_IOPOST software interrupt (see 
Chapter 6), and returns. Routine IOC$REQCOM is avoided so that the next 
IRP queued to the UCB (which must be a read request) is not initiated (be- 
cause the current read request, if any, has not yet terminated). Also, the sta- 
tus of the UCB busy bit is unaltered by COM$POST. However, all read re- 
quests (and half duplex writes) are terminated by invoking the REQCOM 
macro, so that the next request of this type may be processed in the normal 
fashion. 

In full duplex operation, the device can be expecting more than one inter- 
rupt at a time (one for a read request, and one for a write request). Therefore, 
two fork PCs must be stored. (Usually drivers only expect one interrupt at a 
time, and store the fork PC in UCB$L_FPC.) The terminal driver stores more 
than one fork PC by altering the value of R5 (which normally points to the 
UCB), to point to the write buffer packet or the IRP before forking (by invok- 
ing the FORK macro). A fork block is therefore formed in the write buffer 
packet or in the IRP (containing R3, R4, and the fork PC). The fork block in 



427 



VAX/VMS Device Drivers 



the UCB is not used for read or write requests, although it is used at other 
times, such as when allocating a type-ahead buffer or when handling unsolic- 
ited data. 

The technique of altering R5 before forking can easily be extended by any 
driver to allow more than one outstanding interrupt for a particular device, 
provided the driver can distinguish which interrupt is associated with which 
fork block. Therefore, any number of outstanding I/O requests may be han- 
dled by a driver entered at the alternate start I/O entry point. Of course, the 
driver must maintain queues for outstanding I/O requests and synchronize 
I/O operations. The driver should operate almost exclusively at device IPL (as 
the terminal port drivers do), to block out device interrupts in order to 
achieve synchronization with multiple I/O request processing. 



19.4.2 Channels and Terminal Controllers 

VMS terminal controllers have no controller channel concept. Therefore, the 
terminal driver never requests or releases a controller channel (with the 
REQCHAN and RELCHAN macros). The locations normally used in the 
CRB as list heads for the controller channel wait queue (CRB$L_WQFL and 
CRB$L_WQBL) are instead used to contain modem control status informa- 
tion. 



19.4.3 Type- Ahead Buffer 

A type-ahead buffer is allocated from nonpaged pool for each terminal. The 
size of the type-ahead buffer is determined by the SYSBOOT parameter 
TTY_TYPAHDSZ. Every character typed is placed into the buffer, even if a 
read request is active. If the buffer is within 8 characters (or the value of the 
SYSBOOT parameter TTY_ALTALARM) of being full and the terminal is in 
host-sync mode, the driver sends an XOFF character to the terminal to tell it 
to stop sending data. An XON character is not sent to the terminal to tell it 
to start sending data until the buffer is emptied. Using this technique pre- 
vents characters from being lost in block I/O transmissions from high-speed 
terminals. 



19.5 PSEUDO DEVICE DRIVERS 

The VMS operating system supports drivers for virtual devices (pseudo de- 
vices), including the null device (NL:), the network device (NET:), remote 
terminal devices (RT:), and mailboxes (MB:). Users can assign channels to 
these devices and issue I/O requests, just as though they were real devices. 
The following sections highlight some of the features of these pseudo device 
drivers. 



428 



19.5 Pseudo Device Drivers 

19.5.1 Null Device Driver 

The null device driver (in NLDRIVER) is assembled and linked with the sys- 
tem image (SYS.EXE). It is a very simple driver, consisting of two FDT rou- 
tines (one to complete read requests, and one to complete write requests). 
The FDT routines in the null driver respond to read requests by returning an 
SS$_ENDOFFILE status code to the user, and they respond to write requests 
by returning an SS$_NORMAL status code. No data is transferred, nor are 
any privilege or quota checks made. 

19.5.2 Network Device Driver 

The network device (NET:) is best viewed as a mechanism for DECnet-VAX 
users to access network functions. When a process assigns a channel to NET, 
a network UCB is created and given a unique number, such as NET100. The 
channel number returned to the user points to the newly created UCB. This 
channel can then be used to perform access, control, and I/O operations on 
the network. When the user deassigns the last channel to the network UCB, 
the UCB is deleted. 

The network device driver and the communication drivers support two I/O 
request interfaces: $QIOs and "internal" IRPs. 

• When a user issues a $QIO, the executive and the driver's FDT routines 
cooperate to build an IRR The driver then processes the IRP (normally by 
passing it to its own STARTIO routine). 

• So-called internal IRPs are built by kernel mode modules (device drivers) 
and passed to another driver's alternate start I/O interface. 

The remote terminal driver (RTTDRIVER) uses NETDRIVER's internal 
IRP interface in communication across the network. 

NETDRIVER uses the internal IRP interface to pass I/O requests to com- 
munication device drivers. 

There are actually two images that are used for network communication: 
the network device driver (NETDRIVER) and the network ACP (NETACP). 
NETDRIVER creates links to other CPUs, performs routing and switching 
functions, breaks user messages into manageable pieces on transmission, and 
reassembles the messages on reception. The actual I/O in network communi- 
cation is performed by the communication device driver (for example, 
XMDRIVER performs network communication through DMC-lls). 

NETACP performs the following tasks: 

• Creates processes to accept inbound connects 

• Parses network control blocks and supplies defaults when a user issues an 
IO$ -ACCESS function code to create a logical link 

• Transmits and receives routing messages to maintain a picture of the net- 
work 

• Maintains the volatile network database 



429 



VAX/VMS Device Drivers 

Figure 19=5 illustrates some network I/O functions. For mare information on 
DECnet, ssi the DECnet-VAX User's Guid§ ind the DECnet-VAX 8y§t§m 
Manager's Guide. 

19.5.3 Remote Terminals 

DECnet-VAX allows users to log in on a remote VAX/VMS processor and 
perform operations on that remote processor, just as they would at the local 
processor. The communication from the remote process to the controlling 
terminal is performed through a pseudo device on the remote processor called 
a remote terminal. The driver for remote terminals is RTTDRIVER.EXE. 
(Note that while DECnet-VAX can communicate with other DIGITAL oper- 
ating systems running DECnet, the focus of this discussion is on DECnet 
communication between two VAX- 11 processors running the VAX/ VMS op- 
erating system. 

In addition to DECnet, three images are required to support remote termi- 
nals: the local processor uses the image RTPAD.EXE; the remote processor 
uses the images REMACP.EXE and RTTDRIVER.EXE. 

When a user on a local system issues the DCL command SET HOST, 
RTPAD uses DECnet-VAX to request a connection to a network object on the 
specified node. On remote processors running the VAX/VMS operating sys- 
tem, the object is REMACP. The image REMACP creates a UCB for the re- 
mote terminal and links the UCB into the driver tables by calling 
RTTDRIVER at its unsolicited input entry point. REMACP then returns in- 
formation about the remote processor to RTPAD. RTPAD has routines for 
communicating with a number of different DIGITAL operating system (in- 
cluding RSTS, RSX-11M, TOPS-20, and VAX/VMS). The information re- 
turned from REMACP is used to determine which operating system is com- 
municating with the local processor. In the VAX/VMS operating system, 
RTPAD sends unsolicited data to RTTDRIVER; sending this data to 
RTTDRIVER is equivalent to pressing the RETURN key on a terminal that is 
not logged in. RTTDRIVER creates a detached process running LOGINOUT. 
The user is now logged in to the remote system. 

In communicating information across the network, RTTDRIVER receives 
$QIOs from the remote process, packs the information into a block, and uses 
the "internal" IRP interface to pass the request to NETDRIVER. RTPAD 
unpacks the information and reissues the $QIO for the local terminal. If the 
$QIO is a read, RTPAD packs the input information into a block and passes 
the packet(s) of information back to RTTDRIVER. 

When the user logs off from the remote system, REMACP deletes the re- 
mote terminal UCB. 

19.5.4 Mailbox Driver 

Mailboxes are software-implemented devices that can be read and written to. 
Normally, mailboxes are used for communication between processes. Al- 



430 



Process A 



SASSIGN to NET: 



$QIO to NET: 
Channel 



$QIO 



RTTDRIVER 



"Remote" Process 




Creates 
Remote 
Process 


$ASSIGN to RTcu 




Interface for 
IRPs and 
"Internal" IRPs 


$Q10 to RT: 
Channel 


$QIO 







"Internal" 
IRP 



Figure 19-5 

Processing Network I/O Requests 



6 



NETACP 



Maintains "Picture" 
of Network 



Parses and Supplies 
Defaults for IO$_ACCESS 
Functions 



NETDRIVER 



i $QIO Communication 
Device Driver 



Communications 
Device 



Routing and 
Switching Functions 



Maintains Logical 
Links 



Packs and Unpacks 
Information from 
"Internal" IRPs 



Device- 
Specific 
Functions 



"Internal" 
IRP 






5 
TO 
C 

a, 
o 

to 
S. 

TO 
TO 

tl 

TO 
2 



VAX/VMS Device Drivers 

though mailboxes transfer information in much the same way that other I/O 
devices do, they are not actual devices. The following sections describe how 
the mailbox driver (in MBDRIVER, a module in the system image) buffers 
messages written to mailboxes and serializes mailbox read requests. Note 
that mailboxes in shared memory are supported by a separate, loadable 
driver, MBXDRIVER. 

19.5.4.1 Processing Set Mode Requests. A process may request notification of a mail- 
box read or write request by issuing a $QIO request with an IO$_SETMODE 
function code (and an IO$_READATTN or IO$_WRTATTN function code 
modifier). See the VAX/VMS I/O User's Guide for details. The mailbox driv- 
er's FDT routines respond to these requests by taking the following steps: 

1. Verifying that the process may access the mailbox. 

2. Queuing the request to the appropriate list head (UCB$L_MB_ W_AST for 
write requests, or UCB$L_MB_R_AST for read requests) by calling on 
routine COM$SETATTNAST in COMDRVSUB (which allocates, initial- 
izes, and queues an AST control block to the specified list head, as de- 
scribed in Chapter 7). 

3. Raising IPL to IPL$_MAILBOX (IPL 1 1) and checking to see if the notifica- 
tion condition requested is present (current read or write request outstand- 
ing). If so, routine COM$DELATTNAST in COMDRVSUB is called to 
queue the attention AST to the requesting process (see Chapter 7). Other- 
wise, the attention AST request remains queued to the mailbox UCB, but 
the I/O request is completed by calling EXE$FINISHIOC. The attention 
AST will be queued to the process when a read or write request, as appro- 
priate, is issued for the mailbox. 

Note that mailboxes use fork IPL$_MAILBOX (IPL 1 1, the highest fork 
IPL), to avoid possible synchronization problems with other drivers that 
reference mailboxes while at their respective fork IPLs (for example, to 
send a "device is off line" message to the operator's mailbox). 

19.5.4.2 Processing a Mailbox Read Request. When a user issues a read mailbox $QIO, 
the mailbox driver FDT routines perform the following general functions: 

1.