*-----------------------------------------------------------
Title[BitBlt.mc...November 7, 1982  4:06 PM...Taft];
* Bit-boundary block-transfer (PrincOps, chapter 8)
*-----------------------------------------------------------

* Refer to the PrincOps for primary documentation.
* All numbers are octal except timings, which are decimal.

* Data structures:

* BitBltArg: TYPE = MACHINE DEPENDENT RECORD [
MSC[BB.dstLo, 0];		* dst: BitAddress,
MSC[BB.dstHi, 1];
MSC[BB.dstBit, 2];
MSC[BB.dstBpl, 3];		* dstBpl: INTEGER,
MSC[BB.srcLo, 4];		* src: BitAddress,
MSC[BB.srcHi, 5];
MSC[BB.srcBit, 6];
MSC[BB.srcBpl, 7];		* srcBpl: INTEGER,
MSC[BB.width, 10];		* width: CARDINAL,
MSC[BB.height, 11];		* height: CARDINAL,
MSC[BB.flags, 12];		* flags: BitBltFlags ];

* BitAddress: TYPE = MACHINE DEPENDENT RECORD [
*	word: LONG POINTER,
*	reserved: [0..7777B],
*	bit: [0..17B] ];

* BitBltFlags: TYPE = MACHINE DEPENDENT RECORD [
MC[BBF.direction, 100000];	* direction: {forward, backward},
MC[BBF.disjoint, 40000];	* disjoint: BOOLEAN,
MC[BBF.disjointItems, 20000];	* disjointItems: BOOLEAN,
MC[BBF.gray, 10000];		* gray: BOOLEAN,
MC[BBF.srcFunc, 4000];		* srcFunc: {normal, complement},
MC[BBF.dstFunc, 3000];		* dstFunc: {null, and, or, xor},
				* reserved: [0..777B] ← 0 ];

* -- re-interpretation of srcBpl if gray = TRUE
* GrayParm: TYPE = MACHINE DEPENDENT RECORD [
*	reserved: [0..17B] ← 0,
*	yOffset: [0..17B],
*	widthMinusOne: [0..17B] ← 0,	-- present spec requires 0
*	heightMinusOne: [0..17B] ];

%
The 20 BitBlt Functions (combinations of gray, srcFunc, and dstFunc) are
divided into 4 classes:
	A	dest ← f(gray)
	B	dest ← f(gray, dest)
	C	dest ← f(source)
	D	dest ← f(source, dest)

The distribution of functions into classes is:
    Function (class)	gray	srcFunc		dstFunc
	 0 (C)		FALSE	normal		null
	 1 (D)		FALSE	normal		and
	 2 (D)		FALSE	normal		or
	 3 (D)		FALSE	normal		xor
	 4 (C)		FALSE	complement	null
	 5 (D)		FALSE	complement	and
	 6 (D)		FALSE	complement	or
	 7 (D)		FALSE	complement	xor
	10 (A)		TRUE	normal		null
	11 (B)		TRUE	normal		and
	12 (B)		TRUE	normal		or
	13 (B)		TRUE	normal		xor
	14 (A)		TRUE	complement	null
	15 (B)		TRUE	complement	and
	16 (B)		TRUE	complement	or
	17 (B)		TRUE	complement	xor


Terminology: when referring to words in a scan line, "left" and "right"
refer to the words with lower and higher addresses, respectively, independent
of the direction of processing the scan line;  "first" and "last" refer
to the first and last words encountered in the direction of processing.

Since the Dorado can do BitBlt in either direction at equal cost,
this implementation always uses the direction specified by the programmer.
The disjoint bit is ignored, and disjointItems is used only to determine
whether or not "touching" of the data is required in some cases (see below).

The destination item is considered to consist of a left partial word,
some number of full (body) words, and a right partial word.  If the item
begins or ends on a word boundary, the left or right word is still
considered to be a partial word.

The destination bits preserved in the left and right partial words
are determined by the SHC register's LMask and RMask, respectively,
where LMask = dst.bit, RMask = 17-((dst.bit+width-1) mod 20).

The destination width in words, including the first and last partial words,
is computed by DWidth = (width + dst.bit + 17)/20.  Similarly,
the source width in words, including the first and last partial words,
is computed by SWidth = (width + src.bit + 17)/20.
DWidth and SWidth differ by at most +/-1.

If DWidth > 2 then the destination consists of a left partial word, DWidth-2
full body words, and a right partial word.
If DWidth = 2 then the item consists of a left partial word immediately
followed by a right partial word.
If DWidth = 1 then the entire destination item lies within a single word,
not crossing a word boundary.  Effectively, the left and right partial words
are one and the same.  This case (called the "thin" case) requires
special handling, as both LMask and RMask must be applied simultaneously.

Processing of the scan line is controlled by two counters: the Cnt register
(loaded from ICnt), which counts the inner loop (20 words or fewer), and
MCount, which counts the outer loop.  Roughly speaking,
Cnt is loaded initially with DWidth mod 20 and MCount with DWidth/20.
The inner loop is executed Cnt times; then, until MCount is exhausted,
Cnt is reloaded with 20 and the inner loop is executed 20 more times.
The reason for this arrangement is to permit a PreFetch to be executed
every 20 words.

More precisely: Before each horizontal loop, the Cnt register is loaded with
  if DWidth < 23 then DWidth-1 else (DWidth-3) mod 20 +2, so:
	Cnt > 1 if the body contains one or more words.
	Cnt = 1 if there are no body words.
	Cnt = 0 in the thin case.
Cnt is decremented twice before the main loop is reached, so upon entry
to the main loop it contains DWidth-3 (assuming DWidth<23), which is precisely
the correct value for going around the loop DWidth-2 times.  (The extra 2
in DWidth are the first and last partial words, which are handled outside
the word loop.)

MCount is loaded with (DWidth-3)/20 -1, so:
	MCount < 0 if DWidth < 23
	MCount = 0 if 23 <= DWidth < 42, etc.
Each time the inner loop terminates, if MCount<0 the scan line is finished,
but if MCount>=0, MCount is decremented, Cnt is loaded with 17, and the
main loop is reentered for another 20 iterations.

When working left-to-right, the source may be thought of as T,,SrcWd --
that is, with T on the left and SrcWd on the right.  In any given operation,
T contains the previous source word and SrcWd contains the current one, and
the shift operation consists of left-shifting T,,SrcWd by 0 to 17 bits
(to align it with the destination) and storing the leftmost 20 bits of the
result.  Since the shifter is actually a cycler and produces the rightmost
20 bits (rather than the leftmost) of the cycled result, the shifter must
be set up to left-cycle an additional 20 bits (by exchanging SHA and SHB).
Hence SHA=R and SHB=T.

When working right-to-left, the source may be thought of as SrcWd,,T,
where T contains the previous source word and SrcWd the current, as before.
This is exactly symmetric with the left-to-right case and requires only
that SHA and SHB be exchanged.  Hence SHA=T and SHB=R.

Regardless of the direction, if there are at least as many data bits in the
first word of the source item as in the first word of the destination item,
then all the destination bits come from SrcWd during the first shift and
T need not be loaded at all.  On the other hand, if there are fewer bits
in the first word of the source item, the destination bits come from
both SrcWd and T.  In this case, T must contain the first source word and
SrcWd the second.  These two situations must be distinguished, since one
requires fetching only a single word whereas the other requires fetching
two words before operating the shifter for the first time.

A similar situation arises at the end of an item; that is, all the bits
in the last partial destination item may come from the leftover source
word in T, or it may be necessary to fetch an additional word into SrcWd.
These situations must be distinguished also.  An earlier version of this
microcode omitted the test and simply fetched the "extra" word always;
unfortunately, this sometimes resulted in touching a word outside the bit map,
which caused problems in a paged environment when the bit map happened
to be page-aligned.

Two bits in BBCtrl control the fetching of the "extra" word in the above
two cases.

In the normal case in which both srcBpl and dstBpl are multiples of 16,
the ShC register and all horizontal loop control information can be computed
just once for all scan lines.  If either srcBpl or dstBpl is not a multiple
of 16, this doesn't work, and all this stuff has to be recomputed for
each scan line.  I consider this case to be anomalous -- the PrincOps
specification is overly general, or perhaps was motivated by clever hacks
such as being able to BitBlt parallelograms.  Therefore, this implementation
takes the easy way out in this case by doing just one scan line, saving state
as if an interrupt had occurred, and restarting BitBlt from the beginning.
This is slow as blazes, but the user gets what he deserves.......

All the horizontal loops work in both directions by use of a trick:
ALUF[15] is redefined to be A+1 if moving left-to-right but A-1 if
right-to-left.  This ALU function is invoked by the operation "A+/-1",
and is used to advance source and destination pointers along the scan line.
A consequence of this is that ALUF[15] (normally A AND NOT B) cannot be
used by I/O tasks.

BBDst and BBSrc are 16-bit displacements relative to base registers
BBDstBR and BBSrcBR.  If moving top-to-bottom they start near zero
and count up; if bottom-to-top they start at ~2↑15 and count down.
If more than ~2↑15 words of bit map are processed, one of these displacements
becomes negative.  This is detected and causes BitBlt to restart, recomputing
the base registers and displacements.  BR displacement overflow is handled
this way because it takes too much code to recompute the BRs and displacements
in mid-stream.  The cost of restarting BitBlt from scratch in this case is
unimportant, since such a large amount of data is involved.  (Indeed, in
normal programs an interrupt will occur before 2↑15 words are processed,
so BitBlt will be restarted anyway.)

BitBlt is normally invoked with the stack containing a single item:
	TOS:	POINTER TO BitBltArg;

However, when resuming from an interrupt or fault, the stack looks like this:
	TOS:	(POINTER TO BitBltArg)+1;
	TOS+1:	CARDINAL;	-- number of scan lines already done

Note that the PrincOps requires BitBltArg to be 16-word aligned, so the
two cases can be distinguished by looking at the low-order bit of the pointer.
(Note that we distinguish the cases differently from the manner specified
in the PrincOps, which examines the stack depth to see whether intermediate
state is present or absent.  The latter is messy, especially in face of the
fact that a page fault restores StkP to its initial value.)

BitBlt exits in one of three ways:

1. Normal completion: BitBlt renders the stack empty and exits.

2. Interrupt request pending: BitBlt leaves at TOS+1 the number of scan lines
already processed and branches to the interrupt handler.  BitBlt is
subsequently restarted with the value left in TOS+1.

3. Page fault: BitBlt is careful to ensure restartability in the face of
page faults.  That is, any time a page fault can occur, TOS+1 contains the
number of scan lines already done, and restarting BitBlt with that argument
will cause the correct thing to happen.

Local stack usage:
	TOS:	BitBltArg
	TOS+1:	number of scan lines completed
	TOS+2:	saved ALUFM[17]
	TOS+3:	saved ALUFM[15]

This is, of course, relative to the initial StkP value, which is the one
that will be restored by RestoreStkP if a page fault occurs.  During
the main loops, StkP usually points to TOS+1.  Also, remember that
traps and faults save and restore two words above TOS.

Approximate timing for initialization and cleanup (excluding main loops):
	 65 cycles
	+10 if direction=backward
	+29 if resuming from interrupt

	+ one of the following:

	gray:	 13 cycles minimum
		 +6 if resuming from interrupt
		+37 if resuming and (heightMinusOne+1) is not a power of 2

	~gray:	 21 cycles minimum
		+29 if resuming from interrupt
		 +2 if direction=backward

Totals:	gray:	78 cycles minimum, 160 maximum 
	~gray:	86 cycles minimum, 156 maximum 

Note: these overheads are incurred on every scan line if dst.bit or src.bit
is not a multiple of 16.

See comments above main loops for main loop times.
%

*-----------------------------------------------------------
* R-register assignments:
*-----------------------------------------------------------

SetRMRegion[BBRegs];	* The RMRegion itself is defined in RegisterDefs.mc

RVN[BBDst];		* Address of next word to process
RVN[BBSrc];

RVN[DstInc];		* Address increment between scan lines --
RVN[SrcInc];		*  negative if direction = backward

RVN[DRast];		* Raster length (bits, later words) --
RVN[SRast];		*  negative if direction = backward

RVN[PrefDst];		* Address of next munch to PreFetch
RVN[PrefSrc];

RVN[DPrefOffset];	* Offset of leftmost word of next scan line relative
RVN[SPrefOffset];	*  to first word of current scan line

RVN[VCount];		* Vertical line count
RVN[MCount];		* Horizontal munch count
RVN[ICnt];		* Initial value of Cnt register for word loops

RVN[BBDisp];		* Control flags and horizontal loop dispatch
			* B0=1 => direction = backward
			* B8:15 = dispatch value relative to HorizontalDisp
			* (see below for details)

RVN[BBCtrl];		* Control flags:
			* B0=1 => 2 source words required for first dest word
			* B15=1 => 2 source words required for last dest word

RVN[SrcWd];		* Leftover source word -- must be RVREL 17

* Alias used when the source is gray:
RME[GrayBump, SRast];	* Size of gray brick in words -1

* Aliases used during initialization:
RME[Width, DPrefOffset]; * Width of item in bits
RME[DWidth, ICnt];	* Width of destination item in words
RME[SWidth, PrefSrc];	* Width of source item in words
RME[Skew, DstInc];	* Destination-source skew, mod 20
RME[BBMasks, PrefDst];	* LMask and RMask values to be loaded into SHC
RME[BBFlags, SrcInc];	* BitBlt flags word
RME[DstX, BBDst];	* Destination starting X in bits, later in words
RME[SrcX, BBSrc];	* Source starting X in bits, later in words
RME[IBBSrc, SPrefOffset]; * Initial BBSrc (gray only)
RME[BBTemp, SrcWd];	* Must be RVRel 17 because it is an arg to MulSub.

*-----------------------------------------------------------
* Other definitions
*-----------------------------------------------------------

* Base-register assignments

% -- Actually defined in ADefs.mc.  First two must be an even-odd pair.
BR[BBDstBR, ?];		* BitBlt destination base
BR[BBSrcBR, ?];		* BitBlt source base
BR[LPtr, ?];		* Scratch (emulator use only)
%


* ALU functions defined by BitBlt.
* The ALUF Ram is loaded by BitBlt with the desired operations.
XALUOP[,BBOp,,17,E];	* A BBOp B -- logical operation invoked with shifter

XAOP[,+/-1,15,E];	* A +/-1 -- A+1 or A-1 depending on horizontal
			* direction.  This value of ALUF is normally
			* A AND NOT B and is restored by BitBlt when done.
			* This means, however, that A AND NOT B cannot be used
			* by other tasks.


* Layout of BBDisp register:
* B0 = 0 if direction = forward, 1 if backward
* B8-15: BigBDispatch value for setup and body dispatches.
* The following addressing constraints apply:
* (1) B13 = 1 and B15 = 1 if a real source bit map is used (0 if gray block).
* (2) B14 = 0 if the item(s) must be touched before beginning a transfer.
* (3) B11 = 1 if the destination is an operand.
* (4) Certain targets are tied together by Call constraints.
* (5) Must not stomp on too many JCN conditional branch targets.
* These bits are rather carefully selected to permit the same BBDisp to be
* used in three different dispatches.

Set[SrcFlg, 5];
Set[NoTouchFlg, 2];
Set[DstFlg, 20];

* BBDispX defines the page used for dispatches on BBDisp.
Set[BBDispX, 1200];
M[BBAt, At[BBDispX, Add[#1, #2]]];

* Page-relative entry points for setup routines.
* All must have B11 = 1, to neutralize body dispatch (DstFlag).
Set[GrayDstSetup&TouchLoc, 70];
Set[GrayDstSetupLoc, 72];
Set[SrcDstSetup&TouchLoc, 75];
Set[SrcDstSetupLoc, 77];	* Must be SrcDstSetup&TouchLoc + 2

Set[HorizontalDispLoc, 70];	* Target of setup dispatch

* Page-relative entry points for body routines.
* Each pair must differ only in the value of DstFlag, and the addresses
* must neutralize any one bits that may have been present for setup dispatch.
Set[GrayBodyLoc, 12];
Set[GrayDstBodyLoc, 32];

Set[SrcBodyLoc, 17];		* These must be xxxx1111 because they contain
Set[SrcDstBodyLoc, 37];		*  conditional Call instructions

*-----------------------------------------------------------
NewOp; ESCEntry[BITBLT];
*-----------------------------------------------------------

	T← MDSHi, MemBase← LPtr, Call[BRHiGetsT];

* See whether this is an initial call or a resumption, and set low bit
* of BitBltArg pointer to indicate that further restarts are resumptions.
	T← Stack&+1← (Stack&+1) OR (1C), Branch[.+2, R odd];
	Stack← A0;			* Initial call, zero # of lines done
	T← T-1, StkP+1, RBase← RBase[BBRegs];
	BRLo← T;			* Set up BR to point to BitBltArg
	Fetch← BB.flags;
	BBFlags← MD, Fetch← BB.height;

* Dispatch on gray, srcFunc, dstFunc
	T← LDF[BBFlags, 4, 11];
	BigBDispatch← T;
	Branch[BBFunctionTable];

*-----------------------------------------------------------
BBFunctionTable: DispTable[20];		* gray, srcFunc, dstFunc, ALU op
	T← 1C, Branch[BBFC];		* 0, 0, 0	NOT A
	T← 21C, Branch[BBFD];		* 0, 0, 1	NOT A AND B
	T← 5C, Branch[BBFD];		* 0, 0, 2	NOT A OR B
	T← 15C, Branch[BBFD&T];		* 0, 0, 3	A EQV B
	T← 37C, Branch[BBFC];		* 0, 1, 0	A
	T← 35C, Branch[BBFD];		* 0, 1, 1	A AND B
	T← 27C, Branch[BBFD];		* 0, 1, 2	A OR B
	T← 23C, Branch[BBFD&T];		* 0, 1, 3	A XOR B
	T← 1C, Branch[BBFA];		* 1, 0, 0	NOT A
	T← 21C, Branch[BBFB];		* 1, 0, 1	NOT A AND B
	T← 5C, Branch[BBFB];		* 1, 0, 2	NOT A OR B
	T← 15C, Branch[BBFB&T];		* 1, 0, 3	A EQV B
	T← 37C, Branch[BBFA];		* 1, 1, 0	A
	T← 35C, Branch[BBFB];		* 1, 1, 1	A AND B
	T← 27C, Branch[BBFB];		* 1, 1, 2	A OR B
	T← 23C, Branch[BBFB&T];		* 1, 1, 3	A XOR B
*-----------------------------------------------------------

* Select the appropriate dispatch word for the function.
BBFA:	BBDisp← Add[NoTouchFlg]C, Branch[SetBBF];
BBFB:	BBDisp← Add[DstFlg, NoTouchFlg]C, Branch[SetBBF];
BBFB&T:	BBDisp← Add[DstFlg]C, Branch[SetBBF];
BBFC:	BBDisp← Add[SrcFlg, NoTouchFlg]C, Branch[SetBBF];
BBFD:	BBDisp← Add[SrcFlg, DstFlg, NoTouchFlg]C, Branch[SetBBF];
BBFD&T:	BBDisp← Add[SrcFlg, DstFlg]C, Branch[SetBBF];

SetBBF:	Stack&-1← ALUFMRW← T, ALUF[17],	* Set ALU function and save old
		T← MD;			* T← height in scan lines

* VCount← (scan lines left to do)-1.  Stack has scan lines already done.
	VCount← T-(Stack)-1;		* height - (scan lines done) -1

*-----------------------------------------------------------
* Set up destination base address
*-----------------------------------------------------------

* Registers in use: (permanent) BBDisp, VCount; (temporary) BBFlags.

	T← (Fetch← BB.dstBpl)-1, Branch[BitBltDone1, ALU<0];
	Fetch← T, DRast← MD;		* DRast← bits per line
	PD← (DRast) AND (17C);		* See if integral words per line
	T← Stack, Branch[.+2, ALU=0];
	T← Stack, Reschedule;		* No, process only one scan line
	Q← T, Branch[DstMult, ALU#0];	* Branch if resuming from interrupt

	T← (Fetch← BB.dstLo)+1, DstX← MD; * DstX← dst.bit
	T← MD, Fetch← T;		* T← low base word address
	BBTemp← MD, Branch[DstSetupBR];	* BBTemp← high base word address

* Following garbage is only executed when resuming BitBlt from an interrupt
* or fault.  We must compute the base address for the next scan line.
* n.b. our life is made very difficult by the presence of the BitAddress
* structure and the specification of bit map width in bits rather than words,
* to say nothing of features such as bits/line being negative if direction =
* backwards.  I assume these were intended to help the Dolphin implementation,
* but they are definitely not helpful!  Alto BitBlt both had a cleaner
* interface and was easier to implement.
DstMult:
	DstX← MD;			* DstX← dst.bit
	T← DRast, Call[MulSub];		* T,,Q ← Q*T, clobbers BBTemp

* Now have 32-bit bit offset in T,,Q.  Add it to bit map base address.
	Fetch← BB.dstLo, DRast, Branch[.+2, R>=0],
		DispTable[1, 2, 2];	* Squash final Multiply dispatch
	T← T-(Stack);			* DRast negative, adjust high result
					* to make it 32-bit 2's complement
	DstX← (DstX)+Q;			* Add low bit offset to dst.bit
	T← BBTemp← A← T, XorSavedCarry;	* BBTemp← high bit offset
	T← RCY[T, DstX, 4];		* Convert to word offset
	T← T+MD;			* Add to low base word address
	DstX← (DstX) AND (17C);		* Mask new BitAddress.bit
	Fetch← BB.dstHi;
	BBTemp← RSH[BBTemp, 4];		* BBTemp← high word offset
	BBTemp← (BBTemp)+MD, XorSavedCarry; * Add to high base word address

* Together again here.  Now BBTemp,,T = adjusted base value.
DstSetupBR:
	MemBase← BBDstBR, DRast, Branch[DstReallySetupBR, R>=0];

* Processing bottom-to-top.
* Decrease base by 2↑15, and (later) increase initial displacement by 2↑15.
	T← T-(100000C);
	BBTemp← (BBTemp)-1, XorSavedCarry;
	BBDisp← (BBDisp) OR (100000C);	* Set flag to remember direction

* Now finally ready to load the base register!
DstReallySetupBR:
	BRLo← T;
	BRHi← BBTemp;

*-----------------------------------------------------------
* Set up source base address
*-----------------------------------------------------------

* Registers in use: (permanent) BBDisp, VCount, DRast;
* (temporary) BBFlags, DstX.

	MemBase← LPtr;
	T← (Fetch← BB.srcBpl)-1;
	Fetch← T, SRast← MD, BBDisp,	* SRast← bits per line
		Branch[SrcIsGray, R even]; * Branch if source bit map is gray

	PD← (SRast) AND (17C);		* See if integral words per line
	T← Stack, Branch[.+2, ALU=0];
	T← Stack, Reschedule;		* No, process only one scan line
	Q← T, Branch[SrcMult, ALU#0];	* Branch if resuming from interrupt

	SrcX← MD, T← (Fetch← BB.srcLo)+1; * SrcX← src.bit
	T← MD, Fetch← T;		* T← low base word address
	BBTemp← MD, Branch[SrcSetupBR];	* BBTemp← high base word address

* Following garbage is only executed when resuming BitBlt from an interrupt
* or fault.  We must compute the base address for the next scan line.
SrcMult:
	SrcX← MD;			* SrcX← src.bit
	T← SRast, Call[MulSub];		* T,,Q ← Q*T, clobbers BBTemp

* Now have 32-bit bit offset in T,,Q.  Add it to bit map base address.
	Fetch← BB.srcLo, SRast, Branch[.+2, R>=0],
		DispTable[1, 2, 2];	* Squash final Multiply dispatch
	T← T-(Stack);			* SRast negative, adjust high result
					* to make it 32-bit 2's complement
	SrcX← (SrcX)+Q;			* Add low bit offset to src.bit
	T← BBTemp← A← T, XorSavedCarry;	* BBTemp← high bit offset
	T← RCY[T, SrcX, 4];		* Convert to word offset
	T← T+MD;			* Add to low base word address
	SrcX← (SrcX) AND (17C);		* Mask new BitAddress.bit
	Fetch← BB.srcHi;
	BBTemp← RSH[BBTemp, 4];		* BBTemp← high word offset
	BBTemp← (BBTemp)+MD, XorSavedCarry; * Add to high base word address

* Together again here.  Now BBTemp,,T = adjusted base value.
SrcSetupBR:
	MemBase← BBSrcBR, SRast, Branch[.+3, R>=0];

* Processing bottom-to-top.
* Decrease base by 2↑15, and (later) increase initial displacement by 2↑15.
	T← T-(100000C);
	BBTemp← (BBTemp)-1, XorSavedCarry;

* Now finally ready to load the base register!
	PD← (BBFlags) AND (Or[BBF.disjoint!, BBF.disjointItems!]C);
	BRLo← T, Branch[.+2, ALU#0];
	BBDisp← (BBDisp) AND (Not[NoTouchFlg]C); * Not disjoint, must touch
	BRHi← BBTemp, Branch[BBSetupX];

*-----------------------------------------------------------
* Set up gray source block
*-----------------------------------------------------------

* Registers in use: (permanent) BBDisp, VCount, DRast;
* (temporary) DstX.

* Re-interpret srcBpl as a GrayParm record.
* Note: assume direction = forward always (required by PrincOps).
* BBSrcBR will end up pointing one beyond the end of the gray brick MOD 2↑16,
* and BBSrc will address it using the negative indexing trick.
SrcIsGray:
	SrcX← MD, T← (Fetch← BB.srcLo)+1; * SrcX← src.bit
	BBTemp← MD, Fetch← T;		* BBTemp← low base address
	T← LDF[SRast, 4, 10];		* Extract yOffset
	GrayBump← (SRast) AND (17C);	* GrayBump← heightMinusOne --
					* GrayBump and SRast are same register
	T← (GrayBump)-T;		* T← heightMinusOne-yOffset
	BBTemp← (BBTemp)+T+1, MemBase← BBSrcBR; * End of brick +1
	BBTemp← MD, BRLo← BBTemp;	* Load up the base register
	BBTemp← (BBTemp)-1, XorSavedCarry; * high base-1 for negative indexing
	BRHi← BBTemp;
	PD← Stack;			* Resuming from interrupt?
	T← (GrayBump)-T,		* Recover yOffset
		Branch[NoGrayDiv, ALU=0]; * Branch if no division required

* Following garbage is only executed when resuming BitBlt from an interrupt
* or fault.  We must compute the grayIndex for the next scan line.
* In general this is (yOffset + scanLinesDone) MOD height.  I expect height
* usually to be a power of two, so optimize this case and only do the
* full-blown MOD operation when necessary.
	T← (Stack)+T;			* scanLinesDone + yOffset
	T← (GrayBump)+1, Q← T;		* T← height, Q← scanLinesDone+yOffset
	BBTemp← T;
	PD← (GrayBump) AND T;		* = 0 iff height is power of 2
	T← (GrayBump) AND Q, Branch[.+2, ALU=0];
* DivSub performs [T: remainder, Q: quotient] ← (T,,Q)/BBTemp
* and clobbers BBCtrl -- beware!
	T← A0, Call[DivSub];
	Nop;

* Together again here.  Now T = appropriate initial y coordinate.
NoGrayDiv:
	IBBSrc← T-(GrayBump)-1;		* Init negative index for references

*-----------------------------------------------------------
* X-coordinate setup: widths, margins, skew, masks, etc.
*-----------------------------------------------------------

* Registers in use: (permanent) BBDisp, VCount, DRast, SRast (= GrayBump);
* (temporary) DstX, SrcX (= BBSrc).
BBSetupX:
	MemBase← LPtr;
	Fetch← BB.width;
	Width← MD, T← A0;

* Compute LMask and RMask.
* LMask = dst.bit.
* RMask = 17-((dst.bit+width-1) mod 20) = (-dst.bit-width) mod 20.
* For reference, SHC fields are:
*  B2: SHA=T, B3: SHB=T, B4-7: count, B8-11: RMask, B12-15: LMask
	T← T-(Width);			* T← -width
	T← T-(DstX),			* T← -dst.bit-width
		Branch[BitBltDone2, ALU>=0]; * Stop here if width <= 0
	T← DPF[T, 4, 4], StkP+2;	* B8-11←((-dst.bit-width)mod 20) lsh 4
	BBMasks← T OR (DstX);		* B12-15← dst.bit

* Compute destination width in words, including first and last partial words.
* DWidth← (width + dst.bit +17) / 20.
	T← (Width)+(17C);
	T← T+(Q← DstX);
	DWidth← RSH[T, 4];

	BBCtrl← (BBDisp)-(BBDisp)-1,	* BBCtrl[0] ← 1 in case gray source
		Branch[BBNoSetupSrcX, R even]; * Branch if gray

* Compute source width in words, including first and last partial words.
* SWidth← (width + src.bit +17) / 20.
	T← (SrcX)+(17C);
	T← (Width)+T;
	SWidth← T← RSH[T, 4];

* Set flags to control fetching of "extra" first and last words.
* Except in the "thin" case (DWidth=1), the setup/finish routines for the
* horizontal loops nominally fetch 1 word and store 2 words; an extra fetch
* may be required at the beginning, the end, or both, depending on the number
* of words in the source and destination items (see introductory comments).
* DWidth and SWidth differ by at most +/-1.
* If SWidth = DWidth+1, an extra source word must be fetched at both ends.
* If SWidth = DWidth-1 or SWidth = 1, no extra source words need be fetched.
* If SWidth = DWidth # 1, an extra source word must be fetched at one end:
*   if src.bit mod 20 > dst.bit mod 20 then left else right.
* Set BBCtrl[0]← extraLeft, BBCtrl[1:15]← extraRight.
* (these flags are exchanged later if working right-to-left.)
* T still has SWidth; Q still has dst.bit.
* The following 2 instructions set BBCtrl← -1 if SWidth>DWidth, 0 otherwise.
	PD← (DWidth)-T;
	BBCtrl← T-T-1, XorSavedCarry, Branch[SetupSkew, ALU#0];

* Set BBCtrl← 100000 if src.bit > dst.bit, 77777 otherwise.
	PD← (SrcX)-Q-1;			* Q=DstX; carry iff src.bit > dst.bit
	BBCtrl← 100000C;
	BBCtrl← (BBCtrl)-1, XorSavedCarry, Branch[SetupSkew];

BBNoSetupSrcX:
	Nop;

* X-coordinate setup (cont'd)

* Compute skew = (src.bit-dst.bit) mod 20, and determine horizontal direction.
* Still have Q = dst.bit.
SetupSkew:
	T← (SrcX)-Q;
	PD← BBDisp, Branch[.+2, R even];
	Skew← T AND (17C), DblBranch[SetupRtoL, SetupLtoR, ALU<0];

	BBSrc← IBBSrc;			* Gray, init source ptr
	Skew← T AND (17C);
	T← 200C, Branch[NoSrcThinChk];	* ALUFM control for "A+1"

* direction = forward: work from left to right.
* Set SHA=R, SHB=T, ALUF[15]="A+1".
* Note: if skew = 0, set SHA=R, SHB=R.
SetupLtoR:
	T← 200C, Branch[NoSrcThinChk, ALU=0]; * ALUFM control for "A+1"
	Skew← (Skew) OR (20C), Branch[SetupALU&ShC]; * SHB=T

* direction = backward: work from right to left.
* Set SHA=T, SHB=R, ALUF[15]="A-1".
* Advance starting X coordinates to rightmost ends of items.
* Note: if skew = 0, set SHA=R, SHB=R, and do not exchange extra-word flags.
SetupRtoL:
	BBDisp← (BBDisp) OR (100000C), Branch[.+3, ALU=0];
	Skew← (Skew) OR (40C);		* SHA=T
	BBCtrl← (BBCtrl) LCY 1;		* Exchange source extra-word flags
	T← (Width)-1;
	DstX← (DstX)+T;			* Advance to rightmost X-coordinates
	SrcX← (SrcX)+T;			* Advance non-gray source
	T← 36C;				* ALUFM control for "A-1"

* Have ALUFM control in T for "A+/-1" operation.
SetupALU&ShC:
	PD← (SWidth)-1;			* Thin source check (see below)
NoSrcThinChk:
	Stack&-2← ALUFMRW← T, A+/-1,	* Set ALU function, save old value.
		Branch[SetupShC, ALU#0]; * Leave StkP -> TOS (scan line count)

* If we would have fetched an extra source word, but there is only one source
* word to fetch, then reset source extra-word flags and set SHA=R, SHB=R.
	BBCtrl← A0, Branch[.+2, R>=0];
	Skew← (Skew) AND (17C);
	T← LSH[BBMasks, 10], Branch[.+2]; * Placement

* Merge skew with masks and load SHC.
SetupShC:
	T← LSH[BBMasks, 10];		* Shift masks to B0-7
	T← LCY[T, Skew, 10];		* Concatenate SHA, SHB, count, masks
	T← DWidth, ShC← T;

*-----------------------------------------------------------
* Compute word addresses and increments
*-----------------------------------------------------------

* Registers in use: (permanent) BBDisp, VCount, DRast, SRast, Width, BBCtrl;
* (temporary) DstX, SrcX, DWidth, SWidth.
* Note: at this point DRast and SRast are still in units of bits.

* Convert DstX to X word displacement relative to leftmost word of
* first scan line.  Note: BBDst is the same register as DstX.
	BBDst← RSH[DstX, 4];		* BBDst← DstX/20

* Compute destination inter-scan-line word address increment.
* DstInc← DRast + (if direction=forward then -DWidth else DWidth).
* Also compute first PreFetch offset (from BBDst).
* = leftmost word on next scan line, even if processing right-to-left.
* DPrefOffset← if direction=forward then DRast else DRast-DWidth.
* Finally, if direction=backward then add 2↑15 to the word address,
* which will be counted down toward the base register which was initialized
* to the leftmost word of the first scan line -2↑15.
* Note: T = DWidth here.
	DRast← RSH[DRast, 4],		* Convert DRast to words
		Branch[DstIncRtoL, R<0]; * Branch if direction=backward

DstIncLtoR:
	DstInc← (DRast)-T;		* L to R: DRast-DWidth
	DPrefOffset← T← DRast, Branch[DstIncDone];

DstIncRtoL:
	DRast← (DRast) OR (170000C);	* Extend sign
	DstInc← (DRast)+T;		* R to L: DRast+DWidth
	DPrefOffset← T← (DRast)-T;
	BBDst← (BBDst)+(100000C);

DstIncDone:
	PrefDst← (BBDst)+T;		* Initial PreFetch address

* Is a source item required?  If not, skip source setup.
	BBDisp, Branch[BBNoSource, R even];

* Convert SrcX to X word displacement relative to leftmost word of
* first scan line.  Note: BBSrc is the same register as SrcX.
	BBSrc← RSH[SrcX, 4];		* BBSrc← SrcX/20

* Compute source inter-scan-line word address increment.
* SrcInc← SRast + (if direction=forward then -SWidth else SWidth).
* Also compute first PreFetch offset (from BBSrc).
* = leftmost word on next scan line, even if processing right-to-left.
* SPrefOffset← if direction=forward then SRast else SRast-SWidth.
* Finally, if direction=backward then add 2↑15 to the word address,
* which will be counted down toward the base register which was initialized
* to the leftmost word of the first scan line -2↑15.
	T← SWidth;
	SRast← RSH[SRast, 4],		* Convert SRast to words
		Branch[SrcIncRtoL, R<0]; * Branch if direction=backward

SrcIncLtoR:
	SrcInc← (SRast)-T;		* L to R: SRast-SWidth
	SPrefOffset← T← SRast, Branch[SrcIncDone];

SrcIncRtoL:
	SRast← (SRast) OR (170000C);	* Extend sign
	SrcInc← (SRast)+T;		* R to L: SRast+SWidth
	SPrefOffset← T← (SRast)-T;
	BBSrc← (BBSrc)+(100000C);

SrcIncDone:
	PrefSrc← (BBSrc)+T;		* Initial PreFetch address

*-----------------------------------------------------------
* Final adjustments prior to entering vertical loop
*-----------------------------------------------------------

* Compute ICnt, the initial value of the Cnt register for each loop.
* ICnt← if DWidth <= 22B then DWidth-1 else NOT (DWidth-3) [= 2-DWidth].
BBNoSource:
	PD← (DWidth)-(23C);
	ICnt← T← (DWidth)-1, Branch[.+2, ALU<0];
	ICnt← (1S)-T;			* = (2S)-(DWidth)

* Never need to "touch" data in the "thin" case.
* ALU=0 here iff the "thin" case applies (DWidth=1).
	Stack← (Stack)-1, Branch[.+2, ALU#0];
	BBDisp← (BBDisp) OR (Add[NoTouchFlg]C);

* Enter vertical loop with (scan lines done)-1 on TOS and ALU>=0.
	MCount← T-T-1, MemBase← BBSrcBR;
	PD← A0, Branch[BBVerticalLoop];

*-----------------------------------------------------------
* BitBlt vertical loop (per-scan-line)
* At top of loop, the following invariants hold:
*	VCount = (number of scan lines remaining)-1
*	Stack = (number of scan lines done)-1
*	MemBase = BBSrcBR
*	ALU<0 iff destination BR displacement has overflowed
* Vertical loop overhead:
*	 6 cycles for loop control and destination pointer update
*	+4 cycles for source pointer update if source item is used
*	+4 cycles if item is greater than 20B words wide
*-----------------------------------------------------------

BBVerticalLoop:
	VCount← (VCount)-1, T← MD,	* Force faults from previous item
		Branch[BBDoneOrOverflow, ALU<0, R<0];

* If positive, ICnt has the desired initial value of Cnt (DWidth-1).
	T← NOT (Cnt← ICnt), Branch[SmallBlock, R>=0];

* Item greater than one munch wide.  Set up separate munch and word counts.
* T now has DWidth-3, where DWidth is the number of words in the destination
* item, including first and last partial words.
* MCount ← (DWidth-3)/20 -1, Cnt ← (DWidth-3) mod 20 +2.
	MCount← RSH[T, 4];
	T← T AND (17C);
	T← T+(2C);
	MCount← (MCount)-1, Cnt← T;

* This dispatch goes to one of: GrayDstSetup, GrayDstSetup&Touch,
* SrcDstSetup, or SrcDstSetup&Touch.
SmallBlock:
	BigBDispatch← BBDisp;
	Stack← (Stack)+1,		* Advance scan lines done
		BranchExternal[Add[BBDispX, HorizontalDispLoc]];


*-----------------------------------------------------------
AdvanceSrcDst:
* Control returns here at the end of individual horizontal loops that
* involve both source and destination items.
* BBSrc, BBDst point one beyond last word processed.
*-----------------------------------------------------------
	T← SrcInc;
	BBSrc← T← (BBSrc)+T;
	PrefSrc← T+(SPrefOffset);
	T← BBDst, Branch[SrcBROverflow, ALU<0];

*-----------------------------------------------------------
AdvanceDst:
* Control returns here at the end of individual horizontal loops that
* involve only a destination item (and gray, which is handled separately).
* BBDst and T point one beyond last word processed.
*-----------------------------------------------------------
	BBDst← T← T+(DstInc);
	PrefDst← T← T+(DPrefOffset),
		DblBranch[BBReschedPending, BBVerticalLoop, Reschedule];

SrcBROverflow:
	Branch[BBDoneOrOverflow];

*-----------------------------------------------------------
* Either VCount is exhausted or one of the BR displacements overflowed
* (or conceivably both events occurred at the same time).
* If VCount is exhausted then return normally; otherwise restart BitBlt.
* Note: scan line count at TOS is one behind.
*-----------------------------------------------------------
BBDoneOrOverflow:
	Branch[BBReschedPending];	* Handle same as Reschedule request


*-----------------------------------------------------------
* Reschedule pending.  Branch to the reschedule trap handler.
* After handling the interrupt, it will restart the BitBlt with the state
* left on the stack.
*-----------------------------------------------------------
BBReschedPending:
	T← MD, VCount, Branch[.+2, R>=0];

* We just processed the last scan line, so exit normally.
	StkP+2, Branch[BitBltDone];

* Not done yet: either BR displacement overflowed or Reschedule is pending.
* Restore clobbered registers, check for interrupts, and then restart the instruction.
* Note: scan line count at TOS is one behind, so fix it.
	Stack&-1← (Stack&-1)+1;
	Call[RestoreStdRegs];
	Branch[.+2, Reschedule];	* Start instruction over if no interrupt pending
	Branch[@BITBLT];

* Reschedule request is pending.  Give interrupt a chance to take.  If it doesn't,
* restart the instruction from the current point.
	Call[InterruptLongRunningOpcode];
	Branch[@BITBLT];


*-----------------------------------------------------------
* Really done.  Restore clobbered ALUFM, flush stack, and exit.
*-----------------------------------------------------------
* Get here during initialization when only ALUFM[17] has been clobbered.
BitBltDone1:
	Nop;
BitBltDone2:
	StkP+1, Branch[.+2];

* Get here at end when both ALUFM[17] and ALUFM[15] have been clobbered.
BitBltDone:
	ALUFMRW← Stack&-1, ALUF[15];
	ALUFMRW← Stack&-2, ALUF[17];
* Must clear the low-order bit of BitBltArg so that if the caller does
* a REC to recover it, he will see the right thing.
	Stack&-1← (Stack&-1)-1, RBase← RBase[MesaRegsMain];
	ShC← StdShC, NextOpcode;

*-----------------------------------------------------------
* Horizontal loops
*-----------------------------------------------------------

%
Organization of horizontal (per-word) loops:

There are a number of variations, depending on the following:
	1. Whether the source is normal or gray;
	2. Whether or not the destination is an operand;
	3. Whether or not the destination is "thin" (one word per scan line);
	4. Whether or not data need be "touched" before doing the transfer.

There are fewer than 16 total cases because some of these combinations
cannot occur (for example, the thin case never requires touching).
Each case has a "setup" routine, a "body" routine, and a "finish" routine.
Many of these routines are shared among cases, and the flow of control
is determined by a complicated network of dispatches on BBDisp.
Finish routines exit to the vertical loop by a branch to AdvanceDst or
AdvanceSrcDst.

It is never necessary to "touch" if any of the following is true:
	a. The destination is "thin";
	b. The destination is not an operand of the BitBlt function;
	c. The function can be performed multiple times with the same effect
	   as performing it only once (i.e., anything except XOR,
	   assuming the source and destination items don't overlap).
The "touch" and "no touch" cases are distinguished as part of the dispatch.

To reduce miss wait and increase performance, while processing each
scan line we fall out of the main loop once per munch and PreFetch
one munch for the next scan line.  This strategy depends on the
assumption that each scan line is less than 100 munches long (for a 100-row
cache, which is what the Dorado has at present).  Note that PreFetches
are done left-to-right even if transfers are done right-to-left.

Note that the current implementation is imperfect in that the last munch
of the next scan line may not be prefetched, or an extra munch prefetched
unnecessarily, because the main loop does not terminate at munch boundaries
but rather at multiples of 20 words from the end of the scan line.
Also, the first scan line is not prefetched, and a line past the end of
the last scan line is prefetched unnecessarily.  These defects should not
affect performance noticeably in normal use.
%

TopLevel;
KnowRBase[BBRegs];

%*-----------------------------------------------------------
Case A:  dest ← f(gray)

BBOp is taken from the following table:

    Function (class)	gray	srcFunc		dstFunc		ALU operation
	10 (A)		TRUE	normal		null		NOT A
	14 (A)		TRUE	complement	null		A

Entry point to this code is at GrayDestSetup.
It dispatches here after setup.

Timing, per scan line, including vertical loop overhead:
	19 cycles minimum in the normal case
	+1 cycle per full word (excluding first and last partial words)
	+4 + (2 * # of munches) cycles if item is wider than 20B words

	12 cycles total in the "thin" case
%*-----------------------------------------------------------

* Entry point for srcFunc = normal.
* T has first partial word to be stored, and SrcWd has gray word.
GrayBody: BBAt[GrayBodyLoc],
	BBDst← (Store← BBDst)+/-1, DBuf← T;
	T← ShiftNoMask[SrcWd], Branch[GrayEnd, Cnt=0&-1];

* Just replicate the shifted gray word throughout the destination body.
GrayLoop:
	BBDst← (Store← BBDst)+/-1, DBuf← T, Branch[GrayLoop, Cnt#0&-1];

GrayEnd:
	MCount← (MCount)-1, Cnt← 17S, Branch[GrayLast, R<0];
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[GrayLoop];

GrayLast:
	Fetch← BBDst, Branch[DstFinish];

* Store last partial word.  This is the tail of cases A and B.
* Have already fetched the last word (at BBDst).
DstFinish:
	BBDisp, Branch[.+2, R<0];
	T← XShMDRMask[SrcWd], B← MD, Branch[StoreLastDst];
	T← XShMDLMask[SrcWd], B← MD, Branch[StoreLastDst];

* Thin destination slice, for cases A and B.
DstThin:
	T← XShMDBothMasks[SrcWd], B← MD;
StoreLastDst:
	BBDst← T← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
		Branch[AdvanceDst];

%*-----------------------------------------------------------
Case B:  dest ← f(gray, dest)

BBOp is taken from the following table:

    Function (class)	gray	srcFunc		dstFunc		ALU operation
	11 (B)		TRUE	normal		and		NOT A AND B
	12 (B)		TRUE	normal		or		NOT A OR B
	13 (B)		TRUE	normal		xor		A EQV B
	15 (B)		TRUE	complement	and		A AND B
	16 (B)		TRUE	complement	or		A OR B
	17 (B)		TRUE	complement	xor		A XOR B

The gray word is kept in SrcWd and put directly into the shifter.

Case A also dispatches to one of case B's entry points, GrayDstSetup.
The setup code dispatches to the correct body routine.

Timing, per scan line, including vertical loop overhead:
	18 cycles minimum in the normal case
	+3 cycles per full word (excluding first and last partial words)
	+6 + (2 * # of pages) cycles if data needs to be "touched"
	+4 + (2 * # of munches) cycles if item is wider than 20B words

	13 cycles total in the "thin" case
%*-----------------------------------------------------------

GrayDstSetup&Touch: BBAt[GrayDstSetup&TouchLoc],
	T← DRast, FlipMemBase, Call[TouchDst];
	FlipMemBase;

GrayDstSetup: BBAt[GrayDstSetupLoc],
	T← BBSrc← (Fetch← BBSrc)+1, FlipMemBase; * Fetch gray word
	SrcWd← MD, Fetch← BBDst,	* SrcWd← gray word
		Branch[.+2, ALU<0];	* Exhausted gray block?
	BBSrc← T-(GrayBump)-1;		* Yes, reset for next scan line

	PrefDst← (PreFetch← PrefDst), Carry20, Branch[DstThin, Cnt=0&-1];
* The following dispatch may modify GrayBody to GrayDstBody.
	BigBDispatch← BBDisp, Branch[.+2, R<0];
	T← XShMDLMask[SrcWd], B← MD, Branch[GrayBody];
	T← XShMDRMask[SrcWd], B← MD, Branch[GrayBody];

* Body routine for case B only.
GrayDstBody: BBAt[GrayDstBodyLoc],
	BBDst← T← (Store← BBDst)+/-1, DBuf← T;
	PrefSrc← (Fetch← T)+/-1, Branch[GrayDstEnd, Cnt=0&-1];

* Inner loop runs with one word fetched ahead (now in MD).
* PrefSrc is used as a temporary -- runs one word ahead of BBDst.
GrayDstLoop:
	PrefSrc← (Fetch← PrefSrc)+/-1, T← MD;
	T← XShiftNoMask[SrcWd], B← T;
	BBDst← (Store← BBDst)+/-1, DBuf← T, Branch[GrayDstLoop, Cnt#0&-1];

GrayDstEnd:
	MCount← (MCount)-1, Cnt← 17S, Branch[DstFinish, R<0];
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[GrayDstLoop];

%*-----------------------------------------------------------
Case C:  dest ← f(source)

BBOp is taken from the following table:

    Function (class)	gray	srcFunc		dstFunc		ALU operation
	 0 (C)		FALSE	normal		null		NOT A
	 4 (C)		FALSE	complement	null		A

Entry points to this code are at SrcDstSetup and SrcDstSetup&Touch.
They dispatch here after setup.

Timing, per scan line, including vertical loop overhead:
	25 cycles minimum in the normal case
	+4 cycles per full word (excluding first and last partial words)
	+12 + (4 * # of pages) cycles if data needs to be "touched"
	+1 cycle if 2 source words are required for the first destination word
	+3 cycles if gray is required
	+4 + (5 * # of munches) cycles if item is wider than 20B words

	17 cycles minimum in the "thin" case
	+1 cycle if 2 source words are required
%*-----------------------------------------------------------


* Body routine for case C only.
SrcBody: BBAt[SrcBodyLoc],
	T← BBDst← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
**** Program around MicroD problem.  Desired branch clause is:
****		Call[SrcEnd, Cnt=0&-1];
**** but we must actually write:
		BRGO@[0] RETCL@[2] JCN[43];
	BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase,
		BBAt[SrcBodyLoc, 1]; **** MicroD problem
	SrcWd← MD, T← SrcWd;
	T← XShiftNoMask[SrcWd], Branch[SrcBody];

* This is called as a subroutine at the end of each munch.
* It either exits the horizontal loop or returns to do another munch.
SrcEnd:
Subroutine;
	MCount← (MCount)-1, Cnt← 17S,
		DblBranch[SrcDstFinish, SrcDstMore, R<0],
		BBAt[SrcBodyLoc, 2]; **** MicroD problem
SrcDstMore:
	PrefSrc← (PreFetch← PrefSrc), Carry20;
	FlipMemBase;
	PrefDst← (PreFetch← PrefDst), Carry20;
	FlipMemBase, Return;
TopLevel;

* Store last partial word, for cases C and D.
SrcDstFinish:
	BBCtrl, Branch[.+2, R even];	* Fetch extra word at end?
	BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase, Branch[.+2];
	FlipMemBase;
	SrcWd← MD, T← SrcWd, Fetch← T;
	BBDisp, Branch[.+2, R<0];
	T← XShMDRMask[SrcWd], B← MD, Branch[StoreLastSrcDst];
	T← XShMDLMask[SrcWd], B← MD, Branch[StoreLastSrcDst];

* Thin destination slice, for cases C and D.
SrcDstThin:
	T← XShMDBothMasks[SrcWd], B← MD;
StoreLastSrcDst:
	BBDst← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
		Branch[AdvanceSrcDst];

%*-----------------------------------------------------------
Case D:  dest ← f(source, dest)

BBOp is taken from the following table:

    Function (class)	gray	srcFunc		dstFunc		ALU operation
	 1 (D)		FALSE	normal		and		NOT A AND B
	 2 (D)		FALSE	normal		or		NOT A OR B
	 3 (D)		FALSE	normal		xor		A EQV B
	 5 (D)		FALSE	complement	and		A AND B
	 6 (D)		FALSE	complement	or		A OR B
	 7 (D)		FALSE	complement	xor		A XOR B

Case C also dispatches to these entry points.
The setup code dispatches to the correct body routines.

Timing, per scan line, including vertical loop overhead:
	 25 cycles minimum in the normal case
	+ 5 cycles per full word (excluding first and last partial words)
	+12 + (4 * # of pages) cycles if data needs to be "touched"
	+ 1 cycle if 2 source words required for the first destination word
	+ 3 cycles if gray is required
	+ 4 + (5 * # of munches) cycles if item is wider than 20B words

	 17 cycles minimum in the "thin" case
	+ 1 cycle if 2 source words are required
%*-----------------------------------------------------------

SrcDstSetup&Touch: BBAt[SrcDstSetup&TouchLoc],
	T← DRast, FlipMemBase, Call[TouchDst];
	T← SRast, FlipMemBase, Call[TouchSrc];
SrcDstSetup: BBAt[SrcDstSetupLoc],
	BBCtrl, DblBranch[SrcDstSetup2, SrcDstSetup1, R<0];

SrcDstSetup2:
	BBSrc← (Fetch← BBSrc)+/-1;
SrcDstSetup1:
	PrefSrc← (PreFetch← PrefSrc), Carry20;
	T← MD, BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase;
	SrcWd← MD, Fetch← BBDst;
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[SrcDstThin, Cnt=0&-1];
* The following dispatch may modify SrcBody to SrcDstBody.
	BigBDispatch← BBDisp, Branch[.+2, R<0];
	T← XShMDLMask[SrcWd], B← MD, Branch[SrcBody];
	T← XShMDRMask[SrcWd], B← MD, Branch[SrcBody];

* Body routine for case D only.
SrcDstBody: BBAt[SrcDstBodyLoc],
	T← BBDst← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
**** Program around MicroD problem.  Desired branch clause is:
****		Call[SrcDstEnd, Cnt=0&-1];
**** but we must actually write:
		BRGO@[0] RETCL@[2] JCN[103];
	BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase,
		BBAt[SrcDstBodyLoc, 1]; **** MicroD problem
	SrcWd← MD, T← SrcWd, Fetch← T;
	T← XShiftNoMask[SrcWd], B← MD, Branch[SrcDstBody];

* This is called as a subroutine at the end of each munch.
* It either exits the horizontal loop or returns to do another munch.
SrcDstEnd:
Subroutine;
	MCount← (MCount)-1, Cnt← 17S,
		DblBranch[SrcDstFinish, SrcDstMore, R<0],
		BBAt[SrcDstBodyLoc, 2]; **** MicroD problem
TopLevel;

*-----------------------------------------------------------
TouchSrc:
* Touches every page of the source item for this scan line.
* Entry: BBSrc = address of first word of source item
*	T = SRast = words per source scan line; negative => bottom-to-top
*	SrcInc = distance between last word of one scan line and
*		first word of the next; negative => bottom-to-top
*	MemBase = BBSrcBR
* Exit: T and BBTemp clobbered
* Method: touch the last words of the item.  If the item is >401B
* words long, also touch intermediate words at multiples of 400B words from
* the last word of the item.
* Note: we don't also need to touch the first word of the item, since if a fault
* occurs on it during the BitBlt word loop, nothing will have changed yet.
* Note: |SRast-SrcInc| gives the number of words touched in the item;
* the value is positive if moving left-to-right, negative if right-to-left.
* Timing: 5 cycles for first page or less
*	 +1 cycle if right-to-left
*	 +2 cycles per additional page
*-----------------------------------------------------------
Subroutine;

	T← T-(SrcInc)-1;
	BBTemp← (BBSrc)+T, DblBranch[STouchRtoL, STouchLtoR, ALU<0];

STouchLtoR:
* T has (# words in source item)-1, BBTemp points to last (rightmost) word.
	T← T-(400C)-1;
	BBTemp← (Fetch← BBTemp)-(400C), Branch[.+2, ALU<0];
	T← T-(400C), Branch[.-1];
	Return;

STouchRtoL:
* T has -(# words in source item)-1, BBTemp points to last (leftmost) word -2.
	BBTemp← (BBTemp)+(2C);		* Fix the off-by-2 address
	T← T+(400C)+1;			* Should be T+402, but no matter
	BBTemp← (Fetch← BBTemp)+(400C), Branch[.+2, ALU>=0];
	T← T+(400C), Branch[.-1];
	Return;

*-----------------------------------------------------------
TouchDst:
* Touches every page of the destination item for this scan line.
* Entry: BBDst = address of first word of destination item
*	T = DRast = words per destination scan line; negative => bottom-to-top
*	DstInc = distance between last word of one scan line and
*		first word of the next; negative => bottom-to-top
*	MemBase = BBDstBR
* Exit: T and BBTemp clobbered
* Note: must dirty any words touched, to force write-protect fault.
* Timing: 6 cycles for first page or less
*	 +1 cycle if right-to-left
*	 +3 cycles per additional page
*-----------------------------------------------------------
Subroutine;

	T← T-(DstInc)-1;
	BBTemp← (BBDst)+T, DblBranch[DTouchRtoL, DTouchLtoR, ALU<0];

DTouchLtoR:
* T has (# words in source item)-1, BBTemp points to last (rightmost) word.
	T← T-(400C)-1;
DTouchLoopLtoR:
	Fetch← BBTemp, PD← T;
	T← T-(400C), Branch[DTouchDoneLtoR, ALU<0];
	Store← BBTemp, DBuf← MD;
	BBTemp← (BBTemp)-(400C), Branch[DTouchLoopLtoR];

DTouchDoneLtoR:
	Store← BBTemp, DBuf← MD, Return;

DTouchRtoL:
* T has -(# words in source item)-1, BBTemp points to last (leftmost) word -2.
	BBTemp← (BBTemp)+(2C);		* Fix the off-by-2 address
	T← T+(400C)+1;			* Should be T+402, but no matter
DTouchLoopRtoL:
	Fetch← BBTemp, PD← T;
	T← T+(400C), Branch[DTouchDoneRtoL, ALU>=0];
	Store← BBTemp, DBuf← MD;
	BBTemp← (BBTemp)+(400C), Branch[DTouchLoopRtoL];

DTouchDoneRtoL:
	Store← BBTemp, DBuf← MD, Return;