*-----------------------------------------------------------
Title[AltoBitBlt.mc...February 17, 1983  10:38 AM...Taft];
* Bit-boundary block-transfer
* This version emulates Alto BitBlt, and has been extended to
* permit use of long pointers.
*-----------------------------------------------------------

%
Refer to the Alto Hardware Manual for primary documentation.
All numbers are octal except timings, which are decimal.

BBTable format:
	0	Function (see below)
	1	unused
	2	DBCA	destination base core address
	3	DBMR	destination bit map raster
	4	DLX	destination left X
	5	DTY	destination top Y
	6	DW	destination width (also source width)
	7	DH	destination height (also source height)
	10	SBCA	source base core address
	11	SBMR	source bit map raster
	12	SLX	source left X
	13	STY	source top Y
	14	Gray0
	15	Gray1
	16	Gray2
	17	Gray3
	20	LSBCAlo	long pointer to source bit map
	21	LSBCAhi
	22	LDBCAlo	long pointer to destination bit map
	23	LDBCAhi

Function:
	B0=1	Use the long pointers in words 20-23 and ignore DBCA, SBCA

		Following 2 bits defined only if NOT using long pointers:
	B10=1	Source block is in the alternate bank (XM)
	B11=1	Destination block is in the alternate bank (XM)

	B12:13	SourceType
	B14:15	Operation


The 20 BitBlt Functions (combinations of SourceType and Operation) are
divided into 6 classes:
	A	dest ← gray
	B	dest ← f(gray, dest)
	C	dest ← f(source)
	D	dest ← f(source, gray)
	E	dest ← f(source, dest)
	F	dest ← f(source, gray, dest)

The distribution of functions into classes is:
    Function (class)	SourceType		Operation
	 0 (C)		0 source		0 dest ← source
	 1 (E)		0 source		1 dest ← source OR dest
	 2 (E)		0 source		2 dest ← source XOR dest
	 3 (E)		0 source		3 dest ← NOT source AND dest
	 4 (C)		1 NOT source		0 dest ← source
	 5 (E)		1 NOT source		1 dest ← source OR dest
	 6 (E)		1 NOT source		2 dest ← source XOR dest
	 7 (E)		1 NOT source		3 dest ← NOT source AND dest
	10 (D)		2 source AND gray	0 dest ← source
	11 (F)		2 source AND gray	1 dest ← source OR dest
	12 (F)		2 source AND gray	2 dest ← source XOR dest
	13 (F)		2 source AND gray	3 dest ← NOT source AND dest
	14 (A)		3 gray			0 dest ← source
	15 (B)		3 gray			1 dest ← source OR dest
	16 (B)		3 gray			2 dest ← source XOR dest
	17 (B)		3 gray			3 dest ← NOT source AND dest

If DLX <= SLX, the BitBlt horizontal loop works left-to-right;  if DLX > SLX,
right-to-left.  Similarly, if DTY < STY, the vertical loop works top-to-
bottom;  if DTY >= STY, bottom-to-top.  This is so that the correct thing
will happen if source and destination blocks overlap.  (Note that this test
depends on the assumption that if the blocks overlap, they belong to the
same bit map, i.e., DBCA=SBCA and DBMR=SBMR.  No check is made for this.)
The "=" cases could be handled either way; however, the DTY=STY
case must be handled bottom-to-top because the documented algorithm for
proper phasing of the gray block depends on this.

Terminology: when referring to words in a scan line, "left" and "right"
refer to the words with lower and higher addresses, respectively, independent
of the direction of processing the scan line;  "first" and "last" refer
to the first and last words encountered in the direction of processing.

The destination block is considered to consist of a left partial word,
some number of full (body) words, and a right partial word.  If the block
begins or ends on a word boundary, the left or right word is still
considered to be a partial word.

The destination bits preserved in the left and right partial words
are determined by the SHC register's LMask and RMask, respectively,
where LMask = DLX mod 20, RMask = 17-((DLX+DW-1) mod 20).

The destination width in words, including the first and last partial words,
is computed by DWidth = (DW + (DLX mod 20) + 17)/20.  Similarly,
the source width in words, including the first and last partial words,
is computed by SWidth = (DW + (SLX mod 20) + 17)/20.
DWidth and SWidth differ by at most +/-1.

If DWidth > 2 then the destination consists of a left partial word, DWidth-2
full body words, and a right partial word.
If DWidth = 2 then the block consists of a left partial word immediately
followed by a right partial word.
If DWidth = 1 then the entire destination block lies within a single word,
not crossing a word boundary.  Effectively, the left and right partial words
are one and the same.  This case (called the "thin" case) requires
special handling, as both LMask and RMask must be applied simultaneously.

Processing of the scan line is controlled by two counters: the Cnt register
(loaded from ICnt), which counts the inner loop (20 words or fewer), and
MCount, which counts the outer loop.  Roughly speaking,
Cnt is loaded initially with DWidth mod 20 and MCount with DWidth/20.
The inner loop is executed Cnt times; then, until MCount is exhausted,
Cnt is reloaded with 20 and the inner loop is executed 20 more times.
The reason for this arrangement is to permit a PreFetch to be executed
every 20 words.

More precisely: Before each horizontal loop, the Cnt register is loaded with
  if DWidth < 23 then DWidth-1 else (DWidth-3) mod 20 +2, so:
	Cnt > 1 if the body contains one or more words.
	Cnt = 1 if there are no body words.
	Cnt = 0 in the thin case.
Cnt is decremented twice before the main loop is reached, so upon entry
to the main loop it contains DWidth-3 (assuming DWidth<23), which is precisely
the correct value for going around the loop DWidth-2 times.  (The extra 2
in DWidth are the first and last partial words, which are handled outside
the word loop.)

MCount is loaded with (DWidth-3)/20 -1, so:
	MCount < 0 if DWidth < 23
	MCount = 0 if 23 <= DWidth < 42, etc.
Each time the inner loop terminates, if MCount<0 the scan line is finished,
but if MCount>=0, MCount is decremented, Cnt is loaded with 17, and the
main loop is reentered for another 20 iterations.

When working left-to-right, the source may be thought of as T,,SrcWd --
that is, with T on the left and SrcWd on the right.  In any given operation,
T contains the previous source word and SrcWd contains the current one, and
the shift operation consists of left-shifting T,,SrcWd by 0 to 17 bits
(to align it with the destination) and storing the leftmost 20 bits of the
result.  Since the shifter is actually a cycler and produces the rightmost
20 bits (rather than the leftmost) of the cycled result, the shifter must
be set up to left-cycle an additional 20 bits (by exchanging SHA and SHB).
Hence SHA=R and SHB=T.

When working right-to-left, the source may be thought of as SrcWd,,T,
where T contains the previous source word and SrcWd the current, as before.
This is exactly symmetric with the left-to-right case and requires only
that SHA and SHB be exchanged.  Hence SHA=T and SHB=R.

Regardless of the direction, if there are at least as many data bits in the
first word of the source block as in the first word of the destination block,
then all the destination bits come from SrcWd during the first shift and
T need not be loaded at all.  On the other hand, if there are fewer bits
in the first word of the source block, the destination bits come from
both SrcWd and T.  In this case, T must contain the first source word and
SrcWd the second.  These two situations must be distinguished, since one
requires fetching only a single word whereas the other requires fetching
two words before operating the shifter for the first time.

A similar situation arises at the end of a block; that is, all the bits
in the last partial destination block may come from the leftover source
word in T, or it may be necessary to fetch an additional word into SrcWd.
These situations must be distinguished also.  An earlier version of this
microcode omitted the test and simply fetched the "extra" word always;
unfortunately, this sometimes resulted in touching a word outside the bit map,
which caused problems in a paged environment when the bit map happened
to be page-aligned.

Two bits in BBFlags control the fetching of the "extra" word in the above
two cases.

All the horizontal loops work in both directions by use of a trick:
ALUF[15] is redefined to be A+1 if moving left-to-right but A-1 if
right-to-left.  This ALU function is invoked by the operation "A+/-1",
and is used to advance source and destination pointers along the scan line.
A consequence of this is that ALUF[15] (normally A AND NOT B) cannot be
used by I/O tasks.

BBDst and BBSrc are 16-bit displacements relative to base registers
BBDstBR and BBSrcBR.  If moving top-to-bottom they start near zero
and count up; if bottom-to-top they start at ~2↑15 and count down.
If more than ~2↑15 words of bit map are processed, one of these displacements
becomes negative.  This is detected and causes BitBlt to restart, recomputing
the base registers and displacements.  BR displacement overflow is handled
this way because it takes too much code to recompute the BRs and displacements
in mid-stream.  The cost of restarting BitBlt from scratch in this case is
unimportant, since such a large amount of data is involved.  (Indeed, in
normal programs an interrupt will occur before 2↑15 words are processed,
so BitBlt will be restarted anyway.)

*-----------------------------------------------------------
* BitBlt Calling Sequence
*-----------------------------------------------------------

The BitBlt subroutine is called by:
	SCall[BitBltSub]
with the Top-of-Stack (TOS) containing the following arguments:
	TOS-1:	pointer to BBTable (in current space)
	TOS:	0

Assumes RBase[AEmRegs] when called and leaves it that way upon return.

Assumes that the BBTable is relative to the MemBase in effect at the time
of the call.  The caller shouldn't assume anything about MemBase upon return.

BitBlt exits in one of three ways:

1. Normal completion: BitBlt sets TOS to zero and returns to caller+2.

2. Interrupt request pending: BitBlt sets TOS to the number of scan lines
already processed and returns to caller+1.  The calling emulator processes
the interrupt and then restarts the BitBlt with the value left in TOS.
Note: if the BitBlt source or destination block is larger than 2↑15 words,
BitBlt stops after processing at most 2↑15 words and returns +1, exactly
as if an interrupt had occurred.

3. Page fault: BitBlt is careful to ensure restartability in the face of
page faults.  That is, any time a page fault can occur, TOS contains the
number of scan lines already done, and restarting BitBlt with that argument
will cause the correct thing to happen.  (However, a BitBlt that causes a
write-protect fault is NOT guaranteed to be restartable.)

Local stack usage:
	TOS-1:	BBTable
	TOS:	number of scan lines completed
	TOS+1:	not used (preserved for Alto Emulator's benefit)
	TOS+2:	return Link
	TOS+3:	saved ALUFM[17]
	TOS+4:	saved ALUFM[15]

Approximate timing for initialization and cleanup (excluding main loops):
	 69 cycles (minimum)
	 +4 if working bottom-to-top
	+21 if DTY # 0 or working bottom-to-top
	 +1 if DWidth = 1 or DWidth > 22B
Add the following if a source bit map is required:
	+31
	 +5 if working right-to-left
	 +4 if working bottom-to-top
	+21 if STY # 0 or working bottom-to-top
	 +5 if DWidth=SWidth
	 +1 if SWidth = 1
	 +2 if STY = DTY and SBMR = DBMR

	 95 cycles maximum in destination-only cases
	164 cycles maximum in source-destination cases

See comments above main loops for main loop times.
%

*-----------------------------------------------------------
* R-register assignments:
*-----------------------------------------------------------

SetRMRegion[BBRegs];	* The RMRegion itself is defined in RegisterDefs.mc

RVN[BBDst];		* Address of next word to process
RVN[BBSrc];

RVN[DstInc];		* Address increment between scan lines --
RVN[SrcInc];		*  negative if working bottom-to-top

RVN[DRast];		* Raster length (words) --
RVN[SRast];		*  negative if working bottom-to-top

RVN[PrefDst];		* Address of next munch to PreFetch
RVN[PrefSrc];

RVN[DPrefOffset];	* Offset of leftmost word of next scan line relative
RVN[SPrefOffset];	*  to first word of current scan line

RVN[VCount];		* Vertical line count
RVN[MCount];		* Horizontal munch count
RVN[ICnt];		* Initial value of Cnt register for word loops

RVN[BBDisp];		* Control flags and horizontal loop dispatch
			* B0=1 => work right-to-left
			* B8:15 = dispatch value relative to HorizontalDisp
			* (see below for details)

RVN[BBFlags];		* Control flags:
			* B0=1 => 2 source words required for first dest word
			* B15=1 => 2 source words required for last dest word

RVN[SrcWd];		* Leftover source word -- must be RVREL 17

* Additional registers, overlaid with emulator temps in the AEmRegs block.
RME[Gray0, ETemp0];	* The gray block
RME[Gray1, ETemp1];
RME[Gray2, ETemp2];
RME[Gray3, ETemp3];

* Aliases used during initialization
RME[Width, DPrefOffset]; * Width of block in bits
RME[DWidth, ICnt];	* Width of destination block in words
RME[SWidth, PrefSrc];	* Width of source block in words
RME[Skew, DstInc];	* Destination-source skew, mod 20
RME[BBMasks, PrefDst];	* LMask and RMask values to be loaded into SHC
RME[BBFunc, SrcInc];	* BitBlt function word
RME[DstX, BBDst];	* Destination starting X in bits, later in words
RME[SrcX, BBSrc];	* Source starting X in bits, later in words
RME[DstY, SPrefOffset];	* Destination starting Y in scan lines
RME[SrcY, SRast];	* Source starting Y in scan lines
RME[BBTemp, SrcWd];	* Must be RVRel 17 because it is an arg to MulSub.

*-----------------------------------------------------------
* Other definitions
*-----------------------------------------------------------

* Base-register assignments

% -- Actually defined in ADefs.mc.  First two must be an even-odd pair.
BR[BBDstBR, ?];		* BitBlt destination base
BR[BBSrcBR, ?];		* BitBlt source base
BR[ScratchBR, ?];	* Scratch (emulator use only)
%


* ALU functions defined by BitBlt.
* The ALUF Ram is loaded by BitBlt with the desired operations.
XALUOP[,BBOp,,17,E];	* A BBOp B -- logical operation invoked with shifter

XAOP[,+/-1,15,E];	* A +/-1 -- A+1 or A-1 depending on horizontal
			* direction.  This value of ALUF is normally
			* A AND NOT B and is restored by BitBlt when done.
			* This means, however, that A AND NOT B cannot be used
			* by other tasks.


* Layout of BBDisp register:
* B0 = 0 if working left-to-right, 1 if right-to-left
* B8-15: BigBDispatch value for setup and body dispatches.
* The following addressing constraints apply:
* (1) B9 = 1 and B15 = 1 if a source bit map is used.
* (2) B14 = 0 if the block(s) must be touched before beginning a transfer.
* (3) B12 = 1 and B13 = 0 if gray is used; B12 = 0 and B13 = 1 otherwise.
* (4) B11 = 1 if the destination is an operand.
* (5) Certain targets are tied together by Call constraints.
* These bits are rather carefully selected to permit the same BBDisp to be
* used in three different dispatches.

Set[SrcFlg, 101];
Set[NoTouchFlg, 2];
Set[NoGrayFlg, 4];
Set[GrayFlg, 10];
Set[DstFlg, 20];

* BBDispX defines the first of two pages used for dispatches on BBDisp.
* BBDispX must refer to an even page.
Set[BBDispX, 1200];
M[BBAt, At[BBDispX, Add[#1, #2]]];

* Page-relative entry points for setup routines.
* All must have B11 = 1, to neutralize body dispatch (DstFlag).
Set[GrayDstSetup&TouchLoc, 30];
Set[GrayDstSetupLoc, 32];
Set[SrcDstSetup&TouchLoc, 125];
Set[SrcDstSetupLoc, 127];	* Must be SrcDstSetup&TouchLoc + 2
Set[SrcGrayDstSetup&TouchLoc, 131];
Set[SrcGrayDstSetupLoc, 133];	* Must be SrcGrayDstSetup&TouchLoc + 2

Set[HorizontalDispLoc, 20];	* Target of setup dispatch

* Page-relative entry points for body routines.
* They must neutralize all bits used for the setup dispatch.
Set[GrayBodyLoc, 16];
Set[GrayDstBodyLoc, 36];
Set[SrcGrayBodyLoc, 117];	* These must be xxxx1111 because they contain
Set[SrcGrayDstBodyLoc, 137];	*  conditional Call instructions

*-----------------------------------------------------------
BitBltSub:	* Entry point
* Preliminaries: read BBTable, decode function
*-----------------------------------------------------------
KnowRBase[AEmRegs];
Subroutine;
	T← Link, StkP-1;
TopLevel;

	Fetch← Stack&+3;		* Fetch Function from BBTable
	Stack&+1← T, MemBase← ScratchBR;

* Set up ScratchBR to point to base of BBTable.
	Gray2← MD, T← VAHi;		* Gray2← function word
	BRHi← T;
	T← VALo;
	BRLo← T;

* Read gray block and save in R-registers.
	Gray0← (Fetch← 17S) AND MD;	* Gray0← function dispatch
	Gray3← MD, T← (Fetch← 16S)-1;
	Gray2← MD, T← (Fetch← T)-1, Q← Gray2; * Q← function word
	Gray1← MD, Fetch← T;

* Dispatch on BitBlt function to compute BBOP and BBDisp.
	Gray0← MD, BigBDispatch← Gray0;
	RBase← RBase[BBRegs], Branch[BBFunctionTable];

*-----------------------------------------------------------
BBFunctionTable: DispTable[20];		* SourceType, Operation
	T← 1C, Branch[BBFC];		* 0, 0	NOT A
	T← 5C, Branch[BBFE];		* 0, 1	NOT A OR B
	T← 15C, Branch[BBFE&T];		* 0, 2	A EQV B
	T← 35C, Branch[BBFE];		* 0, 3	A AND B
	T← 37C, Branch[BBFC];		* 1, 0	A
	T← 27C, Branch[BBFE];		* 1, 1	A OR B
	T← 23C, Branch[BBFE&T];		* 1, 2	A XOR B
	T← 21C, Branch[BBFE];		* 1, 3	NOT A AND B
	T← 1C, Branch[BBFD];		* 2, 0	NOT A
	T← 5C, Branch[BBFF];		* 2, 1	NOT A OR B
	T← 15C, Branch[BBFF&T];		* 2, 2	A EQV B
	T← 35C, Branch[BBFF];		* 2, 3	A AND B
	T← 1C, Branch[BBFA];		* 3, 0	NOT A
	T← 5C, Branch[BBFB];		* 3, 1	NOT A OR B
	T← 15C, Branch[BBFB&T];		* 3, 2	A EQV B
	T← 35C, Branch[BBFB];		* 3, 3	A AND B
*-----------------------------------------------------------

* Select the appropriate dispatch word for the function.
BBFA:	BBDisp← Add[GrayFlg, NoTouchFlg]C, Branch[SetBBF];
BBFB:	BBDisp← Add[GrayFlg, DstFlg, NoTouchFlg]C, Branch[SetBBF];
BBFB&T:	BBDisp← Add[GrayFlg, DstFlg]C, Branch[SetBBF];
BBFC:	BBDisp← Add[SrcFlg, NoGrayFlg, NoTouchFlg]C, Branch[SetBBF];
BBFD:	BBDisp← Add[SrcFlg, GrayFlg, NoTouchFlg]C, Branch[SetBBF];
BBFE:	BBDisp← Add[SrcFlg, NoGrayFlg, DstFlg, NoTouchFlg]C, Branch[SetBBF];
BBFE&T:	BBDisp← Add[SrcFlg, NoGrayFlg, DstFlg]C, Branch[SetBBF];
BBFF:	BBDisp← Add[SrcFlg, GrayFlg, DstFlg, NoTouchFlg]C, Branch[SetBBF];
BBFF&T:	BBDisp← Add[SrcFlg, GrayFlg, DstFlg]C, Branch[SetBBF];

SetBBF:	Stack&+1← ALUFMRW← T, ALUF[17];	* Set ALU function and save old

*-----------------------------------------------------------
* X-coordinate setup: widths, margins, skew, masks, etc.
*-----------------------------------------------------------

* Fetch the X-coordinate information.
	MCount← NOT (Fetch← 6S);	* MCount← small negative, fetch DW
	Width← MD, Fetch← 12S;		* Fetch SLX
	SrcX← MD, T← A0, Fetch← 4S;	* Fetch DLX -- must be fetched last

* Compute LMask and RMask.
* LMask = DLX mod 20, RMask = 17-((DLX+DW-1) mod 20) = (-DLX-DW) mod 20.
* For reference, SHC fields are:
*  B2: SHA=T, B3: SHB=T, B4-7: count, B8-11: RMask, B12-15: LMask
	DstX← MD, T← T-(Width);	* T← -DW
	T← T-(DstX);			* T← -DLX-DW
	BBMasks← DPF[T, 4, 4, MD];	* B8-11← ((-DLX-DW) mod 20) lsh 4,
					* B12-15← DLX mod 20

* Compute destination width in words, including first and last partial words.
* DWidth← (Width + (DLX mod 20) +17) / 20.
	T← (DstX) AND (17C);
	T← (Width)+T;
	T← T+(17C);
	DWidth← RSH[T, 4];

	BBFunc← Q, BBDisp, Branch[.+2, R odd]; * Source block required?
					* BBFunc← full BitBlt function word

* Source block not required.  Set shift count to send R straight thru shifter,
* and handle as left-to-right case.
	Skew← A0, Branch[SetupLtoR];	* SHA=R, SHB=R

* Compute source width in words, including first and last partial words.
* SWidth← (Width + (SLX mod 20) +17) / 20.
	T← (SrcX) AND (17C);
	T← (Width)+T;
	T← T+(17C);
	SWidth← T← RSH[T, 4];

* Set flags to control fetching of "extra" first and last words.
* Except in the "thin" case (DWidth=1), the setup/finish routines for the
* horizontal loops nominally fetch 1 word and store 2 words; an extra fetch
* may be required at the beginning, the end, or both, depending on the number
* of words in the source and destination blocks (see introductory comments).
* DWidth and SWidth differ by at most +/-1.
* If SWidth = DWidth+1, an extra source word must be fetched at both ends.
* If SWidth = DWidth-1 or SWidth = 1, no extra source words need be fetched.
* If SWidth = DWidth # 1, an extra source word must be fetched at one end:
*   if SLX mod 20 > DLX mod 20 then left else right.
* Set BBFlags[0]← extraLeft, BBFlags[1:15]← extraRight.
* (these flags are exchanged later if working right-to-left.)
* T still has SWidth; MD still has DLX.
* The following 2 instructions set BBFlags← -1 if SWidth>DWidth, 0 otherwise.
	PD← (DWidth)-T;
	BBFlags← T-T-1, XorSavedCarry, Branch[SetupSkew, ALU#0];

* Set BBFlags← 100000 if SLX mod 20 > DLX mod 20, 77777 otherwise.
	T← (SrcX) AND (17C);
	BBFlags← (17S) AND MD;		* DLX mod 20
	PD← T-(BBFlags)-1;		* Carry iff SLX mod 20 > DLX mod 20
	BBFlags← 100000C;
	BBFlags← (BBFlags)-1, XorSavedCarry;

* X-coordinate setup (cont'd)

* Compute skew = (SLX-DLX) mod 20, and decide on horizontal direction.
SetupSkew:
	T← (SrcX)-MD;
	Skew← T AND (17C), DblBranch[SetupRtoL, SetupLtoR, ALU<0];

* SLX >= DLX: work from left to right.  Set SHA=R, SHB=T, ALUF[15]="A+1".
* Note: if skew = 0, set SHA=R, SHB=R.
SetupLtoR:
	T← 200C, Branch[NoSrcThinChk, ALU=0]; * ALUFM control for "A+1"
	Skew← (Skew) OR (20C), Branch[SetupALU&ShC]; * SHB=T

* SLX < DLX: work from right to left.
* Set SHA=T, SHB=R, ALUF[15]="A-1".
* Advance starting X coordinates to rightmost ends of blocks.
* Note: if skew = 0, set SHA=R, SHB=R, and do not exchange extra-word flags.
SetupRtoL:
	BBDisp← (BBDisp) OR (100000C), Branch[.+3, ALU=0];
	Skew← (Skew) OR (40C);		* SHA=T
	BBFlags← (BBFlags) LCY 1;	* Exchange source extra-word flags
	T← (Width)-1;
	DstX← (DstX)+T;			* Advance to rightmost X-coordinates
	SrcX← (SrcX)+T;
	T← 36C;				* ALUFM control for "A-1"

* Have ALUFM control in T for "A+/-1" operation.
SetupALU&ShC:
	PD← (SWidth)-1;			* Thin source check (see below)
NoSrcThinChk:
	Stack&-4← ALUFMRW← T, A+/-1,	* Set ALU function, save old value.
		Branch[SetupShC, ALU#0]; * Leave StkP -> TOS (scan line count)

* If we would have fetched an extra source word, but there is only one source
* word to fetch, then reset source extra-word flags and set SHA=R, SHB=R.
	BBFlags← A0, Branch[.+2, R>=0];
	Skew← (Skew) AND (17C);
	T← LSH[BBMasks, 10], Branch[.+2]; * Placement

* Merge skew with masks and load SHC.
SetupShC:
	T← LSH[BBMasks, 10];		* Shift masks to B0-7
	T← LCY[T, Skew, 10];		* Concatenate SHA, SHB, count, masks
	PD← (Width)-1, ShC← T;

* Convert DstX and SrcX to X word displacements relative to start of
* first scan line.  Note: BBDst is the same register as DstX, and
* BBSrc is the same register as SrcX.
	BBDst← RSH[DstX, 4], Branch[.+2, ALU>=0]; * BBDst← DstX/20
	Branch[BitBltDone];		* Width<=0 -- nothing to do
	BBSrc← RSH[SrcX, 4];		* BBSrc← SrcX/20

*-----------------------------------------------------------
* Fetch and set up Y-coordinate information
*-----------------------------------------------------------
	Fetch← 7S;			* Fetch DH
	T← MD, Fetch← 5S;		* Fetch DTY
	DstY← MD, Fetch← 13S;		* Fetch STY -- must be fetched last
	Skew← (DstY)-MD;		* Test and remember vertical direction
* VCount← (scan lines left to do)-1.  Stack has scan lines already done.
	VCount← T-(Stack)-1, Branch[.+2, ALU<0]; * DH - (scan lines done) -1

* DTY >= STY, work from bottom to top.  Start with lowest line not yet done.
	T← VCount, Branch[.+2];

* DTY < STY, work from top to bottom.  Start with highest line not yet done.
	T← Stack;

	DstY← (DstY)+T;			* Compute starting Y coordinates
	SrcY← T+MD;			* MD still has STY

*-----------------------------------------------------------
* Set up destination base address and increments
*-----------------------------------------------------------

* Compute starting destination scan line offset relative to base of bit map.
* DstY has scan line number.
	T← DstY, Fetch← 3S;		* Fetch DBMR
	DRast← MD, T← A0, Q← T, Branch[.+2, ALU=0]; * Don't multiply if DstY=0
	T← DRast, Call[MulSub];		* T,,Q ← Q*T, clobbers BBTemp

* Now have 32-bit offset in T,,Q.  Add it to bit map base address.
**** Program around MicroD problem.  Desired statement is:
****	PD← (BBFunc) AND (20C), DblCall[DFetchBBLong, DFetchBBShort, R<0];
**** but we must actually write:
	PD← (BBFunc) AND (20C), BRGO@[0] RETCL@[3] JCN[44] GPW0@[11400];
		Set[DFetchBBShortLoc, 1420];
****

* Now T,,BBTemp = adjusted base value.
* Note: Skew<0 here iff working top-to-bottom.
	MemBase← BBDstBR, Skew, Branch[DstLoadBR, R<0];

* Processing bottom-to-top.
* Decrease base by 2↑15, and increase initial displacement by 2↑15.
	BBTemp← (BBTemp)-(100000C);
	T← T-1, XorSavedCarry;
	BBDst← (BBDst)+(100000C);
	DRast← (0S)-(DRast);		* DRast← -DBMR

* Now finally ready to load the base register!
DstLoadBR:
	T← DWidth, BRHi← T;		* DWidth for code below
	BRLo← BBTemp;

* Compute destination inter-scan-line word address increment.
* DstInc← DRast + (if left-to-right then -DWidth else DWidth).
* Also compute first PreFetch offset (from BBDst).
* = leftmost word on next scan line, even if processing right-to-left.
* DPrefOffset← if L-to-R then DRast else DRast-DWidth.
* Note that DRast already has DBMR or -DBMR as appropriate.  T = DWidth.
	BBDisp, Branch[DstIncRtoL, R<0]; * Test direction

DstIncLtoR:
	DstInc← (DRast)-T;		* L to R: DRast-DWidth
	DPrefOffset← T← DRast, Branch[.+3];

DstIncRtoL:
	DstInc← (DRast)+T;		* R to L: DRast+DWidth
	DPrefOffset← T← (DRast)-T;

	PrefDst← (BBDst)+T;		* Initial PreFetch address

*-----------------------------------------------------------
* Set up source base address and increments
*-----------------------------------------------------------

* Is a source block required?  If not, skip source setup.
	BBDisp, Branch[BBNoSource, R even];

* If a source-destination overlap exists, set the "touch" flag.
* It is too hard to decide precisely whether an overlap exists,
* but if SBMR=DBMR and STY=DTY, an overlap MAY exist, so we set the flag.
* Note that the flag may already be set as a consequence of operation=XOR.
* (Note that the sense of the flag is complemented: "touch" => NoTouchFlag=0.)
	T← SrcY, MemBase← ScratchBR;
	Fetch← 3S;			* Fetch DBMR
	BBTemp← MD, Fetch← 11S;		* Fetch SBMR
	T← (DstY) XOR T;
	BBTemp← (BBTemp) XOR MD;
	PD← T OR (BBTemp);
	T← SrcY, Branch[.+2, ALU#0];	* Skip if SBMR#DBMR or STY#DTY
	BBDisp← (BBDisp) AND (Not[NoTouchFlg]C), Branch[.-1];

* Compute starting source scan line offset relative to base of bit map.
* T = ALU = SrcY = starting source scan line number.
	SRast← MD, T← A0, Q← T, Branch[.+2, ALU=0]; * Don't multiply if SrcY=0
	T← SRast, Call[MulSub];		* T,,Q ← Q*T, clobbers BBTemp

* Now have 32-bit offset in T,,Q.  Add it to bit map base address.
**** Program around MicroD problem.  Desired statement is:
****	PD← (BBFunc) AND (40C), DblCall[SFetchBBLong, SFetchBBShort, R<0];
**** but we must actually write:
	PD← (BBFunc) AND (40C), BRGO@[0] RETCL@[3] JCN[104] GPW0@[11400];
		Set[SFetchBBShortLoc, 1440];
****

* Now T,,BBTemp = adjusted base value.
* Note: DRast>=0 here iff working top-to-bottom.
	MemBase← BBSrcBR, DRast, Branch[SrcLoadBR, R>=0];

* Processing bottom-to-top.
* Decrease base by 2↑15, and increase initial displacement by 2↑15.
	BBTemp← (BBTemp)-(100000C);
	T← T-1, XorSavedCarry;
	BBSrc← (BBSrc)+(100000C);
	SRast← (0S)-(SRast);		* SRast← -SBMR

* Now finally ready to load the base register!
SrcLoadBR:
	T← SWidth, BRHi← T;		* SWidth for code below
	BRLo← BBTemp;

* Compute source inter-scan-line word address increment.
* SrcInc← SRast + (if left-to-right then -SWidth else SWidth).
* Also compute first PreFetch offset (from BBSrc).
* = leftmost word on next scan line, even if processing right-to-left.
* SPrefOffset← if L-to-R then SRast else SRast-SWidth.
* Note that SRast already has SBMR or -SBMR as appropriate.  T = SWidth.
	BBDisp, Branch[SrcIncRtoL, R<0]; * Test direction

SrcIncLtoR:
	SrcInc← (SRast)-T;		* L to R: SRast-SWidth
	SPrefOffset← T← SRast, Branch[.+3];

SrcIncRtoL:
	SrcInc← (SRast)+T;		* R to L: SRast+SWidth
	SPrefOffset← T← (SRast)-T;

	PrefSrc← (BBSrc)+T;		* Initial PreFetch address

*-----------------------------------------------------------
* Final adjustments prior to entering vertical loop
*-----------------------------------------------------------

* Compute ICnt, the initial value of the Cnt register for each loop.
* ICnt← if DWidth <= 22B then DWidth-1 else NOT (DWidth-3) [= 2-DWidth].
* MCount starts out negative (was initialized long ago).
BBNoSource:
	PD← (DWidth)-(23C);
	ICnt← T← (DWidth)-1, Branch[.+2, ALU<0];
	ICnt← (1S)-T;			* = (2S)-(DWidth)

* Never need to "touch" data in the "thin" case.
* ALU=0 here iff the "thin" case applies (DWidth=1).
	Stack← (Stack)-1, Branch[.+2, ALU#0];
	BBDisp← (BBDisp) OR (Add[NoTouchFlg]C);

* Enter vertical loop with (scan lines done)-1 on TOS and ALU>=0.
	T← A0;
	PD← Q← T, Branch[BBVerticalLoop]; * Init gray value to 0


*-----------------------------------------------------------
* xFetchBBShort:
* Fetch BitBlt short pointer for destination or source
* Entry: ALU#0 iff alternate bank bit set
*	T,,Q = 32-bit displacement
* Exit: T,,BBTemp = adjusted base address
*-----------------------------------------------------------
Subroutine;

**** Note: flush absolute placement when MicroD is fixed ****
DFetchBBShort: At[DFetchBBShortLoc],	* Fetch DBMR
	BBTemp← Q, Fetch← 2S, DblBranch[BBNormal, BBAlternate, ALU=0];
SFetchBBShort: At[SFetchBBShortLoc],	* Fetch SBMR
	BBTemp← Q, Fetch← 10S, DblBranch[BBNormal, BBAlternate, ALU=0];

BBNormal:
	BBTemp← (BBTemp)+MD, RBase← RBase[AEmRegs];
	T← T+(EmuBRHiReg), XorSavedCarry, Branch[FetchBBRet];

KnowRBase[BBRegs];
BBAlternate:
	BBTemp← (BBTemp)+MD, RBase← RBase[AEmRegs];
	T← T+(EmuXMBRHiReg), XorSavedCarry;
FetchBBRet:
	RBase← RBase[BBRegs], Return;


*-----------------------------------------------------------
* xFetchBBLong:
* Fetch BitBlt long pointer for destination or source
* Entry: T,,Q = 32-bit displacement
* Exit: T,,BBTemp = adjusted base address
*-----------------------------------------------------------
Subroutine;

**** Note: flush absolute placement when MicroD is fixed ****
DFetchBBLong: At[DFetchBBShortLoc, 1],
	BBTemp← 22C, Branch[.+2];	* Destination long pointer low word
SFetchBBLong: At[SFetchBBShortLoc, 1],
	BBTemp← 20C;			* Source long pointer low word
	BBTemp← (Fetch← BBTemp)+1;
	BBTemp← MD, Fetch← BBTemp;
	BBTemp← (BBTemp)+Q;
	T← T+MD, XorSavedCarry, Return;

TopLevel;

*-----------------------------------------------------------
* BitBlt vertical loop (per-scan-line)
* At top of loop, the following invariants hold:
*	VCount = (number of scan lines remaining)-1
*	Stack = (number of scan lines done)-1
*	MemBase = BBDstBR in destination-only cases, BBSrcBR otherwise
*	ALU<0 iff destination BR displacement has overflowed
* Vertical loop overhead:
*	 6 cycles for loop control and destination pointer update
*	+4 cycles for source pointer update if source block is used
*	+4 cycles if block is greater than 20B words wide
*-----------------------------------------------------------

BBVerticalLoop:
	VCount← (VCount)-1, Branch[BBDoneOrOverflow, ALU<0, R<0];

* If positive, ICnt has the desired initial value of Cnt (DWidth-1).
	T← NOT (Cnt← ICnt), Branch[SmallBlock, R>=0];

* Block greater than one munch wide.  Set up separate munch and word counts.
* T now has DWidth-3, where DWidth is the number of words in the destination
* block, including first and last partial words.
* MCount ← (DWidth-3)/20 -1, Cnt ← (DWidth-3) mod 20 +2.
	MCount← RSH[T, 4];
	T← T AND (17C);
	T← T+(2C);
	MCount← (MCount)-1, Cnt← T;

* This dispatch goes to one of: GrayDstSetup, GrayDstSetup&Touch,
* SrcDstSetup, SrcDstSetup&Touch, SrcGrayDstSetup, or SrcGrayDstSetup&Touch.
SmallBlock:
	BigBDispatch← BBDisp;
	Stack← (Stack)+1, Branch[HorizontalDisp]; * Advance scan lines done


*-----------------------------------------------------------
AdvanceSrcDst:
* Control returns here at the end of individual horizontal loops that
* involve both source and destination blocks.
* BBSrc, BBDst point one beyond last word processed.
*-----------------------------------------------------------
	T← SrcInc;
	BBSrc← T← (BBSrc)+T;
	PrefSrc← T+(SPrefOffset);
	T← BBDst, Branch[SrcBROverflow, ALU<0];

*-----------------------------------------------------------
AdvanceDst:
* Control returns here at the end of individual horizontal loops that
* involve only a destination block.
* BBDst and T point one beyond last word processed.
*-----------------------------------------------------------
	BBDst← T← T+(DstInc);
	PrefDst← T← T+(DPrefOffset),
		DblBranch[BBReschedPending, BBVerticalLoop, Reschedule];

*-----------------------------------------------------------
* Reschedule pending.  See if interrupt is really being requested,
* and if so, terminate processing and return +1 from BitBltSub.
* The calling emulator will process the interrupt and then
* call BitBltSub again.
* T = PrefDst here.
*-----------------------------------------------------------
BBReschedPending:
	VCount, RBase← RBase[NWW], Branch[.+2, R>=0];

* We just processed the last scan line, so return normally.
	Branch[BitBltDone];

* Test NWW to see whether an interrupt is really pending.
	PD← NWW, NoReschedule;
	Branch[.+2, ALU>0];

* No interrupt pending.  Restore RBase and ALU and continue vertical loop.
	PD← T, RBase← RBase[BBRegs], Branch[BBVerticalLoop];

* Interrupt really pending.  Restore clobbered ALUFM locations before
* taking interrupt.  Note: scan line count at TOS is one behind, so fix it.
BBInitiateInterrupt:
	T← Stack&+3← (Stack&+3)+1, RescheduleNow, * Force immediate trap
		Branch[BitBltDone1];


*-----------------------------------------------------------
* Either VCount is exhausted or one of the BR displacements overflowed
* (or conceivably both events occurred at the same time).
* If VCount is exhausted then return normally; otherwise restart BitBlt.
* The easiest way to restart is to pretend an interrupt occurred.
* Note: scan line count at TOS is one behind, and VCount is over-decremented.
*-----------------------------------------------------------
BBDoneOrOverflow:
	VCount← (VCount)+1;
SrcBROverflow:
	VCount, Branch[BitBltDone, R<0];
	RBase← RBase[AEmRegs], Branch[BBInitiateInterrupt];


*-----------------------------------------------------------
* Really done.  Set TOS=0, restore clobbered ALUFM, and return.
*-----------------------------------------------------------
BitBltDone:
	T← Stack&+3← A0, RBase← RBase[AEmRegs];
BitBltDone1:
	ALUFMRW← Stack&+1, ALUF[17];	* Restore ALUF[17]
	ALUFMRW← Stack&-2, A+/-1;	* Restore ALUF borrowed for A+/-1
	PD← T, Link← Stack&-2;
Subroutine;
	Return[ALU=0];			* Return +2 if done, +1 otherwise
TopLevel;

*-----------------------------------------------------------
* Horizontal loops
*-----------------------------------------------------------

%
Organization of horizontal (per-word) loops:

There are a number of variations, depending on the following:
	1. Whether or not a source block is used;
	2. Whether or not a gray block is used;
	3. Whether or not the destination is an operand;
	4. Whether or not the destination is "thin" (one word per scan line);
	5. Whether or not data need be "touched" before doing the transfer.

There are fewer than 32 total cases because some of these combinations
cannot occur (for example, no source block and no gray block).
Each case has a "setup" routine, a "body" routine, and a "finish" routine.
Many of these routines are shared among cases, and the flow of control
is determined by a complicated network of dispatches on BBDisp.
Finish routines exit to the vertical loop by a branch to AdvanceDst or
AdvanceSrcDst.

It is never necessary to "touch" if any of the following is true:
	a. The destination is "thin";
	b. The destination is not an operand of the BitBlt function;
	c. The function can be performed multiple times with the same effect
	   as performing it only once (i.e., anything except XOR,
	   assuming the source and destination blocks don't overlap).
The "touch" and "no touch" cases are distinguished as part of the dispatch.

Q is used exclusively to hold the complement of the gray value for the
current scan line, or zero if gray is not used.

To reduce miss wait and increase performance, while processing each
scan line we fall out of the main loop once per munch and PreFetch
one munch for the next scan line.  This strategy depends on the
assumption that each scan line is less than 100 munches long (for a 100-row
cache, which is what the Dorado has at present).  Note that PreFetches
are done left-to-right even if transfers are done right-to-left.

Note that the current implementation is imperfect in that the last munch
of the next scan line may not be prefetched, or an extra munch prefetched
unnecessarily, because the main loop does not terminate at munch boundaries
but rather at multiples of 20 words from the end of the scan line.
Also, the first scan line is not prefetched, and a line past the end of
the last scan line is prefetched unnecessarily.  These defects should not
affect performance noticeably in normal use.
%

TopLevel;
KnowRBase[BBRegs];

%*-----------------------------------------------------------
Case A:  dest ← gray

ShC = R straight through.
BBOp = NOT A.

Entry point to this code is at GrayDestSetup.
It dispatches here after setup.

Timing, per scan line, including vertical loop overhead:
	20 cycles minimum in the normal case
	+1 cycle per full word (excluding first and last partial words)
	+4 + (2 * # of munches) cycles if block is wider than 20B words

	13 cycles total in the "thin" case
%*-----------------------------------------------------------

* T has first partial word to be stored, and SrcWd has (uncomplemented)
* gray word.
GrayBody: BBAt[GrayBodyLoc],
	Q← SrcWd;
	BBDst← (Store← BBDst)+/-1, DBuf← T, Branch[GrayEnd, Cnt=0&-1];

GrayLoop:
	BBDst← (Store← BBDst)+/-1, DBuf← Q, Branch[GrayLoop, Cnt#0&-1];

GrayEnd:
	MCount← (MCount)-1, Cnt← 17S, Branch[GrayLast, R<0];
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[GrayLoop];

GrayLast:
	Fetch← BBDst, Branch[DstFinish];

%*-----------------------------------------------------------
Case B:  dest ← f(gray, dest)

ShC = R straight through.
BBOp is taken from the following table:

Operation		SourceType	BBOP
source OR dest		gray		NOT A OR B
source XOR dest		gray		A EQV B
NOT source AND dest	gray		A AND B

The gray word is kept in SrcWd and put directly into the shifter.

Case A also dispatches to one of case B's entry points, GrayDstSetup.
The setup code dispatches to the correct body routine.

Timing, per scan line, including vertical loop overhead:
	18 cycles minimum in the normal case
	+3 cycles per full word (excluding first and last partial words)
	+6 + (2 * # of pages) cycles if data needs to be "touched"
	+4 + (2 * # of munches) cycles if block is wider than 20B words

	13 cycles total in the "thin" case
%*-----------------------------------------------------------

GrayDstSetup&Touch: BBAt[GrayDstSetup&TouchLoc],
	T← DRast, Call[TouchDst];
	BDispatch← VCount, Branch[.+2];	* Duplicate for placement
GrayDstSetup: BBAt[GrayDstSetupLoc],
	BDispatch← VCount;
	RBase← RBase[Gray0], Call[GetGray];
KnowRBase[BBRegs];
	SrcWd← NOT T, Fetch← BBDst;

* Label the target for the horizontal dispatch.  It is never actually reached
* by the dispatch, since the branch address is modified by BigBDispatch.
* The instruction so labelled should be on the same page as the actual
* targets, and it should not otherwise be constrained.
HorizontalDisp: BBAt[HorizontalDispLoc],
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[DstThin, Cnt=0&-1];
* The following dispatch may modify GrayBody to GrayDstBody.
	BigBDispatch← BBDisp, Branch[.+2, R<0];
	T← XShMDLMask[SrcWd], B← MD, Branch[GrayBody];
	T← XShMDRMask[SrcWd], B← MD, Branch[GrayBody];

* Body routine for case B only.
GrayDstBody: BBAt[GrayDstBodyLoc],
	BBDst← T← (Store← BBDst)+/-1, DBuf← T;
	BBSrc← (Fetch← T)+/-1, Branch[GrayDstEnd, Cnt=0&-1];

* Inner loop runs with one word fetched ahead (now in MD).
* BBSrc runs one word ahead of BBDst.
GrayDstLoop:
	BBSrc← (Fetch← BBSrc)+/-1, T← MD;
	T← XShiftNoMask[SrcWd], B← T;
	BBDst← (Store← BBDst)+/-1, DBuf← T, Branch[GrayDstLoop, Cnt#0&-1];

GrayDstEnd:
	MCount← (MCount)-1, Cnt← 17S, Branch[DstFinish, R<0];
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[GrayDstLoop];

* Store last partial word.  This is the tail of case A also.
* Have already fetched the last word (at BBDst).
DstFinish:
	BBDisp, Branch[.+2, R<0];
	T← XShMDRMask[SrcWd], B← MD, Branch[StoreLastDst];
	T← XShMDLMask[SrcWd], B← MD, Branch[StoreLastDst];

* Thin destination slice, for cases A and B.
DstThin:
	T← XShMDBothMasks[SrcWd], B← MD;
StoreLastDst:
	BBDst← T← (Store← BBDst)+/-1, DBuf← T, Branch[AdvanceDst];

%*-----------------------------------------------------------
Case C:  dest ← f(source)
Case D:  dest ← f(source, gray)

BBOp is taken from the following table:

Operation		SourceType		BBOP
source			source			NOT A		(Q = 0)
source			NOT source		A		(Q = 0)
source			source AND gray		NOT A		(Q = NOT gray)

Q is ORed with shifter output on the AMux.

Entry points to this code are at SrcDstSetup, SrcDstSetup&Touch,
SrcGrayDstSetup, and SrcGrayDstSetup&Touch.  They dispatch here after setup.

Timing, per scan line, including vertical loop overhead:
	25 cycles minimum in the normal case
	+4 cycles per full word (excluding first and last partial words)
	+12 + (4 * # of pages) cycles if data needs to be "touched"
	+1 cycle if 2 source words are required for the first destination word
	+3 cycles if gray is required
	+4 + (5 * # of munches) cycles if block is wider than 20B words

	17 cycles minimum in the "thin" case
	+1 cycle if 2 source words are required
	+3 cycles if gray is required
%*-----------------------------------------------------------


* Body routine for cases C and D only.
SrcGrayBody: BBAt[SrcGrayBodyLoc],
	T← BBDst← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
**** Program around MicroD problem.  Desired branch clause is:
****		Call[SrcGrayEnd, Cnt=0&-1];
**** but we must actually write:
		BRGO@[0] RETCL@[2] JCN[43];
	BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase,
		BBAt[SrcGrayBodyLoc, 1]; **** MicroD problem
	SrcWd← MD, T← SrcWd;
	T← XShiftNoMask[SrcWd], A← Q, Branch[SrcGrayBody];

* This is called as a subroutine at the end of each munch.
* It either exits the horizontal loop or returns to do another munch.
SrcGrayEnd:
Subroutine;
	MCount← (MCount)-1, Cnt← 17S,
		DblBranch[SrcDstFinish, SrcDstMore, R<0],
		BBAt[SrcGrayBodyLoc, 2]; **** MicroD problem
TopLevel;

%*-----------------------------------------------------------
Case E:  dest ← f(source, dest)
Case F:  dest ← f(source, gray, dest)

BBOp is taken from the following table:

Operation		SourceType		BBOP
source OR dest		source			NOT A OR B	(Q = 0)
source XOR dest		source			A EQV B		(Q = 0)
NOT source AND dest	source			A AND B		(Q = 0)
source OR dest		NOT source		A OR B		(Q = 0)
source XOR dest		NOT source		A XOR B		(Q = 0)
NOT source AND dest	NOT source		NOT A AND B	(Q = 0)
source OR dest		source AND gray		NOT A OR B	(Q = NOT gray)
source XOR dest		source AND gray		A EQV B		(Q = NOT gray)
NOT source AND dest	source AND gray		A AND B		(Q = NOT gray)

Q is ORed with shifter output on the AMux.

Cases C and D also dispatch these entry points.
The setup code dispatches to the correct body routines.

Timing, per scan line, including vertical loop overhead:
	 25 cycles minimum in the normal case
	+ 5 cycles per full word (excluding first and last partial words)
	+12 + (4 * # of pages) cycles if data needs to be "touched"
	+ 1 cycle if 2 source words required for the first destination word
	+ 3 cycles if gray is required
	+ 4 + (5 * # of munches) cycles if block is wider than 20B words

	 17 cycles minimum in the "thin" case
	+ 1 cycle if 2 source words are required
	+ 3 cycles if gray is required
%*-----------------------------------------------------------

SrcGrayDstSetup&Touch: BBAt[SrcGrayDstSetup&TouchLoc],
	T← DRast, FlipMemBase, Call[TouchDst];
	T← SRast, FlipMemBase, Call[TouchSrc];
SrcGrayDstSetup: BBAt[SrcGrayDstSetupLoc],
	BDispatch← VCount;
	RBase← RBase[Gray0], Call[GetGray];
KnowRBase[BBRegs];
	Q← T, BBFlags, DblBranch[SrcDstSetup2, SrcDstSetup1, R<0];

SrcDstSetup&Touch: BBAt[SrcDstSetup&TouchLoc],
	T← DRast, FlipMemBase, Call[TouchDst];
	T← SRast, FlipMemBase, Call[TouchSrc];
SrcDstSetup: BBAt[SrcDstSetupLoc],
	BBFlags, DblBranch[SrcDstSetup2, SrcDstSetup1, R<0];

SrcDstSetup2:
	BBSrc← (Fetch← BBSrc)+/-1;
SrcDstSetup1:
	PrefSrc← (PreFetch← PrefSrc), Carry20;
	T← MD, BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase;
	SrcWd← MD, Fetch← BBDst;
	PrefDst← (PreFetch← PrefDst), Carry20, Branch[SrcDstThin, Cnt=0&-1];
* The following dispatch may modify SrcGrayBody to SrcGrayDstBody.
	BigBDispatch← BBDisp, Branch[.+2, R<0];
	T← XShMDLMask[SrcWd], A← Q, B← MD, Branch[SrcGrayBody];
	T← XShMDRMask[SrcWd], A← Q, B← MD, Branch[SrcGrayBody];

* Cases E and F (cont'd)

* Body routine for cases E and F only.
SrcGrayDstBody: BBAt[SrcGrayDstBodyLoc],
	T← BBDst← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
**** Program around MicroD problem.  Desired branch clause is:
****		Call[SrcGrayDstEnd, Cnt=0&-1];
**** but we must actually write:
		BRGO@[0] RETCL@[2] JCN[103];
	BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase,
		BBAt[SrcGrayDstBodyLoc, 1]; **** MicroD problem
	SrcWd← MD, T← SrcWd, Fetch← T;
	T← XShiftNoMask[SrcWd], A← Q, B← MD, Branch[SrcGrayDstBody];

* This is called as a subroutine at the end of each munch.
* It either exits the horizontal loop or returns to do another munch.
* Note: code following SrcGrayDstMore is used by cases C and D also.
SrcGrayDstEnd:
Subroutine;
	MCount← (MCount)-1, Cnt← 17S,
		DblBranch[SrcDstFinish, SrcDstMore, R<0],
		BBAt[SrcGrayDstBodyLoc, 2]; **** MicroD problem
SrcDstMore:
	PrefSrc← (PreFetch← PrefSrc), Carry20;
	FlipMemBase;
	PrefDst← (PreFetch← PrefDst), Carry20;
	FlipMemBase, Return;
TopLevel;

* Store last partial word, for cases C, D, E, and F.
SrcDstFinish:
	BBFlags, Branch[.+2, R even];	* Fetch extra word at end?
	BBSrc← (Fetch← BBSrc)+/-1, FlipMemBase, Branch[.+2];
	FlipMemBase;
	SrcWd← MD, T← SrcWd, Fetch← T;
	BBDisp, Branch[.+2, R<0];
	T← XShMDRMask[SrcWd], A← Q, B← MD, Branch[StoreLastSrcDst];
	T← XShMDLMask[SrcWd], A← Q, B← MD, Branch[StoreLastSrcDst];

* Thin destination slice, for cases C, D, E, and F.
SrcDstThin:
	T← XShMDBothMasks[SrcWd], A← Q, B← MD;
StoreLastSrcDst:
	BBDst← (Store← BBDst)+/-1, DBuf← T, FlipMemBase,
		Branch[AdvanceSrcDst];

*-----------------------------------------------------------
* Other subroutines
*-----------------------------------------------------------

*-----------------------------------------------------------
GetGray:
* Get the gray word for the current scan line.
* Calling sequence:
*	BDispatch← VCount;
*	RBase← RBase[Gray0], Call[GetGray];
* Exit:	T = complement of gray word
*	RBase = BBRegs
* Note: returns Gray(n mod 4), where n is the number of scan lines remaining
* after the current scan line.  Note that VCount = n-1.
*-----------------------------------------------------------
Subroutine;
KnowRBase[Gray0];
	T← NOT (Gray1), RBase← RBase[BBRegs], Return, DispTable[10], Global;
KnowRBase[Gray0];
	T← NOT (Gray2), RBase← RBase[BBRegs], Return;
KnowRBase[Gray0];
	T← NOT (Gray3), RBase← RBase[BBRegs], Return;
KnowRBase[Gray0];
	T← NOT (Gray0), RBase← RBase[BBRegs], Return;
KnowRBase[Gray0];
	T← NOT (Gray1), RBase← RBase[BBRegs], Return;
KnowRBase[Gray0];
	T← NOT (Gray2), RBase← RBase[BBRegs], Return;
KnowRBase[Gray0];
	T← NOT (Gray3), RBase← RBase[BBRegs], Return;
KnowRBase[Gray0];
	T← NOT (Gray0), RBase← RBase[BBRegs], Return;

*-----------------------------------------------------------
TouchSrc:
* Touches every page of the source item for this scan line.
* Entry: BBSrc = address of first word of source item
*	T = SRast = words per source scan line; negative => bottom-to-top
*	SrcInc = distance between last word of one scan line and
*		first word of the next; negative => bottom-to-top
*	MemBase = BBSrcBR
* Exit: T and BBTemp clobbered
* Method: touch first and last words of the item.  If the item is >401B
* words long, also touch intermediate words at multiples of 400B words from
* the last word of the item.
* Note: |SRast-SrcInc| gives the number of words touched in the item;
* the value is positive if moving left-to-right, negative if right-to-left.
* Timing: 5 cycles for first page or less
*	 +1 cycle if right-to-left
*	 +2 cycles per additional page
*-----------------------------------------------------------
Subroutine;

	T← T-(SrcInc)-1;
	BBTemp← (Fetch← BBSrc)+T, DblBranch[TouchRtoL, TouchLtoR, ALU<0];

*-----------------------------------------------------------
TouchDst:
* Touches every page of the destination item for this scan line.
* Entry: BBDst = address of first word of destination item
*	T = DRast = words per destination scan line; negative => bottom-to-top
*	DstInc = distance between last word of one scan line and
*		first word of the next; negative => bottom-to-top
*	MemBase = BBDstBR
* Exit: T and BBTemp clobbered
* Timing: 5 cycles for first page or less
*	 +1 cycle if right-to-left
*	 +2 cycles per additional page
*-----------------------------------------------------------
Subroutine;

	T← T-(DstInc)-1;
	BBTemp← (Fetch← BBDst)+T, DblBranch[TouchRtoL, TouchLtoR, ALU<0];

TouchLtoR:
* T has (# words in source item)-1, BBTemp points to last (rightmost) word.
	T← T-(400C)-1;
	BBTemp← (Fetch← BBTemp)-(400C), Branch[.+2, ALU<0];
	T← T-(400C), Branch[.-1];
	Return;

TouchRtoL:
* T has -(# words in source item)-1, BBTemp points to last (leftmost) word -2.
	BBTemp← (BBTemp)+(2C);		* Fix the off-by-2 address
	T← T+(400C)+1;			* Should be T+402, but no matter
	BBTemp← (Fetch← BBTemp)+(400C), Branch[.+2, ALU>=0];
	T← T+(400C), Branch[.-1];
	Return;