[Openmcl-devel] Re: persistence of xref info in fasl files.

Sat Jan 3 14:53:48 PST 2004

On Sat, 3 Jan 2004, Alan Ruttenberg wrote:

> Here's an implementation. Gary, I don't really know which what the best
> place to put the hooks, so I've used advise where it seemed
> appropriate. Please feel free to fix or tell me how to.
>
> I came across one issue when doing this.  If you have (defun foo ()
> (flet ((bar ())) *baz*))
> then you get a recorded xref from bar to *baz*. I think this should be
> foo to *baz* since bar isn't global. If you disagree with that policy
> the code needs to be reworked a bit.
>

I can't remember whether I sent mail about this before the holidays,
or who I would have sent it to if I did.

It does seem to me that "the function itself" is a good place to keep
information about what things a function references; among other things,
that helps to ensure that the information gets GCed when the function
does.

Depending on what you mean when you ask "what does the function X
reference ?", some or most of that information's already there. A
function in OpenMCL's a "uvector" (the name once meant "uniform"
or "universal vector"; no one remembers); you can find the number
of elements in a uvector with CCL:UVSIZE, and access those elements
with CCL:UVREF.  (UVREF's SETFable; think of the fun to be had
setting the elements of a function.)

? (defun show-uvector (u)
  (dotimes (i (ccl:uvsize u))
    (format t "~&~d : ~s" i (ccl:uvref u i))))
SHOW-UVECTOR
? (show-uvector #'show-uvector)
0 : #<CODE-VECTOR #x5438B86>
1 : "~&~d : ~s"
2 : FORMAT
3 : (FUNCTION-SYMBOL-MAP (#(I #:G130 U) . #(31 48 120 575 44 120 63 20 120)))
4 : SHOW-UVECTOR
5 : 8388864
NIL

If you're interested in this sort of thing, you might find it more
interesting to look at a broader sample of functions.  The general
rules are:

a) element 0 is always a "code-vector": a sequence of machine instructions.
b) the last element is always a fixnum; bits in this fixnum describe
   various attributes of the function.
c) depending on whether one of the bits in (b) is set or not, the next-
   to-last element is the function's "name"; this is mostly for debugging,
   but there are some things (the macroexpansion of DEFUN, for instance)
   that attach greater significance to it.
d) depending on the setting of another bit in (b), the next-to-next-to-last
   element is a plist whose entries contain debugging information.
e) All other elements are things ("constants" or "immediates") that're
   referenced by the function.

Rules a-d apply to #'SHOW-UVECTOR, which means that elements 1 and 2
are just "constants that the function references": the format string
and the symbol FORMAT, in this case.

Looking at larger functions would give you a different impression,
but let's look at another small one:
.
? (defun bind-package-and-return-nil ()
    (let* ((*package* *package*))
      nil))
BIND-PACKAGE-AND-RETURN-NIL
? (show-uvector #'bind-package-and-return-nil)
0 : #<CODE-VECTOR #x5437666>
1 : #<SVAR *PACKAGE* 41 #x5035456>
2 : (FUNCTION-SYMBOL-MAP NIL)
3 : BIND-PACKAGE-AND-RETURN-NIL
4 : 8388608

Again, rules a-d apply, and this function only references a single
constant (element #1); that's an object of type CCL::SVAR that refers
to the special variable *PACKAGE*.  (SVARs are used in the per-thread
special-binding and reference code in recent versions of OpenMCL; in
older versions, this constant would have just been the symbol *PACKAGE*.)

>From this little sample, we can observe:
1)  It's fairly easy to find the constants that a compiled function
    (might) actually need to reference.
2)  There's currently no information retained about things that're
    referenced at compile-time (macros) or about things that're
    inlined, strength-reduced, dead-code-eliminated, source-transformed,
    or otherwise appeared in the source code but aren't referenced in
    the object code.  (UVREF, UVSIZE, DOTIMES in #'SHOW-UVECTOR).
3)  There's no metainformation that tells us in any detail -how- the
    constants are being used: a function like:

(defun cons-em-up ()
  (cons "~&~d : ~s" 'FORMAT))

makes random references to the same constants that #'SHOW-UVECTOR does,
and we wouldn't want an XREF utility to say that CONS-EM-UP calls FORMAT.

Let's assume that problem (2) is easy to solve, and that the compiler
frontend kept track of that fact that DOTIMES, UVREF, and UVSIZE were
referenced in the code and the backend just emitted those symbols as
constants, so #'SHOW-UVECTOR might look like:

0: #<code vector>
1: "~&~d : ~s"
2: FORMAT
3: DOTIMES
4: CCL:UVSIZE
5: CCL:UVREF
6: (FUNCTION-SYMBOL-MAP ...)
7: SHOW-UVECTOR
8: 8388864

(It might take a while to come up with a reasonable policy here:
is DOTIMES interesting ?  It might have macroexpanded into "calls"
to 1+, or =.  Are those interesting ? Etc.)  Once we've decided what's
interesting, it's not too hard to ensure that everything that was
interesting to the frontend shows up in the backend.

If we agree that problem 2's solvable, we can look at problem (3).
There seem to be a relatively finite number of interesting ways
in which a constant can be referenced.

0) as a global function name.  The last time that I checked, slightly
   over 50% of all function constants in MCL fell into this category;
   that was several years ago, but I'd be surprised if the percentage
   changes that often.
1) as an interesting macro name.
2) as a random argument to a function, return value, etc.
3) as a functonal argument to a function known to take functional
   arguments.
4) as a function name that only appeared in the sourcce
5) in a special variable reference operation.
6) in a special variable assignment operation.
7) in a special variable binding operation.

I may be missing something, but it seems pretty likely that
we can encode the interesting cases in an 8 bit byte.  (a given
constant might have more than one use, as in:

(defun foo ()
  (format t "~s is a programming language in its own right." 'format))

.) A special variable (or a CCL::SVAR denoting one) might have 3 or
even more of these attributes; most constants in a given function would
have one or two.

For the five constants (elements 1-5) in the revised #'SHOW-UVECTOR as
they're used there, we'd have:

1: "~&~d : ~s"		; (ash 1 2)
2: FORMAT		; (ash 1 0)
3: DOTIMES		; (ash 1 1)
4: CCL:UVSIZE		; (ash 1 4)
5: CCL:UVREF		; (ash 1 4)

using bit numbers corresponding to the list above. If the next-to-next-
to-last element of the function (the debugging-info-plist, or CCL::%LFUN-INFO)
contained an entry of the form:

(xref-constant-information-map #<5-element vector of UNSIGNED-BYTE 8>)

I -think- that we'd be able to say that the data structure that contains
the XREF information for a function is the function, and I think that
there'd be some advantages to not having to maintain separate data
structures and keep them in synch.  For instance:

(defun x (arg) (foo (1+ arg)))
(setf (symbol-function 'y) #'x)
(setf (symbol-function 'x) #'(lambda ()))
(who-calls 'foo)

There are some possible advantages to using a separate data structure
as Alan's code does as well.  I'm not 100% convinced that the idea I'm
proposing is entirely better, but I think that it's worth thinking about
further.  It has always sort of struck me that (incomplete and imprecise)
XREF info's already there and there's a lot of it, and it seems more
attractive to make it more complete and precise than to duplicate it.)

> The bit about (add-pre-xref-xrefs) is for bootstrapping.
>
> BTW, what does "indirect calls" mean?

Maybe that bit 3 is set in the corresponding xref-constant-information-map ?

>
> -Alan
>
>