[Openmcl-devel] process-run-function and mach ports usage
Gary Byers
gb at clozure.com
Wed Feb 23 03:22:16 PST 2011
I was going to write a longer reply, but I'd already spent a few hours
looking into this today (seeing some of the same things that you saw
and reaching some different conclusions), and this is all starting to
seem too much like the sort of thing where it's said that "if you don't
stop doing it, you'll go blind."
Apple's documentation for mach_thread_self() (see
<http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/man/mach_thread_self.html?txt>)
doesn't seem to explicitly state that that function increments the
port's send reference count (though it certainly seems to.) The GNU
Hurd documentation does state this, but warns the user that
mach_thread_self() allows the reference count to silently wrap around
(wrap around 2^32 ? what ?) without error and that the reference count
shouldn't be decremented if this overrun might have happened.
This stuff escaped from a lab somewhere; the fact that it's been
lurching around the countryside frightening villagers as long as it
has is pretty sad.
I don't care what a thread's Mach port's send right's reference count
is. I do care that thread creation doesn't leak kernel resources, and
because if this nonsense it does.
(defun foo ()
(let* ((s (make-semaphore)))
(dotimes (i 1000000)
(process-run-function "test" (lambda () (signal-semaphore s)))
(wait-on-semaphore s)
(when (eql 0 (mod i 1000)) (print i)))))
This gets noticeably slower (the average and worst-case intervals
between PRINT calls increase) as the number of leaked ports increases.
I found that it wasnt't too bad up to ~100K; it was pretty bad by the
time it got to ~500K. (The mechanism that 'top' uses to count ports
in the task - mach_port_names() - refuses to return values > ~128K.
When this degrades, it seems that a lot of time is spent in Mach,
effectively growing a hash table and rehashing a bunch of dead port
names (and a few live ones) into it. You're right that this needs
to be addressed: I don't care as much as I might about trying to
create millions of trivial threads as fast as they can, but I do
care about long-lived processes being able to create large numbers
of threads over a long period of time without having their performance
degrade.
CCL makes exactly one explicit call to mach_thread_self(); as far as I
can tell, the pthreads library makes another. I looked at the
reference count of the listener thread's kernel port after it had
compiled a small file and done a few other unremarkable things and
found that it was (IIRC) a few thousand. I don't know what things
(exception handling ? IPC ?) cause the reference count to increase
or whether it's at all practical to ensure that they'e all balanced
by mach_port_deallocate() calls. (I don't think that examples involving
threads that barely do anything give us a real sense of what the issues
are.)
It -does- seem to be practical to try to ensure that the last thing
that a thread does (this would be in thread_manager.c:shutdown_thread_tcr(),
which does most of the deallocation of thread-private resources) is to
ensure that the kernel port's send right's reference count is exactly 1.
That seems to make it very likely that the eventual mach_port_deallocate()
in the kernel destroys the port.
I've been running with that change in effect for a while; the test case
above doesn't seem to suffer any visible performance degradation. (Every
so often, the GC runs and destroys some locks/semaphores that've become
garbage; if you're really creating threads as fast as you can do so
sequentially, that means that "some" may be "a few K.")
The idea seems to basically work and will likely continue to do so unless
and until the pthread exit code changes dramatically. Unless some downside
becomes apparent soon, I'll try to check that in to the trunk in the next
day or so.
On Wed, 23 Feb 2011, Willem Rein Oudshoorn wrote:
> Gary Byers <gb at clozure.com> writes:
>
>> On Tue, 22 Feb 2011, Willem Rein Oudshoorn wrote:
>>
>>> Gary Byers <gb at clozure.com> writes:
>>>
>>>> At this point, I'd probably say that it -looks- like there's a net loss
>>>> of ~1 port every time a thread is created and destroyed, but that isn't
>>>> entirely predictable.
>>>
>>> After a bit of experimentation, it seems that it always loses 1 port.
>>> (The first time it might not appear this way because doing a garbage
>>> collect immediately after booting the lisp image it will recover some
>>> ports already.)
>>>
>>> After some debugging, it turns out that the mach_thread port is not
>>> freed. I think I know at least one reason why this is the case, but
>>> that can not be the whole story. (The mach port business is all
>>> completely new for me, so it takes a bit of time figuring this out.)
>>>
>>
>> The port that the the lisp kernel generally refers to as "mach_thread"
>> is effectively a "task-wide" (OS process-wide) identifier for the thread.
>> It's not explicitly created by user code; it's created by the OS (I use
>> the term loosely ...) when the thread is created, and I'd naively
>> expect the port to be deallocated at some point after the thread
>> exits.
>
> Yes, indeed, the kernel will deallocate the receive right of the port.
> Normally this will make the port go away. However, if the user program
> has created send rights to the thread port, the send rights will
> transfer into a dead state and will not disappear, but hang around.
>
> One way to create send rights to the port is by calling
> mach_thread_self (). So basically all calls to mach_thread_self
> (and mach_task_self) should be balanced by a reference decreasing
> operation like mach_port_deallocate.
>
> This can easily be observed by running the following attached
> c-program 'tt'. This program creates a thread with:
>
> pthread_attr_init (&attr);
> pthread_attr_setdetachstate (&attr, PTHREAD_CREATE_DETACHED);
>
> int err = pthread_create (&tid, &attr, do_nothing, NULL);
>
> and in the do_nothing call it does:
>
> pp = mach_thread_self ();
>
> This call makes the thread port leak.
> (If later mach_port_deallocate (task_port, pp) is called, the
> port is immediately freed)
>
>> (I'd also find it believable that "at some point after" might
>> not be "immediately" and that the port might linger, sort of like a
>> listening TCP socket does.) I'll check, but I don't think that it's
>> meaningful for a thread to destroy its own self port while exiting,
>> and the basis for my naive belief that it's the kernel's responsibility
>> to recycle these ports is that it just doesn't seem to scale well to
>> do this in some other user thread. ("Remember those 10,000 threads you
>> created earlier ? Some of them are probably dead by now. Go harvest
>> their kernel ports.")
>
> First, I agree, destroying the thread port is not the programs
> responsibility.
> However, correctly balancing the send count for the send rights
> for the threat is the users responsibility.
>
>>
>> Actually, it looks (at a rough glance) like there's some code in the
>> pthreads library that tries to do something like that; in the version
>> of that code that I'm looking at, it's called _pthread_reap_threads()
>> and it's called (under some circumstances) when a thread exits. That's
>> worth looking at further. If that works as my reading of the code
>> suggests it should, then when a thread exits other threads are examined
>> and if their kernel ports are dead those ports are deallocated. I don't
>> know if that reading's entirely correct, and I don't know why that isn't
>> having the intended effect if that is indeed the intended effect.
>
> I haven't looked at the pthread library. But at the moment, I don't
> understand where the bug is. It could be in the pthread library,
> but running some differnt test programs written in C does not
> show it yet. Is I mentioned before, I am quite convinced
> that every call to mach_thread_seld (and mach_task_self) should
> be balanced by reference count decreasing operations.
>
> However, a thread has a send_count of about 5:
>
> * 1 for creating the lowlevel thread it self (the pthread library?)
> * 1 created by mach_thread_self in CCL code
> * 3 others
> [This is from memory, I have experimented a bit so I could be one off.]
>
> Now after exiting, the first one is dealt with correctly (pthread
> library???). The second is AFAICS a bug n CCL. But fixing that
> leaves still the remaining 3. And at the moment I can't really
> see who creates these send rights. I suspect the swap exception ports
> code, but I haven't checked this.
>
>> It's possible that something that CCL does inhibits port recycling by
>> Mach. It's also possible that one would need to wait longer than we
>> have in order to see that recycling take place. I don't know.
>
> In my testing with C, the kernel will deal with ports immediately.
> I would be a bit skeptical if the kernel runs a regular 'port garbage
> collect'. I think that if user code does not manage the reference
> count correctly, the port (most likely dead) will hang around forever.
>
>>> The reason this matters:
>>>
>>> (time (ccl:process-run-function "test" (lambda ())))
>>>
>>> returns on my machine values around the follwoing:
>>>
>>> During that period, 304 microseconds (0.000304 seconds) were spent in user mode
>>> 325 microseconds (0.000325 seconds) were spent in system mode
>>>
>>>
>>> However after
>>>
>>> (loop :repeat 10000 :do
>>> (ccl:process-run-function "test" (lambda ())))
>>>
>>> doing
>>>
>>> (time (ccl:process-run-function "test" (lambda ())))
>>>
>>> returns values in the range of:
>>>
>>> During that period, 315 microseconds (0.000315 seconds) were spent in user mode
>>> 528 microseconds (0.000528 seconds) were spent in system mode
>>>
>>>
>>> And it is getting progressively worse. After about 150000 thread
>>> creations, the same function takes about 10ms.
>>
>> I'd certainly agree that it's desirable for thread creation to be as
>> quick as possible and for it not to degrade over time. It's possible
>> that the degradation that you see has something to do with port
>> leakage, but there are so many other things that can cause that sort
>> of thing that I'd be hesitant to conclude that there's a causal
>> relationship there. Whether I need to or not, I also want to point
>> out that this can be hard to measure: all that we know for sure after
>> the loop above runs is that 10000 threads were created and have either
>> exited or are on their way towards exiting. In order to exit, the
>> thread needs to run (get some CPU time), and if the number of runnable
>> threads exceeds the number of CPU cores ... well, it can take a while
>> for even the short life cycle of the threads in your example to
>> complete 10000 times. All that we know for sure is that when the loop
>> above exits, it's been started 10000 times.
>
> Yes, I see your concern. However, I still think it does degrade,
> because:
>
> 1 - after running that loop I wait for quite a while
> 2 - the nr of thread indicated by ps or top is back to the normal amount
> (indicating that at least all the mach threads are finished.)
> 3 - I run a few (gc) to try to recycle ports and that has worked
> 4 - The cpu usage is back to normal (low)
>
>> True as that might be, it's definitely the case that the number of Mach
>> ports that a task (Unix process) can reference is large but finite, and
>> my recollection is that there's a lot of performance degradation as this
>> limit is approached.
>
> Well, in my experience the mach port usage goes up with threads
> and never down again.
> Also if the nr of mach ports used is in the region of > 100000
> the performance definitely degrades.
>
>
>> Amit Singh's book and website <http://osxbook.com/> deal with Mach
>> and other parts of OSX that generally aren't dealt with elsewhere.
>
> I will look it up.
>
>>>
>>> Wim Oudshoorn.
>>>
>>> P.S.: sbcl also loses 1 port per thread creation.
>>
>> I'm really skeptical that this has anything to do with user (non-OS-kernel)
>> code, but I don't know that with 100% certainty.
>>
>> If creating threads in C via pthread_create() doesn't seem to have the
>> same problem, it'd be interesting to see whether creating a detached
>> thread (via pthread_attr_setdetachstate(...,PTHREAD_CREATE_DETACHED))
>> affects this.
>
> Creating threads that are joinable will keep a mach port around until
> the threads are actually joined.
> Creating threads in a detached state will free the port immediately
> after finishing the threads execution.
>
>> At this point, I'm most suspicious of the pthread cleanup
>> code that -looks- like it should be deallocation the Mach ports of recently
>> exited threads, and it's plausible that thread creation options could affect
>> that (intentionally or otherwise.)
>
> My guess is that the pthread library need to do this to get the pthread
> semantics right on top of mach_threads. Most likely in joining etc.
>
>
> Wim Oudshoorn.
>
>
More information about the Openmcl-devel
mailing list