[Openmcl-devel] Speed, compilers and multi-core processors

Thu May 21 16:02:37 PDT 2009

Sure, I agree with you that there are lots of interesting things you  
can do with GPUs; I hope I didn't leave the impression that I thought  
they were useless. Anything that can be decomposed into a large number  
of independent computations on small independent data sets is a good  
candidate for GPGPU processing.  In the HPC world these are called  
"embarrassingly parallel applications". As you noted, anything for  
which a map/reduce paradigm works, is also a candidate for using lots  
of GPU's, as are things like rendering and large matrix computations.

But there are interesting things that you can't do with GPUs too. They  
are fast because they are limited in certain ways. They don't  
typically have cache, so if you begin to need a lot of off-chip memory  
references (assuming such are even allowed by the chip; often some  
external control structure is required to move data in and out) then  
it idles while waiting for that data. They typically have a very  
limited ability to communicate externally. So if any sort of global  
synchronization is required, you would require some external service  
to do that for you. If you need a bunch of separate processes crawling  
around examining dynamically determined arbitrary locations within the  
same large data set (e.g. for certain kinds of searches), then current  
GPU configurations are not going to work well for you. [Note I just  
saw Alex R.'s response as I was preparing to send this and agree that  
alternative algorithms can *sometimes* be found to take better  
advantage of GPUs. This is an interesting research area, but I remain  
a sceptic that this will always work and in any event will likely  
require yet another computational language/paradigm that will need to  
be developed, propagated to the community, and learned by developers.  
I.E. GPUs are not an automatic immediate solution to every problem.]

Even conventional multi-core processors can have problems with that  
sort of application although things like shared caches, large locally  
accessible memories, and fast inter-processor communication channels  
can help. That's why multi-threading chips (e.g Sun's Niagara and  
offspring and others; also see: http://en.wikipedia.org/wiki/Multithreading_(computer_hardware)) 
  also have a place. Typically they can fire off a memory reference  
and switch to another thread in a single cycle so that they are rarely  
idle. Of course you need the right application to take advantage of a  
large number of threads as well as a memory subsystem that can keep up  
with that load.

The connection machine was a very interesting creature and it would be  
fun to have an updated version of it around, but I wouldn't equate the  
connection machine to current incarnations of GPU's. The latter simply  
don't have the external communication and synchronization capabilities  
that connection machine processors had.

A few years back I was talking with a developer of FPGA applications.  
He was perplexed by the fact that he wasn't seeing the sort of speedup  
in his application that he expected (relative to running on a  
conventional processor). In fact it was slower when run on the FPGAs.  
As I explained to him then, the cost of moving data into and out of  
the FPGA's was completely dominating the execution time. He was doing  
data transfers fairly frequently so the FPGA was more idle than it was  
busy while this was going on. The board that he was using coupled  
FPGA's to the processor in much the same way that GPUs are coupled to  
processors today, so I expect that some GPU developers will see  
similar effects with their applications. Just a cautionary note ...

We live in interesting times. There is a wealth of technology around  
and new variations every day it seems. I expect to see new types of  
GPUs that are easier to use for more conventional processing and I  
expect to see conventional processors that are extended to add more  
multi-threading capabilities as Sun and others have done. We may see  
streaming processors too, who knows (ala the work done at Stanford:  
see http://en.wikipedia.org/wiki/Stream_processing).  Each type has  
its proper place, but as near as I can tell no one of them is  
applicable to all problems. Perhaps the only thing I can predict with  
some confidence is that there will be an awful lot of software work  
required to support them.

Regards, Paul

On May 21, 2009, at 12:48 PM, Rainer Joswig wrote:

>
> Am 21.05.2009 um 16:55 schrieb Paul Krueger:
>
>> You can't emphasize these points enough. GPGPU technology has its  
>> place, but it's not perfect for everything. If you have an  
>> application where data can be partitioned up neatly and distributed  
>> to separate processing elements which tend to do the same limited  
>> things over and over (FFT's are a good example), then GPGPU's may  
>> be appropriate (as are FPGA's for similar reasons, although there  
>> are certainly other factors there). If you have an application  
>> where each processing thread may dynamically determine that it  
>> needs data from an arbitrary location within a very large block of  
>> memory or needs to do frequent updates within large data blocks in  
>> arbitrary ways, then GPGPU's are not appropriate because the  
>> communication and synchronization costs will typically kill you.  
>> That's especially true on any larger distributed memory  
>> architecture, but even on smaller systems you might overwhelm the  
>> memory subsystem.  Many of the sorts of AI, graph, and intelligent  
>> applications that I am personally more interested in fall into the  
>> second category, so GPGPU's will likely not be of much help.
>
> The appeal of the GPU is that it has lots of computing elements  
> (originally designed for 3d rendering tasks). The trend is that  
> these computing elements are getting more flexible and numeric  
> computations are getting more accurate. Now we are at a point where  
> the first programming languages appear that will allow writing  
> portable algorithms that are able to run on the GPU: CUDA, OpenCL, ...
>
> The typical applications the average user will see are about  
> manipulating large amounts of multimedia data. Think using FinalCut  
> Pro to render visual effects, to convert video formats etc. Think  
> iPhoto/Aperture plugins that manipulate large 'raw' images (dust  
> filters, sharpeners, ...). At the same time scientists working with  
> such kind of data will find its uses, too.
>
> If you look back at the SIMD Connection Machine it was thought that  
> there are graph applications that are able to run in parallel on  
> such a machine. The graph was mapped to compute elements which could  
> communicate efficiently with nearby elements. A somewhat typical  
> application domain was also querying data collections. Spread those  
> data collections across a huge number of compute elements and run  
> MAP/REDUCE operations. An outcome was the WAIS protocol and a  
> document query engine running on the connection machine ( http://en.wikipedia.org/wiki/Wide_area_information_server 
>  )
>
> For Lisp users it might be interesting to run numeric matrix  
> operations in applications like Maxima on the GPU. Image  
> understanding applications like Freedius ( http://www.ai.sri.com/project/FREEDIUS 
>  ) could benefit from it. But there it would be probably written in  
> the C/OpenCL side and used from Lisp via FFI. CAD applications, too.  
> I could even imagine that a large Triple Store ( http://en.wikipedia.org/wiki/Triple_Store 
>  ) has algorithms in the query domain that could benefit from GPU /  
> SIMD support. Also think about blackboards that have some dimensions  
> as matrices mapped to the GPU (example of a blackboard system in  
> Lisp: http://gbbopen.org/ ).
>
> As an example see this GBBOpen function:  map-instances-on-space- 
> instances
>
>   http://gbbopen.org/hyperdoc/ref-map-instances-on-space- 
> instances.html
>
>   ' Apply a function once to each unit instance on space instances,  
> optionally selected by a retrieval pattern. '
>
> Then see  FIND-INSTANCES:  http://gbbopen.org/hyperdoc/ref-find-instances.html
>
> I could imagine that SOME uses of these function could be speed up a  
> lot by running in parallel on a GPU with, say, 256 compute elements.
>
> But that is just speculation on my side. It really depends if users  
> really have such application problems and they can be mapped to GPUs.
>
> Regards,
>
> Rainer Joswig
>
>
>>
>> Paul
>>
>> On May 20, 2009, at 1:06 PM, Dan Weinreb wrote:
>>
>>> The instruction set is very restricted, and the communication
>>> paths aren't there, as you suggested.  GPGPU is especially
>>> good for highly compute-intensive operations over not
>>> all that much data.  An FFT is an obvious example but
>>> there are many, many good examples.  (Not that I'm an
>>> expert, but I do know that much.)
>>>
>>> There are CUDA-compatible devices that don't even
>>> have a video connection, i.e. for GPGPU only.
>>> The NVidia Tesla, called a "computing processor"
>>> (weird name).  240 cores per board, and you can
>>> chain together four of them.
>>>
>>> (My officemates are getting this info and telling to
>>> me faster than I can type it in.  Thanks, Andrew
>>> and Scott.)
>>>
>>> -- Dan
>>>
>>> Jeremy Jones wrote:
>>>>
>>>> On Wed, May 20, 2009 at 9:13 AM, Raffael Cavallaro
>>>> <raffaelcavallaro at mac.com> wrote:
>>>>
>>>>> tomshardware.com ran this a couple of days ago:
>>>>>
>>>>> <http://www.tomshardware.com/reviews/nvidia-cuda-gpgpu,2299.html>
>>>>>
>>>>> It's a summary of real-world results from apps using Nvidia's  
>>>>> CUDA.
>>>>> For certain things, like video encoding, they're seeing a 4x  
>>>>> speedup
>>>>> using the GPU over using the CPU. In addition, when they use the  
>>>>> GPU,
>>>>> it leaves the CPU free for other tasks.
>>>>>
>>>> Why don't we just throw out the main CPU and fill our computers  
>>>> with
>>>> graphics cards?  (Once CCL is ported to GPUs of course)
>>>>
>>>> Seriously though, what does a CPU have that a GPU doesn't,  
>>>> besides a
>>>> different instruction set?  More memory?  Better i/o?  Is the GPU
>>>> instruction set too specialized?  I bet the answer is mainly  
>>>> software,
>>>> like OSes and device drivers.  I remember in the old days it was
>>>> common to have a separate processor to handle i/o.  Maybe that's  
>>>> what
>>>> the main CPU should be relegated to.  OTOH, if the software is good
>>>> enough, it should just be distributed to whatever computing  
>>>> resources
>>>> are appropriate and available.  Just thinking out loud.
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>>
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>
> Rainer Joswig, Hamburg, Germany
> http://lispm.dyndns.org/
> mailto:joswig at lisp.de
>
>
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20090521/84cd2e07/attachment.htm>