Finding Love with GPU-Accelerated Matchmaking: A Tech Fairy Tale

Thu, Feb 1, 2018 by Mirko Sertic

This entertaining tech fairy tale follows four IT professionals who leverage OpenCL and vector mathematics to create a matchmaking system. By implementing cosine similarity calculations on GPU-accelerated hardware through Bytecoder, a Java-to-OpenCL cross-compiler, they not only solve their own romantic challenges but also demonstrate the practical applications of parallel processing in data-intensive tasks. The story showcases how modern GPU technology can process millions of similarity computations in seconds, while highlighting the accessibility of high-performance computing through familiar programming languages like Java.

9 Minutes reading time

Performance ,Interesting ,OpenCL ,Compiler

An office fairy tale

Once upon a time in an office far far away worked four very nice IT people very hard in their office cubicles. Their names were Alice, Bob, Carol and Dave.

They worked very hard each day. They started work at 9 o’clock, and finished often after sunset. They were very happy in their business, but they were very lonely in their private life due to the amount of work they did every day.

Time passed. And some day, they sat together having a break and drinking some coffee. Suddenly, Alice stand up and said "We need to do something! Look at us, we are lonely, we need to change this!". Bob, Carol and Dave silently agreed. After a few minutes, Carol asked : "Yes, but how can we change something, where should we start?".

Math can really help

And here began a very interesting journey of our four friends. After some discussion, they came to the point where they collected some information about them self and their interests. Maybe they could find a mate matching their interests this way? Everyone in IT likes to make tables, so our friends. They created a table, showing their interests and how much they liked it. A score of 0 meant "I really dislike this", a score of 10 meant "I cannot live without this!". So this is the table our friends created:

Interest	Alice	Bob	Carol	Dave
Hiking	5	0	2	7
Swimming	1	10	6	2
Cinema	0	3	3	1
Dancing	6	0	2	8

After gathering this information, Bob asked "What now? What can we do with this kind of data?" Dave, the data scientist, came up with a simple idea: Why not express every person as a vector, and map the interests to vector elements? If we have vectors for other persons, we could use a cosine similarity to match the vectors, and the person with the highest similarity might be a possible mate!

"What is a cosine similarity" asked Alice? Can you please explain this? And Dave did. A cosine similarity basically measures the delta of the angle of two vectors. If the vectors show in exactly the same direction, they have a similarity of 1. If they show in the opposite direction, the similarity is -1. We can use the cosine similarity to compute the similarity even for n-dimensional data. Look, the cosine similarity of vectors from two persons shows us how much they like the same interests. So it is an indicator of how good they might match!

"Great idea!" said Alice,"but we do not have data from other persons". Silence. Finally, Carol came up with an option : "Well, why not try this similarity matching with ourselfs, before we collect masses of data?". That was a viable option, so our friends started to boot their development machines. Dave came up with another great idea. As they are doing some interesting math here, why not learn cool new tech while trying to find a mate? At this point, our friends met OpenCL.

Behind the scenes of OpenCL

OpenCL is a cross platform language for heavy data processing. Using OpenCL, we can write data processing algorithms and run them on any kind of OpenCL-enabled devices. Interestingly, OpenCL algorithms can run on the CPU utilizing maybe 4 or 8 CPU cores, but it can also run on the GPU(Graphics card) utilizing maybe thousands of threads in parallel. Using the GPU, heavy math can be greatly accelerated!.

In OpenCL, data processing algorithms are called kernels. The amount of data is split into so called work items, and every work item is processed independently. So if we have for instance 300 work items, and our GPU can 300 threads, OpenCL can compute every single work item completely parallel!

So, what kind of input and output generates the kernel for the mate-matching problem of our four friends?

Well, we need a list of all potential mates, which is a list of vectors with four elements. One vector for every mate, and the elements of every vector describe their score for a given interest, such as hiking, swimming, cinema or dancing. This is one input for the kernel.

We want to compute the similarity for a single mate with all the others. This computation can be done independently, so every single mate is a work item. The kernel gets another input, which is the current work item id. Now, what is the output? The kernel needs to calculate all the similarities, and for every work item, it needs to output the id of the most matching mate. So the output for a single work item is the id of its most matching mate based on the cosine similarity of their vectors.

Developer productivity

OpenCL kernels are written in a C99 based language. It is not hard to learn, but it is different from languages such as Java. Our four friends are Java specialists, so what can they do? Well, there is a framework to the rescue. With Bytecoder, we can write OpenCL kernels in Java. The Java bytecode is translated into C99 kernels in the background and executed on the OpenCL device. As our friends wanted to keep their productivity up, they decided to use Bytecoder.

Putting it all together

How, here is the annotated Java kernel code for the matching problem of our four friends:

import static de.mirkosertic.bytecoder.api.opencl.GlobalFunctions.get_global_id;
import static de.mirkosertic.bytecoder.api.opencl.GlobalFunctions.get_global_size;
import static de.mirkosertic.bytecoder.api.opencl.VectorFunctions.dot;
import static de.mirkosertic.bytecoder.api.opencl.VectorFunctions.length;

import de.mirkosertic.bytecoder.backend.opencl.CPUPlatform;
import org.junit.Test;

import de.mirkosertic.bytecoder.unittest.Slf4JLogger;

@Test
public void testSimilarity() throws Exception {
    // The data of our four friends
    Vec4f theAlice = new Vec4f(5f, 1f, 0f, 6f);
    Vec4f theBob = new Vec4f(0f, 10f, 3f, 0f);
    Vec4f theCarol = new Vec4f(2f, 6f, 3f, 2f);
    Vec4f theDave = new Vec4f(7f, 2f, 1f, 8f);

    // We need an input for our kernel, a list of vectors
    Vec4f[] theInputs = new Vec4f[] {theAlice, theCarol, theBob, theDave};

    // This is the computed output
    int[] theMostSimilar = new int[theInputs.length];
    float[] theMostSimilarity = new float[theInputs.length];

    // We obtain a platform
    Platform thePlatform = PlatformFactory.resolve().createPlatform(new Slf4JLogger());
    // All computation is done within a context. A context is
    // used to cache memory buffers and compiled kernels
    try (Context theContext = thePlatform.createContext()) {

        // We fire up the computations
        theContext.compute(theInputs.length, new Kernel() {

            // This method is called for every workitem
            @Override
            public void processWorkItem() {
                // This is the id of the current work item
                int theCurrentWorkItemId = get_global_id(0);
                // This is the total number of work items
                int theMax = get_global_size(0);

                // We obtain the current work item from the list
                Vec4f theCurrent = theInputs[theCurrentWorkItemId];
                float theCurrentLength = length(theCurrent);

                float theMaxSimilarity = -1;
                int theMaxIndex = -1;

                // And compute the similarities with all other work item
                // except itself
                for (int i = 0;i<theMax;i++) {
                    if (i != theCurrentWorkItemId) {
                        Vec4f theOther = theInputs[i];
                        float theOtherLength = length(theOther);

                        float theLength = theCurrentLength * theOtherLength;

                        if (theLength != 0) {
                            float theSimilarity = dot(theCurrent, theOther) / (theLength);

                            if (theSimilarity > theMaxSimilarity) {
                                theMaxSimilarity = theSimilarity;
                                theMaxIndex = i;
                            }
                        }
                    }
                }

                // The highest similarity is written to the output
                theMostSimilar[theCurrentWorkItemId] = theMaxIndex;
                theMostSimilarity[theCurrentWorkItemId] = theMaxSimilarity;
            }
        });
    }

    // Output the results
    for (int i=0;i<theInputs.length;i++) {
        System.out.println("Most similar match for input " + i + " is " + theMostSimilar[i] + " with a similarity of " + theMostSimilarity[i]);
    }
}

Bytecoder makes it easy to write OpenCL kernels. We can use Java as the programming language to keep developer productivity up. Bytecoder compiles the Java bytecode to C99 kernels transparently in the background and sends them to the OpenCL enabled device. If there is no OpenCL device available, Bytecoder uses an emulation layer which executes the kernel in the Java virtual machine.

We have a match

Our friends started their machines, and put the created kernel to work with their own test data. Here is the output:

Item	Name	Best matching item	Similarity
0	Alice	3	0.993
1	Bob	2	0.907
2	Carol	1	0.907
3	Dave	0	0.993

The systems says Alice matches with Dave, and Bob matches with Carol! This was very confusing to our four friends, so they took a look at their data to check if the algorithm is wrong.

Interest	Alice	Bob	Carol	Dave
Hiking	5	0	2	7
Swimming	1	10	6	2
Cinema	0	3	3	1
Dancing	6	0	2	8

Alice has indeed a high similarity with Dave. Both like hiking and dancing. Bob and Carol have also a high similarity, as both like swimming and cinema. The algorithm seems to be correct.

Our four friends also realized what happened. They spent far too much time with working, and didn’t realize where their potential soul mate is sitting. The next day, Alice went dancing with Dave, and Bob went swimming with Carol. No more data was needed.

And they all lived happily ever after.

Up to 11 and beyond

Peggy, an intern of the IT department, realized some day that the office is empty and everyone is happy all the time. She heard about what our four friends have done, and also Peggy didn’t want to be alone anymore.

Unfortunately, there is no one left in the office. But fortunately, Peggy has collected a very large data set from other persons, with over 100'000 data points. Maybe she previously worked for the CIA, but it doesn’t matter in this case. And thanks to OpenCL and GPU acceleration, Peggy can start up the machine and find her potential soul mate. How long might it take to compute the best matches of so many people? Here are the results:

Running on the CPU/JVM on a Lenovo T440P Intel Core i7 with 8 cores : 19 seconds
Running on the embedded Intel GPU of the i7 : 6 seconds
Running on the secondary NVIDIA GT 730M : 3 seconds

Just to be clear. This means she computed the cosine similarity for every single person with the others. She computed 100'000 x 99'999 = 9'999'900'000 cosine similarities!

The OpenCL accelerated computation is a lot faster than running on the CPU. But it could even be faster, as the used kernel is far from being highly optimized.

But Peggy didn’t care. After 3 seconds, she went our of the office to meet her new soul mate.

What we have learned

Of course this is a fairy tale. And of course it is total overkill to compute four cosine similarities on the GPU. What I want to show you is a simple use case that might be extended to eventually become a full blown recommender system for all kind of matching problems. I also want to show you that we can easily use our favorite programming language and cross compiler technology such as Bytecoder to accelerate data processing problems by running them on the GPU. I also want to to note that Bytecoder is not the only available Java OpenCL toolkit. Aparapi is another great tool for this kind of problem, so maybe you want to take a look at this. I really hope that this short story was interesting and a little bit helpful. Thank you for reading, and happy processing!

<<Pevious posting: Mastering Object-Oriented Programming in WebAssembly: From High-Level to Low-LevelNext posting: Supercharging Vue.js with WebAssembly and Java>>