An interesting encounter with parallel programming on GPU

5 min readApr 17, 2021

Recently I was thinking about the time when I was buying a new laptop. One important component that I considered while buying the laptop was the Graphic Card. I had fantasized about using the laptop for everything from game development to video-editing, and had gone so far as to check the minimum as well as recommended GPU(Graphical processing unit) specifications for different software like Unreal Engine, Blender, and a few others. What I didn’t realize was when starting on your first job, and trying to get used to the work culture and the learning curve, free time or time for personal aspirations is hard to find. Not to mention that spending 9 hours a day working is way more exhausting than spending 8 hours a day studying (or pretending to study 😉 ) in college. So now I was just playing games made by others rather than trying to develop one, and not doing much of video-editing either.

Last week my friend asked if I wanted to try doing some parallel programming using the GPU, and I jumped at the opportunity and said yes. GPUs are primarily designed by two companies — NVIDIA and AMD — and they have large catalogs, but coincidentally my friend and I had the same GPU — the NVIDIA GEFORCE GTX 1650 and I was finally going to use it for something.

Originally, for working with GPUs, developers had to learn specialized languages like OpenGL which were used specifically for GPUs. But in 2007 NVIDIA launched the CUDA framework for general purpose computing on NVIDIA’s graphic processing units (GPU) hardware. CUDA can be used to add parallelism to our program in many popular languages like C, C++, Python, MATLAB, etc. This opened the gates to a lot of interesting uses of the GPUs like accelerating video or audio signal processing, scientific computing, computer vision, deep learning and even cryptography.

We decided to start with a basic program in Python and compare its performance when run on the CPU to that when run on the GPU.

We already had the setup ready, but if you want to try it too, follow these steps :

Download and Install Anaconda (With Python 3.8)
Download and install PyCharm (Community edition would suffice)
Download and install a stable release of MS Visual Studio (Currently it is 2019 version 16.9.3)
Open Anaconda prompt and run

conda install numba
conda install cudatoolkit

For our comparison, we used a simple program that increments each element of a very large array by 1. You can find the code in the answer to this question. Theoretically CPU will obviously execute it sequentially while the GPU will execute some chunks parallelly, thus reducing the execution time. And practically also, GPU did take less time for executing the program. But now starts the interesting part. Look at my results and my friend’s results below :

More than disappointed, I was confused. As you can see we both had kept the array size as 10⁷, both had written the exact same lines of code, both had the exact same GPU, and still my program took 1.36 seconds to run with the GPU, which was more than 4 times of what her program took — 0.32 seconds. We tried running it multiple times and the results were similar each time. Now the obvious culprit was the GPU, so we checked its specifications in the NVIDIA Control Panel -> System Information.

To our surprise, this is what we found :

We had completely different Graphic cards, at least specification-wise, to which NVIDIA decided to give the same name. It is like having 6-GB and 8-GB variants of the same smart-phone, only difference is when buying the mobile phone we are aware of what specs we are paying for. So I opened the NVIDIA website to see if they had mentioned these two versions of the GTX-1650 card. Now there are two versions mentioned alright, but these still have the same number of CUDA cores :

https://www.nvidia.com/en-us/geforce/graphics-cards/gtx-1650/

So I searched some more and found out that many others have also had the same confusion regarding the GTX-1650 specifications, you can read some of the threads here and here. Then I found out that NVIDIA indeed has a GTX 1650 version with 1024 CUDA cores, and it is mentioned on a different page on their website :

https://www.nvidia.com/en-eu/geforce/gaming-laptops/gtx-1650/

Whatever NVIDIA’s logic for this confusing naming might be, I was happy that at least I know the reason why I was getting different results, it was because of the CUDA cores, and if you are planning to buy a laptop on which you want to do parallel programming, do pay attention to the number of CUDA cores.

But the story doesn’t end here, there’s a twist. When I sat down to write this article today, I thought of running the same program again. See the results for 3 successive runs of the program for yourself :

And after this it consistently gave execution time between 0.2 and 0.3 seconds. Now I am both happy and confused. Happy that it finally takes similar execution time as my friend, but again confused as to how did the execution time magically come down! And what about the difference in number of CUDA cores! It remains a mystery. If I find out what is going on I will update on it, but for now, I am going with the wrong assumption that probably I had too many Chrome tabs, and screen sharing going on and too many applications open that day, which somehow impacted the GPU performance.

Have a good day ! Ciao.

An interesting encounter with parallel programming on GPU

Written by Nagraj