Différence Groupe-Conseil en Statistique (et Simulation)

Simple Tips To Write Efficient Code & Data Science Scripts

Writing Efficient Data Science Code - Simple Programming Tips

By Vincent Béchard, Analytical Decision Specialist


Published on 2020-11-10

Writing Efficient Data Science Code

If your code works, it does not mean it is efficient! Programming CPU heavy data processing and analysis tasks? Some of my clients are amazed by the execution speed of my code, and I am not a professional software developer! Nevertheless, here are some of my simple tricks to keep in mind when writing code in any language:

  • Do calculations only once: avoid repeating the same arithmetic, try to store results in memory and reuse them later (yes more management!)
  • Watch loops contents… if it contains repetitive inefficient code, this code is triggered many times! Example of inefficient code inside a loop: if/else and switch statements to process user options. Process options outside the loop and write several variants of the loop!
  • Avoid writing a ton of 1- or 2-lines functions just to make things “cute”. Call stack can become heavy and slow in the computer
  • Use the built-in language features, for example the vectorized operations in R, Matlab or Python, or the native functions in VBA and JavaScript (they will always be faster than your own implementation)
  • With object-oriented languages: write classes! Code is always cleaner and simpler to debug.
  • Having a code profiler? Profile execution and work on most time-consuming steps.

Again, these are just some of the things I keep in mind when I code something… But it helps a lot!

Efficient code in the context of data science: statistical programming

Not all data scientists are software development experts… Having some computer programming skills such as object-oriented data structures and algorithms can make day and night difference!  And this is true even if you write scripts in a data science platform such as R, Python or Matlab.

Writing Efficient Data Science Code

Here I share some interesting results on code performance. I needed to implement an algorithm in R that I had previously written in pure C#. I ended with 2 versions:

  • Version 1 : direct translation of C# code to R, with explicit loops and using only custom functions
  • Version 2 : implementation that leveraged the native features such as implicit loops and vectorized operations

Short illustration: if you need to shuffle a vector, an efficient implementation in C# is:

int[] randomNumbers = Shuffle(Enumerable.Range(0, 11), new Random()).ToArray(); 
public static IEnumerable Shuffle(this IEnumerable source, Random random)
    T[] list = source.ToArray();
    int count = list.Length;
    while (count > 1)
        int index = random.Next(count--);
        T temp = list[index];
        list[index] = list[count];
        list[count] = temp;
    return list;

And in R, it is simply:

v <- sample(v)

In a statistical programming environment, data manipulations are quite easier! Coming back to my to-be-translated-algorithm, not only the Version 2 code was much shorter, but it was faster on various problem sizes:

Writing Efficient Data Science Code

Bottom line : leveraging the native built-in features of a programming environment can save time to edit the code and time to execute the code!

Want to learn more tips about writing efficient code?

We offer training sessions on statistical programming. More details here: https://difference-gcs.com/en/training-statistical-programming-r-language/


At Différence, our core expertise is centered on statistic & data science, Lean applications & operational excellence, and… simulation! We can train, coach and help practitioners to learn how to achieve modern multi-paradigm modelling Don’t hesitate to ask for more information by contacting us at info@difference-gcs.com.