Performance Gap Between Array And Vector In Rust

Updated: January 31, 2022

As mentioned in the previous post, I am writing a package moldybrody for Molecular dynamics and Brownian dynamics simulations. The core of the package is the vector structure and its operations, which are Cartessian<T, N> using Array and CartessianND<T> using Vector. Here, T is a Generic representing the data type of the components of the vector, and N is a constant defined at compile time that refers to the length of the Array. 이 둘을 처음에 나눈 이유는 compile time에서 벡터의 차원이 명확하게 정의된다면 서로 다른 차원의 벡터가 연산되는 식의 에러를 미리 없앨 수 있기 때문이었습니다. However, when checking the performance of vector operations, it seems that Array is an indispensable data type even for performance. Today, I will record the process that led me to this conclusion.

Initial code structure

The initial code had the following format.

use std::fmt::Debug;
use std::convert::TryInto;

impl<T : Scalar, const N : usize> Cartessian<T, N>{
    pub fn map<'a, B, F>(&'a self, mut f : F) -> Cartessian<B, N>
        where F : FnMut(&'a T) -> B,
              T : Clone,
              B : Debug{

        Cartessian::<B, N>{
            coord : self.into_iter().map(|x| f(x)).collect::<Vec<B>>().try_into().unwrap(),
        }
    }
}

impl<T : Scalar, const N : usize> Add<Cartessian<T, N>> for &Cartessian<T, N>{
    type Output = Cartessian<T, N>;

    fn add(self, rhs : Cartessian<T, N>) -> Cartessian<T, N>{
        let mut out = self.clone();
        out.iter_mut().zip(&rhs).for_each(|(x, y)| *x = *x + *y);
        out
    }
}

impl<'a, T : Scalar, const N : usize> Mul<T> for &'a Cartessian<T, N>{
    type Output = Cartessian<T, N>;

    fn mul(self, c : T) -> Cartessian<T, N>{
        self.map(move |elt| *elt * c);
    }
}

fn compute_move(){
    // ...
    let dv = force * (dt / self.mass());
    let dx = self.vel() * dt + force * (dt * dt / (self.mass() + self.mass()));
    return (dx, dv);
}

To be more specific, I used the addition and scalar multiplication of Cartessian<T, N>, and the Trait for these was defined. Here, I implemented the scalar multiplication by multiplying each component by a constant through the map function, and then creating a Vec from the iterator through the collect function, and converting it to an array through the try_into function.

Of course, the performance was very low, and the most time-consuming point was Allocation. There were about 12 million allocations in total. So I modified the code in this part.

First Improvement

The modified code is as follows.

impl<'a, T : Scalar, const N : usize> AddAssign<&'a Cartessian<T, N>> for Cartessian<T, N>{

    fn add_assign(self, rhs : Cartessian<T, N>) -> Cartessian<T, N>{
        for (x, y) in self.iter_mut().zip(rhs){
            x.add_assign(*y);
        }
    }
}

impl<'a, T : Scalar, const N : usize> MulAssign<T> for Cartessian<T, N>{

    fn mul_assign(self, c : T) -> Cartessian<T, N>{
        for (x, y) in self.iter_mut().zip(rhs){
            x.mul_assign(*y);
        }
    }
}

fn compute_move(){
    // ..

    let mut temp = Cartessian::<T, N>::default();
    temp += force;
    temp *= dt / self.mass();
    let dv = temp.clone();

    let mut dx = Cartessian::<T, N>::default();
    dx += self.vel();
    dx *= dt;

    temp *= T::zero();
    temp += force;
    temp *= dt * dt / (self.mass() + self.mass());
    dx += temp;

    return (dx, dv);
}

I tried to reduce allocation by using AddAssign and MulAssign traits. In the actual running process, it was confirmed that Heap allocation almost did not occur. There were about 26 allocations in total. The speed was also three times faster.

Realize a Problem

However, this result was strange. In fact, I was using the clone function, so the allocation of the array must exist. However, it was not counted in allocation, and even in the time profiler, it was not found in the heaviest stack. I wondered if clone does not allocate, but I couldn’t find relevant information.

The difference was in Heap. I was too superficial in understanding Stack memory and Heap memory, so I couldn’t think of the fact that Stack allocation was not counted in Heap allocation. Anyway, Cartessian<T, N> that uses Array has a fixed size at compile time and will not change thereafter, so it will be allocated in Stack, not Heap. So, clone can happen very quickly.

Improve Performance on Trait Implementation

Then, why was the first code so long? The problem is that the structure of the iterator going around the vector and then returning to the array causes Heap allocation twice. This means that unnecessary time consumption is very large. So, I decided to use the map function in the entire code, and use the clone through the map function. I also wrote code that reduces stack allocation by returning only the value to the already allocated variable in the case of an operation that completely transfers ownership from the borrow (the vector addition below).

impl<T : Scalar, const N : usize> Add<Cartessian<T, N>> for Cartessian<T, N>{
    type Output = Cartessian<T, N>;

    fn add(mut self, rhs : Cartessian<T, N>) -> Cartessian<T, N>{
        self.iter_mut().zip(&rhs).for_each(|(x, y)| *x += *y);
        self
    }
}

impl<'a, T : Scalar, const N : usize> Mul<T> for &'a Cartessian<T, N>{
    type Output = Cartessian<T, N>;

    fn mul(self, c : T) -> Cartessian<T, N>{
        let mut out = self.clone();
        out.iter_mut().for_each(|x| *x = *x * c)
        out
    }
}

fn compute_move(){
    // ...
    let dv = force * (dt / self.mass());
    let dx = self.vel() * dt + force * (dt * dt / (self.mass() + self.mass()));
    return (dx, dv);
}

After changing this way, unnecessary Heap allocation also disappeared, and the readability of the code also improved. Of course, when implementing the same code for CartessianND<T>, it can be seen that it consumes a large amount of time in Heap allocation, which is the same as before. Of course, this case is the clone of the variable with a variable size, Vec<T>, so Heap allocation occurs.

Conclusion

I am actually thinking about my career after finishing my doctorate. I think I am not studying enough, so I don’t understand such obvious problems immediately, but if I want to be a good programmer, I think I need a deeper understanding of the basic structure of the computer, especially the memory part.

Share on

Twitter Facebook LinkedIn

Myeongseon Choi

Performance Gap Between Array And Vector In Rust

Initial code structure

First Improvement

Realize a Problem

Improve Performance on Trait Implementation

Conclusion

Share on

You may also enjoy

Matrix Transpose in CUDA

Kummer’s Theorem

Numb3rs S1E3: Spatial SIR Model

Numb3rs S1E1: The Rossmo Formula