Codingame Spring Challenge 2025
Published:
Updated:
My experience from the Codingame Spring Challenge 2025.
I ended up 138th on 16 365 people with a final score of 634 (lower is better).
Table of content
Problem
The challenge was based on the Cephalopod board game. In this version of the game, a 3×3 grid was used. Given an initial board state and a maximum number of turns (a problem), the goal was to:
- explore all the possible paths;
- for each path, compute a “hash” of the final state (which is a decimal representation of the board);
- output (to the standard output) the sum of all these hashed modulo 230.
The goal of the challenge is to write the fastest algorithm. At the end of the challenge, your algorithm is evaluated on different problems which are not known (and cannot be known) and your score is the sum of the computation times. The goal is to have the lowest score.
In order to evaluate yourself during the challenge, you could use:
- 12 problems of increasing complexity that can be used to validate your algorithm;
- another set of other problems which are used to give you a temporary score and rank you against the other players.
The 12 test problems could be extracted by logging them to standard error. The other two sets of problems were not accessible.
The second set of problem is different from the final set of problems which are used to give you a final score.
Important note: we apparently only have access to one core/hyperthread so multithreading does not seem to be a viable option.
Feedback
I did the challenge in Rust. This was a nice opportunity to practice some Rust for performance and refresh my (rusty) Rust skills a little bit. Optimizing the algorithm was quite interesting. It was the first time I did this kind of algorithmic optimization job in Rust.
Limitations:
- Apparently, the timer only started when you started reading the problem from the standard input. This was apparently done in order to avoid penalizing programming languages which are slow to start. However, it appears it was possible to exploit this by precomputing things before reading from the standard intput.
- It might have been possible to use compile-time computation (in C++ or Rust) to precompute things at compile-time.
- The challenge was very adapted to Rust, C++. The top 159 submitions are either C++, Rust or C.
My solutions
I ended up implementing 4 different algorithms:
- a dynamic programming algorithm;
- a depth-first exploration with memoization;
- an hybrid of the two previous algorithms, combining steps of dynamic programming and a depth-first exploration with memoization;
- a very clever 🤣 algorithms similar to the exponentiation-by-squaring method.
Dynamic Programming algorithm
I started with a dynamic programming algorithm which, I thought, would give good results. My dynamic programming algorithm computes for each horizon, the set of reachable states (and the number of times they are reachable) in a HashMap
. In order to avoid useless memory accesses, all the final states are summarized (in .hash
) by suming their hashes.
struct StateCounter {
map: MyHashMap<State, u32>,
}
struct ComputationState {
current_states: StateCounter,
next_states: StateCounter,
hash: u32,
}
This first algorithm directly gave quite good peformance.
Depth-first exploration with memoization
A second algorithm I wrote, used depth-first exploration of the different paths with memoization in order to avoid exploring a state (depth, board state) pair more than once.
struct DfsSearcher {
depth: usize,
// Key is combining depth and board State:
memo: MyHashMap<u64, u32>
}
// ...
It initially gave results which were not as good as my first algorithm.
I could have spent more time optimizing this one. One possible solution would have to exploit equivalent board states (after rotation and or symmetries) in order to reuse more computations. It could have yielded important speed up and would certainly have given performance comparable to the dynamic programming algotihm. However, I spent time optimizing the dynamic programming algorithm and implementing the “clever” algorithm instead.
Hybrid algorithm
I then implemented an hybrid algorithm between the dynamic programming algorithm and the depth-firest explocation algorithms, more because it took 2 minutes and 10 lines to write than because I thought it would provide good resulsts.
The idea is to use dynamic programming for some time steps and then continue with depth-first exploration for the end of the game. As expected, it did not give any good result.
Clever algorithm
I then designed a very clever algorithm (I was very proud of) similar to the exponentiation-by-squaring method.
Note: exponentiation-by-squaring method
In the exponentiation-by-squaring method (used for example in cryptography), a large exponentiation is decomposed in a multiplication of squares. For example, instead of computing n8448 as 8448 multiplications, we can use the binary representation of the exponent (8448 = 0b10000100000000 = 256 + 8192 = 28 + 213
) which gives a decomposition with only 14 multiplications:
- n2 = n × n
- n4 = n2 * n2
- n8 = n4 * n4
- n16 = n8 * n8
- n32 = n16 * n16
- n64 = n32 * n32
- n128 = n64 * n64
- n256 = n128 * n128
- n512 = n256 * n256
- n1024 = n512 * n512
- n2048 = n1024 * n1024
- n4096 = n2048 * n2048
- n8196 = n4096 * n4096
- n8448 = n256 × n8192
In this algorithm, I used the base-two representation of the number of steps (eg. 40 = 0b101000 turns) and always compute reachable states after 2n turns from a given state.
struct ReachableState {
state: State,
count: u32,
}
struct Reachability {
hash: u32,
reachable_states: MyHashMap<State, u32>,
}
struct Dp2Solver {
// Key is (log2(depth), state)
reachability_cache: MyHashMap<u64, Rc<Reachability>>,
}
It ended up having very very VERY bad performance for some reasons. Probably the horizons, we are using here are not big enough for this kind of trick to matter anyway. Maybe it was not so clever after all…
Algorithms conclusion
The dynamic programming algorithm directly gave quite good peformance. I then implemented the depth-first exploration one and the hybrid one but they did not give comparing results.
The depth-first exploration could have given be good resulats as well with some optimizations but I did not spend enough time on this path and took much time optimizing the first one.
Performance tips
See the appendix for some evaluation of the different optimizations.
See The Rust Performance Book for more performance tips.
Packed board state representation
I packed the whole board state in a uint32
. The board is 9×9 and each cells can have 7 states (0 for unoccupied and 1 to 6). We can use 3 bits per cell leading to 27 useful bits.
#[derive(PartialEq, Eq, PartialOrd, Ord, Hash, Clone, Copy)]
struct State(u32);
impl State {
#[inline(always)]
fn get(&self, i: usize, j: usize) -> u32 {
let (shift, mask) = CELL_SHIFT_MASKS.get(i).unwrap().get(j).unwrap();
return (self.0 & mask) >> shift;
}
#[inline(always)]
fn set(&mut self, i: usize, j: usize, value: u32) {
let (shift, mask) = CELL_SHIFT_MASKS.get(i).unwrap().get(j).unwrap();
self.0 = (self.0 & (!mask)) | (value << shift);
}
// ...
}
We can for example decide if a state is final with:
const ONES: u32 = 1 | 1 << 3 | 1 << 6 | 1 << 9 | 1 << 12 | 1 << 15 | 1 << 18 | 1 << 21 | 1 << 24;
impl State {
#[inline(always)]
fn is_final(&self) -> bool {
((self.0 & ONES) | ((self.0 >> 1) & ONES) | ((self.0 >> 2) & ONES)) == ONES
}
}
Custom hashing method
I used HashMap
s quite heavily and it turned out it was a bottleneck. By default HashMap
s is Rust are not so fast, especially for small integers, and provide resistance against malicious denial of service. In my case, the keys are small (integers) and I don't care for DoS.
I started implementing FNV64 hash The nice thing is that it is very simple to implement:
const FNV_PRIME: u64 = 1099511628211;
const FNV_BASIS: u64 = 14695981039346656037;
struct FnvHasher(u64);
impl Default for FnvHasher {
#[inline]
fn default() -> FnvHasher {
FnvHasher(FNV_BASIS)
}
}
impl Hasher for FnvHasher {
#[inline(always)]
fn finish(&self) -> u64 {
return self.0;
}
#[inline(always)]
fn write(&mut self, bytes: &[u8]) {
let mut hash = self.0;
for b in bytes.iter() {
hash = hash ^ (*b as u64);
hash = hash * FNV_PRIME;
}
self.0 = hash;
}
}
struct FnvBuildHasher {}
impl Default for FnvBuildHasher {
fn default() -> FnvBuildHasher {
FnvBuildHasher {}
}
}
impl BuildHasher for FnvBuildHasher {
type Hasher = FnvHasher;
fn build_hasher(&self) -> Self::Hasher {
return FnvHasher::default();
}
}
pub type MyHasher = FnvHasher;
pub type MyHashMap<K, V> = FnvHashMap<K, V>;
This have a nice performance improvement.
I then switched to the hash function found in the rustc-hash packages which was shamefully pasted into my code. This gave a nice performance improvement over FNV64.
Manual loop unrolling
I obtained a nice speedup by manually unrolling loops (combined with #[inline(always)]
for function inlining).
I rewrote:
for i in 0..3 {
for j in 0..3 {
if i != 0 {
if i != 2 {
// ...
}
// ...
}
if i != 2 {
// ...
}
// ...
}
}
into
self.fill_next_states_sub_final(
&mut hash,
current_state,
&cells,
*current_count,
0,
0,
);
self.fill_next_states_sub_final(
&mut hash,
current_state,
&cells,
*current_count,
0,
1,
);
self.fill_next_states_sub_final(
&mut hash,
current_state,
&cells,
*current_count,
0,
2,
);
self.fill_next_states_sub_final(
&mut hash,
current_state,
&cells,
*current_count,
1,
0,
);
// ...
with
#[inline(always)]
fn fill_next_states_sub_final(
&mut self,
hash: &mut u32,
current_state: &State,
cells: &[[u32; 3]; 3],
current_count: u32,
i: usize,
j: usize,
) {
if i != 0 {
if i != 2 {
// ...
}
// ...
}
if i != 2 {
// ...
}
// ...
}
Loop unrolling combined with function inlining gives an opportunity for the compiler to remove useless branches (for a given value of i
and j
).
This provided a nice speedup (from 127ms to 105ms for the test cases in my machine).
Optimize last iteration
Just after the end of the challenge, I noticed that I hard forgoten to optimize the last iteration of the dynamic programming algorithm by avoiding to compute the count of each state (current_states: StateCounter
) but by directly summarizing them all by the sum of their hashes. With this optimization, the computation time would go from 93ms to 88ms for the test cases.
Some other tips
Local development
Develop on you local machine and use version control instead of using the builtin Codingame editor!
- Using a local editor is a lot more convenient than the builtin editor.
- Having proper source control is invaluable in order to compare different versions.
- You can use profiling tools such as cargo-flamegraph in order to evaluate where the computation time is spent (see my post about Flame Graphs).
Execute the test problems locally
I extracted the 12 tests problems (by writing them to the standard error) and copied them locally as text files:
# 2 states
# Expected: 322444322
20
0 6 0
2 2 2
1 6 1
I then used them as a benchmark for local development using this simple test runner:
#!/usr/bin/python3
import sys
import subprocess
import time
tests = sys.argv[1:]
max_len = max(len(test) for test in tests)
total_duration_ns = 0
global_status = "OK"
for filename in tests:
raw_content = open(filename, "rt").read()
lines = raw_content.split("\n")
test_name = lines[0][1:]
expected = lines[1].split(" ")[-1].strip()
input = "\n".join(line for line in lines if not line.startswith("#"))
# sys.stdout.write(input)
a = time.monotonic_ns()
run = subprocess.run(
["./target/release/spring-challenge-2025"],
input=input,
encoding="UTF-8",
capture_output=True,
)
b = time.monotonic_ns()
duration_ns = b - a
total_duration_ns += duration_ns
output = run.stdout.strip()
ok = expected == output
if not ok:
global_status = "KO"
status = "OK" if ok else "KO"
sys.stdout.write(
f"{filename.ljust(max_len)} {status} {str(duration_ns/1000000).rjust(12)}ms\n"
)
sys.stdout.write(
f"{'TOTAL'.ljust(max_len)} {global_status} {str(total_duration_ns/1000000).rjust(12)}ms\n"
)
Used as:
python3 run_tests.py tests/* > result.txt && cat result.txt
which would give the fllowing result:
tests/test01.txt OK 0.932269ms
tests/test02.txt OK 0.661858ms
tests/test03.txt OK 0.764412ms
tests/test04.txt OK 0.704308ms
tests/test05.txt OK 0.60948ms
tests/test06.txt OK 1.043959ms
tests/test07.txt OK 4.679256ms
tests/test08.txt OK 8.575778ms
tests/test09.txt OK 8.692376ms
tests/test10.txt OK 14.885249ms
tests/test11.txt OK 24.510455ms
tests/test12.txt OK 23.78096ms
TOTAL OK 89.84036ms
I could then register a baseline with:
cargo build --release &&
python3 run_tests.py tests/* > baseline.txt &&
cat baseline.txt
and compare the current version with this baseline with;
cargo build --release &&
python3 run_tests.py tests/* > result.txt &&
cat baseline.txt &&
echo &&
cat result.txt
Appendix, evolution of the performance
Evolution of the performance based on the 12 test problems (cumulative computation duration on my machine):
Version | Duration |
---|---|
Dynamic programming algorithm, initial version | 181ms |
Dynamic programming algorithm, removed useless branch | 147ms |
Depth first exploration with memoization | 206ms |
Dynamic programming algorithm, use FNV64 hash | 148ms |
Dynamic programming algorithm, use modified FNV64 hash | 127ms |
Dynamic programming algorithm, force inline and manual loop unrolling | 105ms |
Hybrid algorithm | 297ms |
"Clever" 🙄 algorithm | 2332ms |
Dynamic programming algorithm, use FxHash | 93ms |
Depth first exploration with manual loop unrolling and FxHash | 155ms |
"Clever" algorithm, use FxHash | 2015ms |
Dynamic programming algorithm, optimize last iteration | 88ms |
These duration cannot be meaninfully compared with the final score. The test cases (and number thereof) and the machines are different.
References
- Codingame Spring Challenge 2025
- Codingame Spring Challenge 2025 Forum
- Cephalopod
- Cephalopod at BoardGameGeek
- The Rust Performance Book
- The FNV Non-Cryptographic Hash Algorithm
- FNV hash history
- rustc-hash
- Flame Graphs
- Some other post-mortems (much better than me):
- Frictionless Bananas (1st)
- robostac (2nd)
- Karang (#12)
- skotz (14th)