You have given us the bike (>21 cube), but not a way to ride it

Anyhow, seriously, is that increasing overhead coming only from insertion of values at the end of a sequential file? I understand there are a lot of more stuff done my CS, whose computing size depends on the previous patches. A solution to optimise that is wished by many of us, I think.
A suggestion: I am trying to understand the advantage of increasing the LUT level, based on machine based speculation.
My minimum LUT level is 33. So I have based everything on comparing the outcome at 33 with the one at 66.
Any way I can have a minimum lower level?