Thanks for your reply.
I have tried almost all the points you have suggested above.
- Replaced
arr = op
with arr.swap(op)
.
- I am assuming all elements are 64 bit numbers therefore no need to find max element from the array.
- Have used single line digit extraction function
(x >> shift) & mask
.
- Have declared all data globally. Though not a "good" practice, this definitely gave better performance on my system. Also declared macro constants.
- Minimized the use of functions, therefore reducing stack build-up time.
- Base 2^16 giving me a better result on my system, therefore I am not really sure why any other base should be any better. Am I missing something?
- All data using fixed size data types.
- Have created simple test case generators and on random inputs (randomly generated around 200 testcases sum of N less than 5 x 10^7), and there has been better performance than before.
After all this, I am still facing TLE.