Do some profiling and find out, where the bottle-neck is. A simple example that can be used for TMUL:
from time import time
size = 1000 # change it to 10000
print "size", size
line = "1234567890"*size+" "+"1234567890"*size
t0 = time()
s, t = line.split()
print "split", time()-t0
t0 = time()
a, b = int(s), int(t)
print "int ",time()-t0
t0 = time()
c = a * b
print "mul ", time()-t0
t0 = time()
out = str(c)
print "str ", time()-t0
print "len ", len(out)
On my (very old) machine the results are as follows:
size 1000
split 0.000113964080811
int 0.102551221848
mul 0.0254321098328
str 0.422478914261
len 19999
size 10000
split 0.0239429473877
int 4.40381789207
mul 0.475347995758
str 46.4313352108
len 199999
That was my starting point.
You should be aware that any number that shall be given out, has to be converted to a string (at least I know no other technique).
Of course it is not necessary to do it explicitly - as I did in the example -, but if you don't, it is done implicitly. So, avoiding an explicit conversion to string and using "print c" instead does'nt help. 