Jump to content


Photo

Massive Pre-Computed ALU Tables for "Gaming Mode"


  • This topic is locked This topic is locked
28 replies to this topic

#1 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 03 March 2012 - 08:27 AM

Massive Pre-Computed Tables on the ALU.

Thesis
My idea is to give the ALU a several gigabyte live flash drive containing multiple tables. These tables will be referenced during "Gaming Mode", when mathematical precision is not as important as responsive interaction with the game. The idea is also to include a Gaming Switch, a boolean, which controls wether the ALU computes it's functions on demand (false), or grabs them from the pre-computer table (true). It is expected that the tables are much, much less precise than computing the value, but also a fraction faster.

Memory Addressing
The tables in this flash drive would not be indexed, but rather, each result given 2 bytes. To find a value in the table representing the answer when a and b are given as inputs, you would jump to position a*100000+b, which would be directly outputted with no further checking.

Example Case, Square Root
Square Roots are used extensively in distance checking between objects for purposes of collision detection, pathfinding, culling, and many other cases. This implies that a tiny, tiny speed improvement would have drastic effects on the performance of any program using square roots. Suppose we were presented with the number 340.67, representing the distance between a player and an item. We need to decide whether to send that item to the video card for rendering to the screen, a process known as culling. Currently, the square root must be calculated beforehand to several decimal places. However, in a video game, such precision is not needed. What would happen in "Gaming Mode" is the number 340.67 is rounded to 340.6, multiplied by 10 to 3406, multiplied by the amount of bytes of the values in that table (in this case, 2 bytes per value), added to the offset of the first byte of the relevant table in the flash drive, then the index at that position is returned to the calling process. The value at that address is 58.36, which is what the process receives from the ALU.

Intention
My intention is that, when precision is not required, the ALU can do it's SQRT, SINE, LOGARITHM, and even EXPONENTS calculations simply by relaying the value stored in pre-computer tables. This results in considerably faster all-round performance, not only in Video aspects, but in Sound, Logic, File System, and basic GUI aspects of computing.

Terminology
ALU: Arithmetic Logic Unit
Table: 1D, 2D, or 3D array of values where any 1, 2, or 3 coordinates return a single result

Credit and Citations
Currently, this is the work of Daniel Toye alone. Citations will be added if any edits are made as a result of someone's critiques or observations.

Edited by time4dan, 03 March 2012 - 08:29 AM.

  • 0

#2 kibblesbob

kibblesbob

    "Xarrot Studios"

  • GMC Member
  • 215 posts
  • Version:GM:Studio

Posted 03 March 2012 - 08:41 AM

If this is saved in a text file, this will end up making it slower than the actual function calls for the arithmetic you just mentioned. GM's file I/O is quite slow.
  • 0

#3 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 03 March 2012 - 09:03 AM

This isn't in GM, this is strictly hardware talk, but it would affect all of Game Design as we know it.
  • 0

#4 kibblesbob

kibblesbob

    "Xarrot Studios"

  • GMC Member
  • 215 posts
  • Version:GM:Studio

Posted 03 March 2012 - 09:36 AM

Definitely could work then :P Sounds like a straightforward, simple speed test is the next logical step!
  • 0

#5 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 03 March 2012 - 06:01 PM

I could get a "mock" Game Maker version, in which I write a function that calculates the SQRT to 5 decimal places, then a function that looks it up to 1 decimal place.

Additionally, there could even be a hybrid, if the speed of reading is fast enough, as algorithms for finding the SQRT are just guessing closer and closer, so you could look it up in the table and then continue to refine your guess to however many spaces you need. It gives you a head start.
  • 0

#6 Zeop

Zeop

    GMC Member

  • New Member
  • 192 posts

Posted 03 March 2012 - 08:41 PM

Not to be mean or anything, but you are underestimating how slow a look up in a offloaded table, such as the one you are suggestion, would be.
Especially when loading data from a USB. The latency in doing such a look up will be many times slower than it takes to compute the square root, sine, logarithm or whatever function you want to precompute. Even if the table is located on the harddrive, the lookup time will be much greater than the compute time.

Precomputed tables is not exactly a new thing. In the complex and confusing world of c++, it is though the means of template metaprogramming actually possible to precompute every computable result from of function. A functionality which can greatly increase performance, at the cost of a large memory footprint, and compile time. Also, please note that for each decimal of precision the needed memory to store the computations increases by a factor of 10. So it is done with care, and only in cases where optimization is taken to the extreme.

So, for simple functions it does not make sense to have a precomputed table.
Also, if you want performance, then game maker is most likely not for you.
  • 0

#7 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 04 March 2012 - 01:20 AM

I'm not sure if you read my post or not, Zeop, but regardless:

I never said USB, I said Flash drive, one which is attached directly to the ALU.
The ALU calculates the pointer address and then that address is directly sent from it's drive to the calling CPU.

It is on no way "offloaded".

I also mentioned a couple times the decimal precision problem, which is why it only works for Games which do not need high precision but do need high performance. This is why I suggested it's used to 1 decimal point.

Finally, if I didn't know about pre-computed tabled, how would I come up with this idea?

It's not about Game Maker, it's about all Games, whether in GML or C++. I believe everything where performance is required could benefit from low-precision high-performance tables. The only question is wether constructing the pointer address and routing that value from a tiny drive to the cpu would be even minimally faster than computing the root manually.

A much easier alternative to this idea is to have a mode in which the ALU can drastically lower it's precision for difficult functions like exponents and logarithms.
  • 0

#8 iam1me

iam1me

    GMC Member

  • GMC Member
  • 380 posts
  • Version:GM8

Posted 04 March 2012 - 08:53 AM

Attaching such a memory device to an ALU seems like it would be a horrible idea to me, no offense. It would dramatically increase the minimum latency for the CPU, and since most CPUs today implement a RISC architecture the result would be the slowing down of the entire CPU. Thus, in order to provide the marginal speed up for simple calculations you would be slowing down all functions of the CPU - the result being an overall slower computer/game.
  • 0

#9 Zeop

Zeop

    GMC Member

  • New Member
  • 192 posts

Posted 04 March 2012 - 01:30 PM

A much easier alternative to this idea is to have a mode in which the ALU can drastically lower it's precision for difficult functions like exponents and logarithms.


The ALU does not compute any of these functions. The ALU handles binary arithmetic and logic, nothing else. Regardless, flash drives also have much higher latencies for what you are suggesting. Your suggestion might make sense if you keep your table somewhere where you have fast access. Keeping it in RAM is a solution. Older algorithm exists that make use of precomputed tables, they even gave a much higher precision by making use of the properties of these functions. They are superseded by just computing the value, as the performance difference is negligible, also carrying around huge tables is not really desired by anyone.

Besides most of these functions are already approximating precision by iteration. You can implement your own functions that deliver a much lower precision if you so desire, but you will have to do some lowlevel tinkering, as most of these algorithms are handwritten in ASM to maximize performance. It would probably be the better solution.

Lastly, when are you interested in lower precision, other than in some very specific situations? Lower precision leads to some very hard to track down problems. Take you have a specific method for computing a value, but because the precision in some functions is to low, you are getting the wrong result. How will you track down what is the cause of this problem? Everything works as intended, but you are not getting the right result.

Edited by Zeop, 04 March 2012 - 01:39 PM.

  • 0

#10 chance

chance

    GMC Member

  • Global Moderators
  • 7506 posts
  • Version:GM:Studio

Posted 04 March 2012 - 07:49 PM

Square Roots are used extensively in distance checking between objects for purposes of collision detection, pathfinding, culling, and many other cases. This implies that a tiny, tiny speed improvement would have drastic effects on the performance of any program using square roots.

No, that doesn't follow. If the program was mostly square-root calculations, then you might get some enhancement.

But not for "any program using square roots". That's typically not what determines the speed limit.

My intention is that, when precision is not required, the ALU can do it's SQRT, SINE, LOGARITHM, and even EXPONENTS calculations simply by relaying the value stored in pre-computer tables. This results in considerably faster all-round performance...

Again, no. Not unless the program's rate-limiting step is a math calculation. I suppose if your application calculates Julia sets, or does massive numerical integrations, then a look-up approach might provide some speed enhancement. But for typical games, this isn't what limits the speed.

It's like trying to improve your gas mileage by turning off the radio.

.

Edited by chance, 04 March 2012 - 07:51 PM.

  • 0

#11 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 05 March 2012 - 08:18 PM

@iam1me, I realise that any kind of memory device retrieval may be slower than calculating it by hand, but I'm not going to lie, this idea is in a world where a single pointer lookup is quicker than an exponent calculation.

@Zeop, that's exactly what I proposed as an alternative a few posts back. Instead of re-iterating the root series 8 times, iterate it 4 times.

@Chance:

No, this is for games where performance is less than perfect. For example, I can run Crysis on Low, but not on Medium, because each frame takes more time than was allotted to it to render.

For a typical game we avoid the laggy things by limiting the number of objects. If you could have 20,000 physical bodies of water in GM, this could push it to maybe 25,000. Or, if 20,000 was lagging, this could make it run smoothly.
  • 0

#12 NakedPaulToast

NakedPaulToast

    GM Studio/Mac/Win

  • GMC Member
  • 8298 posts
  • Version:GM:Studio

Posted 06 March 2012 - 12:06 AM

Completely unreasonable.

You can't just attach storage to a CPU or ALU. They need to access the storage through the bus which takes a huge amount of time relative to how quickly storage can be accessed within the core. And you most certainly can't put such large tables within the core, there's a reason processors cores have memory measured in Megabytes or less.

If it were reasonable to have large amounts of fast storage at a CPU or ALU's disposable than what we know as RAM would be displaced by huge memory within the cores.

Edited by NakedPaulToast, 06 March 2012 - 12:11 AM.

  • 0

#13 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 12:11 AM

It would only take 2 gigs or so, similar to a thumbnail size USB drive. Additionally, it wouldn't be going through the bridge, there would be no point as nothing else should be able to access that table.

I think I'll settle for one that can be put into RAM.
  • 0

#14 NakedPaulToast

NakedPaulToast

    GM Studio/Mac/Win

  • GMC Member
  • 8298 posts
  • Version:GM:Studio

Posted 06 March 2012 - 12:13 AM

It would only take 2 gigs or so, similar to a thumbnail size USB drive. Additionally, it wouldn't be going through the bridge, there would be no point as nothing else should be able to access that table.

I think I'll settle for one that can be put into RAM.


Well if you could put two gigs on a core, then why aren't they putting gigabytes on a core?
  • 0

#15 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 12:16 AM

Because it's not necessary, a core doesn't use a lot of un-bridged memory.

It's like the Graphics Card memory, it's not bridged because nothing else accesses it.
  • 0

#16 NakedPaulToast

NakedPaulToast

    GM Studio/Mac/Win

  • GMC Member
  • 8298 posts
  • Version:GM:Studio

Posted 06 March 2012 - 12:22 AM

Because it's not necessary, a core doesn't use a lot of un-bridged memory.

It's like the Graphics Card memory, it's not bridged because nothing else accesses it.


Are you kidding me?

One of the main reasons that CPU's have increased in performance over the years is because of cached memory. Basically program and data are moved from RAM to cached memory and then executed at speeds that are ten-fold faster. Take a historical look at the increase in size of cache memory over the years on CPUs.

If it were reasonable to put 2 GIG of storage on a processor than it would be done, and it would be used in a similar capacity to how existing cached memory could be used. Instead of wasting one of the most expensive types of memory on something that would be rarely if ever used.

This topic is absolutely silly.
  • 0

#17 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 12:36 AM

This topic is just looking toward a time where that 'amazing' memory is cheap enough. The main question was, would it be faster to have a single pointer lookup, or to take several iterations to calculate the answer.
  • 0

#18 icuurd12b42

icuurd12b42

    Self Formed Sentient

  • GMC Elder
  • 16059 posts
  • Version:GM:Studio

Posted 06 March 2012 - 01:44 AM

This reminds me of the old tricks in the late 80's early 90's

instead of t = sin(degtorad(angle));

t = sins[angle mod 360]
or
t = sins[angle mod 256]
when having gone to the extreme of using a char for degree, where the mod 256 would not be needed in c if using a unsigned char
t = sins[angle]

populating the preprocessed array
for(int i = 0; i<255; i++);
{
sins[i] = sin(degtorad((float)i/256.0f * 360.0f));
}
  • 0

#19 NakedPaulToast

NakedPaulToast

    GM Studio/Mac/Win

  • GMC Member
  • 8298 posts
  • Version:GM:Studio

Posted 06 March 2012 - 01:56 AM

This reminds me of the old tricks in the late 80's early 90's

instead of t = sin(degtorad(angle));

t = sins[angle mod 360]
or
t = sins[angle mod 256]
when having gone to the extreme of using a char for degree, where the mod 256 would not be needed in c if using a unsigned char
t = sins[angle]

populating the preprocessed array
for(int i = 0; i<255; i++);
{
sins[i] = sin(degtorad((float)i/256.0f * 360.0f));
}


Now, as it was in the 80's or 90's, if there is performance to be gained by using lookup tables instead of calculating, it belongs in code and some frankensteinian flash drive glued on to the processor.
  • 0

#20 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 01:59 AM

I agree. It should be. But a lot of games don't do pre-computed tables.

I just want there to be a way that I can turn it on, wether it be in ram or on the ALU. I want the speed boost. I don't care if things suddenly start getting culled 3 pixels closer than expected, or if bullets miss the mark by a pixel, the speed is worth it.
  • 0

#21 icuurd12b42

icuurd12b42

    Self Formed Sentient

  • GMC Elder
  • 16059 posts
  • Version:GM:Studio

Posted 06 March 2012 - 02:33 AM

Correct me if I'm wrong but I think some PC architecture (or programming libraries) had LUT for some math features a long time ago but I think this became obsolete when the math co-processor was faster calculating results than looking up the table...

It is certain that in the case of GM7 and higher a LUT could improve performance especially since the loss of performance from using doubles instead of singles for numbers.
  • 0

#22 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 03:34 AM

Exactly what I was thinking, a sub-series of Math functions with "quick_" prepended to them.

quick_sqrt(10) is very VERY slightly faster than sqrt(10) because it uses a lookup table.
  • 0

#23 icuurd12b42

icuurd12b42

    Self Formed Sentient

  • GMC Elder
  • 16059 posts
  • Version:GM:Studio

Posted 06 March 2012 - 04:03 AM

Exactly what I was thinking, a sub-series of Math functions with "quick_" prepended to them.

quick_sqrt(10) is very VERY slightly faster than sqrt(10) because it uses a lookup table.



It would be faster if you did not wrap it around a script, access the LUT array directly

LUTsqrt[10]
  • 0

#24 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 04:08 AM

This is true. I noticed GM has absolutely no global arrays at all.. Could be cool. I only suggested a function because you could internally remove the decimal or linearly tween between the two integers around that decimal. If users have to wrap a floor() around it so that the index has no decimal place, it'll lose most of it's speed benefit.
  • 0

#25 icuurd12b42

icuurd12b42

    Self Formed Sentient

  • GMC Elder
  • 16059 posts
  • Version:GM:Studio

Posted 06 March 2012 - 04:13 AM

globalvar LUTsqrt;

LUTsqrt[0] = 0;
LUTsqrt[1] = 1;
LUTsqrt[2] = 1.412;


But the moment you need to tweak the argument like you mentioned you may in effect cancel all benefits.
  • 0

#26 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 04:18 AM

Maybe use the un-used @ symbol?

sqrt@4.5

Then the parser could know what to expect and how to handle it without the overhead of a function.
  • 0

#27 icuurd12b42

icuurd12b42

    Self Formed Sentient

  • GMC Elder
  • 16059 posts
  • Version:GM:Studio

Posted 06 March 2012 - 04:42 AM

If we are strictly speaking of a GM implementation as proof of concept; like my previous post imply, any tweaking could cancel out the benefits, that would include

-Having the functionality in a script (somewhat)
-Having code to cap the values (mostly)
-Having to parse a string (definitely)


The interpreted nature of GML makes it so that the more code to achieve this, the slower it will be

the fastest means to the data in through an array, and GM is smart enough to accept floating points as value (no floor() needed)


So, minimally, to implement this in a safe way, you would need a script
value = 10;
x = LUTFunction(value);


//LUTFunction
//this for rotary type data like sin/cos angle
if(argument0<0) return LUTarray[argument0 mod numLUTentries + numLUTentries];
return LUTarray[argument0 mod numLUTentries];


an unsafe but fastest way would be
value = 10;
x = LUTarray[value];
but now the user needs to ensure his value is compatible; which is not a big deal when you are cautious and conscious about the danger
  • 0

#28 DanRedux

DanRedux

    GMC Member

  • GMC Member
  • 1403 posts
  • Version:GM:Studio

Posted 06 March 2012 - 04:45 AM

This is why I suggested it be a modification to the actual parser. The @ symbol is currently unused, but could be used to access LUT's quickly and understandable.

sqrt@4.5 literally means "Square Root at 4.5". The parser could detect the @ symbol and find the right-hand-side, parsing it quicker than a function and in fact quicker than an array (due to no internal map lookup).

It's semantic, could be very fast, and would add almost no space or startup time to the GM runner.
  • 0

#29 paul23

paul23

    GMC Member

  • Global Moderators
  • 3864 posts
  • Version:GM:Studio

Posted 06 March 2012 - 03:41 PM

Actually many games do preprocessing (either during map-load or during creation of the game). For example in RTS games the map is generally floodfilled, resulting in each "island" getting it's own key.

This results in the computer being much faster when calculating paths. - It knows already that if the position keys is different no path is possible. (Making the worst case faster).

However precalculating numbers is just silly, you're thinking about micro optimizations those will never give results close to macro optimizations like using a better sorting algorithm etc etc. In GM it's even worse: not the function itself often takes time, but the parser, so the fact functions are there does take time. The gain will simply always be neglectable, and before you even think about these you have to use a profiler.
  • 0




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users