Agreed. There's quite a bit of room for optimization if your language design allows for it. Plus you have flexibility to make different tradeoffs as computer architectures and the cost of various operations change over time.
With Linux the stack size is a process limit, set with ulimit (default 8MB?). You can even set it to unlimited if you want, meaning that essentially (but not quite) the stack and heap grow towards each other only limited by the size of the address space.
ulimit only affects the main program stack though. if you are using multi-threading then there is a per-thread stack limit, which you can configure with pthreads, but not until C++23 for std::thread.
> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"
Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.
You’re missing the fact that the compiler isn’t forced to fill every register in the first place. If it was less efficient to use more registers, the compiler simply wouldn’t use more registers.
The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.
> You’re missing the fact that the compiler isn’t forced to fill every register in the first place.
Temporaries start their lives in registers (on RISCs, at least). So if you have 40 alive values, you can use the same one register to calculate them all and immediately save all 40 of them on the stack, or e.g. keep 15 of them in 15 registers, and use the 16th register to compute 25 other values and save those on the stack. But if you keep them in the call-invariant registers, those registers need to be saved at the function's prologue, and the call-clobbered registers need to be saved and restored around inner call sites. That's why academia has been playing with register windows, to get around this manual shuffling.
> The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.
Would you be willing to work through that proof? There may very well be less total memory traffic for machine with 31 registers than with 16; but it would seem to me that there should be some sort of local optimum for the number of registers (and their clobbered/invariant assignment) for minimizing stack traffic: four registers is way too few, but 192 (there's been CPUs like that!) is way too many.
32 registers have been used in many CPUs since the mid seventies until today, i.e. for a half of century.
During all this time there has been a consensus that 32 registers are better than less registers.
A few CPUs, e.g. SPARC and Itanium have tried to use more registers than that, and they have been considered unsuccessful.
There have been some inconclusive debates about whether 64 architectural registers might be better than 32 registers, but the fact the 32 are better than 16 has been pretty much undisputed. So there are chances that 32 is close to an optimal number of GPRs.
Using 32 registers is something new only for Intel-AMD, due to their legacy, but it is something that all their competition has used successfully for many decades.
I have written many assembly language programs for ARM, POWER or x86. Whenever I had 32 registers, this made it much easier for me to avoid most memory accesses, for spilling or for other purposes. It is true that a compiler is dumber than a human programmer and it frequently has to adhere to a rigid ABI, but even so, I expect that on average even a compiler will succeed to reduce the register spilling when using 32 registers.
Started on a dependency-free (including manual object file creation, excluding manual linker) single-pass C compiler with a goal of it being self-hosting. Spawned after my previous project (single-pass Wasm JIT) started to plateau a bit and wanted to start something more "full-stack compiler"-y.
> High level languages lack something that low level languages have in great adundance - intent.
Is this line really true? I feel like expressing intent isn't really a factor in the high level / low level spectrum. If anything, more ways of expressing intent in more detail should contribute towards them being higher level.
I agree with you and would go further: the fundamental difference between high-level and low-level languages is that in high-level languages you express intent whereas in low-level languages you are stuck resorting to expressing underlying mechanisms.
I think this isn't referring to intent as in "calculate the tax rate for this purchase" but rather "shift this byte three positions to the left". Less about what you're trying to accomplish, and more about what you're trying to make the machine do.
Something like purchase.calculate_tax().await.map_err(|e| TaxCalculationError { source: e })?; is full of intent, but you have no idea what kind of machine code you're going to end up with.
Maybe, but from the author's description, it seems like the interpretation of intent that they want is to generally give the most information possible to the compiler, so it can do its thing. I don't see why the right high level language couldn't give the compiler plenty of leeway to optimize.
Been working on a single-pass WebAssembly JIT compiler (atm only guaranteed to work on M1 air), been super interesting learning and thinking through strategies to balance fast compilation & fast generated code. It's the fastest single-pass JIT output-wise that I'm aware of, but the compilation speed could use some work.
for the average developer I think the killer feature is allowing you to query over whatever data you want (csv, json, parquet, even gsheets) as equals, directly from their file form - can even join across them
It has great CSV and JSON processing so I find it's almost better thought of as an Excel-like tool vs. a database. Great for running quick analysis and exploratory work. Example: I need to do some reporting mock-ups on Jira data; DuckDB sucks it all in (or queries exports in place), makes it easy to clean, filter, pivot, etc. export to CSV
If you're developing in the data space you should consider your "small data" scenarios (ex: the vast majority of our clients have < 1GB of analytical data; Snowflake, etc. is overkill). Building a DW that exists entirely in a local browser session is possible now; that's a big deal.
distributing the dispatch instructions lets the predictor pick up on specific patterns within the bytecode - for example, comparison instructions are probably followed by a conditional branch instruction
reply