That's awesome. It's not easy to get seniors to spend that much time in the gym, so well done if you contributed to motivate her! Physical exercise is indeed one of the few Parkinson's treatments that there's little doubt about.
In terms of reversing the damage, look into photobiomodulation (PBM) aka red light therapy. In "Improvements in clinical signs of Parkinson's disease", Liebert et al in 2021, she shows improvement in all symptoms including cognition, which is one of preciously few such results I've found in my extensive search of the literature[1]. Caveats are that this was a proof-of-concept study with n=6 only and that the red light helmet is somewhat expensive if you want to try it. There's a Canadian company that makes one for above $2000 (modern, very sci-fi thing), and a more hackerish version from an Australian company for ~$700 (pairs of diodes on aluminum bands you have to finish assembling yourself).
[1]: At least which are actionable for the public. There are tons of trials but good luck getting in early unless you can donate a new library to the university! Gene therapy, stem cell therapy, GDNF, drugs that target alpha-synuclein are all promising but not yet accessible. PBM is something you can do today and since mitochondrial dysfunction is a leading hypothesis in the pathogenesis of PD, the treatment fits.
We already do tree searches: see beam search and “best of” search. Arguable if it is a “clever” tree search but it’s not entirely unguided either since you prune your tree based on factors like perplexity which is a measure of how probable/plausible the model rates a branch as it stands so far.
In beam search you might keep the top n branches at each token generation step. Best of is in a sense the same but you take many steps using regular sampling at a time before pruning.
To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.
Your triton code is great, nice work. Wouldn’t feel too bad about spending your time that way!
As it happens I was also thinking it might be worthwhile to dive into the Triton sources but for another reason: half2 arithmetic. That’s one thing that the Triton branch lost that the (faster) CUDA kernels had and I think it made a difference. In theory with compatible hardware you can retire twice as many ops per second when processing float16 data which we are in this case.
Can’t see anyone having tried to get half2 to work with Triton though.
For some workloads, it’s almost all about the VRAM. In those cases I’ve been wondering if getting a high memory M1 or M2 Mac could be a good lab machine thanks to unified memory. It’ll run more quietly, use significantly less power, no worries about overloading your electric circuit. On a 128 GB RAM Mac Studio you could theoretically run or even train models that otherwise would require multiple $6k A6000 GPUs in custom machine builds taking oodles of power at the plug. It’d be slow but slow beats not possible. And if you need a new development machine anyhow, you can justify some of that beefy Mac Studio’s cost as part of your required spend anyhow. PyTorch has supported “mps” as a target device for some time now.
Within a closed system, consistency (and verifying it) is a fail fast mechanism. For example, it’s better to crash on a constraint failure when attaching a doodad to a non-existent user account than to figure out where all these orphan doodads came from next year.
It seems like you’re making a semantic argument to equate the Ethereum network with its validators. That seems confusing. Here are some examples of how “A runs B” does not imply “A == B”:
“Employees” are part of a company, they run the company, don’t they? Yet the cost of having employees is not “therefore revenue” from the standpoint of the company.
“Drivers” are a part of Uber, they deliver the service, don’t they? Yet the money paid to drivers reduce Uber’s profits.
I think where your argument runs into trouble is “from the standpoint of the network”. If you want to equate the network and its validators, to say they are the same thing, then your sentence becomes, “The money [the validators] get paid [by the validators] is therefore revenue, from the standpoint of [the validators]”. That’s non-sensical. You can’t give yourself money and say it’s revenue. Either these two things are in fact not the same thing and we can analyse the cashflow of “Ethereum the network” separately from “the validation service providers”, in which case Ethereum is paying out less than it’s taking in, so it is profitable. Or they are the same thing, in which case the “profit”, to the extent you can say a virtual entity like a network can have such a thing, is even higher.
This is because whatever costs the validators bear are less than the ETH they receive is worth. This is true if we assume validators are rational actors (they wouldn’t validate if they were losing money doing so). And even if we take away the assumption that they are profit motivated (maybe they’re all doing it as charity work for some higher purpose), the cost of running an Ethereum validator is tiny, so we end up in the same place: outgoings are smaller than receipts when considering the whole.
(The fact that Ethereum the network “burns” its receipts and then “mints” its outgoings to the validators does not affect this calculation since it’d work out the same if Ethereum paid validators from fees directly.)
Seriously, this isn't rocket science. Ethereum provides a service, namely it stores data and does some computations in exchange for a fee. Ethereum users pay a fee and in return they have their computations done. By definition, Ethereum is profitable if and only if the fees that are paid by its users exceed the costs that are incurred by whoever is in charge of doing the computations and keeping the network running. (I thought these were the "validators" but I might have got the terminology wrong, apologies if this is the case.) Therefore we need to know, on one hand, the total amount of fees paid by the users and, on the other hand, the total amount of costs incurred by the network (i.e. by all the entities that do the computations and run the network), over a particular time frame.
The disagreement here seems to be "how relevant are the real world costs and profitablity born by ethereum node operators and validators to the overall 'costs' of running ethereum"?
You seem to think that is the end all be all to measure the profitability of the network. And you aren't entirely wrong. If it requires too much hardware, too much internet bandwidth, too much in the way of skilled node operators relative to the value the folks running the nodes would gain, the number of particpants would dwindle and the physical network would suffer.
So there are boundaries that real world costs impose on the operation of the network. But at what level do those boundaries kick in? Well, that is why it's been an important value to the developers that a node can be run on 'commodity hardware and internet'. You can run a full node on a standard Pc from the last 5 years with 16gb of ram (less in some configurations), a 2 tb ssd (or 1tb if you don't mind some downtime every few months), a modest internet connection and can be done so by anyone with some basic command line skills.
Because of those modest demands, I can and do run a non-validating node on old hardware I already owned on an internet connection I would be paying for anyway. I make $0 from doing so, but it interests me as a hobby because I want non-intermediated access to the the network. In contrast some large node service providers have immense costs because they host on a cloud services and they hire expensive SRE's to keep it running at a high reliability level. But they woudn't be spending that money if they didn't seem some kind of profit or value in it. Because of that variability and the low baseline to get started, whether it is real-world profitable to any particular participant is irrelevant to "ethereum" as a whole.
So, from my view, it becomes reasonable to look at it as "how much ether is created vs how much is burned" to see if "ethereum" as a whole is profitable.
I don't know if it's relevant, I think it's interesting, to me at least as an economist, to ask these questions. What you do with this information is up to you.
Whatever that number is, it will be equal or less than the number already discussed. The network “hires” contractors to provide the services you mentioned and it pays a known figure for that. Not much else to it really. Since all we are discussing is whether the network is profitable or not in this thread we don’t need to dig into more specific analysis of the service providers’ internal costs (and indeed that would be difficult since they are globally distributed with different attendant costs and efficiencies). Just to note they are unlikely to themselves be making a loss is sufficient.
No, the costs incurred in providing a service is exactly what needs to be quantified in order to determine whether the provision of that service is profitable. If you insist that the contractors must be excluded from the analysis (for some reason), then you have to admit the possibility that the network is being subsidised by the contractors (as would occur if they were operating at a loss), at which point the entire concept of profitability of the network becomes rather meaningless. So you can't exclude the contractors. And you can't simply assume that contractors are unlikely to be making a loss either, because that's exactly the question that we're asking.
Yea the costs are the tokens it pays to validators. I'm not sure what you don't understand about that. Ethereum the network takes payment for transactions, and pays validators for validating. It is profitable because it currently takes more payments for transactions than it pays to validators. All of this is on the sites I mentioned.
How could validators possibly be revenue? Maybe it helps if you visualize them as contractors who do a job for the Ethereum network, and get paid for that. How the contractor manages their own budget is irrelevant to Ethereum.
Good heavens. The contractors are the network. If you leave the contractors out there's nothing left. There's no network. The network doesn't take payments. Contractors do. Other than that of contractors, there is no economic activity. In this view, the network is neither profitable nor unprofitable, since the entire concept of 'profitability' refers to an economic activity.
I wanted to provide a clever insight here saying they should bring some more solar panels, but unless my back of the napkin calculations are totally off that would work poorly. To begin with, there’s very little sunlight at the south pole and efficiency of panels is 10% of normal. Admittedly I’m only glancing through this paper, but an installation to generate an average continuous 2.5kW would have a total cost $250k [1].
The good news is that if you did do that, 2.5kW would allow you to heat a very significant amount of water per hour, even from ice to shower temperature. Like a US gallon in 4 minutes from 0º to 35ºC.
It does say on there they are training it on the Pile training data. And they have this bit comparing inference with GPT2-XL:
RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M
So it looks about twice as fast for inference while using only about 80% as much VRAM. Obviously at such a small size, just 1.5B, you can run it even on consumer GPUs but you could do that with GPT2 as well. If it remains 80% of VRAM usage when scaled up, we’re still talking 282GB once it’s the size of BLOOM w/ 176B parameters. So yeah still 8x A100 40GB cards I guess. Not going to be the Stable Diffusion of LLMs.
THe RWKV model seems really cool. If you could get transformer-like performance with an RNN, the “hard coded” context length problem might go away. (That said, RNNs famously have infinite context in theory and very short context in reality.)
Is there a primer for what RWKV does differently? According to the Github page it seems the key is multiple channels of state with different decaying rates, giving I assume, a combination of short and long term memory. But isn’t that what LSTMs were supposed to do too?
There's already research that tries to fix this problem with transformers in general, like Transformer-XL [1]. I'm a bit puzzled that I don't see much interest in getting a pre-trained model out that uses this architecture---it seems to give good results.
My understanding is that RNNs aren't worse than Transformers per se, they are just slower to train, and use GPU much more efficiently, i.e. much more stuff could be run in parallel.
Yes it is. They were developed to fix the vanishing gradient problem.
The 1997 paper where they were introduced puts it like this:
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
I can confirm it from what we’re seeing on a video prediction task. Future frames end up blurry. The first frame is sharp, but by frame 3 it’s only crisp when it’s very certain of its prediction. Any kind of rare movement, it goes “I kinda know what it roughly looks like” and smears fingerpaint all over the canvas.
The overall trajectory looks ok, so I’ll be more rigorously investigating whether it’s possible to squeeze more precise context out of it. For example, since the first frame is sharp, you could discard the other future frames and use that first frame as the last history entry (rolling completion window). If “the first frame is always sharp” is true, then it seems reasonable that you can generate N sharp frames with that technique, which might work better than predicting N all at once.
You might also mess with your loss function to force it to "make up its mind" as right now the blurry mess likely minimizes the error from the actual frame (which isn't really want you want).
Exactly! That was the exact thing I was trying to think of a way to do.
Got any ideas? There’s discriminators, but after reading over prior work, it seems like they help, but they weren’t really a groundbreaking / effective solution.
I had two harebrained ideas in mind. One is to add yolo style object detection. The difference between a blurry mess and a recognizable object is the fact that it’s a recognizable object, so minimizing the error wrt yolo might work. (“If there are more recognizable objects in the ground truth image than the generated image, penalize the network”)
The other was to try to make some kind of physics-based prediction of the world — if it knows roughly where a street is, or where a wall is relative to an object, then it’ll likely be less confused when generating objects. That idea is very nascent, but right now I’m attacking it by trying to get our RNN to predict an nbody simulation. (Two or three 2D circles that have a gravitational effect on each other, with bouncing when they collide.) The RNN is surprisingly okay at that, even though it’s only examining pixels, but it gets blurry. I was going to try to get it to spit out actual predictions of position, velocity, acceleration, radius in the hopes that it’ll be able to make a connection between “I know there’s a ball flying along this trajectory, so obviously it should still be there 3 frames from now.”
It seems like the more traditional solution is to add a loss term related to the optical flow of the image (displacement from the previous frame to current), or to do foreground/background segmentation masks and have it focus only on the foreground. Both of those feel like partial solutions though, and it feels like there should be some general way to “force it to make up its mind,” as you say. So if you have any oddball ideas (or professional solutions), I’d love to hear!
Have you checked RSSM approach in DreamerV1,V2,V3,PlaNet? It uses deterministic (GRU hidden state) and discrete stochastic latent states. The deterministic and stochastic (sampled) latent state are used to predict the next state. I think the stochastic state might help with your problem a bit.
In terms of reversing the damage, look into photobiomodulation (PBM) aka red light therapy. In "Improvements in clinical signs of Parkinson's disease", Liebert et al in 2021, she shows improvement in all symptoms including cognition, which is one of preciously few such results I've found in my extensive search of the literature[1]. Caveats are that this was a proof-of-concept study with n=6 only and that the red light helmet is somewhat expensive if you want to try it. There's a Canadian company that makes one for above $2000 (modern, very sci-fi thing), and a more hackerish version from an Australian company for ~$700 (pairs of diodes on aluminum bands you have to finish assembling yourself).
[1]: At least which are actionable for the public. There are tons of trials but good luck getting in early unless you can donate a new library to the university! Gene therapy, stem cell therapy, GDNF, drugs that target alpha-synuclein are all promising but not yet accessible. PBM is something you can do today and since mitochondrial dysfunction is a leading hypothesis in the pathogenesis of PD, the treatment fits.