Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

He explains it quite well: all necessary information is already in the pixel-space and adding more sensors slows team down more than it improves the system performance. My understanding is that major blockers are not in perception area anyways, would be great if someone with relevant experience could comment if this is indeed the case.


I am a principal engineer for a major autonomous vehicle company. You can break this statement down into two components:

Adding more sensors slows his team now more than it improves system performance

I'll take his word on this. It is a lot of work to incorporate multiple sensors.

All necessary information is already in the pixel-space.

I hate to disagree with someone as distinguished as Karpathy, but this is simply not what I have observed from all of that data that we have access to. Given my knowledge of the various stacks deployed today, I would never ever ever get into a vehicle using a vision only stack and expect it to perform in some of the challenging environments encountered during testing.


I think one should distinguish between 'all necessary information is already in the pixel-space' vs 'we already know how to extract all the information needed from pixel-space'

The fact that (most) humans manage to drive around safely and successfully in current roads proves that the information needed exists in the pixel-space (not just current image, but say current + history). We don't yet have stacks that can successfully map everything needed from this information but I don't think Dr. Karpathy ever claimed that.

(I am not a principal engineer but a mere PhD student who argues daily with people on how RGB information is underappreciated and under utilized)


> The fact that (most) humans manage to drive around safely and successfully in current roads proves that the information needed exists in the pixel-space

But that doesn't mean that it translates to a car.

We constantly move our 576MP resolution eyes in multiple orientations in order to visualise a scene and focus on the most important areas. Cars have fixed, low-quality cameras.

We then interpret this data using the most advanced pattern recognition system the world has ever seen that is trained for at least 20+ years to fully comprehend the behaviour of everything this planet has to offer. Cars don't have anything close to this.


You kind of want to make it seem like a 576 MP resolution (where did you even get this number from while people still argue about a fair comparison between human eye and a camera?) or having to move your head/eyes to visualize your surroundings rather than actually having multiple fixed cameras covering the entire surroundings all the time is a good thing? If the resolution mattered that much, every car would have ultra-high resolution cameras on it.

Humans certainty have a stronger and general prior to make sense out of the information, and that's exactly why I left it as a possibility. Cars don't * yet * have anything close to it, just like they didn't have a way to accurate detect objects a few years ago and just like they didn't have a way to capture RGB information a few decades ago.

I am an optimistic guy, and I certainly believe in the power of learning at scale.


> 576MP

Actually our eyes are more like 8MP: https://www.picturecorrect.com/what-is-the-resolution-of-the...

Perhaps higher synthetic resolution from moving our eyes about, or perhaps that is meaningless.


It could be reframed as saying we have a peak acuity equivalent to a 576MP camera of the same FOV with a theoretical max of 20 samples per second (50 ms to move targets, realistically probably more like single digits). The 8MP comparison is only relevant if there are so many targets that need constant full resolution that you can't focus on all of them or the targets are so large that they are larger than the peak acuity FOV. In practice this is not the case because we can identify something once and keep tracking it in the periphery without issues and something that large will likely be extremely easy to identify.


That doesn't make sense: a camera doesn't get more pixels just because the camera is taking a video tracking something. Neither if it had zoom and a controlled gimbal.


If you turn that tracked video into a panorama it would. Or if you took 10 zoomed photos and stitched them over top of an unzoomed photo. The point is that unless the task demands more focus areas than the eye can focus on in a given window then the visual acuity (for the parts of the scene that matter) is higher than an 8MP shot of the entire scene.


I'll agree with you that there are still techniques to be discovered.

I also agree that most humans manage to drive in challenging conditions, but their margins for error become slimmer and slimmer. I personally want my autonomous robot vehicle to be way more efficient and safer than the best human operator and also able to deal with conditions that any sane human would pull to the side of the road when encountering.


Definitely agree with your second point! In theory, the reaction time and complete environment awareness should itself make an autonomous system way safer than human drivers.

In some way, I am against the philosophy of using HD maps + LIDAR data for highly accurate localization which most companies seem to be using these days. I believe that this approach is inherently brittle and is an 'easy way out' to the hard localization problem. I think more resources should be put into developing more natural, no HD map dependency techniques.

PS: It is my understanding that most of the major players were using HD maps, not sure if it is still true.


>their margins for error become slimmer and slimmer.

Can you elaborate on this? I've always felt like the margins of error are getting wider because the automotive tech (particularly safety features) are so vastly improved. I doubt people would be able to text and drive as much, for example, if they were driving a 1950s era Willys jeep just because it requires so much more attention to keep on the road by comparison to modern vehicles.


bleh. auto accidents are the #1 preventable cause of death for kids:

https://en.wikipedia.org/wiki/Preventable_causes_of_death#Am...



Full on agreement. There are literally videos of Teslas smashing into stationary vehicles on the highway at night using only vision camera for FSD. No way any rational actor could claim the visible pixel space is sufficient in that scenario compared to LIDAR, Radar, etc


It's funny you use Radar as an example of 'good sensor' while it is well known that most or maybe almost all?) of the stationary vehicle accidents you're talking about happened because of Radars inability to detect a stationary obstacle.

On the other hand, RGB data does have that information, we use it everyday to avoid obstacles, even under foggy and rainy conditions (I'm no LIDAR expert but I know it sucks in rainy conditions)

I am not saying I support a vision only stack, but all I am saying is it is certainly possible to deploy a vision only stack in the future.


You mean some radars inability to detect stationary obstacles. Clutter rejection has a lot of more sophisticated algorithms to apply with greater compute power to throw at the problem.


These were flat truck/ambulance surfaces encountered at an angle - exactly the conditions that the first stealth fighter, with its angular surfaces used to evade some of the best radar in the world, because from most angles, no radio waves would be reflected back to the radar sensing device. It's hard to get the job done with nothing.


> Therefore, for this simple ADAS algorithm using roof mounted LIDAR, heavy rain does not prove to be a particularly important factor in the system performance.

https://www.mdpi.com/2079-9292/8/1/89/htm


Compare their occupancy map with what you get out of the latest LIDAR Waymo is using and it is scary (occupancy is harder as it fills in what is occluded, but Tesla's looks like Minecraft-style 1x1x1m resolution).


Out of curiosity: Could you please elaborate what such challenging environments can be?


It’s good enough for people, so all the info is there.

Doesn’t mean it’s better or easier


I have driven in extreme rain flash flood conditions in north Texas and I consider this a specific challenge, natural, that would defeat his system.


Any amount of snow would do this too. It severely reduces the color space of road features.


Tesla cameras are not RGB, they're WWWR (white-white-white-red). Essentially they have a Bayer array, but only for 1 red pixel out of four, the other three are black & white. I believe that the W pixels aren't homogenous either, there is some design aspect that enables them to cover a wider range of intensities so that they can handle both darkness and full sunlight.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: