Awesome read! It's great to see someone faced with tech challenge and not back d...

zkms · on July 31, 2018

> As an information theorist, I used to be able to point to optical disks as an application of my field, and be like "see all that math is useful for something," but now I don't have anything shiny to point to anymore

I strongly disagree, there's many shiny things you can point towards.

Point to a smartphone (and also the relevant sections of the LTE/HSPA/UMTS/WCDMA/GSM / 802.11 standards documents); there's a literal panoply of coding/error-correction maths that's crucial to every single one of those standards. You can actually go to the ETSI website and find the LTE standard and get free PDFs with the exact parameters of all the codes they use, in enough detail to reimplement them yourself.

In those standards, there's all sorts of lovely little state machines for coping with errors, like LTE's "HARQ loop" where your phone will tell the cell base station that it didn't receive a chunk of data correctly; at which point the tower will resend a differently punctured version of the chunk of data, and your phone's modem will try to soft-decode the data with both versions to work with. Oh, and that exchange (including processing on both ends and radio transmission latency) takes under 10 milliseconds to complete -- the standard places an exigent and strict deadline on how long your phone has to respond with its acknowledgement/request.

Also hard-drive platters are extremely shiny (more so than optical disks) and those also use error-correcting codes, as do SSDs. Did you know that bit cells in modern SSDs are so small that the number of electrons it takes to affect a measurable voltage difference for the sense amplifiers is less than 100 (https://twitter.com/whitequark/status/684018629256605696)? All your data lives in those differences of tens of electrons per bit cell; no wonder there's ECC machinery hard at work in SSDs!

exikyut · on July 31, 2018

> ...like LTE's "HARQ loop" where your phone will tell the cell base station that it didn't receive a chunk of data correctly; at which point the tower will resend a differently punctured version of the chunk of data, and your phone's modem will try to soft-decode the data with both versions to work with.

Cooool.

I googled "HARQ loop" and because I visit HN so much the first result was https://news.ycombinator.com/item?id=11151232, so I learned that the data-processing portion (where the decode attempt is retried) must complete within 3ms!

I've been wondering for some time about good (bandwidth-efficient) ways to do error correction/recovery in the area of general-purpose high-efficiency byte transports, and just in case, wanted to put this here in case you're/anyone is interested. How good would it be to throw TCP's "flood of ACKs" out the window and instead compare frame checksums (is CRC32 good?) of every say 32 frames, and send a bitmask (in this case 8 bytes) every n frames noting which were correct and which aren't? This would send ~32 times less data (and I just learned a TCP ACK is approx. 74 bytes! https://stackoverflow.com/questions/5543326/what-is-the-tota...)

(My interest/focus is within the domain of good ideas lacking widepread implementation/uptake. I've found that these always seem to be hiding in the woodwork.)

tialaramex · on July 31, 2018

Your TCP ACK approach is confusing. If you mean to check all the frames, but only say if they're OK every 32 frames that only makes sense if you somehow have a medium with high bandwidth AND high error rates. If your error rates are low conventional TCP will just send occasional ACKs saying e.g. "Yup, got the last 500 packets, keep it up".

But if your error rates are high and your bandwidth is high you should already know how to fix this and get very slightly less bandwidth without errors, so why is this suddenly TCP's problem and not your transmission medium?

namibj · on July 31, 2018

Because I for one can't change the DOCSIS parameters my ISP set for the segment I am on, and neither can I affect radio communications reliability without impaired movement. While the error detection is rather weird, and a simple erasure code might well work better, this is a good idea.

exikyut · on July 31, 2018

Thanks for the suggestion. I've been meaning to take the time to stare at how erasure coding works and keep staring until my eyes stop glazing over and I get it :)

I replied to the parent comment, FWIW.

exikyut · on July 31, 2018

Thankyou for helping me understand that I don't actually understand how TCP works :) I had no idea ACKs apply to the last n bytes as a group - although now I think about it, it makes a lot of sense that no other approach would be viable (even with 80s/90s era line rates).

I want to design a protocol capable of withstanding both high and low error rate scenarios; everything from usage on 64kbps satellite to flaky home Wi-Fi to wobbly 4GX cellular performance around tunnels and buildings. The (slightly pathological) application I have in mind (mostly as a theoretical test) involves video and interactivity, so basically requires simultaneous high throughput and low latency (essentially worst case scenario). My goal is to build something that responds well to that, perhaps by being able to have its performance metrics retuned on the fly.

I read somewhere a while ago about how TCP is a poor general-purpose catch-all protocol, and that eschewing it where possible (cough where there aren't any proxies) will generally boost performance (if the replacement protocol is well-designed).

It would be awesome if I could to assign significance to and act on each incoming network packet. This would mean I wouldn't need to wait for enough packets to come through and be reassembled first. So packets could arrive out of order or get dropped and I'd be able to recover the data from what did make it through instead of throwing what I did receive away and just starting over.

Yes, a lot of work; I'm trying to figure out if it's worth it depending on the situation.

My original musing was about TCP ACK. If I'm following correctly, I now understand that apparently TCP ACKs apply to 728KB of data (1492 * 500 = 746000), so if a failure occurs... does 728KB of data get resent?! I presume not where the link is slow or congested or has a high error rate, but if that's still happening where a lot of data is being sent, that still feels quite high to me.

Thanks for replying; you helped me properly think through my idea and realize it would never actually work: I hadn't realized that I was assuming packets would arrive in order. If I use a 64-byte string to represent a 512-bit bitmask of ok/fail checksum tests, that would only work if both sides of the link are considering the same set (and sequence order) of packets.

Another thought, with far fewer moving parts (/ chances for imposition/assumption...), was to simply report a running success/failure percentage, with the remote end using that information to get as close as possible to 0% errors. That doesn't narrow down what needs retransmitting though. Back to the drawing board, then...

I have no delusions of reinventing TCP/IP, and I'm not even interested in reinventing the wheel. I do find networking very fun, but I guess I'm still working on understanding how it works, building an accurate mental model, and (in particular) honing/tuning the (very important) sense of knowing what solution approaches will work and what will fail.

tialaramex · on July 31, 2018

When you have no idea about something, time spent envisioning how to improve it is time entirely wasted. "Hmm, I wonder if elephants are used to pick chocolate bars from tall trees. Maybe we could improve production by creating a type of stilts for elephants that makes it easier to reach upper branches. Ah, but when the elephants fly home to Canada, would the stilts interfere with their wings?".

TCP has Window Scaling, allowing high bandwidth links to have up to 2GB or so on the wire, and Selective Acknowledgement, allowing recipients to acknowledge ranges of correctly received packets if some are lost. I picked 500 entirely out of the air, actual numbers will vary enormously.

A TCP/IP "stack" automatically tunes the settings for each connection, if your stack is modern it will do a pretty good job. Work on improving this automatic tuning in consideration of both theory and real world practical experience is an ongoing topic of network engineering.

namibj · on Aug 1, 2018

I non-rarely have links with ~0.1% packet loss. If you get even just that much packet loss, and a long-distance link, you _very_ quickly drop below 10 Mbit/s, if the other side can't do fancy selective ACK and similar additions. While I might be able to do something against this, as long as I control one node on both sides of the lossy channel, I can compensate with a little bit of FEC. I might even be able to operate lossless, using a fountain code to add more parity for each data bundle as long as my recipient node continues asking my sending node to supply more parity.

You are right though, for unicast networks there is little use of L3/L4 FEC. Maybe with the age of IPFS and other content-addressed networks we will come back to broadcast-FEC networks, like those used back when for 3GPP multicast services like e.g. TV.

pmarreck · on Aug 3, 2018

> You can actually go to the ETSI website and find the LTE standard and get free PDFs with the exact parameters of all the codes they use, in enough detail to reimplement them yourself.

Does this include things like turbo codes? I've been curious about FEC for a while now

kroeckx · on July 31, 2018

A CDDA has 2352 bytes of main data and 96 byte of sub-channel data per sector, so a total of 2448 bytes per sector. A sector is made up of 98 sub-frames. Each such sub-frame has 24 bytes of main data, 4 bytes of C1 error correction and 4 bytes of C2 error correction, and 1 "byte" of sub-channel data. The first 2 sub-frames have a special pattern in the sub channel data to recognize the start of the sector, leaving 96 bytes of sub-channel data available for other things. The error correction only covers the main data, the sub channel data just has a CRC.

The standard API to talk to the CD-ROM does not have a way to get the C1 and C2 error correction data. There is only a way to get 1 bit for each of the 2352 main data bytes after the C2 error correction to indicate that there was an error with it or not. But getting that data doesn't actually seem to be useful on the drives I tested it with, they all seem to have weird firmwares.

voltagex_ · on July 31, 2018

I think you need a certain Plextor drive.

https://github.com/saramibreak/DiscImageCreator is kinda the gold standard for this.

taneq · on July 31, 2018

> As an information theorist, I used to be able to point to optical disks as an application of my field, and be like "see all that math is useful for something," but now I don't have anything shiny to point to anymore :(

Surely there's some forward error correction built into, well, pretty much any physical media these days? I can't imagine BluRay would need less of it than a CD.

Cthulhu_ · on July 31, 2018

I'm sure pretty much all storage media (SSD's and such) have it, as well as internet protocols and digital media (now that having storage locally is becoming less cool)

dooglius · on July 31, 2018

High-bandwidth Ethernet uses forward error correction, but I'm not aware of any higher-than-physical internet protocols that do, unless you count erasure codes a la QUIC

waterhouse · on July 31, 2018

IP, TCP, and UDP all have checksums, though (a) that's less powerful than error correction and (b) the above are weaker than the Ethernet checksum (Ethernet is 32 bits while the others are 16 bits, and also I think these three are all addition- or XOR-based instead of a polynomial CRC).

pmarreck · on Aug 3, 2018

> As an information theorist, I used to be able to point to optical disks as an application of my field, and be like "see all that math is useful for something," but now I don't have anything shiny to point to anymore

QR codes, dude. ;)

dylan604 · on July 31, 2018

While it might not be shiny, could you not point to the math involved in encryption used in day to day usage like HTTPS, OTR, cryptocurrencies, etc?

ECC is still used in RAM, and bitwise math is still used in RAID arrays. Again, not shiny, but still valid examples.