Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[dupe] What end-to-end encryption should look like (jitsi.org)
245 points by lopespm on April 22, 2020 | hide | past | favorite | 62 comments


This is a dupe of https://news.ycombinator.com/item?id=22855407

The URL is slightly different but it is the same.


In short:

* currently the Jitsi Meet videobridge sees unencrypted conversations

* they're changing that using a new WebRTC API called "Insertable Streams"

* it's currently in alpha with an open RFC

* they plan to use the double ratchet algorithm for key exchange in the future


The mention of double ratchet is confusing and maybe worrying in the sense that it feels as though they don't understand what they're doing here.

The Double Ratchet is designed around synchronous linear messages:

Alice: "Hi Bob" Bob: "Hey" Alice: "Check out our new house guest!" <cat photo> Bob: "Aww :cat-emoji: :heart:"

One of the two ratchets changes the key used for messages from Alice to Bob each time Alice sends a message, ensuring that knowing one of these ephemeral keys only gets you one message and resetting all the assumptions about how much data can safely be encrypted with a single key. And of course likewise from Bob to Alice.

The other ratchet uses the a Diffie-Hellman style algorithm to agree new keys entirely each time they exchange messages back and forth, in order to repair key compromise.

But, while this makes sense for an application that is periodically sending messages it isn't a reasonable fit for video streaming. It doesn't make sense to change the keys for each frame transmitted for example.

I guess it can make sense to have the system automatically change the keys when participants leave, somehow, as this means a participant can't secretly eavesdrop calls they seem to have left. But I don't see how that's the double ratchet.

Their example 'foo' password obviously is a placeholder, and I can see that if they used a scheme like Jitsi's default random VC names ("BearsMasticateSteakImmediately") you could have something that can't reasonably be brute-forced but exactly what happens with key exchange definitely needs more thought.

The good news is that done correctly that's largely orthogonal to the encryption problem. It's important, but it doesn't need to block the other work.


> The mention of double ratchet is confusing and maybe worrying in the sense that it feels as though they don't understand what they're doing here.

I'll start this by saying that I recognize your username from lots of good informative posts here over the years. But I think you're being unnecessarily harsh here, to the point of taking maybe the most uncharitable interpretation of the limited data that we have.

We've spent the past 20 years telling people not to roll their own crypto. Is it really that bad when somebody takes that advice to heart?

> I guess it can make sense to have the system automatically change the keys when participants leave, somehow, as this means a participant can't secretly eavesdrop calls they seem to have left. But I don't see how that's the double ratchet.

They mentioned the Olm version of double ratchet. This usage scenario sounds a lot like ratcheting the sender keys in a Matrix group chat with Megolm.

If you think that you might want to occasionally ratchet your keys (as speculated here [1]), then IMO it's better to grab something good (but possibly overpowered) off the shelf than to try to cobble together something yourself. Even if maybe you don't wind up using the hash-based ratchet very much.

[1] https://news.ycombinator.com/item?id=22858476


> It doesn't make sense to change the keys for each frame transmitted for example.

If it doesn't hurt the reliability of the communication, I don't see the problem... and it could have "hidden" benefits.


It could also have “hidden” complications, which I find much more likely.


I have a question on the definition of "end-to-end encryption" with regards to groups/rooms. Is it sufficient to have a single key shared by all "ends" for the room, or must each user have its own key? If a single group key is acceptable, must it be rotated as a member goes or comes?

I'm just looking for common definition of the term wrt webrtc groups for fear of misusing it when describing a product. (for the purposes of this clarification, pretend I have key distribution/derivation figured out and I whether I use asymm or symm encryption doesn't apply to the question itself)


I think it's a hot topic and the definition is divisive. Having 1 "secret" that end users share and use is technically end to end encryption, I'd say, but it's kind of like sending messages to yourself.

It's kind of like setting up one private/public key pair, then sharing both with everyone in your group and writing each other messages with them. If one of your group become a malicious actor, or if a malicious actor manages to steal the pair, it can compromise the group communications. You also can't cryptographically distinguish between each other since you share one identity.

Mix in video conferencing, though, and you really just have to trust what you see/hear from your peers video feed in addition to trusting your shared key pair isn't compromised.

Ideally, each party member has their unique key pair, they share their public keys between the group, then communicate that way. But you got more overhead in starting a Jitsi meet room rather than just a link and a shared password.


Separate keys means you need to have separately encrypted streams for each pair of participants. This is probably not scalable, and definitely does not play well with multicast or with a video-routing server.


Is that really true?

I'm not a cryptographer by any means, but it seems like every time you transmit a chunk of video over the network, you could generate a single-use key, encrypt the video with that key, and then encrypt the single-use key with every intended recipient's individual key.

When the recipients get the message, they'd find their encrypted copy of the temporary key in the list, recover the single-use key, then decrypt the video.

That way the only wasted bandwidth is in transmitting keys, which is a lot smaller than video.

The recipients could give the single-use key (that they decrypted with their individual key) to someone who isn't supposed to have it, but that doesn't seem like a problem since they already have the decrypted data and could give that away.

For a small performance improvement, you could re-use the temporary key a few times (for, say, 10 seconds) if the set of intended recipients doesn't change.


You both are right, what is not scalable is not encrypting messages to all when there are multiple keys, what is not scalable is sharing this shared secret and rotating it when members are added and removed frequently.

This scalability issue is a non-problem imo:

* I don't expect large group VCs, and I don't expect the member set to be updated frequently

* For large group VCs, the threat model is much different and you can do broadcasting (use the same key)

For group chat I would say that confidentiality goes out of the window when you start having a large group, so having one key is just more simple to implement and makes sense.


I like that idea. I think there might be some risk that because all recipients receive the same encrypted text (encrypted by a different key for each recipient), a recipient may be able to discover other recipients' encryption keys. To avoid that, I think I would pad each encrypted text with random data, where each recipient gets different random data.

Then the problem boils down to key management, but Keybase has that figured out IMHO. :-) It would be amazing to see a collaboration between Jitsi and Keybase to build end-to-end encrypted video chat with high quality crypto.


"I think there might be some risk that because all recipients receive the same encrypted text (encrypted by a different key for each recipient), a recipient may be able to discover other recipients' encryption keys."

If that's possible, it means you're using horribly broken encryption. So it shouldn't be a concern if you're doing things right.


Yeah, on second thought, that would mean TLS is broken: if many people connect to a static web site and receive the same content, they would be at risk of revealing key material. However, they are not at risk of any such thing. Never mind. :-)


I am not sure if this relates, but this past year at Blackhat there was a talk on Messaging Layer Security and they broughtup a concept called TreeKEM, where users share keys in a tree heirachy to reduce the number of shared secrets.

Again, not sure if this is applicable, but the comments made me think of this.

[0] https://www.blackhat.com/us-19/briefings/schedule/index.html... [1] https://i.blackhat.com/USA-19/Wednesday/us-19-Robert-Messagi...


This still uses a single shared secret for the whole group.


Thank you for verfiying that


To add to that, the only goal of TreeKEM is to avoid latency/complexity when adding and removing group member frequently/quickly. Which is really not the use-case of most groups (so not sure why they're doing this).


Probably groups with 6 or so members are alright, but the more people your group has, the more joining/parting there is. And the larger the effort is to distribute a new set of keys. So if you want your method to be scalable (and some telegram rooms have tens of thousands of members!), you need strategies like this.


> some telegram rooms have tens of thousands of members

I'd say that at this point you can either:

* accept that there is no confidentiality anymore. It's just not realistic to have a "secret" group with that many members

* have the person who adds new members forward them the group key, and give up on key rotation

btw, I'm wondering how treeKEM manages malicious members when key rotation happens


> accept that there is no confidentiality anymore. It's just not realistic to have a "secret" group with that many members

You do have a point, especially when it's a group where people are in their free time. However, if they are present for work they are less likely to leak information. Also, encryption should give a default level of privacy to build on.


right, but how much do we want key rotation as a priority when a member leaves? What are the chances they're going to collude with whoever can see the traffic?


I think it's fine if you don't re-encode (i.e. need to decode) the audio/video packets in any way. IIRC, you can broadcast RTP stream packets without knowing their contents if you settle on common codecs (e.g. opus and h264). But if you need to embed picture-loss-indicator messages or something, it might not work. I think this is the essence of "SFUs" (selective forwarding units) in WebRTC parlance.


Not really. You can have internal per-packet keys which you throw away after your cryptographic setup and then only convey that key in separately in some encrypted form to each member along with a common payload which is amenable to multicast.

I think that's a piece of signal's group chat protocol.


I think a strong cryptographic communications platform would offer the option of using either multi-party-shared-key or p2p key pairs depending on the use-case. Similar to how Slack has both private and public channels within an org.


Signal has published some interesting material addressing this question.

https://signal.org/blog/signal-private-group-system/


I think it's very much a gray area.

True E2EE seems like each endpoint would have to have its own key. But the bandwidth realities of group videoconferencing make it entirely infeasible for each participant to be sending a separately encrypted stream to each other participant. So the only realistic solution is for all members to share the same key, which makes everything dependent on the security of the key distribution. But then... is it really E2EE? I'm not sure there's a generally accepted answer here.


If you already have pairwise E2E encryption working, you can use it to distribute a shared key. It's the first part that's hard.


If your video was encrypted once and multicast to all n other participants, couldn't you do it in such a way that it would be decryptable by n different keys that you handed out, one per participant, without anyone knowing other people's decryption keys? There will be n^2 keys total, but only n streams total.

What would be an attack scenario here that wouldn't exist with the entirely infeasible n^2 streams scenario you mentioned? Nobody would be able to use the keys they possess to make an imposter stream.


What matters is that the server or software provided and anyone not in the group can't obtain plaintext even if they can intercept and modify all traffic on the Internet.

Obviously you need to rotate the session keys when someone leaves, since they must no longer be able to obtain plaintext.


> Is it sufficient to have a single key shared by all "ends" for the room, or must each user have its own key?

It is enough that a single key is shared, that being said some systems do more because they want to provide additional properties to end-to-end encryption for groups.


Too early, jitsi needs lot of work to get good basic features first. Their "videobridge" just a pieces of software without docs, arch description, has not possibility to run as pure SFU without any other jitsi parts like Colibry XMPP. And etc. "InsertableStreams" scheduled for Chrome only and future 84 version, still experimental and if it's a after codec processing then useless.


Janus [1] is a much more capable webrtc gateway that's pluggable to do anything you want. Jitsi is like a Wowza type server. Stay away.

[1] https://github.com/meetecho/janus-gateway


Is it a turnkey solution? Janus seems more like a WebRTC gateway with no directly usable end-user app (eg. video chat).

Jitsi Meet seems to be a turnkey thing where you can just self host an instance and have users create video meetings without too much fuss.


This might be useful: https://www.youtube.com/watch?v=u8ymYTdA0ko

Janus is more of a Swiss Army knife, vs a all-in-one app for a specific use case for webrtc.


Ye will see, thx. I'm looking the "mediasoup" SFU, it looks very nice, but need to try a code.


> From this key we derive a 128bit key using PBKDF2. We use the room name as a salt in this key generation. This is a bit weak but we need to start with information that is the same for all participants so we can not yet use a proper random salt.

1) You do not need to use a password-based KDF if your key is a random bytearray of >= 16 bytes. I expect that most people would use their feature by copy/pasting the key in email or whatever channels they're using to communicate the room link.

2) I'm not sure about IV generation, maybe somebody with more knowledge on SSRC/etc. can look at that

3) decryption error do not kill the connection, that's untypical but I think that should be fine


> 3) decryption error do not kill the connection, that's untypical but I think that should be fine

I assume that this is so you can deal with the fact that some individual frames have been lost and/or corrupted as long as the next ones decrypt just fine. That is important for video streams; I can imagine that the quality of life of people using E2EE streams would suffer if the stream was highly vulnerable to corrupting even a single bit within any of the frames.


In practice a bunch of layers under the application one have checksums: UDP already has a checksum, IP already has a checksum too, the data link layer too... so maybe you wouldn't even reach a naturally corrupted AES-GCM ciphertext? Not sure what are the probabilities for the UDP checksum though.


I want to point out that E2E encryption on Web doesn't make your conversation more secure than regular HTTPS encryption without E2E encryption. The core to the argument here is the threat model. The whole point of using E2E on any conversation is to make sure that even the service provider cannot read your conversation. It's very clear that the threat model is againsting the service provider. However, the very same service provider of your conversation channel also provides the underlying encryption application on Web. That means, if the service provider wants to act evil, it's always possible to sneak you an application that steals your conversation by simply not applying E2E encryption, or just eavesdropping before encrypting.

The root of the problem is that Web application doesn't have a root of trust. As long as this problem is not addressed, E2E encryption on Web will always be meaningless.


This kind of thinking is why PGP never took off.

Yes, your ISP could hack the binaries, break the HTTPS trust model somehow, and alter what you download, but practically speaking that isn't going to happen. Trying to build your system to defeat an adversary who has that kind of power ends up making it too cumbersome for the 99.999% use case.

What I don't understand is why the server needs to decrypt the traffic at all. Why not just have it rebroadcast the encrypted streams? The endpoints could exchange crypto keys with public key crypto and to the video server it would just be a bunch of bytes. Clients would turn on and off video streams from different clients based on network conditions and how much screen real-estate they have. Audio could even be encoded on a different channel so it could always be forwarded even if the video is not.


>What I don't understand is why the server needs to decrypt the traffic at all

In short, to make it scale for many (tens to hundreds) participants. Clever techniques like Simulcast[0] and SVC[1] allow that, but routing server must support it to meet different requirements of individual participants.

0. https://webrtchacks.com/sfu-simulcast/

1. https://webrtchacks.com/chrome-vp9-svc/


With media streams the provider typically wants to be able to recode the data to lower quality to fit the bandwidth available to clients with lower bandwidth / congestion etc.

There are ways around this but they are quite complex and place additional CPU overhead on each sender.

Always Tradeoffs.


As if Zoom or any pro company doesn’t actually know. Bragging about knowing what it means actually make them sound amateurish.


Zoom has fundamental problems that makes them not able to support E2E easily. They only recently started sending video over WebRTC data channels. Before they were sending over websockets. They're not even using the browser's native media streaming APIs. So E2E is a huge problem for them to implement.


Tangential (and noob) question: Could there be an encryption+compression scheme where:

1. Sender sends an encrypted stream at K bps.

2. Server takes the encrypted stream and compresses it, without decrypting anything. It then sends encrypted+compressed stream at J bps (J<K) to end recipient.

3. The end recipient decrypts the compressed stream using a key provided by the original sender, not the server.

This with reasonably secure encryption and reasonably size-efficient codecs, obviously. So the step 2 would add compression additional to the compression of the original codec.

Is this mathematically possible?


If the data is encrypted well, it should be indistinguishable from random noise.

You can't compress random noise.

Ergo, you can't compress encrypted data. It's a nice idea though.


Generally it's not possible to compress well encrypted data with any meaningful scheme, but if your encryption is stream cipher based, you can subsample a stream of PCM audio on the server quite easily, by just removing, say every second byte (or every second pair of bytes). Then decrypting entities can insert dummy data at every second byte position, xor with the encryption stream and discard the data again, replacing it with interpolated data instead. Probably the same can be done for uncompressed image data, but any kind of serious compression wouldn't be possible and this simple compression can already be done on the clients without much overhead. So I guess it's mostly of theoretical use.

You would rather have each client send streams in multiple resolutions, say one in 1080p and one in 480p, and then make the central server decide which stream to send to which client. Taking one step further, the clients can be asked to adjust the quality of the stream they are sending depending on which streams they need. There are obvious latency concerns, but there are also bandwidth savings in having less data being uploaded by clients. Actually the bandwidth use is highest if the client uploads an uncompressed or barely compressed unencrypted stream for the server to compress. It's much better if the client's hardware did the compressing part.


Usually you compress first then encrypt. Unencrypted data has better compression ratio.

Also modern video data is already compressed where most frames are difference data.


Generally speaking, video compression algorithms are 'lossy' (cause a reduction in quality) and need to be able to 'see' the video in order to compress it. The compression typically removes details that are not perceivable by humans (this is also the case for image and audio compression), and by sending just the changes from frame to frame, rather than each entire frame. For both of these, the compression algorithm needs knowledge of the video stream, which would be impossible if it's compressed.

You could try to use a lossless compression algorithm (such as used by zip), but those are effectively useless on video in general, and even more so on encrypted data, which appears to be random.


Thanks for your answer.

> the compression algorithm needs knowledge of the video stream, which would be impossible if it's compressed.

That makes perfect sense and I guess my theory of secure encryption + post-re-compression fails because of this. But what if we didn't need perfectly secure encryption, but just per-block encryption. So the server knows that you've sent 60 frames, but doesn't know what is in those frames.

What if the codec was made in such a way that the server knew that certain blocks in the stream could be discarded to reduce size while the stream without those blocks still makes sense to the recipient.

For example: the stream is composed of 64 byte blocks, but the codec says that every 2 blocks there's a discardable block that adds image quality but is not essential. So, with this knowledge, the server discards every 2 blocks when sending that data to people with low bandwidth and sends the original stream with all its blocks to those with high bandwidth.

It's an extremely naive scheme, but maybe this principle could be applied to more complicated codecs, so the server only needs to know metadata about the stream (where each block is and whether it's essential), but not the content of the block itself (framebuffer and audio sample values).

I'm sorry if this idea is too dumb (and my English skills are not the best).


Interesting; some time ago the Ogg audio transport was working on bitrate peeling https://en.wikipedia.org/wiki/Bitrate_peeling with the basic idea that a stream can be encoded at one bitrate but can be served at that or any lower bitrate. A simpler example is FM stereo radio - the main frequency provides a mono audio signal, which works just fine, but if the receiver can also pick up the stereo sub-frequency (containing just the diff of left and right channels), it gets stereo.

Anyways, the wikipedia page linked above links to this about the same concept, for video: https://en.wikipedia.org/wiki/Scalable_Video_Coding so it might be feasible. Not sure how feasible/secure encrypting this would be.


Naively no, but I wonder if there is some variant of homomorphic encryption that does this? I would be surprised though if this was efficiently possible.


Yes, I think homomorphic encryption is what I was thinking about, although I didn't know its name until now.


You can't compress properly encrypted data.


Shouldn't they inspire the end to end encryption from openSSH. As far as I know it's used by many worldwide and it seems fairly secure. Maybe I am missing something though.


It looks like what they have now is openssh like. Client 1 sends data to the server encrypted. Server decrypts. Server reencrypts and sends to client 2.

This describes how to communicate in such a way where the client 1 and 2 can communicate with the server in the data path and The server being unable to see the contents of the message.

The hard part is key management between clients in a secure way.


You said "server decrypts" but then also "server being unable to see contents". How? If I decrypt, I see contents.


I think he's wrong. The server never receives the key. The bridge passes along the encrypted stream and encryption/decryption only happens client side.

In the demos, you can see that the parameter is a # parameter and not a query parameter.


I was unclear in my comment. I was describing a before and after set of solutions. The former allows for server side decryption. The latter approach would prevent server side decryption.


Now I get it, thx


1. OpenSSH is point-to-point, there's no one-to-many messages.

2. OpenSSH doesn't stop sessions for a user whose authorization info had changed, which is what Jitsy tries to change to prevent.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: