Hacker Newsnew | past | comments | ask | show | jobs | submit | geocar's commentslogin

> Do you think Chinese LLMs acquired training data legitimately?

I think they probably acquire it in accordance with Chinese law.

> but I don't think the US "started it" to be fair.

Who are you quoting with those marks? Started what? To be fair to whom?


> I think they probably acquire it in accordance with Chinese law.

You can easily look up[1] how China struggles with effective enforcement of IP laws.

And specifically for LLMs, Anthropic recently claimed that Chinese models trained on it without permission.[2]

> Who are you quoting with those marks?

Double quote marks have other uses besides direct quotes, such as signaling unusual usage.[3] In this case, talking about countries like they're squabbling kids.

> Started what?

Fishy use of others' IP, packaging others' work without attribution.

> To be fair to whom?

To US companies using Chinese LLMs without attribution.

---

[1]: https://en.wikipedia.org/wiki/Allegations_of_intellectual_pr...

[2]: https://www.reuters.com/world/china/chinese-companies-used-c...

[3]: https://en.wikipedia.org/wiki/Quotation_marks_in_English#Sig...


They said Chinese law, which is not the same as American law, and presumably using IP the way they have is legal there, if indeed they actually did, as allegations of IP theft are just that, allegations, and even if they weren't, all nations in the history of mankind have been "stealing" "intellectual property" since forever, including the US from Britain, literally with the good graces of the fledgling US government [0].

As to what Anthropic said, it's quite specious as this analysis shows [1], ie the amount of "exchanges" is only tantamount to a single day or two of promoting, not nearly enough to actually get good RL training data from. Regardless, it's not as if other American LLM companies obtained training data legitimately, whatever that means in today's world.

[0] https://theworld.org/stories/2014/02/18/us-complains-other-n...

[1] https://youtu.be/_k22WAEAfpE


The linked wikipedia article specifically talks about China struggling to enforce Chinese law. Here's a quote:

> Despite making efforts in intellectual property protection in China, a major obstacle in prosecution is corruption in courts; local protectionism and political influence prohibits effective enforcement of intellectual property laws. To help overcome local corruption, China established specialized IP courts and sharply increased financial penalties.

> all nations in the history of mankind have been "stealing" "intellectual property" since forever

You can't use 100-400 years ago as the counterexample to what happens today. It's like justifying Russian invasion of Ukraine with colonists invading Native American territories. We're in a different world order, things that were normalized that far back shouldn't be normalized today.


> The linked wikipedia article specifically talks about China struggling to enforce Chinese law. Here's a quote: > > Despite making efforts in intellectual property protection in China, a major obstacle in prosecution is corruption in courts; local protectionism and political influence prohibits effective enforcement of intellectual property laws. To help overcome local corruption, China established specialized IP courts and sharply increased financial penalties.

That doesn't sound like struggling to me.

https://www.matec-conferences.org/articles/matecconf/pdf/201...

Compare with the growth in cases in the US:

https://www.uscourts.gov/data-news/judiciary-news/2020/02/13...

Why is it China increasing cases is evidence of struggling to you? Do you think the US is also struggling? What exactly are you talking about?

> You can't use 100-400 years ago as the counterexample to what happens today.

The US joined the Berne convention in 1988. I do not think we are talking about 400 years ago, but we're talking about the majority of the US history, having law that it was okay to ignore copyrights of the rest of the world.

> It's like justifying Russian invasion of Ukraine with colonists invading Native American territories

I don't agree: One can also mean that there is no justification for the invasion of the Ukraine just like there was no justification for invading American territories.


They are struggling to enforce domestic IP law because it directly affects their own businesses, they don't care about international IP law.

Human nature is the same in any time period, there is no "normalization" at all, it's just how humans have always and will always continue to act, even today, with the world order currently breaking down.


Human nature may be the same, but it differs based on context. Humans act differently in a threatening high risk, low order world than they do in a more stable, lawful world. There is normalization, because in a pre-nuclear, pre-military alliance, pre-diplomacy, pre-world-police world you had to be much more ruthless and cunning as a state. The norms for people were completely different.

I see no evidence that they do act substantially differently post nukes given everything going on in the world in the news today. Regardless, this thread is going off topic, have a good day.

> You can easily look up[1] how China struggles with effective enforcement of IP laws.

I didn't see anything in there about Chinese companies violating Chinese law.

Can you so easily look up how American companies struggle with effective enforcement of Chinese IP laws? I think it should be pretty easy to see how American companies struggle with effective enforcement of European IP laws, and I can tell you it is similar.

From here, it is not so clear that the US can even enforce its own laws at the moment.

> signaling unusual usage

Thank you!

> In this case, talking about countries like they're squabbling kids.

> > Started what?

> Fishy use of others' IP, packaging others' work without attribution.

I see. I guess if China is 3000 years old then maybe obviously, because the US is such a young country by comparison.

So you think it is "fair"[1] to violate Chinese Law because there were people in China who violated US law first?

If so, I think that is pretty childish.

[1]: I am trying it out!


> So you think it is "fair" […]

Maybe fair in a tit-for-tat sort of way, but not okay. That's why I called the whole situation funny. The rest of your post is answered in the sibling comment.


Initial thoughts are it's a meh protocol that does not look well thought-out, has fewer features than SSH, to the point I'm not sure it deserves to be called SSH3 and not telnet-over-websockets. Also, there's already an SSH3 https://marc.info/?l=openssh-unix-dev&m=99840513407690&w=2 so I _really_ think the thing you're thinking of is just some namesquatter assuming it has any connection to openssh or ssh.

I also know how to use SRV records so this is a non-issue for me and everyone I work with.


Specifically to use a different key for each host.

This has history: https://egopoly.com/2008/02/ssh-slow-on-leopard.html

I also know of https://github.com/Crosse/sshsrv and other tricks

I agree more SRV records would have helped with a tremendous number of unnecessary proxies and wasted heat energy from unnecessary computing, but in this day and age, I think ECH/ESNI-type functions should be considered for _every_ new protocol.


SRV is essentially a simple layer of abstraction that provides (via one approach) the required end result (reachability + UX) that is easy to add to any $PROTO client without. Supporting ESNI would complicate the actual lib/protocol, increase the amount of dev and maintenance work required all around, significantly increase complexity, and require more infrastructure and invasive integration than any DNS-enabled service already uses.

> The HTTP traffic goes to a server (a reverse proxy, say nginx) on the host, which then reads it and proxies it to the correct VM.

That's one implementation. Another implementation is the proxy looks at the SNI information in the ClientHello and can choose the correct backend using that information _without_ decrypting anything.

Encrypted SNI and ECH requires some coordination, but still doesn't require decryption/trust by the proxy/jumpbox which might be really important if you have a large number of otherwise independent services behind the single address.


I don't see any reason that GitHub should use rel="nofollow"

Github only has authority because people put their shit there; if people want to point that back at the "right" website, Github should be helping facilitate that, instead of trying to help Google make their dogshit search index any better.

I mean, seriously, doesn't Bing own Github anyway?


Perverse incentives strike again! Websites that allow links in user-generated content are spammed with user-generated spam links to improve SEO of spam sites, which hurts the site's own reputation because most of the links on it are spam. To avoid this, all sites use nofollow.


As this example shows, by all sites using nofollow, Github is improving the SEO of spam sites.

What the fuck are you talking about?


GitHub doesn't care if spam sites have SEO, as long as GitHub isn't being penalized for linking to them.


Why exactly do you think should GitHub be penalized?

Talk about perverse.


I completely agree: If it is ugly-as-sin-but-useful I will learn it.

The aesthetic of mathematics as it appears in journals is I think questionable, but undeniably convenient for communication, so it is every language making the case that you (dear reader) can say something very complicated and useful in the ideal amount of space.

"Hello world" isn't that: That's the one program everyone should be able to write correctly, 100% of the time. That's how we can talk about brainfuck as exercise, but APL is serious.

Or put another way, even if seeing a new kind of "hello world" excites dear reader, it's probably not going to excite me, unless it's objectively disgusting.

What Om does here is exactly right for me: It tells me what it is, and makes it easy for me to drill down to each of those things to figure out what the author means by that, and decide if I am convinced.

I mean, that's the point right? I'm here trying to learn something new and that requires I allow myself to be convinced, and since "hello world" is table-stakes, seeing it can only slow my ability to be convinced.


These "types" are hindley-milner types and have almost nothing to do with what C calls a type.

Your "feelings" may help you make snap judgements that can keep you alive, but they cannot help you code and they will conspire against you when you effort to learn new things. Nobody wants to feel wrong, and you will feel wrong many times when you learn something new, but it is the only way to actually learn the thing. Remember this the next time you have "feelings" about knowledge


So what problem is this solving? No need to be a dick.


> No need to be a dick.

But there was a need for you to characterise me so?

> So what problem is this solving?

What makes you ask me that instead of reading the website and papers for yourself? Do you think I could possibly know enough about the kinds of other problems you have from the example one that makes you call me names?

I mean, did you read even the first page of the paper I suggested? Were you confused by anything in the first paragraph? Do you know what System-F means in that context? Did you do an Internet search? Anything? Anything at all you could say you got stuck on that you didn't understand? Or did you somehow get the impression I should spoon-feed you?

Why do you waste anyones time with this?


You don't know me.


> If you're gonna make a website for your programming language, you NEED to put an example of the language front and center on the landing page.

Did you consider the possibility that this sort of thing was done to avoid wasting time with non-experts who think an "example" of a language they don't know is enough to make comments about?

> I still have not seen a line of TAL

My suggestion: Start with the "Papers" and then look at the paper that introduces TAL. It has an example program with analysis


I’m an expert and I find it very frustrating when I don’t find some example code front and centre. It might not reveal the detail, but it sets the scene quickly and lets me know what sort of a thing I’m dealing with.


> I’m an expert and I find it very frustrating

So you say, but I think _I'm_ an expert too, and I wasn't frustrated in the slightest. Maybe you're just not an expert in this space. Did you consider that?

Of course it would be nice if everyone communicated to us in our preferred way, but I think making the reader work a little bit before they have a conversation is a good way to figure out if you're dealing with an expert or not, because an expert actually worth talking to about your ideas will not find it to be too much work to understand them

Students can especially benefit from this advice, because they are still too new to be able to recognise experts from the substance of their words


"I don't like having my time wasted" does not imply anything about one's skill in a field.

It's not 1995. Most of the internet is noise, and if you're showcasing something it's good form to immediately show your readers what actually is, and why they may or may not care about it.

Not showing the syntax of a programming language on the homepage of a programming language is poor communication. If you're OK with that - great, but not valuing your time and willingness to have it wasted in no way implies that you're an "expert".


> "I don't like having my time wasted" does not imply anything about one's skill in a field.

I have no idea what you think you just said, but I did not say anything like that.

> Most of the internet is noise, and if you're showcasing something it's good form to immediately show your readers what actually is

So you say, but without responding to either of my suggestions for not doing this, and after saying something that doesn't sound relevant at all.

Of what exactly are you trying to convince me to do? I'm not the author of this page, I'm not confused by what TAL is, and I'm not going to agree that you don't deserve to have your time wasted when you're here wasting mine, so what is it?


> We can't say that a function equals a set

Why not?

Can we not so easily speak of the set of all inputs and the set of all outputs? Why not exactly then is a function not a set of morphisms/arrows?

To me, x->x+1 and {(x,x+1)|x∈R} seem the same[1] but maybe it just seems useful to be able to make statements of the cardinality of that set: If there are a lot of rules, then that set is big, but if there are few rules (like x->x+1), that set is small. This is enough to permit some analysis.

It also preserves "plus" for sets, because a function plus a function is the sum of those rules being considered.

What is it do you think I am missing?

[1]: I understand I don't really mean big-R here because computers have limited precision for fadd/add circuits, so if you'd prefer I said something slightly differently there please imagine I did so.


You miss the point here. Just because functions happen to be sets ZF does not mean sets of functions are functions. O(...) denotes a set of functions.


> Just because functions happen to be sets ZF does not mean sets of functions are functions. O(...) denotes a set of functions.

We can enumerate all programs up to a given length, so up to that limit all sets of functions are functions.

f(x)=O(g(x)) still makes sense in exactly this way: if g(x) is 1, then f(x) is a function that is O(1) right? How do we know g(x) is 1? Because all programs of some length that compute f(x) have that property. Of course there are longer programs that do it, and shorter programs that don't, and other programs still, but we're talking about these ones.

f(x)<O(g(x)) then says that the f(x) must be shorter than that; it isn't a member of the set.

What do you think I am missing?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: