Is copying huge blocks of data *free* in 2024? My benchmarks suggest otherwise, ...

tsimionescu · 2024-08-12T05:58:48 1723442328

The way almost all programming languages work is that they explicitly pass a copy of a pointer to a function. That is, in almost all languages used today, whether GC or not, assigning to a function parameter doesn't modify the original variable in the calling function. Assigning to a field of that parameter will often modify the field of the caller's local variable, though.

That is, in code like this:

  ReferenceType a = {myField: 1}
  foo(a)
  print(a.myField)
  
  void foo(ReferenceType a) {
    a.myField = 9
    a = null
  }

Whether you translate this pseudocode to Python, Java, C# (with `class RefType`), C (`RefType = *StructType`), Go (same as C), C++ (same as C), Rust, Zig etc - the result is the same: the print will work and it will say 9.

The only exceptions where the print would fail with a null pointer issue that I know of are C++'s references and C#' s ref parameters. Are there any others?

klyrs · 2024-08-12T16:03:17 1723478597

Right. Passing pointers is much cheaper than passing values of large structures. And then references are an abstraction over pointers that allow further compile-time optimization in languages that support it. Pass-by-value, pass-by-pointer, and pass-by-reference are three distinct operational concepts that should be taught to programmers.

tsimionescu · 2024-08-12T16:31:07 1723480267

I think the right mental model is pass-by-value for the first two. There is nothing different in the calling convention between sending a parameter of type int* vs a parameter of type int. They are both pass-by-value. The value of a pointer happens to be a reference to an object, while the value of an int is an int. In both cases, the semantics is that the value of the expression passed to the function is copied to a local variable in the function.

Depending on the language, that is very likely the whole picture of how function calls work. In a rare few modern languages, this is not true: in C# and C++, when you have a reference parameter, things get sonewhat more complicated. When you pass an expression to a reference parameter, instead of copying the value of evaluating that expression into the parameter of the function, the parameter is that value itself. It's probably easier to explain this as passing a pointer to the result of the expression + some extra syntax to auto-dereference the pointer.

klyrs · 2024-08-12T17:13:11 1723482791

> I think the right mental model is pass-by-value for the first two. There is nothing different in the calling convention between sending a parameter of type int* vs a parameter of type int.

You're talking about parameters of type int; I'm talking about structs that are strictly larger than pointers. Structs which may be nested; for which deep copies are necessary to avoid memory leaks / corruption. And here, the distinction between these "mental models" exhibits a massive gap in real performance.

Here's a deliberately pathological case in C++; I've seen this error countless times from programmers in languages that make a distinction between references/pointers and values:

    bool vector_compare(vector<int> vec, size_t i, size_t j) {
        return vec[i] < vec[j];
    }

    int vector_argmin(vector<int> vec) {
        if (vec.size()) {
            size_t arg = 0;
            for(size_t i = 1; i < vec.size(); i++) {
                if (vector_compare(vec, i, arg))
                    arg = i;
            }
            return arg;
        } else return -1;
    }

The vector_compare function makes a copy of the full vector before doing its thing; this ends up turning my linear-looking runtime into accidentally-quadratic. From the perspective of this solitary example, it would make sense to collapse reference/pointer into the same category and leave "value" on its own.

But actually these are three distinct concepts, with nuance and overlap, that should be taught to anybody with more than a passing interest in languages and compilers. I'm not here to weigh in on what constitutes a modern language, but the notion that we should just throw this crucial distinction away because some half-rate programmers don't understand it is patently offensive.

tsimionescu · 2024-08-12T19:51:59 1723492319

My point is the same for int as for vector<int>. There is 0 difference in the C++ calling convention between passing a vector<int> and a vector<int>: they both copy an object of the parameter type. Of course, copying a 1000 element vector is much slower than copying a single pointer, but the difference is strictly the size of the type. The copying occurs the same way regardless. This is also the reason foo(char) is less overhead than a foo(char).

Everything (except reference types) is pass-by-value, but of course values can have wildly different sizes.

Also, the problem of accidentally copying large structs is not limited to arguments, the same considerations are important for assignments. Another reason why "pass-by-pointer" shouldn't be presented as some special thing, it's just passing a pointer copy.

klyrs · 2024-08-12T20:31:17 1723494677

Your point rather misses the mark.

Your vector<int*> is a red herring. The distinction I'm making is between passing a (vector<int>)* and a vector<int>, because those two objects have radically different sizes, and the distinction can and does create severe performance issues. And yet, pointers are still different from references: with a reference, you don't even need your object to have a memory address.

tsimionescu · 2024-08-13T03:09:02 1723518542

HN markup ate my *... Yes, I'm also talking about vector<int> and vector<int>*. They are indeed of radically different sizes, and the consequences of copying one are very different from the consequences of copying the other.

But this doesn't change the fact that they are both passed-by-value when you call a function of that parameter type.

kaba0 · 2024-08-12T08:44:09 1723452249

It’s semantics only. The compiler is free to optimize it in any way, e.g. if a function call gets inlined, there is nothing “returning” to begin with, it’s all just local values.

jerf · 2024-08-12T01:10:25 1723425025

See cousin posts. That's not what the terms mean.