• CriticalResist8@lemmygrad.ml
    link
    fedilink
    arrow-up
    8
    ·
    12 days ago

    The article mentions “Packing multiple models per GPU”, but also “using a token-level autoscaler to dynamically allocate compute as output is generated, rather than reserving resources at the request level” which I’m not sure what that means but may hint that there are ways to scale this down, possibly.

    If not Alibaba then other researchers will eventually get to it.