Community Inference - For Good or Bad

There is a surplus of great models that are hostable locally, but not many members of the community have infinite VRAM. Most top out somewhere in 128G. What if we could pool our resources together and host models for each other?

Two people exchange a gift
Photo by Annika Gordon / Unsplash

There is a surplus of great models that are hostable locally, but not many members of the community have infinite VRAM. Most top out somewhere in 128G. What if we could pool our resources together and host models for each other?

Community Hosted Inference as a Resource

A History of Sharing

Without getting too boring, we can take a quick sweep through what the tech outlook looks like today. From sharing code (open source), distributing compute tasks (folding@home and even bitcoin), Mastodon nodes, BitTorrent, Linux mirrors, to even good old Minecraft servers, the community has a willingness and a desire to "donate" unused compute to projects wanting to use community accessible resources.

For some, it's a matter of wanting to build the ultimate homelab. For others, it's a desire to know that, as a community, we can provide services roughly at a level equivalent to (sometimes better than) expensive platforms.

On multiple occasions after a colo build, I have personally spent time opening up the network to public compute from folding@home to classic Quake servers. It's a good way to do some network load testing, it's fun, and there's rarely any risk. Friends of mine have lit up "weekend services" from their University labs. There's a whole lot of hardware sitting idle at any moment and the popularity of the homelab has put that in the living room.

If you have a fancy homelab at home that sits idle most days, why not put it to good use?

Clients and Servers

At this point, anyone can go grab llama.cpp, ollama, vLLM, and/or Lemonade Server. Lemonade Server is one of my favorites for the ability to offer CPU, GPU, and NPU in one package exposed as the generic openai compatible API, much like the other services in the list, giving a consistent interface roughly across the board.

With hosting the models made very easy, using the models is a matter of grabbing your favorite tool from LMStudio to Goose.

In just a few moments you can get started hosting and using your own local models.

Your only limit in what you want to accomplish is a little imagination and a lot of hardware - the hardware part being the tricky thing with a possible solution being looking for people with extra resources.

The Risks

We can't expect everyone to be quite as Campfire Kumbaya hippie free (inference) love as I'm making it out to be. We will inevitably have bad actors.

Anthropic has reported on AI Orchestrated Cyber Espionage Campaigns using their service. Social media has no lack of actors attempting to harass and publish harmful materials. The average 3D printer enthusiast has to deal with 3D printed firearms. Tor exit node operators have to deal with bad actors using the service for DDoS and intrusion activities.

The question becomes accountability. If we light up Qwen3.5 on llama.cpp and open ports, does that invite the feds to come knocking on our doors?

police car at street
Photo by Matt Popovich / Unsplash

Service Anatomy of Community Inference and Risk

Model Hosting

So far, in spite of some peculiar choices and behavior of certain models, no actual models have ever been banned or made illegal. In spite of containing copyrighted content, potential CSAM risks, the model's ability to generate harmful code, the model's ability to discuss harmful subjects, etc there have been no actions taken under the law against specific models.

Hosting a model itself for download currently has no risk.

Service providers like Hugging Face have taken this to an extreme with a massive amount of models being made available to anyone.

At this time there seems to be no risk in hosting a model itself.

Providing Inference

Data goes in, gets converted to tokens, some LLM magic happens, a decoder runs and you have an artifact in the form of a produced document. That document could be text, code, tool calls, audio, images, or even video.

The actual inference process is, in a nutshell, a massive transform. Nothing is permanently stored by any of the services outlined previously.

As of today, there have been no attempts to hold an inference provider liable for providing access or transmitting LLM artifacts. We haven't seen a case where Google is responsible for someone hosting Stable Diffusion, the service providers for Grok never got involved in that fiasco, and no current court case has gone after generic compute for the production of responses.

To mention Hugging Face again, their SafeTensors should make it relatively safe to host the actual models. We should, if we provided inference, make sure to responsibly source models from services and protect our supply chain.

At this time there seems to be no risk in operating a model.

Artifact Generation

When the user submits a prompt, anything could come out of the other side. If you've used LLMs in the past, you'll know it could be complete trash, a refusal to answer the question, or potentially harmful material.

Harmful material in this case could be the description of building a harmful device, code that is clearly meant to defeat security controls, or harmful information about a specific person.

When it comes to large platforms like Anthropic and Google (Claude and Gemini, respectively) the services include layers of protection and moderation. These are in the form of guard models, such as Llama Guard or ShieldGemma. If hosting the model itself, we would not have these additional models for control and users may indeed generate harmful content.

If we store and serve the content there may be an obvious liability in distributing this material. The current outlook seems to be that Elon Musk and Twitter are not responsible for the massive distribution of harmful content through Grok, but we of course would not want to rely on getting the same treatment as Elon Musk.

Artifacts that are generated should be reasonably expected not to be stored, anyway, as that would be a violation of the user's privacy. This is something that large providers like OpenAI simply do not care about, but again, we do not expect to have the same treatment as OpenAI.

Our best approach is to not store anything the user requests.

This leaves us at a point where the generated artifact becomes entirely the user's responsibility. We are not executing any generated code, we do not have agents taking actions on instructions, and we are not distributing harmful content outside of the initial user response.

Power Demand

Power is the significant question especially in the era of current events at the time of writing this article.

I mentioned that Lemonade Server can serve models from NPU. An NPU workload is significantly lower power than a GPU workload but with a clear performance implication.

There is not a good response to this outside of begging people to make massive solar arrays to serve LLMs ... but that is an obvious hypothetical SolarPunk-esque fiction.

Quick Comparison to Other Services

A Tor Exit Node

Users may be familiar with the idea of sharing network resources in the form of The Tor Project. In this scenario, you are loaning out excess network to help implement endpoints and materials with your own network and storage.

In a hypothetical scenario, if someone was to launch a DDoS or other harmful interaction from the exit node you are hosting, the traffic would lead directly back to you.

An attack would originate directly from you.

This has an obvious implication.

BitTorrent

Hosting torrents involves loaning out your network resources to provide a higher level of services to other users. Data is distributed across torrent "seeders" and traffic is distributed across nodes and seeds.

This means you are storing potentially harmful and/or unlawful material and are directly facilitating the distribution of it.

This has an obvious implication.

Bitcoin and Crypto

Hosting crypto blockchains and doing mining involves hosting part of the blockchain and performing transformations to verify blocks.

Although there are cursed blocks in the Bitcoin block chain that we aren't going to discuss here, no one so far has been held accountable for having the blockchain stored on their PC or functioning as a client, mining, or stakeholder.

Game Servers including Minecraft

Game servers are a very risky proposition.

Anywhere that users can directly interact with each other can present the possibilities of harassment, distribution of materials, doxxing, etc. In the case of Minecraft, chat could be used to facilitate these actions. In other games with voice there is an obvious ability to be a bad actor.

Moderation is key to hosting a gaming server but the actual game server itself has little to no risk.

Mastodon Instances

Distributing and storing material on behalf of users for the Mastodon Social Network can have obvious implications.

Moderation is key to hosting a node.

Service Comparison Conclusion

Compared to other common services, hosting an LLM has few parallels.

If our service executed instructions on a user's behalf, it could be possible to initiate distribution of material and/or instruction-following in a harmful program. Only providing inference does not carry either of these implications.

If we were offering a complete service it would be critical to offer guard models to prevent harmful content from being published by our service. This would be a service level interaction similar to game servers. Hosting the model with a common API should not have this implication.

Hosting any service has, as expected, a risk. Game servers can have Remote Code Exploits and you can inadvertently deny yourself legitimate service if a Tor node or BitTorrent seed overtakes your available network bandwidth.

Security implications for remote code exploits and network bandwidth constraints could be mitigated by something as simple as using containers for your inference servers.

Conclusion

Outside of your power bill, there doesn't seem to be any downside to hosting models. A University or other campus-like facility could (and most probably already do) serve models. Opening those endpoints up to the public could allow users without extensive hardware or the desire to subscribe to services like Anthropic to participate in using LLMs for coding and research.

My alignment to open source ethos says that a technology the many can benefit from should be available to the many rather than benefiting only the few.

In a later article we will discuss what it means to be a bad actor hosting the inference provider.