Astera Institute

Senior sys admin / site reliability engineer

Astera Institute Berkeley, CA

About Astera

Jed McCaleb founded the Astera Institute as a non-profit dedicated to developing high leverage technologies that can lead to massive returns for humanity.

About Obelisk

Obelisk is the Artificial General Intelligence (AGI) lab at Astera. Obelisk’s mission is to produce AGI in a safe, socially beneficial way. We are focusing on different problems and different approaches than some other AGI efforts. In particular we are focused on the following problems:

How does an agent continuously adapt to a changing environment and incorporate new information?

In a complicated stochastic environment with sparse rewards, how does an agent associate rewards with the correct set of actions that led to those rewards?

How does higher level planning arise?

What we're looking for

We’re looking for a system administrator / site reliability engineer (SRE) who will be in charge of the low level systems that we use to do our machine learning research. We use a large number of GPUs to run experiments of various sizes. We need someone to make that infrastructure performant, reliable, efficient, and secure.

We’re currently using the following technologies, but as our first and only SRE, you would be free to change most of this:

  • Bare-metal servers running Ubuntu, configured via Ansible.
  • Some of our servers are on-prem, some are rented from a specialty provider of GPU servers.
  • Clusters running Kubernetes, deployed via Ansible (Kubespray).
  • We run various services including self-hosted GitHub runners.
  • Our machine learning training uses Ray for multi-node jobs.
  • Tailscale for VPN / secure access.
  • Google Workspace for SSO.

Your Responsibilities:

  • Network administration: make it fast, easy, and secure for us to connect to our clusters.
  • Kubernetes cluster management: make sure our clusters and all the workloads we run on them are reliable and easy to use.
  • Information security: make sure everything we do is secure.

Basic Qualifications:

  • 5 years relevant experience in domains such as Linux server administration, networking, information security, or Kubernetes administration.

Preferred Qualifications:

  • Experience running a bare-metal Kubernetes cluster
  • Deep knowledge of networking (TCP/IP, NAT, firewalls, VLANs)
  • Familiarity with Tailscale

Location

You will be required to be in the office in Berkeley, California at least once per week because we have our own hardware on-premise. Beyond that, most work can be done remotely, but you must be available during normal Pacific business hours.

Why work here?

  • Plenty of funding and computers.
  • Trying to advance the state of the art in AI, which requires facing fascinating technical problems.
  • Small focus. Other places (e.g., DeepMind) are doing research into lots of problems simultaneously, or are doing research and building products (e.g. Anthropic). We are completely focused on a small set of problems.
  • Small. This has benefits and disadvantages, but a huge advantage is less communication overhead and bureaucracy. This makes work faster and more fun.
  • No outside funding means there’s no pressure to chase trends or make products.

Compensation Range: $150K - $300K
  • Seniority level

    Mid-Senior level
  • Employment type

    Full-time
  • Job function

    Information Technology
  • Industries

    Research Services

Referrals increase your chances of interviewing at Astera Institute by 2x

See who you know

Get notified about new Senior System Administrator jobs in Berkeley, CA.

Sign in to create job alert

Similar jobs

People also viewed

Looking for a job?

Visit the Career Advice Hub to see tips on interviewing and resume writing.

View Career Advice Hub