Senior sys admin / site reliability engineer
Senior sys admin / site reliability engineer
Astera Institute
Berkeley, CA
See who Astera Institute has hired for this role
About Astera
Jed McCaleb founded the Astera Institute as a non-profit dedicated to developing high leverage technologies that can lead to massive returns for humanity.
About Obelisk
Obelisk is the Artificial General Intelligence (AGI) lab at Astera. Obelisk’s mission is to produce AGI in a safe, socially beneficial way. We are focusing on different problems and different approaches than some other AGI efforts. In particular we are focused on the following problems:
How does an agent continuously adapt to a changing environment and incorporate new information?
In a complicated stochastic environment with sparse rewards, how does an agent associate rewards with the correct set of actions that led to those rewards?
How does higher level planning arise?
What we're looking for
We’re looking for a system administrator / site reliability engineer (SRE) who will be in charge of the low level systems that we use to do our machine learning research. We use a large number of GPUs to run experiments of various sizes. We need someone to make that infrastructure performant, reliable, efficient, and secure.
We’re currently using the following technologies, but as our first and only SRE, you would be free to change most of this:
You will be required to be in the office in Berkeley, California at least once per week because we have our own hardware on-premise. Beyond that, most work can be done remotely, but you must be available during normal Pacific business hours.
Why work here?
Jed McCaleb founded the Astera Institute as a non-profit dedicated to developing high leverage technologies that can lead to massive returns for humanity.
About Obelisk
Obelisk is the Artificial General Intelligence (AGI) lab at Astera. Obelisk’s mission is to produce AGI in a safe, socially beneficial way. We are focusing on different problems and different approaches than some other AGI efforts. In particular we are focused on the following problems:
How does an agent continuously adapt to a changing environment and incorporate new information?
In a complicated stochastic environment with sparse rewards, how does an agent associate rewards with the correct set of actions that led to those rewards?
How does higher level planning arise?
What we're looking for
We’re looking for a system administrator / site reliability engineer (SRE) who will be in charge of the low level systems that we use to do our machine learning research. We use a large number of GPUs to run experiments of various sizes. We need someone to make that infrastructure performant, reliable, efficient, and secure.
We’re currently using the following technologies, but as our first and only SRE, you would be free to change most of this:
- Bare-metal servers running Ubuntu, configured via Ansible.
- Some of our servers are on-prem, some are rented from a specialty provider of GPU servers.
- Clusters running Kubernetes, deployed via Ansible (Kubespray).
- We run various services including self-hosted GitHub runners.
- Our machine learning training uses Ray for multi-node jobs.
- Tailscale for VPN / secure access.
- Google Workspace for SSO.
- Network administration: make it fast, easy, and secure for us to connect to our clusters.
- Kubernetes cluster management: make sure our clusters and all the workloads we run on them are reliable and easy to use.
- Information security: make sure everything we do is secure.
- 5 years relevant experience in domains such as Linux server administration, networking, information security, or Kubernetes administration.
- Experience running a bare-metal Kubernetes cluster
- Deep knowledge of networking (TCP/IP, NAT, firewalls, VLANs)
- Familiarity with Tailscale
You will be required to be in the office in Berkeley, California at least once per week because we have our own hardware on-premise. Beyond that, most work can be done remotely, but you must be available during normal Pacific business hours.
Why work here?
- Plenty of funding and computers.
- Trying to advance the state of the art in AI, which requires facing fascinating technical problems.
- Small focus. Other places (e.g., DeepMind) are doing research into lots of problems simultaneously, or are doing research and building products (e.g. Anthropic). We are completely focused on a small set of problems.
- Small. This has benefits and disadvantages, but a huge advantage is less communication overhead and bureaucracy. This makes work faster and more fun.
- No outside funding means there’s no pressure to chase trends or make products.
-
Seniority level
Mid-Senior level -
Employment type
Full-time -
Job function
Information Technology -
Industries
Research Services
Referrals increase your chances of interviewing at Astera Institute by 2x
See who you knowGet notified about new Senior System Administrator jobs in Berkeley, CA.
Sign in to create job alertSimilar jobs
People also viewed
-
Senior SRE
Senior SRE
-
Lead Site Reliability Engineer
Lead Site Reliability Engineer
-
Sr. Manager of SRE Operations
Sr. Manager of SRE Operations
-
Sr. Site Reliability Engineer (Investment Manager)
Sr. Site Reliability Engineer (Investment Manager)
-
SRE Lead/Architect
SRE Lead/Architect
-
Senior Site Reliability Engineer - Automation / Containers
Senior Site Reliability Engineer - Automation / Containers
-
Senior Manager of SRE
Senior Manager of SRE
-
Senior Site Reliability Engineer - Automation / Containers
Senior Site Reliability Engineer - Automation / Containers
-
Junior DevOps Engineer - Hybrid in OH
Junior DevOps Engineer - Hybrid in OH
-
Remote Position || Sr. Site reliability engineer with ***AWS cert, FedRAMP & On call support***
Remote Position || Sr. Site reliability engineer with ***AWS cert, FedRAMP & On call support***
Looking for a job?
Visit the Career Advice Hub to see tips on interviewing and resume writing.
View Career Advice Hub