DeepScale, Inc. is a fast growing start-up in the Advanced Driving space, providing perceptual systems for Advanced Driver Assist Systems and Autonomous Vehicles. DeepScale uses deep learning to build accurate and efficient perception systems that enable cars to “see”. Our software takes input from sensors and produces an environmental model of the real world. Our prior work has produced neural nets that maintain state-of-the-art accuracy but are up to 500x smaller than other nets designed for the same task. We have thought leaders and experienced practitioners in computer vision, AI-powered 3D reconstruction, and deploying small neural nets in embedded applications.
As a Senior Systems Engineer for DeepScale your role will be to architect, build, and maintain the core infrastructure that enables DeepScale to:
High Level Responsibilities
- Collect, process, and annotate data
- Design and train models for perception-on-vehicle tasks, and
- Deploy those models to embedded platforms.
- Work with Data, Models, and Deployment Engineers to understand their needs.
- Plan for growth/scaling in all areas: compute, storage, networking, monitoring, services.
- Find, evaluate, and deploy new technologies as appropriate.
- Use monitoring and automation to improve robustness, efficiency, and incident recovery time
- Balance fixing existing problems, servicing requests for new features, and exploring new tech
Concrete Core Responsibilities
- Management of local GPU cluster and related services
- Maintain engineering network infrastructure, including VPN and internet access
- Data management and backups
- Manage both on-prem/datacenter hardware and cloud resources
Example/Major Key Infra Components (parenthetically: technologies either in-use-now, of-Interest, or decoy -- can you guess which ones we actually use?)
- Configuration Management (Ansible, PXE, Chef, Puppet, ssh)
- GPU Cluster (Slurm, Bright)
- Storage/Backup (ZFS, NFS, Lustre, Weka.io, B2)
- Monitoring (Python, Prometheus, influxdb, dogratian, grafana, nagios, ganglia)
- Networking (40GbE, RoCE)
- Cloud/Containers (AWS, Docker, LXD, schroot, K8s)
- Internal Services (LDAP, pypi, gitlab, jenkins, nginx, sshd)
- Security/Access (WPA2, hashicorp vault, VLANs, VPN)
- Reasonable proficiency in at least one programming language. Ability+Willingness to (learn to) work in Python for automation/IaC/monitoring tasks.
- Familiarity with a solid variety of core/critical technologies, particularly from among those listed above, in the areas of storage, GPU servers, core infra services, automation / conﬁguration management, and monitoring.
- Proﬁciency with source control, continuous integration and testing methods (git)
- Strong knowledge of Linux systems and internals (Debian preferred) with a good understanding of networking and related protocols, OS customization, and package management (APT)
- Hands on datacenter hardware spec/buy/build experience
- Deep experience with cloud infra and/or containers
- Deep experience with configuration management and/or automated cluster bringup
- Experience with cluster management software
- Specific expertise with GPU servers
- Have used or developed metrics/analytics tools for cluster/storage/network usage
- Experience with Slurm or similar job systems
- Minimum BS in CS, Engineering, or similar.
- 4 years work experience in a related field