CompileCrew

Clue Digital and CompileCrew:
Highly available and scalable data science infrastructure

Clue Digital, an ad-tech platform that helps marketers interpret their ad data, approached CompileCrew with the requirement to build a highly available and scalable JupyterHub setup. This setup aimed at enabling their users to efficiently run, scale, and share their data science workloads. By leveraging Dask Hub, a managed service that provides a scalable environment for parallel computing with Dask and JupyterHub, Clue enhanced collaborative data science efforts by facilitating notebook access for their users. In this project, we also integrate HashiCorp Vault for managing secrets, encryption keys, and other sensitive data, ensuring secure operations. Continuous integration and assessment of the project were managed through Jenkins and SonarQube.

Tech Stack

  • Cloud Platform: AWS
  • Infrastructure as Code: Terraform
  • Container Orchestration: Kubernetes
  • Parallel Computing: Dask Hub
  • Interactive Computing: JupyterLab
  • Secret Management: HashiCorp Vault
  • Continuous Integration: Jenkins
  • Code Quality and Security Analysis: SonarQube
  • Container Security: Trivy
  • Containerization: Docker

Infrastructure

  1. Automation with Terraform:
    • The entire ClueDev setup is fully automated using Terraform, offering high configurability and ease of deployment.
  2. High-availability Architecture:
    • By utilizing AWS’s multiple Availability Zones (AZs) for high availability, we ensure secured access for multiple users through the use of a VPN.
    • The Dask Hub setup is deployed on AWS EKS, ensuring a highly scalable architecture.
      • Core Nodes: Host the JupyterLab environment.
      • User-Spot Nodes: Host Jupyter Notebook workspaces for users.
      • Worker-Spot Nodes: Deploy Dask workers to handle high-load computational tasks.
    • The user and worker nodes utilize spot instances and auto scaling to zero, optimizing cost-efficiency. Spot instances leverage the cheapest available instances, with minimal risk of interruption, while auto scaling to zero terminates idle nodes, reducing unnecessary costs.
  3. Secure Access with Vault:
    • A high-availability Vault cluster is deployed on AWS EC2 instances to securely manage secrets and encryption keys. This setup is scalable via AWS autoscaling and operates behind private subnets, accessible through a VPN connection.
  4. Code Management and Deployment:
    • Notebooks are managed through a GitHub repository, with code assessments conducted using Jenkins and SonarQube, ensuring code quality and security.
    • Docker images deployed to EKS are scanned for vulnerabilities using Trivy.
  5. JupyterLab Customization:
    • The JupyterLab base image, a custom Docker image forked from the Pangeo project, is built and deployed to DockerHub via Jenkins. Trivy scans the images for vulnerabilities to ensure security.

Conclusion

The ClueDev project offers a robust solution for creating and managing multi-user Jupyter notebook environments. By utilizing ClueDev, teams, classes, or organizations can establish shared computing environments to collaborate on data analysis, visualization, and machine learning projects. This setup not only enhances productivity and collaboration but also ensures scalability, security, and cost-efficiency.

You might also want to explore

Product-Market Fit | 6 minute read
MVP Development | 5 minute read
Scroll to Top