Carnegie Mellon University

Julian Uran

Julian Uran

Machine Learning Research Engineer

Address
500 South Craig St
Pittsburgh, PA 15213

Research Engineer in the HPC AI and Big Data group; supporting the User Community for PSC’s Bridges, Bridges-2, and Neocortex; while also leading the development of COSMO, a REST API for exploring the BlueTides simulation data from the McWilliams Center for Cosmology.

Julian joined the Pittsburgh Supercomputing Center in 2019 after working as Senior Technical Support and Customer Operations Engineer at Hortonworks and Cloudera Inc. Prior to that, He contributed to the University of Delaware Global Computing Lab, University of Los Andes COMIT Research Group, Technological University of Pereira Sirius HPC Research Group; and worked as a Software Developer for top-performing companies in the Latin America and the Caribbean region (VeriTran, BeMovil) after earning his Master's degree in Systems and Computer Engineering from Universidad de los Andes (University of the Andes) in 2012.

Interests: Virtualization, Web Development, Entrepreneurship, User Experience Optimization.

News

2023
  • At the Neocortex Panel Review, I spearheaded the "Software Stack, Operations, and Security" slide, presented to the NSF. My concise yet impactful presentation effectively conveyed the technical strength and seamless service delivery of Neocortex, contributing to the project's successful review.
  • As part of the PSC Software Task Force and Bridges-2 Installers team, I participated in developing regression tests for software maintained by the AI and Big Data team. To ensure comprehensive testing, we designed these tests to run as both Slurm batch jobs and within the ReFrame framework. This dual approach leveraged Slurm's efficient job scheduling for large-scale testing while harnessing ReFrame's flexibility for modular and reusable test components.
  • Started contributing to the PSC Instructional Videos initiative, for generating video tutorials for common subjects that researchers usually run into problems with.
  • Troubleshooted a critical issue on the TRACE cluster at CMU, where half the GPU nodes malfunctioned during job execution. I identified several contributing factors, and by addressing the following diverse issues, successfully restored normal functionality to the TRACE cluster, ensuring optimal performance for GPU-heavy tasks:
    • Missing NVIDIA module: The kernel wasn't loading a crucial NVIDIA module,preventing full GPU utilization.
    • Singularity container issues: Singularity containers, used for job isolation,exhibited unexpected behavior.
    • Slurm job submission: Launching jobs with "--gpus" instead of "--gres:gpus" caused incompatibility with MPI tasks.
    • Network interface conflicts: Unused interfaces with identical IP addresses created communication problems, resolved by specifying preferred interfaces (--mca btl_tcp_if_include ens3f0,trace-ib0).
    • InfiniBand card mismatch: Different InfiniBand card types across nodes led to "Operation not supported ibv_modify_qp" errors in NCCL code. This mismatch caused NCCL jobs to launch on the wrong network interfaces, resulting in failures.
  • Presented the hands-on section of the Ohio Supercomputing Center's "AI Bootcamp and AI Accelerators" component for Neocortex, along other Category 2 systems, in a bootcamp created specifically for technical-savvy attendees that usually support researchers on their technical endeavors.
  • Attended the "Supercomputing 2023" conference in Denver, CO. Participated in the tutorials "Using Containers to Accelerate HPC", "Better Software for Reproducible Science", and "Efficient Distributed GPU Programming for Exascale".
  • Contributed to the publication "Deep Learning Benchmark Studies on an Advanced AI Engineering Testbed from the Open Compass Project", on the Practice and Experience in Advanced Research Computing conference.
2022
2021
  • Aided the Bridges to Bridges-2 migration process by thoroughly documenting and running performance evaluation benchmarks in preparation for acceptance testing, needed for the Bridges-2 supercomputer to be successfully accepted and for performance reviews, and questions from the evaluators, to advance without problems.

  • Implemented the first version of the Data Sharing Portal for the McWilliams Center for Cosmology. This allowed a limited set of files of the BlueTides Simulation to be made publicly available via a web interface, something that was being done to a limited degree in a decentralized way with a web server.

  • Migrated and revamped the AIBD website to the CMU Content Management System, including a Neocortex System section. This helped centralize the content under one manageable platform that is kept up to date by the CMU Technology team, in which an ongoing website design effort had been applied on the old website but needed to be replaced because of the effort required to develop and adapt requirements.

  • Implemented the Neocortex Portal webpage for the Neocortex users to have access to news and documentation, as a way to have a fluid and structured way to share information with them while keeping track of their project statuses and information. This involved the inclusion of the MkDocs documentation system to the Portal, as a way to enable the content to be edited by the multiple Neocortex Team members while having it under version control.

  • Implemented the Neocortex Documentation page using MkDocs, as an independent component of the Neocortex Portal.

  • Coordinated efforts for deploying and configuring the Neocortex system, constantly communicating with multiple teams over three companies (PSC, Cerebras Inc, HPE) for successfully deploying on site the technical equipment required for the specialized hardware to start operating, a $5M project sponsored by the NSF.

  • Attended the "Neocortex Admin Training" by Cerebras, covering contents for System Administration and Operation on February 10, 2021.

  • Generated software modules for the most popular AI frameworks and libraries to be used on the Bridges-2 cluster.

  • Worked with the Bridges-2 Acceptance Testing team for making sure the cluster was able to perform as expected when using the MXnet and TensorRT frameworks.

  • Led the Neocortex Superdome Flex system configuration, a multi-chassis (8 chassis total) configured as logical partitions, each partition with 16 CPUs, 100TB of multi-drive NVMe flash RAID storage, 12TB of RAM memory, 100GbE network interfaces, and 8 InfiniBand network interfaces which are all expected to work to the top of their performance (aggregated speeds) when running jobs on the Neocortex system.

  • Started working as member of the Bridges-2 Continuous Improvement Committee (CIC) from improving the experience researchers have when using the Bridges-2 cluster.

  • Participated on the ICDAR 2021 Conference paper MiikeMineStamps: A Long-Tailed Dataset of Japanese Stamps via Active Learning, on the Document Analysis and Recognition category (co-author).

  • Participated on the Practice and Experience in Advanced Research Computing 2021 (PEARC21) Conference paper "System Integration of Neocortex, a Unique, Scalable AI Platform" (co-author).

  • Part of the team that won the 2021 Mellon College of Science Outstanding Team Recognition Award for the Pittsburgh Supercomputing department from the Mellon College of Science, for deploying two world-class supercomputing resources, BRIDGES-2 and Neocortex, in the midst of the ongoing pandemic. Despite several challenges, delays and hardships, the BRIDGES-2/Neocortex team persevered and through extreme dedication and heroic efforts were able to field these two machines for the scientific research community.

  • Participated as presenter in the Annual Neocortex NSF Operations Review Panel: Operations Performance for Neocortex Testbed Operations Year 1.

2020
  • Implemented a navigation tool for the image labeler software “Label Me”, a critical piece of software that enabled research for an XSEDE ECSS project that has already generated a paper for the International Conference on Document Analysis and Recognition 2021 (ICDAR 2021).

  • Helped the HuBMAP Consortium launch containers on demand via SLURM on the Bridges supercomputer, enabling the project to pave the way for running research workflows on multiple clusters.

  •  Performed a refresh of the first CALIMA software implementation for mining data from the Bridges supercomputer SLURM history and started a whitepaper on this subject. The goal is to predict when it is that jobs will start their execution, and what modifications could be made for getting jobs to start running with lower queuing times.

  • Supported PSC’s critical role in the COVID-19 High Performance Computing Consortium by helping multiple groups on their COVID-19 research efforts by going above and beyond providing technical support and suggesting optimizations for their job executions on a daily basis, continuously spending late night hours with the different teams for solving any issues encountered on those world-wide-interest research projects, critical for the wellbeing of everyone. 

  • For outstanding contributions to the center and more specifically to the AI and Big Data group’s mission, was awarded the Staff Recognition Rookie Award from the Mellon College of Science, given to Staff members that have contributed greatly to the Pittsburgh Supercomputing Center.

  • Won the Staff Recognition PSC- COVID-19 Outstanding Team Achievement Award from the Mellon College of Science, for all of the effort and results while collaborating with research groups on crucial COVID-19 projects.

  • Implemented first version of Data Sharing Portal for the McWilliams Center for Cosmology. This allowed a limited set of files of the BlueTides Simulation to be made publicly available via a web interface, something that was being done to a limited degree in a decentralized way with a web server.

  • Migrated and revamped the AIBD website to the CMU Content Management System, including a Neocortex System section. This helped centralize the content under one manageable platform that is kept up to date by the CMU Technology team, in which an ongoing website design effort had been applied on the old website but needed to be replaced because of the effort required to develop and adapt requirements.

  • Applied to the McWilliams Center/PSC Seed Funding 2020 Program with COSMO, a REST API for Cosmology Data.
  •  Earned the McWilliams/PSC Seed Grant 2020, for which ~$20K on funds were used to hire and lead two graduate interns on implementing a platform for sharing petabyte-scale simulations and datasets via multiple endpoints, such as having a web portal, a RESTful API, and a Globus endpoint for users to explore the BlueTides Simulation data on the way that best suit their needs.

  • Won the Editors' Choice Award from  HPCWire, a leading publication in the high-performance computing field, for "Best Use of High-Performance Data Analytics & Artificial Intelligence" during the virtual 2020 International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), as part of the team lead by Dr. Olexandr Isayev from CMU.

  • Nominated for an Andy Award in the Teamwork and Collaboration category as part of the PSC COVID-19 Rapid Response Team.

  • Implemented the Neocortex Portal webpage for the Neocortex users to have access to news and documentation, as a way to have a fluid and structured way to share information with them while keeping track of their project statuses and information. This involved the inclusion of the MkDocs documentation system to the Portal, as a way to enable the content to be edited by the multiple Neocortex Team members while having it under version control.

  • Coordinated efforts for deploying and configuring the Neocortex system, constantly communicating with multiple teams over three companies (PSC, Cerebras Inc, HPE) for successfully deploying on site the technical equipment required for the specialized hardware to start operating, a $5M project sponsored by the NSF.

  • Aided the Bridges to Bridges-2 migration process by thoroughly documenting and running performance evaluation benchmarks in preparation for an acceptance testing, needed for the Bridges-2 supercomputer to be successfully accepted and for performance reviews, and questions from the evaluators, to advance without problems.

  • Attended the tutorials section of the HotChips Conference 2020 for general information on how to scale out Machine Learning using NVIDIA, Google, and Cerebras Inc systems.

    Attended NVIDIA's GPU Tech Conference, GTC 2020, for the latest developments on NVIDIA technology.

  • Attended the XSEDE HPC Workshop: MPI, on September 1-2, 2020, a basic MPI-focused training for developing code for HPC environments.

  • Attended the MIT Professional Education - Short Programs: Designing Efficient Deep Learning Systems. A two-day course on how Deep Learning works and what systems are considered good for running DL workflows.

  • Assisted with a section of the “Hands-on Virtual Training - Getting Ready to Use the Neocortex System” training on December 8 and 9, 2020. An overview of how to run compilation and training workflows using the Neocortex system with the Cerebras CS-1 boxes.