Job Description
Detecting attackers in real-time requires robust data pipelines that enable machine learning and statistical techniques. As an intern for the Data Engineering team, you will help transform rich network traffic data, cloud log data into meaningful features and develop data systems for collecting algorithm telemetry. You will be involved with building pipelines and tools for both on-prem and cloud deployments while collaborating with Data Scientists and Software Engineers in the process.
Responsibilities
Work with the Data Engineers on the team to improve and develop new features enabling Data Scientists to access data in ways previously unavailable
Possible projects range from
Building out a data converter to parquet format and catalog using AWS Glue
Performing ETL on existing data to restructure time series data in a more accessible format
Automate the piping of network captures into a process to convert into metadata and load into Spark
Qualifications
Required
Working towards a BS or MS in Computer Science or related field
Strong programming skills with experience in Python, C++, or Java
Linux proficiency and shell scripting
Desirable
Experience with Docker, Kubernetes or other container orchestration tool
Experience working with AWS or GCP offerings
Experience with a source control system, preferably Git
Familiarity with Hadoop, Map/Reduce, Spark, and distributed computing
Understanding of data pipeline architectures (e.g. Lambda, Kappa)
Database hands-on experience (MySQL, MongoDB, couchdb, ElasticSearch, etc.)
Knowledge of real-time data pipelines (e.g. Kafka and Spark Streaming)
Experience with continuous integration and deployment workflows