Challenging Big Data Internships
Big Industries is looking for great interns to run a challenging project that will test their capabilities across a range of compute disciplines.
We are seeking interns with a solid capacity for and interest in development and engineering. Example roles our consultants undertake include big data administrator, big data integration developer and application integration engineer.
Administrator: responsible for planning, deploying, securing, maintaining and troubleshooting compute clusters.
Data integration developer: responsible for data ingestion, extraction, transformation and enrichment pipelines, data wrangling, data warehouse design, data protection, data distribution and reporting.
Application integration engineer: responsible for big data applications development. Web API design and development, mobile device data integration, legacy system integration, realtime/online interactions, data visualization, front-end.
All these roles require great skills in Linux, bash shell, scripting (eg. Python), Java, SQL etc. Exposure to Scala, Spark and other grid compute programming frameworks (eg. Ignite, Flink) beneficial.
Consultants working for Big Industries gain a solid understanding of concepts like Messaging/Message queues, Datawarehousing, RDBMS, Information Security, distributed computing and computing architectures (eg. Cloud, N-Tier, Grid, BASE, CAP theory).
Our consultants are expected to build up skills on technologies like Apache Hadoop, NOSQL databases (eg. Cassandra, Hbase), ETL systems (eg. pentaho, informatica), and data visualization tools and libraries like Tableau, Cliqview, Zeppelin and D3.js.
Interns will gain exposure to Big Data technologies such as Hadoop, HDFS, YARN, Cassandra and Mesos.
We would like to build a demonstrator for our customers, showcasing a full stack web based service running on a cluster. The web based service will be a youtube-like service and will allow users to upload, play, remove, and search videos.
The web based service will use the following different components:
* A load balancer (nginx): this will dynamically load balance requests to instances of the wordpress site and the custom application based on the configuration held by the Zookeeper configuration service
* Web servers running a custom application (API): this will allow users to upload their videos to their account
* Web servers running a WordPress site: this will be the primary interface for the service
* A Mysql database: to support wordpress
* A Cassandra NOSQL database: to support the custom application
* A Ceph storage cluster: to store the videos
* A SolrCloud search engine: this will enable users to search for videos based on video metadata
* A Hive data warehouse: this will store search query logs and web access logs for later analysis
* A Zookeeper configuration service: this will act as a central configuration store and expose an API (Apache Curator) to enable components to discover and update their configuration.
The web based service will be deployed on a multi-node Cloudera Hadoop cluster, and the components will be running under the Mesos cluster management framework and Docker containers.
Interns will be expected to undertake the full assignment scope as a self-organizing team:
* Web development (custom application API)
* Cassandra NOSQL database development to support custom application API
* Deploy and configure eg. Ceph, Cassandra, WordPress, MySQL, custom application API as docker containers under Mesos
* Build custom tooling to support configuration discovery using Zookeeper and Curator
* Create automated deployment scripts using a tool like Ansible Playbooks
* Setup hive to ingest log data into the Hive data warehouse for querying
* Configure SolrCloud for the video search feature
* Customise the wordpress engine to integrate with the custom application API
Rob is a Hadoop and large-scale distributed computing evangelist. Solution Architect by trade, Rob is a managing partner at Big Industries - the premiere Hadoop & Big Data systems integrator for Belgium and Luxembourg.