Excellence in Big Data
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the stack itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Spark is a fast and general-purpose cluster computing system which integrates with Apache Hadoop and the Hadoop YARN cluster resource management system. It supports a rich set of high-level tools including Spark SQL for SQL query based structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for near real time datastream processing.
Solr is a blazing fast open source enterprise search platform. Its features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.
Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability, integrated with the Apache Hadoop cluster compute platform. Called SolrCloud, these capabilities provide distributed indexing and search capabilities.
The Apache-licensed Impala project brings interactive querying and massively parallel data warehouse technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software.
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python’s simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.
Apache Kafka is a free and open source, parallel message queuing infrastructure which integrates with Hadoop.
Kafka enables organisations to deploy a Hadoop cluster to function as their central messaging backbone as a flexible alternative to traditional enterprise messaging solutions.
Offering elastic and transparent scalability without downtime, message queues are partitioned and spread across a cluster of servers to scale beyond the capability of any single computer.
Messages are reliably persisted on disk and replicated within the cluster to prevent data loss and each message broker can handle terabytes of messages with consistent performance.
Kafka boasts a modern, cluster-centric design that offers strong durability and fault-tolerance guarantees.
HBase is a NoSQL database. “NoSQL” is a general term meaning the database doesn’t use a SQL query engine as its primary access mechanism.
HBase is a highly distributed database storage engine founded on Apache Hadoop. While it lacks many of the features you may find in an RDBMS, HBase has many features that support linear, modular scaling. HBase capacity is expanded by adding commodity class servers. If an HBase cluster is expanded from 10 to 20 servers, for example, it doubles both in terms of storage and processing capacity.
Traditional RDBMS can scale well – up to the size of a single server – and often require vendor certified, specialized hardware and storage devices.
HBase has delivered horizontal database scalability in deployments worldwide on clusters up to hundreds of servers wide, with no requirement for specialised server architecture, storage array nor fibrechannel.
Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
Apache Hive™ is a batch-oriented data warehousing runtime which simplifies ETL [extract, transform, load] processes on large datasets stored in a Hadoop cluster. Hive offers facilities to apply structure onto this data and query the data using a SQL dialect called HiveQL.
Titan is a graph database optimized for storing and querying graph datasets containing up to hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.
Titan is designed to support the processing of graphs so large that they require storage and computational capacities beyond what a single machine can provide. To this end, Titan integrates with the Hadoop HBase and Apache Cassandra NoSQL database storage engines for data persistence.