Featured Post

The great debacle of healthcare.gov

This is the first time in history when the president of the United States of America, or probably for any head of state around the world,...

Sunday, February 23, 2014

Brief notes on popular data processing solutions

We're so lucky now that we don't have to be confined within a few data management and processing solutions as more and more open source or closed source software and systems are proven to be so much dependable that you see the company like Facebook and Yahoo are using them to maintain their leading market share. I've put together a short notes on some of the popular systems.


Hadoop, a software library framework, is designed to process a large volume of data (also known as Big Data) in a distributed computing environment. Fundamentally, Hadoop decouples the data and its processing into two paradigms: the data is stored in a distributed file system (called as Hadoop Distributed File Systems or HDFS) and the processing happens in an infrastructure where clusters of commodity hardware form the distributed computing environment. Hadoop is designed to keep the scalability of processing in mind thus it can scales from a single machine to thousands machines. Each processing unit in the Hadoop cluster has its local storage and computing capacity which greatly increase the processing speed by keeping data next the processor. The risk of losing data due to failure of the processing node in a cluster is countered by the HDFS whereas the high availability of processing node in the cluster is controlled in the software itself rather than adding extra stack of hardware. The other core modules of Hadoop are the Hadoop YARN and Hadoop MapReduce. Hadoop YARN is the framework for job scheduling and cluster resource management whereas Hadoop MapReduce is a batch based distributed computing software framework that allows to parallelize work over a large volume of raw data which could take days or longer using conventional serial computing techniques, can be used reduced down to minutes on a Hadoop cluster. It simplifies the parallel computing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software.


Apache Cassandra is a NoSQL distributed database management systems designed to handle large volume of data with no Single-Point-Of-Failure (SPOF). The heart of the Cassandra is the Data Model which is fundamentally different from a Relational Database Management Systems. In Cassandra, the data is structured in a column family which is analogous to RDMBS Table but unlike RDBMS, it doesn’t need to have same number and type of columns for all the rows. Each Column in a column family has three attributes: column name, column value and timestamp. A multi-dimension approach of designing a Cassandra data model is to use what it called as Super Column within a Column Family. The main advantage of Cassandra is that it can replicates data transparently in different nodes even in across data centers and all that are achieved without compromising the highest level of fault tolerance. As Cassandra doesn’t use Mast-Slave model to achieve the location transparency and distributed data architecture, it’s free from any bottleneck. To achieve that highest level distributed data availability, some of the features of data access like joining rows weren’t included. There’s also a tradeoff that needs to be made in the area of consistency and availability but that are all configurable (“Tunable Consistency”, as Cassandra coined the term).


Essbase, originated from Extended Spreadsheet Database, is an object oriented and multidimensional database management systems, specially designed to provide the OLAP (Online Analytical Processing) capabilities to users to model, analyze, and interpret the most complex business scenarios. As an OLAP tool, Essbase provides the slicing and dicing of data from a very large volume of data that’s organized into a cross sectional groups of hierarchical dimensions. It is optimized to support OLAP capabilities as opposed to OLTP (Online Transactional Processing, like Oracle RDBMS.


Microsoft SQLServer is the second popular Relational Database Management System (RDBMS) after Oracle RDBMS. It provides all the features of a relational database like, table, view, table joining, query processing, transactional update etc. SQL Server supports both T-SQL (Transact SQL) and ANSI SQL as query language. One of the drawbacks of MS SQLServer is that it can only be deployed on Microsoft Windows Operating System environment. In addition to typical RDBMS features, it also provides replication service, notification service, reporting service etc. As a product from Microsoft, it strongly bonds with its programming environment (.NET) through SQL CLR (Common Language Runtime) and easily accessible through Microsoft Visual Studio development environment.


MySQL is one of most popular RDBMS which is open source in nature and managed by Oracle. This has most of the features like other RDMBS like Oracle RDBMS and MS SQLServer, but it’s mostly used in a smaller scale environment (with few exceptions) and very popular among the open source community. Some of the characteristics of MySQL are:

-     It’s a Relational database and provides both transactional and non-transactional storage engines

-          It supports ANSI SQL query language and as well as procedural programming using Stored Procedure and Functions

-          It’s open source

-          It’s very fast, reliable and scalable

-          It can work in both client/server and embedded model


Oracle is an Object Relational Database Management Systems which provides an added flavor of Object orientation in the Relational Database realm. This is the most popular and widely used RDMBS. Internally Oracle RDBMS is divided into Oracle Instance and Database. The Oracle instance consists of the memory on RAM and server processes which is identified by System Identifier (SID). The database part is consists of the data files in the Operating System. There are number of background processes works in parallel to provide very fast access to the data and modify the same. Oracle database comes with so many added features which makes it the number one RDBMS, some of them are:

-         provides Real Application Cluster (RAC) to enable maximum availability in terms of Oracle instance failure

-       use Automatic Storage Management (ASM) and Automatic Memory Management (AMM) to simplify its management of data and performance

-   Oracle has proprietary backup and recovery tool (RMAN) for simplified backup and restoration of data in case of instance failure and data corruption

-      It supports ANSI SQL (also dubbed as Oracle Query) and PL/SQL for procedural programming

No comments: