Featured Post

The great debacle of healthcare.gov

This is the first time in history when the president of the United States of America, or probably for any head of state around the world,...

Sunday, February 23, 2014

Brief notes on popular data processing solutions

We're so lucky now that we don't have to be confined within a few data management and processing solutions as more and more open source or closed source software and systems are proven to be so much dependable that you see the company like Facebook and Yahoo are using them to maintain their leading market share. I've put together a short notes on some of the popular systems.


Hadoop

Hadoop, a software library framework, is designed to process a large volume of data (also known as Big Data) in a distributed computing environment. Fundamentally, Hadoop decouples the data and its processing into two paradigms: the data is stored in a distributed file system (called as Hadoop Distributed File Systems or HDFS) and the processing happens in an infrastructure where clusters of commodity hardware form the distributed computing environment. Hadoop is designed to keep the scalability of processing in mind thus it can scales from a single machine to thousands machines. Each processing unit in the Hadoop cluster has its local storage and computing capacity which greatly increase the processing speed by keeping data next the processor. The risk of losing data due to failure of the processing node in a cluster is countered by the HDFS whereas the high availability of processing node in the cluster is controlled in the software itself rather than adding extra stack of hardware. The other core modules of Hadoop are the Hadoop YARN and Hadoop MapReduce. Hadoop YARN is the framework for job scheduling and cluster resource management whereas Hadoop MapReduce is a batch based distributed computing software framework that allows to parallelize work over a large volume of raw data which could take days or longer using conventional serial computing techniques, can be used reduced down to minutes on a Hadoop cluster. It simplifies the parallel computing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software.

Cassandra

Apache Cassandra is a NoSQL distributed database management systems designed to handle large volume of data with no Single-Point-Of-Failure (SPOF). The heart of the Cassandra is the Data Model which is fundamentally different from a Relational Database Management Systems. In Cassandra, the data is structured in a column family which is analogous to RDMBS Table but unlike RDBMS, it doesn’t need to have same number and type of columns for all the rows. Each Column in a column family has three attributes: column name, column value and timestamp. A multi-dimension approach of designing a Cassandra data model is to use what it called as Super Column within a Column Family. The main advantage of Cassandra is that it can replicates data transparently in different nodes even in across data centers and all that are achieved without compromising the highest level of fault tolerance. As Cassandra doesn’t use Mast-Slave model to achieve the location transparency and distributed data architecture, it’s free from any bottleneck. To achieve that highest level distributed data availability, some of the features of data access like joining rows weren’t included. There’s also a tradeoff that needs to be made in the area of consistency and availability but that are all configurable (“Tunable Consistency”, as Cassandra coined the term).

Essbase

Essbase, originated from Extended Spreadsheet Database, is an object oriented and multidimensional database management systems, specially designed to provide the OLAP (Online Analytical Processing) capabilities to users to model, analyze, and interpret the most complex business scenarios. As an OLAP tool, Essbase provides the slicing and dicing of data from a very large volume of data that’s organized into a cross sectional groups of hierarchical dimensions. It is optimized to support OLAP capabilities as opposed to OLTP (Online Transactional Processing, like Oracle RDBMS.

SQLServer

Microsoft SQLServer is the second popular Relational Database Management System (RDBMS) after Oracle RDBMS. It provides all the features of a relational database like, table, view, table joining, query processing, transactional update etc. SQL Server supports both T-SQL (Transact SQL) and ANSI SQL as query language. One of the drawbacks of MS SQLServer is that it can only be deployed on Microsoft Windows Operating System environment. In addition to typical RDBMS features, it also provides replication service, notification service, reporting service etc. As a product from Microsoft, it strongly bonds with its programming environment (.NET) through SQL CLR (Common Language Runtime) and easily accessible through Microsoft Visual Studio development environment.

MySQL

MySQL is one of most popular RDBMS which is open source in nature and managed by Oracle. This has most of the features like other RDMBS like Oracle RDBMS and MS SQLServer, but it’s mostly used in a smaller scale environment (with few exceptions) and very popular among the open source community. Some of the characteristics of MySQL are:

-     It’s a Relational database and provides both transactional and non-transactional storage engines

-          It supports ANSI SQL query language and as well as procedural programming using Stored Procedure and Functions

-          It’s open source

-          It’s very fast, reliable and scalable

-          It can work in both client/server and embedded model

Oracle

Oracle is an Object Relational Database Management Systems which provides an added flavor of Object orientation in the Relational Database realm. This is the most popular and widely used RDMBS. Internally Oracle RDBMS is divided into Oracle Instance and Database. The Oracle instance consists of the memory on RAM and server processes which is identified by System Identifier (SID). The database part is consists of the data files in the Operating System. There are number of background processes works in parallel to provide very fast access to the data and modify the same. Oracle database comes with so many added features which makes it the number one RDBMS, some of them are:

-         provides Real Application Cluster (RAC) to enable maximum availability in terms of Oracle instance failure

-       use Automatic Storage Management (ASM) and Automatic Memory Management (AMM) to simplify its management of data and performance

-   Oracle has proprietary backup and recovery tool (RMAN) for simplified backup and restoration of data in case of instance failure and data corruption

-      It supports ANSI SQL (also dubbed as Oracle Query) and PL/SQL for procedural programming

Critical memory components in the System Global Area (SGA)

System Global Area or SGA is the shared memory area that provides memory access to all Oracle server processes to manage the database access. SGA and the background server processes are effectively the Oracle Database Instance. ­­­The SGA memory structure holds several memory components like database buffer cache, shared pool, large pool, redo log buffer, Java pool, Streams pool etc. that are used by specific background processes to perform specific tasks. All these memory components of the SGA are allocated automatically during the instance startup and freed up at instance shutdown. The primary function of the SGA is to achieve faster access to data from the Random Access Memory (RAM) by avoiding expensive disk I/O as well as optimize data processing





Source: Oracle® Database Concepts 12c Release 1 (12.1)



Data Buffer Cache

Also known as Database Buffer Cache is the memory area where data blocks are read into from the data segments and temporarily caches the frequently accessed data blocks. Technically, every data access, except a few specific scenarios, from the database goes through this cache area. Oracle first searches the data in this buffer cache and if there’s a miss then it performs the expensive disk I/O.  Buffer manager uses complex cache retention algorithm (e.g. Least Recently Used or LRU) and probability theory (through the learning of data access pattern) to efficiently use this precious buffer cache memory. The primary function of the Database Buffer Cache is to optimize the physical disk I/O while read and update:

  •         When user updates record, the database updates the data in the cache first and stores metadata about the changes in the redo log buffer. Then later on Database Writer (DBWR) process performs lazy writes in the background to maximize the database performance
  •      When user read records, they’re accessed from the cache first and avoid expensive disk I/O. It keeps the frequently accessed blocks in the buffer and move the infrequently accessed blocks to the disk (using the LRU algorithm)

DBWR process is directly involved in reading and writing on the the Database Buffer Cache area and LGWR is indirectly associated with this area. PMON and SMON are also access the Database Buffer Cache when a process is failed or terminated abruptly and do cleanup and recovery. LGWR doesn’t access it directly but DBWR process’s write to disk may get impacted due to the LGWR process’s completion of write of the redo log buffer to the disk due to the mandate of the write-ahead protocol by the Oracle Database

 Redo Buffer Cache

The Redo Buffer Cache or Redo Log Buffer is a circular buffer in the SGA that stores redo entries, which is a data structure that contains the information necessary to reconstruct changes to the database. Server Processes write the redo entry to the Redo Log Buffer cache. The use of Redo buffer dramatically improves the performance of the database as the users don’t need to wait for the changes to be persisted to the disk (which are very expensive I/O) but red entries written to the redo log buffer cache by database processes are considered as the changes are done. To avoid data loss due to the database instance crash or malfunction, a background process (LGWR) sequentially writes the records from the Redo Log Buffer to the online redo log group on disk. This is very fast process as it goes sequentially from the redo log to online redo log. When recovery of data is necessary by PMON and SMON processes, the Redo Log Buffer Cache is read and reconstructs the lost record

The Shared Pool

The Shared Pool is a container memory area of various subcomponents like Library Cache, Data Dictionary Cache, Server Result Cache, Reserved Pool etc. It caches various types of program data like, parsed SQL, PL/SQL code, System Parameters, and data dictionary information. Almost every operation that occurs in the database does access data from the shared pool

Source: Oracle® Database Concepts 12c Release 1 (12.1)

The Data Dictionary Cache

Data Dictionary Cache, also known as Row Cache, is a collection of database tables and views containing information about the database, its structure, and its users. This information is used by all server processes to perform their specific function, for example – to parse a SQL statement and create the execution plan the data dictionary is being used. As this cache is used very often by the server processes, holding the data dictionary information in the SGA memory area improves the performance of Oracle Database

The Library Cache

Library Cache is a SGA memory component that falls under the Shared Pool and used to hold the executable SQL and PL/SQL code and the control structures such as locks and library cache handles. The Library cache has a variation depending on the architecture. If it’s in dedicated mode, it only contains Shared SQL area but in shared server architecture mode, it has a Private SQL area. Shared SQL area holds the first occurrence of a SQL statement and the Private SQL area holds for each of the sessions. Private SQL area can be associated to the same Shared SQL area.

The storage of executable SQL and PL/SQL statements helps to tremendously improve performance of the database processing. Every time there is a request from user or an application to execute a statement, it first looks up into this cache area and if it’s syntactically and semantically same, it just reuses the same parsed statement. But if the server process doesn’t find the exact same, it parses the statement and subsequently loads that into the Library cache for future use.

Like the Database Buffer cache, this cache also uses the LRU algorithm to optimize the use of the memory location and to increase the probability of Library cache hit. Apart from the LRU algorithm, it also recycles any statement if the underlying objects used by the statement are changed, for example: if the database gathers statistics for a table, table cluster, or index, and change in the database object through DDL statements as well as change in the global database name.

All Server processes use the Library cache to improve the efficiency of SQL and PL/SQL statements execution

How LGWR and the ARCH processes are inter-related?



Both the LGWR and ARCH processes are background processes and are used by Oracle instance to recover transaction and lost data. LGWR process ensures the recovery from an Oracle instance failure whereas ARCH process ensures the recovery from a media failure where the Online redo log files are corrupted or unavailable.

LGWR process writes the redo entries from the redo log buffer to the online redo log file in a sequential manner. There’s at least two or more online redo log files available in the online redo log. But Oracle uses only one online redo log file at a time to store the redo records and when one is filled up or configured to switch at a pre-specified period of time, the log switch occurs and the LGWR process writes to the next available online redo log file.

This scenario holds true in both configuration regardless of the archiving mode or no-archiving mode. But if there’s no more unused log files in the online redo log, then the LGWR proceeds with writing redo log depending on the value of the archiving mode. As in this case, the database is running in ARCHIVELOG mode, a filled online redo log file is available to log writer (LGWR) process only after the changes in that online redo log file have been written to the data files and the file has been archived to the archived red log file by the ARCH process(es).


So the number of ARCH background processes (maximum it can be set to 10) plays an important role in Oracle database performance. LGWR process triggers a new ARCH process when it sees the load is high and continues to do so until it reaches the maximum allowable ARCn process number. So if ARCH processes are slowed down for some reason, it may impact the LGWR activities and eventually overall database performance and recovery in a database running in ARCHIVELOG mode.