of Computer Science and Engineering,ASIET
Abstract – Many technological advancements in typical domains (e.g.
internet, financial companies, health care, user generated data, supply chain
systems etc.) have directed to include data from these domains. Data outburst
trend gave the insight meaning to the buzz word ‘Bigdata’. If we compare with
traditional data, Bigdata exhibits some unique characteristics like it is enormous and unstructured type of
data that cannot be handled using traditional databases. In this paper we analyzed Bigdata system architecture and
analytic approaches. Hadoop framework, Hive, No SQL for addressing the bigdata
is also compared.
Keywords-Bigdata , Bigdata System Architecture,
Bigdata Analytics, data dimension, data density, Bigdata
Big Data, Cloud computing are the buzz words for
the future of IT industries. As per the latest survey by IDC it has
been seen that social sites are creating a huge
amount of dataper day like from twitter site almost 500 million tweets are
sent, From Youtube 4 million hours of content is
uploaded, 6 billion searches are done on
Google, 3.6 billion of likes comes from Instagram,
whereas 5.75 billion of likes done on Facebook,
4.3 billion of messages by facebook. Big Data
is derivative of the trend that followed Data Explosion.
Bigdata is very diverse in nature.
Fundamentally, Bigdata characterize as not only a huge volume of data it
basically consist of set of unstructured,
semi-structured and structured which cannot be stored in simple table formats.Below we
defined definitions of Big Data from these three aspects:
Definition: Big Data
was given definition byseveral IT companies like EMC, IBM and more. Big data is
characterized by the 4 V’s i.e., volume, variety, velocity and value. EMC
supported IDC definition of big data in a report out in 20112
Sameera K M
Asst.Prof. Department of Computer Science and
that “Big data technologies describe a new generation of
technologies and architectures, designed to economically extract value from
very large volumes of a wide variety of data, by enabling highvelocity capture,
discovery, and analysis.”
Definition: In 2011,
Mckinsey’s report 3 defined big
data as “Datasets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyze.”
Definition: The National Institute of Standards and Technology (NIST)
4 suggests that,
“Bigdata is where the data volume, acquisition velocity,or
data representation limits the ability to perform effective analysis using
traditional relational approaches or requires the use of significant horizontal
scaling for efficient processing.”
II. BIGDATA SYSTEM
Bigdata system architecture provides many
functions to deal with different phases of today’s data life. Architecture of
data system is decomposed into four sequential
modules as shown in Fig 1. It includes Data
Generation, Data Acquisition, Data Storage,
and Data Analytics.
Data generation is the first main phase of
Bigdata. Data sources such as sensors, social sites, health care centers,
satellite, air plane, media, business apps, machine log data,generate large,
diverse, and complex datasets. Figure1 depicts data generation phase which
shows that the data source attribute values are mainly from the scientific
field, business field and the networking field. Scientific field produce very
low whereas business field produces very high attribute value and the
networking field produces very high data rate.
Data acquisition phase is divided into data
collection where data is obtained from various data sources, Data transmission
phase and then data pre-processing phase from which useful information is
C. Data Storage
The data storage is
always required to keep the data needed for future use hence a data subsystem
in a big data platform organizes the collected
information in a format which can be used for the exploration and value
abstraction purpose. The data storage consist
of the two parts mainly: Hardware arrangement and for managing data: data management
system is required.
Analytical methods or tools are required to
inspect, transform, and model data to extract meaningful
value. It has certain purposes like to
understand the meaningful information from the data and what value added
functions can be added to the given data.
Bigdata system Architecture Modules
III. BIGDATA ANALYSIS APPROCHES
Following are some approaches for the BigData
It is the framework which allows for the
distributed processing of large datasets across clusters of commodity computers
using a simple programming model. HDFS-Hadoop distributed file system and Map
reduce(programming model) are its core components. Below are some of the
features and usage of Hadoop
Features of Hadoop:
With Hadoop we can scale hardware infrastructure according
to our need i.e. Hardware infrastructure can bescaled up and down both
according to the requirement without any
change in the data formats. Data and computation jobs will be automatically redistributed to accommodate hardware
Efficiency: Parallel computation is made affordable for the
ever growing volume of big data with the help of hadoop. It brings massively
parallel computation to commodity servers, leading to a sizeable decrease in
cost per terabyte of storage.
Hadoop absorbs any type of data from any number of sources. Moreover, different
types of data from multiple sources can be aggregated in Hadoop for further
tolerance: The data lost due to computation failures
caused by node breakdown or network congestion can be recovered by Hadoop.
Hive was started at Facebook in the year 2006
because of the difficulty in controlling of a large amount of data which was increasing like, from few gigabytes to terabytes. Hive acts
as data warehouse system built inside the hadoop file system. It is used to
analyse large data sets which cannot be handled by tradition RDBMS.
Usage of HIVE
Hive can be used for log processing. In this
logs get portioned and bucketed in the
forms of tables and then can be easily analyzed.
Indexing of huge documents.
Customer facing Business Intelligence, which
customer demands for particular any SQL to run.
Hypothesis testing which gets the result based
on considered hypothesis, or in predictive modelling.
Hive is stored inside hadoop in the form of hive
tables from which data is accessed. It is stored inside Hadoop file system
because of its properties like
scalabilty on various commodity hadwares as well scalability
This term basically means non- relational
database systems,which is different from traditional RDBMS. NOSQL is designed
for distributed data stores in which large scaling of data is needed. In this
there is need of any fixed schema and join operations. This data is scaled
Some characteristics of NoSQL are listed
• Simple and flexible Non Relational database
model. As discussed above it does not
require any fixed schema, it offers flexible or no schema which can be easily
handled by various data structures .
• Horizontal scaling over many commodity
• High availability and partition tolerance.
•BASE system (Basically Available, Soft state,
Eventually Consistent). Basically Available indicates the availability of data all the time as discussed in CAP theorem. Soft state
indicates that system may change from time to time without any change in the inputs. Eventual consistency indicates
that system will become consistent with time,
if no inputs are given in that period.
• No complicated relationships are there in
• It is inexpensive.