Skip to main content

What Is Big Data

In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary defines 'data' as -
"The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. "
So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time.In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Examples Of 'Big Data'

Following are some the examples of 'Big Data'-
The New York Stock Exchange generates about one terabyte of new trade data per day.
• Social Media Impact
Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Categories Of 'Big Data'

Big data' could be found in three forms:
  1. Structured
  2. Un-structured
  3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science have achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, now days, we are foreseeing issues when size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabyte.
Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name 'Big Data' is given and imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of a 'structured' data.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
Employee_ID 
Employee_Name 
Gender 
Department 
Salary_In_lacs
2365 
Rajesh Kulkarni 
Male 
Finance
650000
3398 
Pratibha Joshi 
Female 
Admin 
650000
7465 
Shushil Roy 
Male 
Admin 
500000
7500 
Shubhojit Das 
Male 
Finance 
500000
7699 
Priya Sane 
Female 
Finance 
550000

Un-structured

Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos etc. Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
Output returned by 'Google Search'

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a strcutured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
Personal data stored in a XML file-
?
1
2
3
4
5
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over years
Please note that web application data, which is unstructured, consists of log files, transaction history files etc. OLTP systems are built to work with structured data wherein data is stored in relations (tables).

Characteristics Of 'Big Data'

(i)Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'.
(ii)Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analysing data.
(iii)Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social media sites, sensors, mobile devices, etc. The flow of data is massive and continuous.
(iv)Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Advantages Of Big Data Processing

Ability to process 'Big Data' brings in multiple benefits, such as-
• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.
• Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed with 'Big Data' technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
'Big Data' technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of 'Big Data' technologies and data warehouse helps organization to offload infrequently accessed data.

Comments

Popular posts from this blog

JMeter Exceeded Maximum Number of Redirects Error Solution

While running performance test, JMeter allows maximum 5 redirects by default. However, if your system demands more than 5 redirects, it may result in JMeter exceeded maximum number of redirects error. In this post, we have listed down steps to overcome this error. Actual error in JMeter: Response code: “Non HTTP response code: java.io.IOException” Response message: “Non HTTP response message: Exceeded maximum number of redirects: 5” This error is noticed because  JMeter  allows maximum 5 redirects by default and your system may be using more than 5 redirects. You need to increase this count to more than 5 in jmeter.properties file. Follow below steps to achieve this. Navigate to /bin directory of your JMeter installation. Locate jmeter.properties file and open it in any editor. Search for “httpsampler.max_redirects” property in opened file. Uncomment the above property by removing # before it. Change to value to more than 5 Eg. 20. Save the file and restart JMet...

SSO with SAML login scenario in JMeter

SAML(Security Assertion Markup Language) is increasingly being used to perform single sign-on(SSO) operations. As WikiPedia puts it, SAML is an XML-based open standard data format for exchanging authentication and authorization data between parties, in particular, between an identity provider and a service provider. With the rise in use of SAML in web applications, we may need to handle this in JMeter. This step-by-step tutorial shows SAML JMeter scenario to perform login operation. First request from JMeter is a GET request to fetch Login page. We need to fetch two values ‘SAMLRequest’ and ‘RelayState’ from the Login page response data. We can do this by using  Regular Expression Extractor . These two values need to be sent in POST request to service provider. Refer below image to see how to do this. We will get an HTML login page as a response to the request sent in 1st step. We need to fetch values of some hidden elements to pass it in the next request. We...

VBScript Code - Function to convert CSV file into excel and viceversa in QTP using VBScript

We at times are required to convert excel files into csv to read as flat files and sometime require to convert a csv file into excel file to use excel features on the data.   Below function shows how to convert an csv file into excel file and vice versa. We can also convert to other formats based on constants Here constant value 23 is used to create a csv file and constant -4143 to save a file as xls file. Once the destination file is created, we can delete the source file as shown below.  In case of any issue in understanding the code, please add in comment section Call func_ConversionCSVExcel("E:\Test.csv", "E:\Test_converted.xls", "csvtoexcel") Public Function func_ConversionCSVExcel(strSrcFile, strDestFile, Conversion) on error resume next Set objExcel = CreateObject("Excel.application") set objExcelBook = objExcel.Workbooks.Open(strSrcFile) objExcel.application.visible=false objExcel.application.displayalerts=...