How Big is Big Data?

There has been a lot of buzz around big data lately. The volume of data we’re handling is growing exponentially with the popularity of social media, digital pictures, videos, and data from sources like sensors, legacy documents, weather and traffic systems to name a few. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone, states a report from IBM. According to a report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.

I’d like to ask you, what is the maximum database size you have seen so far? A professional in database and related subjects may be able to answer this question. But if you do not know anything about databases, it might not be possible for you to answer this.

Let us take the example of Aadhar Card. UIDIA (Unique Identification Authority of India) has issued about 25 Crore Aadhar Cards in India so far.  The size of each card is around 5MB (it includes photo, finger prints, scanning), so the existing database size could be

5*25, 00, 00,000 MB = 1250 TB = 1PB (1000 TB ~ 1Peta Byte)

On an average, UIDIA issues 1M Aadhar Cards each day. So the size of the database increases by 1 Tera Byte per day. The current population of India in 2014 is 1,270,272,105 (1.27 billion), so the minimum size of the database required to store Aadhar Card data would be around 5 Peta Bytes.  Is this database big enough? Probably not.

Facebook has 680 Million active users on a monthly basis. Is this big?

Google receives 6000 million searches per day.Is this big? May be.

Storing 6000 Million records is not a big thing; you can use conventional databases like Oracle to store these many records.  But it can be more interesting that. What if I ask you to store 6000 million search phrases that are searched in Google everyday for two years and at the end of it create a report on the 25 most searched keywords related to “cricket”? This might sound insane. 6000M * 365 * 2 = 4380 Billion Records! Even if you are able to store these many records, how can you perform analysis on this data and create reports?

That is where big data technologies will help you.  Big data does not use RDBMS, SQL queries or conventional databases. Instead, it uses tools like Hadoop, Hive, Map Reduce etc. Map Reduce is a programming paradigm that allows massive job execution scalability against thousands of servers or clusters of servers. Hadoop is by far the most popular implementation of MapReduce. It aggregates multiple sources of data in order to do large scale processing and also reads data from a database in order to run processor-intensive machine learning jobs. Hive is a SQL like bridge that lets conventional BI applications run queries against a Hadoop cluster. It has increased Hadoop’s reach by making it more familiar for BI users.

While Big Data represents all kinds of opportunities for businesses, collecting, cleaning and storing it can be a nightmare. Not only is it difficult to know whether the data is being transmitted properly, but also that the best possible data is being used. Here are some key points to keep in mind while testing big data:

  • Test every entry point in the system (feeds, database, internal messaging, and front end transactions) to provide rapid localization of data issues between entry points.
  • Compare source data with the data landed on Hadoop system to ensure they match.
  • Verify the right data is extracted and loaded into the correct HDFS location.
  • Verification of output data. Validate that processed data remains the same even when executed on a distributed environment.
  • Verify the batch processes designed for data transformation.
  • Verify more data faster.
  • Verification of output data. Validate that processed data remains the same even when executed on a distributed environment.
  • Verify the batch processes designed for data transformation.
  • Automate testing efforts.
  • You should be able to test across different platforms
  • Test data management is the key to effective testing.

The list above is not static and will keep growing as big data keeps getting bigger by the day. Big data amplifies testing concerns intrinsic to databases of any size as well as poses some new challenges. Therefore, a testing strategy is critical for success with big data. Companies that get this right will be able to realize the power of big data for business expansion and growth.

Manoj Pandey | Zen Test Labs

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s