big data

Big Data Vocab:

  • Big data
  • Sentiment analysis
  • Natural language processing
  • Predictive analytics
  • Distributed computing
  • Scalability
  • Redundancy
  • Cluster
  • Node
    • NameNode
    • DataNode
  • Hadoop
  • Hadoop Distributed File System

Introduction:

Big Data: It’s one of the hottest buzzwords in the tech world today, and it pretty much makes up all of the social media that we use and love. Let me give you an example to put things in perspective: In 2012, Jay Parikh, engineering VP at Facebook, revealed that Facebook handles over 500 terabytes of data every day, with 300 million photos, 2.6 billion ‘likes’ and 2.5 billion content uploads. Clearly, this type of data has rightfully earned the prefix “big”. Now, although many have heard the phrase ‘big data’ before, few actually know what it means – because most existing information about it online is either too general (such as, “it’s data that’s big”) or too specific (the type of jargon that makes you dizzy). As a teen who was frustrated while trying to find information about this hot new topic in between the two aforementioned extremes, I thought –  why don’t I create a curriculum myself? So let’s get into it.

What is it?

According to Dictionary.com, big data is “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.” In today’s world, the amount of data being generated is growing at an exponential rate. In 2015, Forbes published that more data has been created in the past two years than in the entire previous history of the human race. Mind blowing, right? Yet studies reveal that less than 1% of existing data is actually analyzed. Fortunately, the tech industry recently began to realize what they were missing out on. Data is power. Well, at least analyzed data is. And thus began the era of analyzing loads of data to uncover patterns and trends in order to push forward in the world of innovation, and increase our understanding of the world around us.

Its impact

Analyzing Big Data has a great impact in so many different realms around us.

Personalized Marketing

Analyzing data enables personalized marketing, which is a type of marketing in which companies deliver personalized messages and product offerings to individuals based on the data they generate online. Customers generate vast amounts of data through their searches and activity on both the Internet and social media. Companies then have a lot more information from each customer to work with, and they can create their ads and products to be tailored towards individuals and their interests. For example, when you are shopping online on Amazon, oftentimes, the products that are recommended for you are based on your recent searches. This type of marketing is a lot more effective because it was created for you – based on what you are looking for – rather than just randomly assigned to your screen. Not only does it leave you a happier customer, but it also increases your chances of making a purchase with Amazon – which increases their success as a company. Using BD analytics for personalized marketing is evidently a win-win! So how do companies use analyze BD to make this happen?

  • Sentiment analysis (also known as opinion mining): when a company analyzes customer reviews to gauge customer opinions or sentiments around a certain product. For example, when sentiment analysis works to have a computer recognize keywords like “very good!” or “horrible” and associate the opinion around them, which in this case, would be “good” and “bad”. After analyzing the sentiments, companies can then take action by either marketing to you more of a product that you felt was good, and less of a product that you expressed dislike over.
  • Natural Language Processing (NLP): a branch of artificial intelligence that helps computers and machines understand, interpret and manipulate human language.

Improving medicine and healthcare

Big data is revolutionizing the field of healthcare by improving medical treatments. Now that there are tools to collect and analyze data from each individual patient, new treatments can be developed, with precision going as deep as the molecular and tissue level. Additionally, BD can be used to increase the accuracy and specificity of existing treatments. For example, medical imaging can be revolutionized with BD. By feeding imaging technology large sets of data to learn from, the machines would be better equipped to detect abnormalities in images because they had a lot of data to practice with. Imaging would then be more precise – leading to disease detection in even earlier stages, resulting in earlier treatments.

Helping with Natural Disasters

Big data can help monitor and even predict natural disasters. During a natural disaster, there are several people who are giving updates about it on social media and the internet. What they may not be realizing is that their latest Twitter post with a photo attachment of how intense the local earthquake has gotten is actually contributing to a huge stream of data that can help relief centers and disaster prevention take effective action. One of the ways that BD monitors natural disasters is by collecting all the data that people share on their social media relating to the current disasters and analyzing it to see how bad the situation is and what steps to take. BD can also help with predictive analytics, a field which uses data mining, statistics, and modeling to make predictions about future events – in this case, about future disasters. An example of this would be predicting when a volcano is going to erupt by feeding a machine loads of data relating to what the signs that a volcano exhibited before previous eruptions and then feeding in its current state to analyze the similarities and differences. Using BD to improve predictive analytics is incredibly important, because if you know when a disaster will occur, you can better prepare for it and minimize damage.

So how is big data analyzed?

Before I use technical terms to explain how BD is analyzed, I am going to start with an everyday example/analogy. One thing to consider when working with BD is that there is a large amount of it. Obviously, it would be impractical to feed it all to only one computer to be analyzed. A better and more efficient way to process the data is to employ the classic technique of divide and rule, in which the work is divided, completed separately, and then shared at the end so that all parts are informed of the others. For example, suppose you have a really big book to read, and your number one objective is to read it as fast as you can, how would you go about doing it?

  • Method #1 – One solution would be to just sit down and plow through it – which is inefficient – as you’d probably get exponentially more exhausted as time went on, and reading it alone is not very fast at all.
  • Method #2 – Now suppose you assemble a team of ten readers who read at around the same speed, and you assign them each a tenth of the book to read. At the end, everyone  would then collaborate by giving a brief summary of what they read. If everyone starts at the same time, this technique essentially cuts the overall reading time into 1/10 of the original reading time! Also, each reader would not be as worn out, since they would only have to do 1/10 of the overall work. So which method would you choose for speed and efficiency?

Obviously, the second scenario is more efficient, so the technological equivalent of that is precisely what big data analysts choose to employ!

Now, in technical terms…

Let’s start with distributed computing:

A distributed system is a network of independent computers that communicate and work with each other in order to achieve a goal. There are two main advantages of using distributed computing:

  • Scalability – the system can be expanded easily by adding machines where needed
  • Redundancy – several machines carry out the same task, so there is no single point in failure   

Let’s dive into something more technologically specific to big data analytics: Hadoop.

Hadoop is a software that provides a basic structure for storing and analyzing a lot of data on inexpensive machines/computers. What’s special about Hadoop is that it provides massive data storage, and because it was designed for that kind of data storage, it can run analytical algorithms on these huge amounts of data (BD).

Let’s go back to our book reading analogy and translate Method #2 (divide and rule) into the technical terms of how big data is analyzed. Method #2 entails assembling a group of readers and assigning them each an equal portion of the book to read, cutting the total reading time into 1/10 of the original. Now, in tech terms, the group of people that you assemble is analogous to a group of computers, known as a cluster. Each individual reader within the group would be an individual computer within the cluster, known as a node.

So Hadoop employs distributed computing by using clusters and nodes (the nodes all connect through distributed computing to become clusters). The distributed computing system that is specific to Hadoop is known as HDFS (Hadoop Distributed File System). Let’s explore the architecture of the HDFS:


https://codemphasis.wordpress.com/2012/09/27/big-data-hadoop-hdfs-and-mapreduce/

Each cluster has a NameNode and corresponding DataNodes. NameNodes are nodes that exist specifically for the purpose of opening, closing, and renaming files; they also map the blocks of DataNodes and give the DataNodes instructions on what to do with data. They are essentially a form of metadata, or data whose function is to describe other data. DataNodes are nodes that store the data from the files that they are given; they also get and execute instructions given to them by the NameNodes; some examples of what they do include creating, replicating, and deleting blocks of data.

Replication

A key aspect of HDFS is its replication of data, for it greatly increases the fault tolerance (or ability to continue operating properly even in the event that some of its components fail). NameNodes specify how many times they want the data to be replicated, and DataNodes spread the replicated data across several nodes, so that there are multiple copies across several locations, making fault tolerance very high – which is a very optimal aspect of Hadoop.

Wrap up:

Outline of the topics touched upon in this article:

  • What big data is
  • Impact
    • Personalized marketing
    • Improve medicine
    • Monitoring/preventing natural disasters
  • How big data is analyzed
    • Distributed computing
    • Hadoop
      • Hadoop Distributed File System

Fascinated with the topic of big data? (I am too). Read more below if you are interested!

Image sources:

Big Data Vocab Defined:

  • Big data – extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions
  • Sentiment analysis – when a company analyzes customer reviews to gauge customer opinions or sentiments around a certain product
  • Natural language processing – a branch of artificial intelligence that helps computers and machines understand, interpret and manipulate human language
  • Predictive analytics – a field which uses data mining, statistics, and modeling to make predictions about future events
  • Distributed computing – a field of computer science that studies distributed systems; distributed system – a network of independent computers that communicate and work with each other in order to achieve a goal
  • Scalability – the system can be expanded easily by adding machines where needed
  • Redundancy – several machines carry out the same task, so there is no single point in failure   
  • Cluster – computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system
  • Node – individual computer in a cluster
    • NameNode – nodes that exist specifically for the purpose of opening, closing, and renaming files
    • DataNode – nodes that store the data from the files that they are given
  • Hadoop – a software that provides a basic structure for storing and analyzing a lot of data on inexpensive machines/computers
  • Fault tolerance – the ability to continue operating properly even in the event that some of its components fail
  • Hadoop Distributed File System – distributed system that is specific to the Hadoop environment; provides massive data storage and fault tolerance through replication of the data stored within its DataNodes

Quiz: