1/20 – 20 Days of Data

Started week 3 , of ‘Intro to Big Data’ . (Coursea)

2/20 – 20 days of Data

Week 3 (Coursea)

– Distributed File System (DFS) : Network of many different Distributed Files (Long Term Information, Data) on a system that is integrated allowing access to organize and analyze all the Nodes of each File.

– Scalable Computing: Done with ‘Parallel Computers’ (Data Parallel Scalability); most common method used due to the cost efficiency is Commodity Clustering . A form of Distributed Computing over the Internet (creating Data Paralisisim)

Fault Tolerance is implemented in preparation for failure of a set of Nodes in a rack of computers, causing problems in the overall processing of the integrated Network of Data(Nodes).

Fault Tolerance recovery solutions:

Redundant Data Storage (Failure) Data Parallel Job Restart (complete Restart)

– Programming Models: Abstractions of existing Infrastructures

Ideal Requirements of a good Programming Model;

Supports Big Data Operations

Handle Fault Tolerance

Easily Scalable

Optimized for Specific Data Type (Graph/PDF/Tables/Streams/Multimedia) …

Program should be optimized to integrate and work over the Distributed File System (DFS)_ the Data Nodes are stored on

Ex. of a good Programming Model used to batch together and sort multiple matches within an uploaded Network of Big Data Nodes. (Map Reduce – originally created by google for search optimization for web pages within a search query)

C0rs3as ..


cyber training course 365 –


. . . .

Online Courses :








3/20 Week 3 (Coursea)

Hadoop : Four Core ” W’s ” to ask [ Goals of the Project Framework ]

Who (Used It)

What (Is in the Ecosystem – General Culture of the subject you are investigating Ex. Twttr BTC Culture)

Where ( Is it Used)

Why (Is it Beneficial)

3 Main Components of Hadoop :


MapReduce ( Matching Pairs – First developed and used by Google for Search engine optimization )


Many more , Majority of Programs run off HDFS ‘File System’ ( Ex. Facebook Social Graphs Run off the ‘Giraph’ Program which leverages HiveIO to access Hivetables on-top of HDFS System , Program can also work with YARN or MapReduce to access Hivetables from HDFS )

4/20 – 20 Days of Data

Layers of Hadoop Ecosystem :

Understanding the Layer [or] ‘Stack’ Diagram related to Program you are using makes it easier for you to fully understand what capabilities are available for you to leverage.

Top Tools leveraged by Hadoop HDFS :

– MapReduce (Supporting) ; Hive / Pig

– Yarn (supporting) ; Giraph

– ZooKeeper

– HDFS (Supporting all Programs listed above)

[ Many Programs Rely on the HDFS Architect but some are Independent ]

– Many More … (Over 100 open-source Projects available today for Big Data)

You can Download any of these Programs individually and leverage a select few how you like best ; Or you can Download Pre-Stacked packages of multiple tools built together

5/20 – 20 Days of Data

HDFS : Foundation for most Tools in Big Data Ecosystem

[Reason HDFS is so Popular] – Scalable and Reliable Storage , Makes it Possible to store and access large Datasets

HDFS is able to do this by splitting up one Large File into many different smaller Nodes that can be digested in Smaller Chunks

( This is called ‘Partitioning’ )

Typical file that is Inputed comes in [ MB ] _____. Up To . ____ [ TB ]

The Average Size Partitioned Node is broken down into 64 MB chunks

HDFS Fault Tolerance Protocols are in place to Protect you from losing complete Access and Function of any given Partitioned Node that potentially hindered or crashed

This is done with a ‘Default Replication’ Factor (3 Copy Default) , meaning that by Default the HDFS system is set up to immodestly make a minimum of 3 Copies of each Node during the Partitioning Process…

The 3 copy default is what you are always given to start with but you can decide to customize after and make multiple extra copies of any Node you so choose

………… [ Finish HDFS Chapter ]

6/20 – 2- Days of Data

. . . (Finish HDFS Section)

HDFS offers scalable big Data Storage with Fault Tolerance by Partioning Files over multiple Nodes; By breaking down large data file into smaller chunks makes it easier to digest and allows you to replicate multiple copies of each smaller Node to ensure information is safe in the case of a cras

YARN : Resource Manager for Hadoop

Yarn Extends Hadoop by enabling multiple frameworks to integrate with Hadoop Program..

Programs working onto of YARN Addition ; MapReduce , Giraph ; Spark ; Flint

Essential Gears of YARN Framework –

  • Central Resource Manager (Ultimate Decision Maker)
  • Node Manager (In charge of the single Node it is attached too)

7/20 – 20 Days of Data

. . . (Finish YARN Resource Manager)

By working with YARN over HDFS you are able to run many different new programs because of the Central Resource Manager and Node Manager Framework that YARN provides.

[ Before this HDFS was only compatable with MapReduce ]

MAP.REDUCE : Simplifies Parallel Programming

With Map Reduce the only 2 Tasks that you must focus on are ; Map & Reduce

You provide the Task that you are looking ‘MAP’ when you input your Data and ‘REDUCE’ analyses / Summarizes the elements to determine the desired Output asked of it

“Hello-World” is the common Standard / Phrase used for the first program Code you should learn when starting a New Program you’ve never used before

“Hello-World’s” Program for MapReduce is WordCount

WordCount Reads one or more Text Files and counts / pairs the number of each word into list’s from the File

Ex. Map [ File ] . Shuffle & Sort [ WordCount ] . Reduce [ Result File ]

File. 1 – – – – – –

File. 2 – – – – – – Word Count – – – – – – Result File (Paired List)

File. 3 – – – – – –

WordCount can also be Programmed to Link ‘KEY” Words to the best Matching URL after indexing a Web Crawl from the Internet – Optimizing WordCount to produce the best options to someone from a Search Query that they entered

( This is what MapReduce was first used for when it was originally created by Google )

WordCount used on MapReduce is best used for Batch Tasks you need done

Because MapReduce must analyze every Node to Pair up Matches for its Output it is not Fast enough to be used for Real-Time Data that is coming in at a much Faster Velocity and diverse Variety

8/20 – 20 Days of Data

When to Reconsider Hadoop (HDFS) :

Determine the type of Data you will be collecting and how it will be used to solve the problem you are trying answer/getInsight on

[ Reconsider HDFS ] ; If you are working with small DataSets , or Very complicated Algorithms

[ When to consider HDFS ] ;

  • Advanced Analytical Queries ( Search Engine Optimization – Google and Map Reduce )
  • Latency to Sensitive Tasks ( YARN Managers )
  • Security of Sensitive Data ( Fault Tolerance )

Cloud Computing – Business Value / Minimal Cost :

Cloud Computing = IT Infrastructure (Storage) & Applications (SaaS) ‘Rented-Out’ to a businesss or person who pays for access to Cloud Computing Services from a company (Salesforce , Oracle)

Cloud Computing is ON_DEMAND Computing over the Internet – [ CloudComputing is the more popular method used by businesses today because of its convince and freedom or not commitment required to use the technology ]

9/20 – 20 Days of Data

Cloud Computing – used as a service is Much more popular since it requires little to no equipment or any type of large initial capital required if you were to try and buy and host all the Servers yourself in house

There are many different Cloud Computing companies available to choose from and he number is only growing as time goes on and more businesses adopt and integrate the Cloud Computing model into there plan.

( The most obvious benefit from working with a Cloud Computing company is the convivence of leaving the more tedious jobs that are required to the Cloud.Provider such as keeping software and security Updated and Secure to ensure your data is safe … [ Its Never Safe ] )

Cloud Computing Models : 3 Main Models

Depending on how Technical your Data Team is will determine which Model of Cloud Computing is best for you , Some come in full Packages supplying everything you may need for a project, while others just offer the Infrastructure or Platform alone if that is more what you are looking for

(Security is Most important when dealing with data)

  • IaaS – Infrastructure as a Service (Bare Minimum Hardware)
  • PaaS – Platform as a Service (Entire Computing System ; OS System)
  • SaaS – Provides Software and Hardware to give you with everything you need

10/20 – 20 Days of Data

Hadoop Pre-Built Images :

Allows you to quickly generate Value from Data by using Hadoop Pipeline

– Assembling your own Software Stack can be Messy and Challenging to do ; Using Pre-Built Software Images can help solve this problem by providing you with the full stack of tools you need

“Virtual Software” enables you to run PreBuilt Images – ( VmWare / Virtual Box )

Companies that provide PreBuilt Images – Cloudera / HortonWorks Sandbox


[ Reading ] – Downloading & Installing Cloudera VM Hadoop

Read over Hadoop Download Page ;

– Cloudera VM onto Virtual Box …. ( Must buy a switch adapter to download 4gb and higher size files, to be inputed into Cloudera [HDFS] )

Copy Data into Cloudera and Run WordCount(?)

. . . Intro to Big Data Course . . . Finished . . .

[12/20] [13/20] – 20 Days of Data

[12/20] – After Wrk ; Download version of Cloudera for Virtual Box . . .

[13/20] – Start Coursea #2 ; Data Specialization

  • In this course you will learn how to Retrieve and Process Big Data to begin Data Analytics

Welcome to Modeling and Management :

Data Modeling (What type of Services / Platforms needed to achieve objective)


Data Modeling (Understand Type of Data)


Data Streaming (Real-Time Ingestion / Processing)

Must Ask Questions for Data Management :

– {Refer to NTS} –

Data Ingestion :

Data Ingestion = The process of inputting collected Data into Data Systems (Cloudera)

^ Goal of Ingestion is to completely automate the Process – Instead of taking your data and manually inputing it into a platform to be processed … Big Data is usually coming in at such a large Volume/Velocity that you want to automate this step from the point of retrieval

[14/20] [15/20] – 20 Days of Data

[14/20] –

Ingestion Policy ; (Ex.) What to do if Data is Bad (Quality) & Not Used (Store – Sensitive or Discard)

Data Storage :

How much Data is Needed to be Stored ?

Storage Connected to Host Computer ; (or) Should storage be attached to a Network (Cloud) that connects computers in a cluster

  • Memory Hierarchy ‘SSD’ – – – ‘Cache’ [ Bottom – – – Top]

Data Quality :

Better Quality ( Who, What, How, Why ) + Better Analytics

Guide – Industry Report on Big Data Qualities (Gartner)

Data Operation :

Operations that work on single Data Items at a time (MapReduce) – Slow


Operations that work on a collection of data items at the same time (Spark, Flink, Storm)

Scalability & Security :

Scaling up Vertical – V. – Scaling out Horizontal

Up = Add more Processors and RAM ( Difficult / Expensive )

Out = Add less powerful machines over a (Cloud) Network ( Slower / Easier to Implement & Maintain )

– Most people use Parallel Computing . . .

[15/20] –

Security – – – Is a must when dealing with Sensitive Data .

[ Encryption to Decryption is safest but is most costly ]

More Machines (Scaling) makes security over everything more challenging

16/20 – 20 Days of Data

* Management Challenges – – – – > Predicting how your data might change / Scale and deciding what methods you will use to accommodate these changes

Data Models ( Variety) :

– Selection , Projection , Union , Join

– The characteristics of the data helps determine how to properly Analyze it ; [i.e.] What features does the data I’m working with have ? – who / what are you working with

Data Models ( Structured ) :

Structure = shows pattern of organization in. Data Files

ex. of Unstructured – Most media files ; Jpg / Mp3 / Avi

17/20 – 20 Days of Data

* Hands on Cloudera NTS :

[ Cloudera is a type of Distribution of Hadoop , Other distrubution examples are – Hortonworks / MAPR . . .

( An easy way to undertand this in context is the comparison of Linux Systems …. GNU/Linux is the Base version of Linux . But as time went on many features and changes have been built on top of this GNU Frame work. Every time new Features are built onto of GNU its own feature comes out i.e. : Kali, Ubuntu, Whonix, .. Each Version has its own purpose and use case . ) – – – ]

Data Models :

Operations that can be performed on a DataSet [ Methods to manipulate the Data; to find Insight ]

Different dataSets have there own Structures; So operations will vary depending on the type of Data you are working with . . . * With that being said there are a few operations that are Universally used on ALL DataSets —>


“Subsetting” – Extracting part of a collection in the DataSet … Other Names for “Subsetting” – Filtering / Selection

18/20 – 20 Days of Data

….. [ Data Models (Operations) – Finished ]

  • “Substructure Extraction” —-> Extracting a particular part of a structure that is identified in the DataSet (Other Names: Projection)
  • “Union”—-> (Combination) Given TWO DataSets, create a single Net Set with one or more of the elements in both DataSets; Duplicate Elimination [ Both DataSets you are trying to bring together must have the same matching element ]
  • “Join”—-> Where both DataSets have different Data Contents, but similar elements you want to bring together (Matching Records)

Data Models (Contraints) :

– Constraints help determine what is True or False in the computations —-> This specifies the semantics (Meaning) of the Data – – Ex. : The week only have 7 Days no matter what and will not be known by the computer system unless you pass the knowledge onto it

– Different Models you work with have there own type of constraints you must work with

19/20 – 20 Day of Data

Another Constraint Ex. —> Telling the system the number of titles for a movie is limited to One

Types of Constraints :

  • Value Constraint ; “Age is never Negative”
  • Uniqueness Constraint ; “Movie Titles = 1”
  • Cardinality Constraint ; “Check Values of Constraint”
  • + You can add or take away your own Value Constraint

Relational Data Model :

  • Relational Data is Simple / Most Popular (Ex. Used in SQL / Teradata / Oracle / … )

Primary Data Structure for Relational Data is TABLES ( Pic. )

  • Header is the “Schema” of the Table . (The Schema of the Table shows the Constraints)
  • You can “Join” Relational Data is the Elements match (Ex. Both Tables have employee I.D. with different info. attached; Salary / Personal Info. – D.O.B.)

[ Most People start with CSV Files ]

20/20 – 20 Days of Data . . .

SemiStructured Data :

SemiStructured Data usually = ‘ Tree Structured Data ‘

Tree Structure Navigation is important for formats XML / JSON

– Blocks in the code are nested in larger blocks [ Header Begin and End Blocks / Body Begin and End Blocks ]

XML ( Extensible Mark-Up Language )

  • Has two Elements called ‘Sample Attribute’
  • XML allows querying of both Header (Schema) and Data

JSON = Java Script Object Notation (FB / TWTR use this)

  • Has similar nested structure that holds lists inside of lists finally showing the Atomic Property Value

– Away to Organize / Model these SemiStructured DataSets is as Trees

  • The top Root Element of the Data will also be the Top / Root of the Tree
  • Text Values are Always Atomic so they will Always be leafs on the Tree
  • Modeling as Tree allows better Navigational Access to Data (getChildren / getParent / getSibling / Text Querying)

Queries Need Tree Navigation