# NPTEL BIG DATA COMPUTING Assignment 2021

NPTEL BIG DATA COMPUTING One of the main factors contributing to the rise of technology as it is today has been the rapid growth of Information and Communication Technology. With most companies’ recent ventures centered on software development, you will need a team who can handle all business aspects (customer service, communication, marketing) whilst getting their hands dirty by working directly on actual code.

NPTEL BIG DATA COMPUTING is a MOOC course offered by IIT Patna on the NPTEL platform. This course aims to cover the essential topics of Java programming so that the participants can improve their skills to cope with the current demand of IT industries and solve many problems in their own field of studies. The course is developed by Rajiv Misra is working in Department of Computer Science and Engineering at Indian Institute of Technology Patna, India. He obtained his Ph.D degree from IIT Kharagpur

2. Requirements/Prerequisites: NIL
3. INDUSTRY SUPPORT: All compaines or industry.

CRITERIA TO GET A CERTIFICATE

Students have to score Average assignment score = 25% of the average of the best 6 assignments out of the total 8 assignments given in the course.
Exam score = 75% of the proctored certification exam score out of 100 Final scores = Average assignment score + Exam score.

Students will be eligible for CERTIFICATE ONLY IF AVERAGE ASSIGNMENT SCORE >=10/25 AND EXAM SCORE >= 30/75. If any of the 2 criteria are not met, the student will not get the certificate even if the Final score >= 40/100.

## NPTEL BIG DATA COMPUTING ASSIGNMENT WEEK 7 ANSWERS:-

Contents

Q1. Suppose you are using a bagging based algorithm say a Random Forest in model building. Which of the following can be true?

Q2. To apply bagging to regression trees which of the following is/are true in such case ?

1. We build the N regression with N bootstrap sample
2. We take the average the of N regression tree
3. Each tree has a high variance with low bias

Q3. In which of the following scenario a gain ratio is preferred over Information Gain ?

Q4. Which of the following is/are true about Random Forest and Gradient Boosting ensemble methods ?

1. Both methods can be used for classification task
2. Random Forest is use for classification whereas Gradient Boosting is use for regression task
3. Random Forest is use for regression whereas Gradient Boosting is use for Classification task
4. Both methods can be used for regression task

Q5. Given an attribute table shown below, which stores the basic information of attribute a, including the row identifier of instance row_id , values of attribute values (a) and class labels of instances c.

Q6. Bagging provides an averaging over a set of possible datasets, removing noisy and non-stable parts of models.

Q7. Hundreds of trees can be aggregated to form a Random forest model. Which of the following is true about any individual tree in Random Forest?

Q8. Boosting any algorithm takes into consideration the weak learners. Which of the following is the main reason behind using weak learners ?

Reason I-To prevent overfitting

Reason II- To prevent underfitting

## NPTEL BIG DATA COMPUTING ASSIGNMENT WEEK 6 ANSWERS:-

Q1. Which of the following is required by K-means clustering ?

Q2. Identify the correct statement in context of Regressive model of Machine Learning.

Q3. Which of the following tasks can be best solved using Clustering ?

All the best for the final exam, for extra preparation, take our membership for better score in exam read more here:- Final Exam Membership

Q4. Identify the correct method for choosing the value of ‘k’ in k-means algorithm ?

Q5. Identify the correct statement(s) in context of overfitting in decision trees:

Statement I: The idea of Pre-pruning is to stop tree induction before a fully grown tree is built, that perfectly fits the training data.

Statement II: The idea of Post-pruning is to grow a tree to its maximum size and then remove the nodes using a top-bottom approach.

Q6. Which of the following options is/are true for K-fold cross-validation ?

1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number of observations.

Q7. Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset.

Q8. Identify the correct statement(s) in context of machine learning approaches:

Statement I: In supervised approaches, the target that the model is predicting is unknown or unavailable. This means that you have unlabeled data.

Statement II: In unsupervised approaches the target, which is what the model is predicting, is provided. This is referred to as having labeled data because the target is labeled for every sample that you have in your data set.

All the best for the final exam, for extra preparation, take our membership for better score in exam read more here:- Final Exam Membership

## NPTEL BIG DATA COMPUTING ASSIGNMENT WEEK 5 ANSWERS:-

Q1. Columns in HBase are organized to___________________________

Q2. HBase is a distributed ________ database built on top of the Hadoop file system

Q3. A small chunk of data residing in one machine which is part of a cluster of machines holding one HBase table is known as__________________

Q4. In HBase, __________________is a combination of row, column family, column qualifier and contains a value and a timestamp.

Q5. HBase architecture has 3 main components:

Q6. HBase stores data in_______________________

Q7. Kafka is run as a cluster comprised of one or more servers each of which is called___________________

Q8. Statement 1: Batch Processing provides ability to process and analyze data at-rest (stored data).

Statement 2: Stream Processing provides ability to ingest, process and analyze data in-motion in real or near-real-time.

Q9. ________________is a central hub to transport and store event streams in real time.

Q10. What are the parameters defined to specify window operation ?

Q11. Consider the following dataset Customers:

Q12. ________________is a Java library to process event streams live as they occur.

## NPTEL BIG DATA COMPUTING ASSIGNMENT WEEK 4 ANSWERS:-

Q1. Identify the correct choices for the given scenarios:

P: The system allows operations all the time, and operations return quickly

Q: All nodes see same data at any time, or reads return latest written value by any client

Q2. Cassandra uses a protocol called __________________to discover location and state information about the other nodes participating in a Cassandra cluster.

Q3. In Cassandra, ____________________ is used to specify data centers and the number of replicas to place within each data center. It attempts to place replicas on distinct racks to avoid the node failure and to ensure data availability.

Q4. A Snitch determines which data centers and racks nodes belong to. Snitches inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks.

Q5. Statement 1: In Cassandra, during a write operation, when hinted handoff is enabled and If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up.

Statement 2: In Cassandra, Ec2Snitch is important snitch for deployments and it is a simple snitch for Amazon EC2 deployments where all nodes are in a single region. In Ec2Snitch region name refers to data center and availability zone refers to rack in a cluster.

Q6. What is Eventual Consistency ?

Q7. Statement 1: When two processes are competing with each other causing data corruption, it is called deadlock

Statement 2: When two processes are waiting for each other directly or indirectly, it is called race condition

Q8. ZooKeeper allows distributed processes to coordinate with each other through registers, known as ___________________

Q9. In Zookeeper, when a _______ is triggered the client receives a packet saying that the znode has changed.

Q10. Consider the Table temperature_details in Keyspace “day3” with schema as follows:

## NPTEL BIG DATA COMPUTING ASSIGNMENT WEEK 3 ANSWERS:-

Q1. In Spark, a ______________________is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

Q2. Given the following definition about the join transformation in Apache Spark:

Q3. Consider the following statements in the context of Spark:

Statement 1:  Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2:  Spark improves usability through high-level APIs in Java, Scala, Python and also provides an interactive shell.

Q4. Resilient Distributed Datasets (RDDs) are fault-tolerant and immutable.

Q5. Which of the following is not a NoSQL database ?

Q6. Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.

Q7. ______________ leverages Spark Core fast scheduling capability to perform streaming analytics.

Q8. ____________________ is a distributed graph processing framework on top of Spark.

Q9. Point out the incorrect statement in the context of Cassandra:

Q10. Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful machines.

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf).

## NPTEL BIG DATA COMPUTING ASSIGNMENT WEEK 1 ANSWERS:-

Q1. ________________ is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Q2. Which of the following tool is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases ?

Q3. _________________is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Q4. _______________refers to the connectedness of big data.

Q5. Consider the following statements:

Statement 1: Volatility refers to the data velocity relative to timescale of event being studied

Statement 2:
Viscosity refers to the rate of data loss and stable lifetime of data

Q6. ________________refers to the biases, noise and abnormality in data, trustworthiness of data.

Note:- WE NEVER PROMOTE COPYING AND We do not claim 100% surety of answers, these answers are based on our sole knowledge, and by posting these answers we are just trying to help students to reference, so we urge do your assignment on your own.

Q7. _____________ brings scalable parallel database technology to Hadoop and allows users to submit low latencies queries to the data that’s stored within the HDFS or the Hbase without acquiring a ton of data movement and manipulation.

Q8. NoSQL databases store unstructured data with no particular schema.