Here we have given some most frequently asked questions in the data science interview questions which will help you in cracking data science interviews.

**1) What is Data Science?**

Data Science is a process of collecting, cleaning, visualizing and Anayalzing the raw data for future predictions with machine learning models. Data Science is a combination of Statistics, Data Analysis and Machine Learning. It combines all the algorithms, tools and Machine Learning Techniques to find out patterns from Raw Data.

**2) What are the Python Libraries which are used for Data Science?**

Here are the 5 Necessary Python Libraries for Data Science

- Scipy
- Pandas
- Matplotlib
- Numpy
- Seaborn

**3) What you prefer between Python or R for text analysis?**

Python is the best for text analysis as it has pandas and other libraries for data structures and High-Performance data analysis.

R is not as much suitable for text analysis, R is best for Machine Learning.

**4) What are the cases where Resampling is done?**

Cases, where Resampling is done are:-

- Estimating the sample accuracy from the subset of Data.
- Changing Labels of the Data while Testing.
- Validation of Models with Random Dataset.

**5) What is Bias?**

Bias is an underfitting problem where the error occurs in the model as the machine Learning

an algorithm is very simplified.

**6) What type of bias can occur during sampling?**

- Selection Bias
- Under Coverage Bias
- Survivorship Bias

**7) Why Data Cleaning is that much Important for analysis?**

Cleaning data is important for scaling the data from the multiple sources that Analyst or Data Scientist can work with it.

Model Accuracy can be increased by using Data Cleaning.

It is very much helpful to reduce the time complexity of data analysis as the data is not cleaned as it is collected from different sources.

**8) What is Power Analysis?**

Power Analysis is a major Part of Experimental Design. Power Analysis is used to estimate Effect Size, Sample Size, Significance, and Statistical power. Power Analysis estimates one of this parameter and uses it as a parameter for estimating the other three. So it is a tool used for Design and Analysis of Experiments.

**9) What is Collaborative Filtering?**

Collaborative Filtering is a process of Searching correct patterns with collaborative viewpoints, Multiple Data Source, and different Agents.

**10) What is the difference between the expected value and the mean value?**

Mean value and Expected values are similar terms but both are used in different contexts. Mean value is used in Probability Distribution and the Expected value is used in the context of Random Variable.

**11) What is Normal Distribution?**

Normal Distribution is a Distribution of Data where Distribution of Data has no bias toward Left or Right and Distributed equally from the center point. Normal Distribution of Data creates the Bell-Shaped Curve. In Normally Distributed Data the mean, median and mode of data is located at the center of the data.

**12) What are Correlation and Covariance?**

Correlation:- Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

Covariance:- In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. It explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable

**13) What is Point Estimates and Confidence Interval?**

Point Estimate refers to a particular value as an estimation of the population parameter. To derive Point Estimators, the Maximum likelihood Estimator is used to estimating a population parameter.

A confidence interval gives a range of values to Estimate Population Parameters. This likeliness of the range of estimates is the Probability which is called Confidence level or Confidence Coefficient

**14) What is the difference between the Validation set and Test set?**

Validation Set is a part of a test set of data that is used to test the model for overfitting.

The Test Set is used for evaluating and checking the performance of a model.

**15) What is the Confusion Matrix?**

Confusion Matrix is the 2*2 matrix provided by Binary Classifier.A Confusion matrix is used to derive error-rate, accuracy, specificity, sensitivity, precision and recall. Confusion Matrix has 4 outcomes :

- a) True positive(TP) – Correct Positive Prediction
- b) False Positive(FP) – Incorrect positive Prediction
- c) True Negative(TN) – Correct Negative Prediction
- d) False Negative(FN) – Incorrect Negative Prediction

**16) What are OverFitting and UnderFitting?**

Overfitting is an Indication that the model is using noise or random error instead of Underlying a Relationship. Overfitting of a model is occurred because of too much complex model such as more parameters in comparison to data. An overfitted model has poor predictive performance as it overreacts to the minor fluctuations of Data.

The Underfitting model is referred to as a model that has not captured the trend of the data while training. For Example, Underfitting may occur when the linear model is trained with non-linear data. Such a model also has less predictive performance

**17) How to Solve Overfitting and Underfitting problem in a model?**

One way to solve OverFitting and Underfitting is by sampling a DataSet and estimating model accuracy by validation techniques like K-fold Cross-Validation and having a Validation set to Evaluate the model.

**18) What are Regularization and its use?**

Regularization is adding parameters for tuning the model and to prevent the overfitting of the model. It is usually done by multiplying a constant to an existing weight of the model. Mostly the constants are l1 and l2. This will minimize the loss function calculated on the regularized training set.

**19) How does ROC curve work?**

ROC curve is a visual representation of the relation between True positive rates and False positive rates at various thresholds. It is also used to find out the trade-off between sensitivity and false-positive rates.

**20) What is Cross-Validation?**

Cross-Validation is a Model validation technique used to evaluate different parameters for the model to achieve the best accuracy of the model.

Most Popular Articles:

Top 10 Automated Data Science and Machine Learning Tools in Market

Top 10 Python Packages for Data Science