Top Data Science Interview Questions & Answers

5.00 avg. rating (98% score) - 4 votes
Data science interview questions

 

Data Science is an inter-disciplinary field which is associated with the extraction of information and insights from data through scientific methods, processes, and systems. It can be in various forms, either structured or unstructured. With the high use of analytics by businesses to gain the competitive edge in the industry, the demand for good data science professionals have gone high in the job market.

With more and more jobs in the data science arena, it has become a lucrative area to be in at the moment.

If you are looking to get success in a data science profile, here are some common Data Science Interview questions and answers to help you crack the interview.

 

Q1. Which would you prefer – R or Python?

 

Ans. Both R and Python have their own pros and cons. R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. Python, when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database.

 

Q2. Explain what resampling methods are.

 

Ans. Resampling methods are used to estimate the precision of the sample statistics, exchanging label on data points and validating models.

 

Q3. What are Recommender Systems?

 

Ans. It is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an item.

 

Q4. What is an Eigenvalue and Eigenvector?

 

Ans. Eigenvectors are used for understanding linear transformations.

Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

 

Q5. What is selection bias, and how can you avoid it?

 

Ans. Selection bias is an experimental error that occurs when the participant pool, or the subsequent data, is not representative of the target population.

Selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases.

 

Q6. Which package is used to do data import in R and Python? How do you do data import in SAS?

 

Ans. In R, RODBC is used for RDBMS data and data.table for fast import.

In SAS, data and sas7bdat is used to import data.

In Python, Pandas package and the commands read_csv, read_sql are used for reading data.

 

 

Q7. Which technique is used to predict categorical responses?

 

Ans. The classification techniques is used to predict categorical responses.

 

Q8. What is the difference between data science and big data?

 

Ans. Data science is a field applicable to any data sizes. Big data refers to the large amount of data which cannot be analysed by traditional methods.

 

Q9. Name some of the prominent resampling methods in data science

 

Ans. The Bootstrap, Permutation Tests, Cross-validation and Jackknife

 

Q10. What is a Gaussian distribution and how it is used in data science?

 

Ans. Gaussian distribution or commonly known as bell curve is a common probability distribution curve. Mention the way it can be used in data science in a detailed manner.

 

Q11. What is an RDBMS? Name some examples for RDBMS?

 

Ans. Relational database management system (RDBMS) is a database management system that is based on a relational model.

Some examples for RDBMS are MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.

 

Q12. What is a Z test, Chi Square test, F test and T test?

 

Ans. Z test is applied for large samples. Z test = (Estimated Mean – Real Mean)/ (square root real variance / n).

Chi Square test is a statistical method assessing the goodness of fit between a set of observed values and those expected theoretically.

F-test is used to compare 2 populations’ variances. F = explained variance/unexplained variance.

T test is applied for small samples. T test = (Estimated Mean – Real Mean)/ (square root Estimated variance / n).

 

Q13. What does P-value signify about the statistical data?

 

Ans. The p-value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would be the same as or more extreme than the actual observed results.

When,

P-value>0.05, it denotes weak evidence against null null hypothesis which means the null hypothesis cannot be rejected.

P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.

P-value=0.05is the marginal value indicating it is possible to go either way

 

Q14. Differentiate between univariate, bivariate and multivariate analysis.

 

Ans. Univariate analysis is the simplest form of statistical analysis where only one variable is involved.

Bivariate analysis is where two variables are analysed and in multivariate analysis, multiple variables are examined.

 

Q15. What is association analysis? Where is it used?

 

Ans. Association analysis is the task of uncovering relationships among data. It is used to understand how the data items are associated with each other.

 

Also Read>>Skills That Employers Look For In a Data Scientist

 

Q16. What is power analysis?

 

Ans. Power analysis allows the determination of the sample size required to detect an effect of a given size with a given degree of confidence.

 

Q17. What packages are used for data mining in Python and R?

 

Ans. There are various packages in Python and R:

Python – Orange, Pandas, NLTK, Matplotlib, and Scikit-learn are some of them

R – Arules, tm, Forecast and GGPlot are some of the packages

 

Q18. How do you check for data quality?

 

Ans. Some of the definitions used to check for data quality are:

  • Completeness
  • Consistency
  • Uniqueness
  • Integrity
  • Conformity
  • Accuracy

 

Q19. What is the difference between squared error and absolute error?

 

Ans. Squared error measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated.

Absolute error is the difference between the measured or inferred value of a quantity and its actual value.

 

Q20. Write a program in Python which takes input as weight of the coins and produces output as the money value of the coins.

 

Ans. Here is an example of the code. You can change the values.

Python-Program

 

Q21. What is an API? What are APIs used for?

 

Ans. API stands for Application Program Interface and is a set of routines, protocols, and tools for building software applications.

With API, it is easier to develop software applications.

 

Q22. What is Collaborative filtering?

 

Ans. Collaborative filtering is a method of making automatic predictions by using recommendations of other people.

 

Q23. Why do Ans. It is used because it useful in studying any predictive model.

 

Q24. Differentiate between wide and long data formats?

 

Ans. In wide format, categorical data is always grouped.

Long data format is in which there are a number of instances with many variable and subject variable

 

Also Read>>How are Data Scientist and Data Analyst different?

 

Q25. Is it possible to perform logistic regression with Microsoft Excel?

 

Ans. Yes, it is possible. Try to explain it.

 

Q26. What do you understand by Recall and Precision?

 

Ans. Precision is the fraction of retrieved instances that are relevant, while Recall is the fraction of relevant instances that are retrieved.

 

Q27. What is Regularization and what kind of problems does regularization solve?

 

Ans. Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.

It helps to solve over fitting problem in machine learning.

 

Q28. What is market basket analysis?

 

Ans. Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

 

Q29. What is the central limit theorem?

 

Ans. Central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases, regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist.

 

Q30. Is it better to have too many false negatives or too many false positives?

 

Ans. This question will depend on how you show your viewpoint. Give examples

These are some of the popular questions that are asked in a Data Science interview. Always be prepared to answer all types of questions — technical skills, interpersonal, leadership or methodologies. If you are someone who has recently started your career in Data Science, you can always get certified to improve your skills and boost your career opportunities.

 

Q31. Explain the difference between type I and type II error.

 

Ans. Type I error is the rejection of a true null hypothesis or false positive finding, while the Type II error is the non-rejection of a false null hypothesis or false negative finding.
 

Q32. What is Linear Regression?

 

Ans. Linear regression is the most popular type of predictive analysis. It is used to model the relationship between a scalar response and explanatory variables.
 

Q33. What is the goal of A/B Testing?

 

Ans. A/B testing is a comparative study, where two or more variants of a page are presented before random users and their feedback is statistically analyzed to check which variation performs better.
 

Q34. What are Recommender Systems?

 

Ans. Recommender systems are information filtering systems that predict which products will attract customers, but these systems are not ideal for every business situation. These systems are used in movies, news, research articles, products, etc. These systems are content and collaborative filtering based.
 

Q35. What is the main difference between overfitting and underfitting?

 

Ans. Overfitting – In overfitting, a statistical model describes any random error or noise, and occurs when a model is super complex. An overfit model has a poor predictive performance as it overreacts to minor fluctuations in training data.
Underfitting – In underfitting, a statistical model is unable to capture the underlying data trend. This type of model also shows a poor predictive performance.
 

Q36. What is Interpolation and Extrapolation?

 

Ans. Interpolation – This is the method to guess data point between data sets. It is a prediction between given data points.
Extrapolation – This is the method to guess data point beyond data sets. It is a prediction beyond given data points.
 

Q37. Why should you perform dimensionality reduction before fitting an SVM?

 

Ans. These SVMs tend to perform better in reduced space. If the number of feature is large as compared to the number of observations, then we should perform dimensionality reduction before fitting an SVM.
 

Q38. Explain the purpose of group functions in SQL. Cite certain examples of group functions.

 

Ans. Group functions provide summary statistics of a data set. Some examples of group functions are –
a) COUNT
b) MAX
c) MIN
d) AVG
e) SUM
f) DISTINCT
 

Q39. What are the various types of classification algorithms?

 

Ans. There are 7 types of classification algorithms, including –
a) Linear Classifiers: Logistic Regression, Naive Bayes Classifier
b) Nearest Neighbor
c) Support Vector Machines
d) Decision Trees
e) Boosted Trees
f) Random Forest
g) Neural Networks
 

Q40. What is Gradient Descent?

 

Ans. Gradient Descent is a popular algorithm used for training Machine Learning models and find the values of parameters of a function (f), which helps to minimize a cost function.
 

Q41. Name different Deep Learning Frameworks.

 

Ans.
a) Caffe
b) Chainer
c) Pytorch
d) TensorFlow
e) Microsoft Cognitive Toolkit
f) Keras
 

Q42. What is an Autoencoder?

 

Ans. These are feedforward learning networks where the input is the same as the output. Autoencoders reduce the number of dimensions in the data to encode it, while ensuring minimal error and then reconstruct the output from this representation.
 

Q43. What is a Boltzmann Machine?

 

Ans. Boltzmann Machines have a simple learning algorithm that helps to discover interesting features in a training data. These machines represent complex regularities and are used to optimize the weights and the quantity for the problems.
 

Q44. What is Root Cause Analysis?

 

Ans. Root Cause is defined as a fundamental failure of a process. To analyze such issues, a systematic approach has been devised that is known as Root Cause Analysis (RCA). This method addresses a problem or an accident and get to its “root cause”.
 

Q45. What is the difference between a Validation Set and a Test Set?

 

Ans. Validation set is used to minimize overfitting. This is used in parameter selection, which means that it helps to
verify any accuracy improvement over the training data set.Test Set is used to test and evaluate the performance of a trained Machine Learning model.
 

Q46. What is Confusion Matrix?

 

Ans. Confusion Matrix describes the performance of any classification model. It is presented in the form of a table with 4 different combinations of predicted and actual values.
 

Q47. What are the limitations of a Linear Model/Regression?

 

Ans.
• Linear models are limited to linear relationships, such as dependent and independent variables
• Linear regression looks at a relationship between the mean of the dependent variable and the independent variables, and not the extremes of the dependent variable
• Linear regression is sensitive to univariate or multivariate outliers
• Linear regression tend to assume that the data are independent
 

Q48. What is p-value?

 

Ans. A p-value helps to determine the strength of results in a hypothesis test. It is a number between 0 and 1 and Its value determines strength of the results.
 

Q49. What is hypothesis testing?

 

Ans. Hypothesis testing is an important aspect of any testing procedure in Machine Learning or Data Science to analyze various factors that may have any impact on the outcome of experiment.
 

Q50. What is the difference between Causation and Correlation?

 

Ans. Causation denotes any causal relationship between two events, and represents its cause and effects.
Correlation determines the relationship between two or more variables.
Causation necessarily denotes the presence of correlation, but correlation doesn’t necessarily denote causation.

About the Author

Hasibuddin Ahmed

Hasibuddin Ahmed

Hasib is a professional writer associated with learning.naukri.com. He has written a number of articles related to technology, marketing, and career on various blogs and websites. As an amateur career guru, he often imparts nuggets of knowledge related to leadership and motivation. He is also an avid reader and passionate about the beautiful game of football.