Top Data Science Interview Questions & Answers

5.00 avg. rating (98% score) - 3 votes
Data science interview questions


Data Science is an inter-disciplinary field which is associated with the extraction of information and insights from data through scientific methods, processes, and systems. It can be in various forms, either structured or unstructured. With the high use of analytics by businesses to gain the competitive edge in the industry, the demand for good data science professionals have gone high in the job market.

With more and more jobs in the data science arena, it has become a lucrative area to be in at the moment.

If you are looking to get success in a data science profile, here are some common Data Science Interview questions and answers to help you crack the interview.


Q1. Which would you prefer – R or Python?


Ans. Both R and Python have their own pros and cons. R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. Python, when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database.


Q2. Explain what resampling methods are.


Ans. Resampling methods are used to estimate the precision of the sample statistics, exchanging label on data points and validating models.


Q3. What are Recommender Systems?


Ans. It is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an item.


Q4. What is an Eigenvalue and Eigenvector?


Ans. Eigenvectors are used for understanding linear transformations.

Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.


Q5. What is selection bias, and how can you avoid it?


Ans. Selection bias is an experimental error that occurs when the participant pool, or the subsequent data, is not representative of the target population.

Selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases.


Q6. Which package is used to do data import in R and Python? How do you do data import in SAS?


Ans. In R, RODBC is used for RDBMS data and data.table for fast import.

In SAS, data and sas7bdat is used to import data.

In Python, Pandas package and the commands read_csv, read_sql are used for reading data.



Q7. Which technique is used to predict categorical responses?


Ans. The classification techniques is used to predict categorical responses.


Q8. What is the difference between data science and big data?


Ans. Data science is a field applicable to any data sizes. Big data refers to the large amount of data which cannot be analysed by traditional methods.


Q9. Name some of the prominent resampling methods in data science


Ans. The Bootstrap, Permutation Tests, Cross-validation and Jackknife


Q10. What is a Gaussian distribution and how it is used in data science?


Ans. Gaussian distribution or commonly known as bell curve is a common probability distribution curve. Mention the way it can be used in data science in a detailed manner.


Q11. What is an RDBMS? Name some examples for RDBMS?


Ans. Relational database management system (RDBMS) is a database management system that is based on a relational model.

Some examples for RDBMS are MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.


Q12. What is a Z test, Chi Square test, F test and T test?


Ans. Z test is applied for large samples. Z test = (Estimated Mean – Real Mean)/ (square root real variance / n).

Chi Square test is a statistical method assessing the goodness of fit between a set of observed values and those expected theoretically.

F-test is used to compare 2 populations’ variances. F = explained variance/unexplained variance.

T test is applied for small samples. T test = (Estimated Mean – Real Mean)/ (square root Estimated variance / n).


Q13. What does P-value signify about the statistical data?


Ans. The p-value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would be the same as or more extreme than the actual observed results.


P-value>0.05, it denotes weak evidence against null null hypothesis which means the null hypothesis cannot be rejected.

P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.

P-value=0.05is the marginal value indicating it is possible to go either way


Q14. Differentiate between univariate, bivariate and multivariate analysis.


Ans. Univariate analysis is the simplest form of statistical analysis where only one variable is involved.

Bivariate analysis is where two variables are analysed and in multivariate analysis, multiple variables are examined.


Q15. What is association analysis? Where is it used?


Ans. Association analysis is the task of uncovering relationships among data. It is used to understand how the data items are associated with each other.


Also Read>>Skills That Employers Look For In a Data Scientist


Q16. What is power analysis?


Ans. Power analysis allows the determination of the sample size required to detect an effect of a given size with a given degree of confidence.


Q17. What packages are used for data mining in Python and R?


Ans. There are various packages in Python and R:

Python – Orange, Pandas, NLTK, Matplotlib, and Scikit-learn are some of them

R – Arules, tm, Forecast and GGPlot are some of the packages


Q18. How do you check for data quality?


Ans. Some of the definitions used to check for data quality are:

  • Completeness
  • Consistency
  • Uniqueness
  • Integrity
  • Conformity
  • Accuracy


Q19. What is the difference between squared error and absolute error?


Ans. Squared error measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated.

Absolute error is the difference between the measured or inferred value of a quantity and its actual value.


Q20. Write a program in Python which takes input as weight of the coins and produces output as the money value of the coins.


Ans. Here is an example of the code. You can change the values.



Q21. What is an API? What are APIs used for?


Ans. API stands for Application Program Interface and is a set of routines, protocols, and tools for building software applications.

With API, it is easier to develop software applications.


Q22. What is Collaborative filtering?


Ans. Collaborative filtering is a method of making automatic predictions by using recommendations of other people.


Q23. Why do data scientists use combinatorics or discrete probability?


Ans. It is used because it useful in studying any predictive model.


Q24. Differentiate between wide and long data formats?


Ans. In wide format, categorical data is always grouped.

Long data format is in which there are a number of instances with many variable and subject variable


Also Read>>How are Data Scientist and Data Analyst different?


Q25. Is it possible to perform logistic regression with Microsoft Excel?


Ans. Yes, it is possible. Try to explain it.


Q26. What do you understand by Recall and Precision?


Ans. Precision is the fraction of retrieved instances that are relevant, while Recall is the fraction of relevant instances that are retrieved.


Q27. What is Regularization and what kind of problems does regularization solve?


Ans. Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.

It helps to solve over fitting problem in machine learning.


Q28. What is market basket analysis?


Ans. Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.


Q29. What is the central limit theorem?


Ans. Central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases, regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist.


Q30. Is it better to have too many false negatives or too many false positives?


Ans. This question will depend on how you show your viewpoint. Give examples

These are some of the popular questions that are asked in a Data Science interview. Always be prepared to answer all types of questions — technical skills, interpersonal, leadership or methodologies. If you are someone who has recently started your career in Data Science, you can always get certified to improve your skills and boost your career opportunities.

About the Author

Hasibuddin Ahmed

Hasibuddin Ahmed

Hasib is a professional writer associated with He has written a number of articles related to technology, marketing, and career on various blogs and websites. As an amateur career guru, he often imparts nuggets of knowledge related to leadership and motivation. He is also an avid reader and passionate about the beautiful game of football.