There is a booming market for data scientists, data architects, data analysts, and data engineers, where the demands have jumped by approximately 400% over the last 5 years as per Google Trends, paving the way for more employment opportunities. Now with the market growing leaps and bounds, there is a significant dearth of skilled data scientists, who can help businesses sift through an overabundance of data and come up with meaningful insights.
So if you are planning to move on the path to becoming a data scientist, you need to prepare well and create a fabulous impression on your prospective employers with your knowledge. This write up brings you some important data science interview questions and answers to help you crack your data science interview.
Q1. Which would you prefer – R or Python?
Ans. Both R and Python have their own pros and cons. R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. Python, when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database.
Q2. What are the Resampling methods?
Ans. Resampling methods are used to estimate the precision of the sample statistics, exchanging labels on data points, and validating models.
Q3. What are Recommender Systems?
Ans. It is a subclass of an information filtering system that seeks to predict the “rating” or “preference” that a user would give to an item.
Q4. What is an Eigenvalue and Eigenvector?
Ans. Eigenvectors are used for understanding linear transformations.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Q5. What is selection bias, and how can you avoid it?
Ans. Selection bias is an experimental error that occurs when the participant pool, or the subsequent data, is not representative of the target population.
Selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases.
Q6. Which package is used to do data import in R and Python? How do you do data import in SAS?
Ans. In R, RODBC is used for RDBMS data and data.table for fast-import.
In SAS, data and sas7bdat is used to import data.
In Python, Pandas package and the commands read_csv, read_sql are used for reading data.
Q7. Which technique is used to predict categorical responses?
Ans. The classification techniques are used to predict categorical responses.
Q8. What is the difference between data science and big data?
Ans. Data science is a field applicable to any data size. Big data refers to the large amount of data that cannot be analyzed by traditional methods.
Q9. Name some of the prominent resampling methods in data science
Ans. The Bootstrap, Permutation Tests, Cross-validation and Jackknife
Q10. What is a Gaussian distribution and how it is used in data science?
Ans. Gaussian distribution or commonly known as bell curve is a common probability distribution curve. Mention the way it can be used in data science in a detailed manner.
Q11. What is an RDBMS? Name some examples for RDBMS?
Ans. A relational database management system (RDBMS) is a database management system that is based on a relational model.
Some examples of RDBMS are MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.
Q12. What is a Z test, Chi-Square test, F test, and T-test?
Ans. Z test is applied for large samples. Z test = (Estimated Mean – Real Mean)/ (square root real variance / n).
Chi-Square test is a statistical method assessing the goodness of fit between a set of observed values and those expected theoretically.
F-test is used to compare 2 populations’ variances. F = explained variance/unexplained variance.
T-test is applied for small samples. T-test = (Estimated Mean – Real Mean)/ (square root Estimated variance / n).
Q13. What does P-value signify about the statistical data?
Ans. The p-value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would be the same as or more extreme than the actual observed results.
P-value>0.05, it denotes weak evidence against null hypothesis which means the null hypothesis cannot be rejected.
P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
P-value=0.05is the marginal value indicating it is possible to go either way
Q14. Differentiate between univariate, bivariate and multivariate analysis.
Ans. Univariate analysis is the simplest form of statistical analysis where only one variable is involved.
Bivariate analysis is where two variables are analyzed and in multivariate analysis, multiple variables are examined.
Q15. What is association analysis? Where is it used?
Ans. Association analysis is the task of uncovering relationships among data. It is used to understand how the data items are associated with each other.
Q16. What is power analysis?
Ans. Power analysis allows the determination of the sample size required to detect an effect of a given size with a given degree of confidence.
Q17. What packages are used for data mining in Python and R?
Ans. There are various packages in Python and R:
Python – Orange, Pandas, NLTK, Matplotlib, and Scikit-learn are some of them
R – Arules, tm, Forecast and GGPlot are some of the packages
Q18. How do you check for data quality?
Ans. Some of the definitions used to check for data quality are:
Q19. What is the difference between squared error and absolute error?
Ans. Squared error measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated.
Absolute error is the difference between the measured or inferred value of a quantity and its actual value.
Q20. Write a program in Python which takes input as weight of the coins and produces output as the money value of the coins.
Ans. Here is an example of the code. You can change the values.
Q21. What is an API? What are APIs used for?
Ans. API stands for Application Program Interface and is a set of routines, protocols, and tools for building software applications.
With API, it is easier to develop software applications.
Q22. What is Collaborative filtering?
Ans. Collaborative filtering is a method of making automatic predictions by using the recommendations of other people.
Q23. Why do <a href=”https://learning.naukri.com/career-path/data-scientist-13data scientists use combinatorics or discrete probability?
Ans. It is used because it useful in studying any predictive model.
Q24. Differentiate between wide and long data formats?
Ans. In a wide format, categorical data are always grouped.
The long data format is in which there are a number of instances with many variable and subject variable
Q25. What does NLP stand for?
Ans. NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.
Q26. What do you understand by Recall and Precision?
Ans. Precision is the fraction of retrieved instances that are relevant, while Recall is the fraction of relevant instances that are retrieved.
Q27. What is Regularization and what kind of problems does regularization solve?
Ans. Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.
It helps to solve overfitting problem in machine learning.
Q28. What is market basket analysis?
Ans. Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.
Q29. What is the central limit theorem?
Ans. The central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases, regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist.
Q30. Is it better to have too many false negatives or too many false positives?
Ans. This question will depend on how you show your viewpoint. Give examples
These are some of the popular questions that are asked in a Data Science interview. Always be prepared to answer all types of questions — technical skills, interpersonal, leadership or methodologies. If you are someone who has recently started your career in Data Science, you can always get certified to improve your skills and boost your career opportunities.
Q31. Explain the difference between type I and type II error.
Ans. Type I error is the rejection of a true null hypothesis or false-positive finding, while the Type II error is the non-rejection of a false null hypothesis or false-negative finding.
Q32. What is Linear Regression?
Ans. Linear regression is the most popular type of predictive analysis. It is used to model the relationship between a scalar response and explanatory variables.
Q33. What is the goal of A/B Testing?
Ans. A/B testing is a comparative study, where two or more variants of a page are presented before random users and their feedback is statistically analyzed to check which variation performs better.
Q34. What are Recommender Systems?
Ans. Recommender systems are information filtering systems that predict which products will attract customers, but these systems are not ideal for every business situation. These systems are used in movies, news, research articles, products, etc. These systems are content and collaborative filtering based.
Q35. What is the main difference between overfitting and underfitting?
Ans. Overfitting – In overfitting, a statistical model describes any random error or noise, and occurs when a model is super complex. An overfit model has poor predictive performance as it overreacts to minor fluctuations in training data.
Underfitting – In underfitting, a statistical model is unable to capture the underlying data trend. This type of model also shows poor predictive performance.
Q36. What are Interpolation and Extrapolation?
Ans. Interpolation – This is the method to guess data points between data sets. It is a prediction between the given data points.
Extrapolation – This is the method to guess data point beyond data sets. It is a prediction beyond given data points.
Q37. Why should you perform dimensionality reduction before fitting an SVM?
Ans. These SVMs tend to perform better in reduced space. If the number of features is large as compared to the number of observations, then we should perform dimensionality reduction before fitting an SVM.
Q38. Explain the purpose of group functions in SQL. Cite certain examples of group functions.
Ans. Group functions provide summary statistics of a data set. Some examples of group functions are –
Q39. What are the various types of classification algorithms?
Ans. There are 7 types of classification algorithms, including –
a) Linear Classifiers: Logistic Regression, Naive Bayes Classifier
b) Nearest Neighbor
c) Support Vector Machines
d) Decision Trees
e) Boosted Trees
f) Random Forest
g) Neural Networks
Q40. What is Gradient Descent?
Ans. Gradient Descent is a popular algorithm used for training Machine Learning models and find the values of parameters of a function (f), which helps to minimize a cost function.
Q41. Name different Deep Learning Frameworks.
e) Microsoft Cognitive Toolkit
Q42. What is an Autoencoder?
Ans. These are feedforward learning networks where the input is the same as the output. Autoencoders reduce the number of dimensions in the data to encode it while ensuring minimal error and then reconstruct the output from this representation.
Q43. What is a Boltzmann Machine?
Ans. Boltzmann Machines have a simple learning algorithm that helps to discover interesting features in training data. These machines represent complex regularities and are used to optimize the weights and the quantity for the problems.
Q44. What is Root Cause Analysis?
Ans. Root Cause is defined as a fundamental failure of a process. To analyze such issues, a systematic approach has been devised that is known as Root Cause Analysis (RCA). This method addresses a problem or an accident and gets to its “root cause”.
Q45. What is the difference between a Validation Set and a Test Set?
Ans. The validation set is used to minimize overfitting. This is used in parameter selection, which means that it helps to
verify any accuracy improvement over the training data set. Test Set is used to test and evaluate the performance of a trained Machine Learning model.
Q46. What is the Confusion Matrix?
Ans. Confusion Matrix describes the performance of any classification model. It is presented in the form of a table with 4 different combinations of predicted and actual values.
Q47. What are the limitations of a Linear Model/Regression?
• Linear models are limited to linear relationships, such as dependent and independent variables
• Linear regression looks at a relationship between the mean of the dependent variable and the independent variables, and not the extremes of the dependent variable
• Linear regression is sensitive to univariate or multivariate outliers
• Linear regression tend to assume that the data are independent
Q48. What is the p-value?
Ans. A p-value helps to determine the strength of results in a hypothesis test. It is a number between 0 and 1 and Its value determines the strength of the results.
Q49. What is hypothesis testing?
Ans. Hypothesis testing is an important aspect of any testing procedure in Machine Learning or Data Science to analyze various factors that may have any impact on the outcome of the experiment.
Q50. What is the difference between Causation and Correlation?
Ans. Causation denotes any causal relationship between two events and represents its cause and effects.
Correlation determines the relationship between two or more variables.
Causation necessarily denotes the presence of correlation, but correlation doesn’t necessarily denote causation.
Q51. What is cross-validation?
Ans. Cross-validation is a technique to assess the performance of a model on a new independent dataset. One example of cross-validation could be – splitting the data into two groups – training and testing data, where you use the testing data to test the model and training data to build the model.
Q52. What is Deep Learning?
Ans. Deep Learning is an artificial intelligence function used in decision making. Deep Learning imitates the human brain functioning to process the data and create the patterns used in decision-making. Deep learning is a key technology behind automated driving, automated machine translation, automated game playing, object classification in photographs, and automated handwriting generation, among others.
Q53. What is Pattern Recognition?
Ans. Pattern recognition is the process of data classification that includes pattern recognition and identification of data regularities. This methodology involves the extensive use of machine learning algorithms.
Q54. Where can you use Pattern Recognition?
Ans. Pattern Recognition has multiple usabilities, across-
- Computer Vision
- Data Mining
- Informal Retrieval
- Speech Recognition
Q55. What are some of the most commonly used Machine Learning algorithms?
Ans. Some of the popular Machine Learning algorithms are –
- Linear Regression
- Logistic Regression
- Decision Tree
- Neural Networks
- Decision Trees
- Support vector machines
Q56. What is the main difference between supervised and unsupervised machine learning?
Ans. Supervised learning includes training labeled data for a range of tasks such as data classification, while unsupervised learning does not require explicitly labeling data.
Q57. What is Big Data?
Ans. Big Data is a set of massive data, a collection of huge in size and exponentially growing data, that cannot be managed, stored, and processed by traditional data management tools.
Q58. What are some of the important tools used in Big Data analytics?
Ans. The important Big Data analytics tools are –
• Rattle GUI
Q59. What do you mean by logistic regression?
Ans. Also known as the logit model, logistic regression is a technique to predict the binary result from a linear amalgamation of predictor variables.
Q60. How much data is enough to get a valid outcome?
Ans. All the businesses are different and measured in different ways. Thus, you never have enough data and there will be no right answer. The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.