A career in data analytics can make you an in-demand professional in one of the high-paying industries. If you are looking to crack an interview in data analytics, you can read this article to understand what type of questions are frequently asked.

Data Analytics is the process of extracting and examining data to come to a conclusion by identifying and analysing behavioural patterns. As we live in an information-driven age, data plays an integral role in the functioning of any organisation. Thus, organisations look to hire skilled data analysts who can turn their data into valuable information. This can help them achieve better business growth as they can gain a better understanding of the market, consumers, their product or services and many more. If you love data and want to be a part of a high potential industry, you can go for a certification in data analytics.

Here are some of the top data analytics interview questions and answers:

Q1. What are the best practices for data cleaning?

Ans. There are 5 basic best practices for data cleaning:

  • Make a data cleaning plan by understanding where the common errors take place and keep communications open.
  • Standardise the data at the point of entry. This way it is less chaotic and you will be able to ensure that all information is standardised, leading to fewer errors on entry.
  • Focus on the accuracy of the data. Maintain the value types of data, provide mandatory constraints and set cross-field validation.
  • Identify and remove duplicates before working with the data. This will lead to an effective data analysis process.
  • Create a set of utility tools/functions/scripts to handle common data cleaning tasks.

Q2. What are the challenges that you face as a data analyst?

Ans. There are various ways you can answer the question. It might be very badly formatted data when the data isn’t enough to work with, clients provide data they have supposedly cleaned it but it has been made worse, not getting updated data or there might be factual/data entry errors.

Also Read>> Know more about data analytics and why to get a certification

Q3. What are the data validation methods used in data analytics?

Ans. The various types of data validation methods used are:

  • Field Level Validation – validation is done in each field as the user enters the data to avoid errors caused by human interaction.
  • Form Level Validation – In this method, validation is done once the user completes the form before a save of the information is needed.
  • Data Saving Validation – This type of validation is performed during the saving process of the actual file or database record. This is usually done when there are multiple data entry forms.
  • Search Criteria Validation – This type of validation is relevant to the user to match what the user is looking for to a certain degree. It is to ensure that the results are actually returned.

Q4. What is an outlier?

Ans. Any observation that lies at an abnormal distance from other observations is known as an outlier. It indicates either a variability in the measurement or an experimental error.

Q5. What is the difference between data mining and data profiling?

Ans. Data profiling is usually done to assess a dataset for its uniqueness, consistency and logic. It cannot identify incorrect or inaccurate data values.

Data mining is the process of finding relevant information which has not been found before. It is the way in which raw data is turned into valuable information.

Q6. How often should a data model be retained?

Ans. A good data analyst would be able to understand the market dynamics and act accordingly to retain a working data model so as to adjust to the new environment.

Q7. What is the KNN imputation method?

Ans. KNN (K-nearest neighbour) is an algorithm that is used for matching a point with its closest k neighbours in a multi-dimensional space.

Q8. Why is KNN used to determine missing numbers?

Ans. KNN is used for missing values under the assumption that a point value can be approximated by the values of the points that are closest to it, based on other variables.

Q9. Explain what you do with suspicious or missing data?

Ans. When there is a doubt in data or there is missing data, then:

  • Make a validation report to provide information on the suspected data.
  • Have an experienced personnel look at it so that its acceptability can be determined.
  • Invalid data should be updated with a validation code.
  • Use the best analysis strategy to work on the missing data like simple imputation, deletion method or case wise imputation.

Q10. What is kmeans algorithm?

Ans. Kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separation between these clusters. Due to the unsupervised nature, the clusters have no labels.

Q11. What is the difference between true positive rate and recall?

Ans. There is no difference, they are the same, with the formula:

(true positive)/(true positive + false negative)

Q12. What is the difference between linear regression and logistic regression?

Ans.

Linear Regression Logistic Regression
It requires independent variables to be continuous It can have dependent variables with more than two categories
Based on least-square estimation Based on maximum likelihood estimation
Requires 5 cases per independent variable Requires at least 10 events per independent variable
Aimed at finding the best fitting straight line where the distance between the points and the regression line are errors As it is used to predict a binary outcome, the resultant graph is an S-curved one.

Q13. What is a good data model?

Ans. The criteria that define a good data model are:

  • It is intuitive.
  • Its data can be easily consumed.
  • The data changes in it are scalable.
  • It can evolve and support new business cases.

Also Read>> How to become a data analyst

Q14. Estimate the number of weddings that take place in a year in India?

Ans. To answer this type of guesstimation questions, one should always follow four steps:

Step1: Start with the right proxy – here the right proxy will be the total population. You know that India has more than 1 billion population and to be a bit more precise, it’s around 1.2 billion.

Step2: Segment and filter – the next step is to find the right segments and filter out the ones which are not. You will have a tree-like structure, with branches for each segment and sub-branches which filters out each segment further. In this question, we will filter out the population above 35 years of age and below 15 for rural/below 20 for urban.

Step3: Always round of the proxy to one or zero decimal points so that your calculation is easy. Instead of doing a calculation like 1488/5, you can go for 1500/5.

Step4: Validate each number using your common sense to understand if it’s the right one. Add all the numbers that you have come up after filtering. You will get the required guesstimate. E.g. we will validate the guesstimate to include one-time marriages only at the end.

Let’s do it:

Total population – 1.2 billion

Two main population segments – Rural (70%) and Urban (30%)

Now, filtering age group and sex ratio:

Average marriage age in rural – 15 to 35 years

Average marriage age in urban – 20 to 35 years

Assuming 65% of the total population is within 0-35 years,

Percentage of population which has the probability of getting married in rural area ≈ (35-15)/35*65 ≈ 40%

Percentage of population which has the probability of getting married in urban area ≈ (35-20)/35*65 ≈ 30%

Assuming sex ratio to be 50% male and 50% female,

Total number of marriages in rural area ≈ .70*.40*1.2 billion/2 ≈ 170 million

Considering only first-time marriages in rural area ≈ 170 million/20 ≈ 8.5 million

Total number of marriages in urban area ≈ .30*.30*1.2 billion/2 ≈ 50 million

Considering only first-time marriages in rural area ≈ 50 million/15 ≈ 3million

Thus, the total marriage in India in a year ≈ 11 – 12 million

Q15. What is the condition for using a t-test or a z-test?

Ans. T-test is usually used when we have a sample size of less than 30 and z-test when we have a sample test greater than 30.

Q16. What are the two main methods two detect outliers?

Ans. Box plot method: if the value is higher or lesser than 1.5*IQR (inter quartile range) above the upper quartile (Q3) or below the lower quartile (Q1) respectively, then it is considered an outlier.

Standard deviation method: if value higher or lower than mean ± (3*standard deviation), then it is considered an outlier.

Q17. Why is ‘naïve Bayes’ naïve?

Ans. It is naïve because it assumes that all dataset are equally important and independent, which is not the case in a real-world scenario.

Q18. What is the difference between standardized and unstandardized coefficients?

Ans. The standardized coefficient is interpreted in terms of standard deviation while unstandardized coefficient is measured in actual values.

Q19. What is the difference between R-squared and adjusted R-squared?

Ans. R-squared measures the proportion of variation in the dependent variables explained by the independent variables.

Adjusted R-squared gives the percentage of variation explained by those independent variables that in reality affect the dependent variable.

Q20. What is the difference between factor analysis and principal component analysis?

Ans. The aim of principal component analysis is to explain the covariance between variables while the aim of factor analysis is to explain the variance between variables.

Q21. What are the steps involved in a data analytics project?

Ans. The fundamental steps involved in a data analysis project are –

  • Understand the Business
  • Get the data
  • Explore and clean the data
  • Validate the data
  • Implement and track the data sets
  • Make predictions
  • Iterate

Q22. What do you do for data preparation?

Ans. Since data preparation is a critical approach to data analytics, the interviewer might be interested in knowing what path you will take up to clean and transform raw data before processing and analysis. As an answer to this data analytics interview question, you should discuss the model you will be using, along with logical reasoning for it. In addition, you should also discuss how your steps would help you to ensure superior scalability and accelerated data usage.

Q23. What are some of the most popular tools used in data analytics?

Ans. The most popular tools used in data analytics are:

  • Tableau
  • Google Fusion Tables
  • Google Search Operators
  • Konstanz Information Miner (KNIME)
  • RapidMiner
  • Solver
  • OpenRefine
  • NodeXL
  • Io
  • Pentaho
  • SQL Server Reporting Services (SSRS)
  • Microsoft data management stack

Q24. What are the most popular statistical methods used when analyzing data?

The most popular statistical methods used in data analytics are –

  • Linear Regression
  • Classification
  • Resampling Methods
  • Subset Selection
  • Shrinkage
  • Dimension Reduction
  • Nonlinear Models
  • Tree-Based Methods
  • Support Vector Machines
  • Unsupervised Learning

Q25. What are the benefits of using version control?

Ans. The primary benefits of version control are –

  • Enables comparing files, identifying differences, and merging the changes
  • Allows keeping track of application builds by identifying which version is under development, QA, and production
  • Helps to improve the collaborative work culture
  • Keeps different versions and variants of code files secure
  • Allows seeing the changes made in the file’s content
  • Keeps a complete history of the project files in case of central server breakdown

Q26. What is Collaborative Filtering?

Ans. Collaborative filtering is a technique used by recommender systems by making automatic predictions or filtering about a user’s interests. This is achieved by collecting information from many users.

Q27. Do you have any idea about the job profile of a data analyst?

Ans. Yes, I have a fair idea of the job responsibilities of a data analyst. Their primary responsibilities are –

  • To work in collaboration with IT, management and/or data scientist teams to determine organizational goals
  • To dig data from primary and secondary sources
  • To clean the data and discard irrelevant information
  • To perform data analysis and interpret results using standard statistical methodologies
  • To highlight changing trends, correlations and patterns in complicated data sets
  • To strategize process improvement
  • To ensure clear data visualizations for management

Q28. What is a Pivot Table?

Ans. A Pivot Table is a Microsoft Excel feature used to summarize huge datasets quickly. It sorts, reorganizes, counts, or groups data stored in a database. This data summarization includes sums, averages, or other statistics.

Q29. Name different sections of a Pivot Table.

Ans. A Pivot table has four different sections, which include –

  • Values Area
  • Rows Area
  • Column Area
  • Filter Area

Q30. What is Standard Deviation?

Ans. Standard deviation is a very popular method to measure any degree of variation in a data set. It measures the average spread of data around the mean most accurately.

Q31. What is a data collection plan?

Ans. A data collection plan is used to collect all the critical data in a system. It covers –

  • Type of data that needs to be collected or gathered
  • Different data sources for analyzing a data set

Q32. What is an Affinity Diagram?

Ans. An Affinity Diagram is an analytical tool used to cluster or organize data into subgroups based on their relationships. These data or ideas are mostly generating from discussions or brainstorming sessions, and are used in analyzing complex issues.

Q33. What is imputation?

Ans. Missing data may lead to some critical issues; hence, imputation is the methodology that can help to avoid pitfalls. It is the process of replacing missing data with substituted values. Imputation helps in preventing list-wise deletion of cases with missing values.

Q34. Name some of the essential tools useful for Big Data analytics.

Ans. The important Big Data analytics tools are –

  • NodeXL
  • KNIME
  • Tableau
  • Solver
  • OpenRefine
  • Rattle GUI
  • Qlikview

Q35. What is the Truth Table?

Ans. Truth Table is a collection of facts, determining the truth or falsity of a proposition. It works as a complete theorem-prover and is of three types –

  • Accumulative truth Table
  • Photograph truth Table
  • Truthless Fact Table

Q36. What is data visualization?

Ans. In simpler terms, data visualization is a graphical representation of information and data. It enables the users to view and analyze data in a smarter way and use technology to draw them into diagrams and charts.

Q37. Why should you choose data visualization?

Ans. Since it is easier to view and understand complex data in the form of charts or graphs, the trend of data visualization has picked up rapidly.

Q38. What is the Metadata?

Ans. Metadata refers to the detailed information about the data system and its contents. It helps to define the type of data or information that will be sorted.

Q39. What is the main difference between overfitting and underfitting?

Ans. Overfitting – In overfitting, a statistical model describes any random error or noise, and occurs when a model is super complicated. An overfit model has poor predictive performance as it overreacts to minor fluctuations in training data.

Underfitting – In underfitting, a statistical model is unable to capture the underlying data trend. This type of model also shows poor predictive performance.

Q40. What are some Python libraries used in Data Analysis?

Some of the vital Python libraries used in Data Analysis include –

  • Bokeh
  • Matplotlib
  • NumPy
  • Pandas
  • SciKit
  • SciPy
  • Seaborn
  • TensorFlow
  • Keras

Looking to improve your skills or get a certificate in data analytics? Naukri Learning offers a variety of professional training courses in Data Science, data analytics and big data, which can help you to start a promising career filled with opportunities.