Do you know which job is called the “Sexiest Job of the 21st Century”? I will tell you the answer, it is none other than the Data Scientist. Data science is an emerging field and the number of jobs has enormously increased. If you want to become a data scientist, you should be able to crack the data science interview. I have clubbed a list of the most repeated data science interview questions.
Q-1. How do you explain the term Data Science?
Data science envelops planning information for the investigation, including purifying, accumulating, and controlling the information to perform progressed information examination. Logical applications and information researchers would then be able to audit the outcomes to reveal designs and empower business pioneers to draw informed bits of knowledge
Q-2. List the differences between unsupervised learning and supervised learning.
|Supervised Learning||Unsupervised Learning|
|Input is known and labelled data. There’s a feedback mechanism in supervised learning.||Here, unlabeled data is used as input. Unsupervised learning has no feedback mechanism.|
Q-3. What are feature vectors?
The feature vector is a vector of numerical features which represents an object. Feature vectors are used to represent features, that is, numerals or symbolic characteristics of a certain object in a mathematical way so that it is easy to analyze.
Q-4. What is a random forest?
Random forest is a method in machine learning which will assist you to perform all types of classification and regression tasks.
Q-5. How do you work in a random forest?
The main motto of this technique is that many weak learners come together to combine and work under a strong learner.
- Predictions are done at the majority rule.
- Every time a split is considered, a sample is chosen as split candidates out of all pp predictors.
Q-6. Explain the difference between Data Analyst and Data Science.
Data Scientists need to extract valuable insights which analysts will apply in real-world business situations. The difference between them is that a data scientist needs and has a lot more technical knowledge than a data analyst.
Q-7. In Data Science, what is logistic regression?
Logistic Regression is a method to predict the binary outcome from a linear combination. The logit model is the other name for logistic regression.
Q-8. What is linear regression?
Linear regression is a statistical programming technique where the score of a certain variable is anticipated from the second variable. The second variable is referred to as the predictor variable and A is called the criterion variable.
Q-9. What is Selection bias?
Selection bias is a type of error that occurs when the person who researches determines what should be studied. Selection bias is known as the selection effect. If selection bias is not considered, some conclusions of the study may not be accurate.
There are different types of selection bias. Which are:
- Sampling bias
- Time interval
Q-10. Write a basic SQL query that lists all orders with the information of the customer.
- Ordered table
- Order ID
- Customer ID
- Order number
- Total amount
- Customer table
- First name
- Last name
Q-11. Explain the steps involved in making a decision tree.
- Our input should be the entire data set.
- We must calculate the entropy of predictor attributes and the target variables.
- The attribute with the highest information gain is known as the root node.
- Use the same procedure until the decision node of each branch is concluded.
Q-12. What do you mean by true positive rate and false-positive rate?
The probability that an actual positive will test as positive is known as the true positive rate. It is also known as TPR which is the ratio of True Positives and False Negatives.
The False Positive Rate is the probability of a false arm that means when the actual result is negative, a positive result will be given. The False Positive Rate which is FPR is the ratio of the False Positives to all the positives.
Q-13. Why do we use R in Data Visualisation?
We use R in data visualisation as it has a lot of built-in functions that help in data visualisations which include ggplot2, leaflet, and lattice.
R is useful in feature engineering and also feature engineering. Any type of graph can be created. People believe customising graphics is easier in R when compared to python.
Q-14. What is the ROC curve?
The graph between a False positive rate and a True positive rate is known as the ROC curve. In the graph, the False positive rate is taken on X-axis and the True positive ratio is taken on the Y-axis. As we know, the true positive rate is the ratio of true positives to the total number of positive samples. The area under the ROC curve lies between 0 and 1. The more the ROC curve deviates from the straight line the better the model is.
Q-15. What is dimensionality reduction and what are its benefits?
Reducing the number of features for a given dataset is referred to as dimensionality reduction.
The methods used to reduce dimensionality are:
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)
- Matrix Factorisation
- Feature Selection Methods
With the increase in the number of features, the model becomes complex. If the number of data points increases, the model will overfit the data. The model will not generalise the data. This is known as the curse of dimensionality.
The benefits of dimensionality reduction are-
- The storage space and time are reduced
- Space complexity is reduced.
- Easy visual representation of data in 2D and 3D.
Q-16. How can we find Root Mean Squared Error (RMSE) and Mean Square Error (MSE)?
Root Mean Squared Error (RMSE) is used to find the performance of the linear regression model. We can find out how much data is spread.
Mean Squared Error (MSE) shows us how close the actual data and the line are. The MSE value should be low for a better model.
Q-17. What is unbalanced binary classification? How to prevent it?
In binary classification, if at all the data set is not balanced, the accuracy of the model can not be predicted.
By following the given below methods, we can deal with unbalanced binary classification:
Using different other methods for calculating like precision, recall, F1 score, etc
Undersampling the data (reducing the sample size), oversampling (increasing the sample size) and other techniques like this.
Using ensemble learning so that every decision tree considers the entire sample.
Q-18. Mention the difference between a box plot and a histogram.
Generally, both box plots and histograms are used to represent the data more simply.
A histogram is used to know the probability of distribution of data.
Boxplot is used to compare several datasets.
The major difference is boxplots have fewer details and consume less space than histograms.
Q-19. Abbreviate NLP and define.
Natural Language Processing is the abbreviation of NLP. NLP is the study of programming computers to learn large amounts of textual data.
Q-20. Which type of maintenance is required for a deployed model?
When a model is deployed, it needs to be maintained. The data may change over time. In the case of a model that is predicting house prices, the prices of the houses may rise or fluctuate. The accuracy of the model on new data has to be recorded.
Let us discuss some ways to ensure the accuracy of the model:
Firstly, the model should be checked often. If the model gives low accuracy with a negative data test, we are good to go.
Second, an autoencoder should be inbuilt.
If the model has good prediction accuracy even with new data that means the data gets generalisation learned by the model. If the accuracy of the data is not good, the model has to be trained from the beginning.
Q-21. What do you mean by Eigenvector and Eigenvalue?
Eigenvector: Eigenvectors are for understanding linear transformations. Data scientists must calculate the eigenvectors for a covariance matrix or correlation.
Eigenvalues: Eigenvalues are the directions along using linear transformation acts by flipping, stretching, or compressing.
Q-22. Discuss the steps involved in a data analytics project.
- Understanding the problem in business.
- Surveying the data and studying it carefully.
- Put together the data by finding missing values and transforming variables.
- Start running the model and examine the big data result.
- Implement the model and follow the result to analyse the performance of the model.
Q-23. What is Artificial Neural Networks?
Artificial Neural Networks (ANN) are a set of algorithms that have transformed machine learning. This helps you to modify according to the changing input. So finally, the network generates the best result without redesigning the output.
Q-24. What do you understand about Back Propagation?
Back Propagation is the essence of neural net training. With proper tuning, we can reduce error rates and also make the model reliable.
Q-25. What is the importance of selection bias?
Selection bias happens when there is no particular randomisation while picking individuals or groups or data to be analysed.
Q-26. What do you mean by p-value?
In a hypothesis test in statistics, a p-value enables us to determine our results. P-value is the numerical number between 0 and 1.
Q-27. When should you update the Data science algorithm?
You must update an algorithm in the following situations:
Firstly, if you want your data model to progress as data streams.
Second, the underlying data source as it changes.
Q-28. List the different types of Deep Learning Frameworks.
- Microsoft Cognitive Toolkit
Q-29. Why do you think Data Cleansing is important and which method do you prefer to maintain clean data?
Dirty data frequently leads to incorrect inside, which can damage the organisation. If you give incorrect data in a marketing campaign, the campaign will fail.
Q-30. How do you think Data Science and Machine Learning are related to each other?
Although, we know that Data Science and Machine Learning are two closely related terms but often are misunderstood. Obviously, both of them deal with data.
Data Science is a field that deals with large volumes of data and helps us to draw insights out of this data. This entire process takes care of multiple steps. Many important steps are involved such as data gathering, data visualisation, data manipulation, etc.
Machine Learning can be called a sub-field of Data Science. It also deals with data but the main focus is on learning how to convert data in functional models, which are used to map inputs to outputs etc.
In short, Data Science understands gathering data, processing the data. While Machine Learning is the field of Data Science that deals with building models using algorithms.
Q-31. When do you think underfitting occurs in a static model?
Underfitting occurs when the machine learning algorithm is unable to capture the underlying trend of the data.
Q-32. Define reinforcement learning?
Reinforcement Learning is a mechanism about mapping situations to actions. The final result should help us to increase the binary reward signal. In this way, a learner must discover which action offers a maximum reward.
Q-33. What do you mean by the term confusion matrix?
A confusion matrix is a table that describes the performance of a supervised learning algorithm. It provides a detailed summary of the prediction results on a classification problem. With the help of the confusion matrix, we can not only find errors made by predictor but we can also find the type of errors.
Q-34. List the difference between supervised learning and unsupervised learning.
- Supervised learning: This is a type of Machine Learning where a function is concluded from labelled training data.
- Unsupervised learning: This is a type of Machine Learning where inferences are taken from datasets that contain input data without labelled responses.
Q-35. What is a validation set and how is it different from a test set?
A validation set is part of a training set that is used for parameter selection and also for avoiding the overfitting of the model. On the other hand, a test set is something meant for testing the performance of a trained machine learning model.
Q-36. What do you think outlier values?
Outlier values are data points in statistics that do not belong to a certain population. An outlier value is an uncommon observation that is different from other values in the same set.
Q-37. Tell us about Gradient Descent.
The degree of change in output of a function relating to the changes made to inputs is known as a gradient. Gradient Descent is an algorithm that is meant for minimizing a given activation function.
Q-38. Explain the term Autoencoders.
- Autoencoders are learning networks that are used for transforming inputs into outputs that mean the outputs are close to the inputs.
- Few layers are added between the output and input with the size.
Q-39. Abbreviate GAN.
The abbreviation of GAN is General Adversarial Network. The main task of GAN is to accept the inputs from the noise vector and forward them to the generator and then to Discriminator to identify and differentiate the unique and fake inputs.
Q-40. What do you think is Root Cause Analysis?
Root Cause Analysis was developed to examine industrial accidents but is widely used in other areas. It is a problem-solving technique that is used for isolating the root causes of problems.
Q-41. What are recommender systems?
Recommender systems are a subclass of information filtering systems that predict the preferences that users would give to a product.
Q-42. What do you mean by collaborative filtering?
Most recommender systems use a collaborative filtering process to find patterns and information by collaborating perspectives, and several agents.
Q-43. What are the disadvantages of the linear model?
- It cannot be used to count outcomes or binary outcomes.
- It cannot solve overfitting problems.
Q-44. What are the different types of biases that can occur while sampling?
- Survivorship bias
- Selection bias
- Undercoverage bias
Q-45. What is survivorship bias?
The errors focus on different aspects that will support the process of surviving and overlook them. This leads to wrong conclusions.
Q-46. What do you understand by Ensemble Learning?
Ensemble learning is the process of combining a diverse set of learners. Ensemble learning helps to improve stability.
Q-47. What are the different types of Ensemble Learning?
- Boosting: Boosting adjusts the weight of the observation.
- Bagging: Bagging implements learners on one small population for taking mean for estimation purposes.
Q-48. Name the different design schemas in Data Modelling?
Snowflake Schema and Star Schema are the different design schemas in Data Modelling.
Q-49. What is Principal Component Analysis?
Principal Component Analysis is a technique that finds patterns in data for dimensionality reduction.
Q-50. How do you find the missing values in your data?
Missing values can be handled by deleting rows and columns.
Henry Harvin Analytics Academy
Henry Harvin Analytics Academy provides you with the best Data Science course available online. The duration of the course is thirty-two hours of live online uninterrupted and interactive classes. All the trainers have more than ten years of experience. The institute helps you to achieve real-time experience by making you a part of the internship. You will also receive a recognised certificate after finishing the course. They provide the course in various levels like:
Self-paced: INR 13000
Live online class: INR 15000
I strongly recommend you to invest in this course to develop you build your career by making your CV stand out different from your peers.
The list provided above is for both beginners and experienced. We all know the importance of data science and I wish this list of questions will help you crack your dream job as a data scientist.
Ans. Henry Harvin education provides the best data science certification course.
Ans. Internship with support
Certificate on completion of the course
Ans. Microsoft, Intel, Uber, Slack