In this post I’m going to analyze the factors affecting student’s performance in the final exams and in the end of the post, I’ll use logistic regression to predict whether the student passes the exam or not using attributes like demographic, social and school related features . If you are unfamiliar with logistic regression or need to brush up on the concept, it’ll be good to check my last article on the same.
Data Set Information:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). I will work with the Portuguese dataset since it contains information of more students than in maths dataset. You can get the data and the code from here. This dataset is originally taken from UCI repository.
Our goal is to analyze how all the factors are affecting students performance and predict whether the student will pass or not in his examination.
Note : This dataset contains over 30 features and I’ll only cover some of the features in this post, you can check the github repository for complete analysis
To start with our problem, we first need to import all the dependencies that we will use later.
Importing Data :
To work with the data , we first need to import it .
df_por = pd.read_csv('student-por.csv',sep=';')
Once the data is imported, our first step is to check the dataset and gather as much information as possible, to gain intuition about the data.
df_por.head() #This displays top 5 rows in the dataset
Each row contains the information of a student .Now let’s see all the attributes and their types using
G1, G2 and G3 are the in first , second and final year grades of the students . We’ll see how all the other attributes are related to G3(final grade).
Analyzing the relation of Attributes with G3 :
This data is taken from two schools GP and MS. School plays an important role in student’s marks, since teaching, faculty etc. are different in different schools. To visualize the performance of students in these schools, I will use boxplots:
In this plot, it’s very clear that student’s of GP perform better on an average than students at MS.
To start with gender , it’ll be good if we see how many of all the students are girls and boys:
To evaluate the difference in performance of boys and girls, we’ll again use box plot.
In average female student’s have better performance than male students, since the median score of girls is higher than boys and the high scores(except few) .
We’ll see if age affects the score but first we need to know the range of age present in our data:
Since we have a wide range of age, I’ll group these into 3 groups:
Now it’s easy to evaluate this attribute.
By this we can say that students of age 19+ have a score around 7 to 14.
Our data have 3 types guardians : father,mother and other. By looking at the box plot we can justify that students who have guardian other than their own parents scores average marks but there is very less chance of failing.
Study Time :
As we can say that study time affects the student’s performance, but it’s not true always . The quality of study is more important than quantity of study. The values in this column tells weekly study time (numeric: 1 : <2 hours, 2 : 2 to 5 hours, 3 : 5 to 10 hours, or 4 : >10 hours)
Most of the students study less than 5 hours weekly.
We can see that students who study less than 2 hours tend to perform poorly than the others. And we can also notice a little downfall in students who study more than 10 hours maybe because they are not getting rest or maybe studying poorly.
Student’s past failure records is a key feature which can help us predict the student’s performance. It contains number of times the student has failed .
Students with no failure records perform better than the rest.
Higher Education :
We have a feature which contains the information about whether the student’s want to take higher education or not. As analyzed , the students who don’t want to take higher education have poor performance than the other.
We also have a record of students absences , the column contains numeric values from 0 to 93. To see how the number of absences affect the student’s performance , i’ll visualize the data using scatter plot.
We see that the slope determines that as the number of absences increasing , the grades is decreasing. For better analysis , I’ll group the absences and plot a boxplot.
It is clear that with increase in absences , the performance of the students decrease.
- address : Urban students are distriuted widely in terms of scores, whereas rural students are clustered around 11.
- famsize : The family size didn’t affect the scores, so we dropped this column .
- Pstatus : The performance of students is not much affected by whether there parents are together or apart.
- Medu & Fedu(Parent’s Education) : Students whose parents have completed higher education have slightly better chance of performing well in exams.
- Mjob & Fjob(Parent’s Job) : Students whose parents are teachers , have better performance in school.
- Reason (to choose the school) : Students who choose their school by looking at it’s reputation ,have better scores.
- Traveltime : Travel time to school doesn’t affect much.
- Schoolsup(School support) : Student’s with school support have much higher chance of passing the exam.
- Famsup :Suprisingly there’s almost no difference in student’s grades whether their family supports them or not in education.
- Paid : There are very less student’s who are going to paid classes, there’s a chance of overfitting our model. So we’ll drop this.
- Activities : Does not affect much.
- nursery : There’s not much difference in scores.
- internet : Students who do not have internet access at home tend to score average marks and the student with internet access are widely spread in scoring range.
- romantic relation : Not much difference.
- famrel : Surprisingly, family relations is not affecting the student’s performance much.
- freetime : not useful.
- goout : Going out with friends doesn’t affect the performance much , but the student’s go out very frequently ,tend to score less marks.
- Dalc and Walc : Student’s with high alcohol consumption, scores less.
- health :There’s not much difference in the scores with respect to student’s health in the given data.
To apply the logistic regression model for prediction of whether a student’s passes or fails, we need to convert all the attributes into dummy variables and also set a minimum score for the student to pass the exam.
I’ve set the minimum score as 8 ( 40% of total score).
Now our data is ready to be trained and tested . So i’ll divide the data set into training and testing data sets.
Note : I’ll not take ‘G1’ and ‘G2’ for training or testing.
Now we’ll train the model and predict the score of the model to know how well our model is performing.
Our model is doing pretty well , as it can predict the correct output for more than 95% of times.
Note : There are more detailed method to evaluate the model’s performance but I’ll not discuss it in this blogpost.
If you liked this post , please share and subscribe. I’ve done this analysis without any guidance so if you have any feedback on this, please let me know.