Comparative SVM and Decision Tree Algorithm in Identifying the Eligibility of KIP Scholarship Awardee

Scholarship selection process has specific rules, but if the number of applicants exceeds the quota, a selection process is needed. Based on the observation of a university in Sukabumi, the selection for KIP scholarship has not yet had a standard method. Several methods can be used to assist the selection process, such as classification based on historical data of applicants. The algorithms used for classification include Decision Tree (DT) and Support Vector Machine (SVM). The research process uses SEMMA (Sample, Explore, Modify, Model, Assess) method. Dataset for KIP scholarship awardee from 2021-2022 consist of 519 samples with 16 attributes. From the exploration results, the most important features for model modeling are Status DTKS, Status P3KE, Father's income, mother's income, combined income, and performance. These attributes are converted into numerical data to facilitate model fitting. The K-Fold Cross-Validation results for the Decision Tree model in the case of KIP Scholarship classification yield an accuracy of 78.44% for the entire test dataset, a precision of 0.73107, indicating that 73.11% of the predictions are true, a recall (sensitivity) of 78.45%, and an F1 score of 73.20%. The results for the SVM model are an accuracy of 80.17%, a precision of 84.44%, and a recall of 80.17%.


Introduction
Kartu Indonesia Pintar (KIP) "university" is one of the Smart Indonesia Programs as stipulated in the Ministry and Culture Regulation No. 10 of 2020 which is intended for students who are admitted to higher education institutions [1].KIP university Merdeka aims to increase the economic potential and social mobility of students from poor/vulnerable families to attend college [2].
Each university has a quota for KIP university students.The number of KIP scholarship awardee is calculated based on the accreditation rank.If the number of applicants exceeds the quota, not all applicants can be accommodated, so a re-selection process is needed to ensure that KIP scholarship awardee are truly deserving students [3], so each university has various methods for the final selection process of KIP lecture recipients, including at one of the universities in Sukabumi.Currently, the selection and verification process of KIP university has been conducted through various methods, including interviews, home visits to prospective recipients, and selection of specializations relevant to the intended study program.However, the process of determining final admission eligibility is still not done in a clear and transparent manner [4].One of them is through machine learning classification methods using data on applicants and recipients of KIP Lecture in previous years [5] can be done by utilizing several algorithms such as Naive Bayes, Decision Tree, K-Nearest Neighbor and Support Vector Machine [6].Previous research includes the classification of college KIP recipients using logistic regression [7].In the research conducted by Ronny Susetyoko et al, the regression classification method can only produce numerical results or numbers with input variables only numbers.On the other hand, Gagan Suganda conducted research on the classification of college KIP recipients using a simple mathematical formula for conditional probability, Naive Bayes [8].However, this method has its drawbacks.If the conditional probability is zero, the prediction probability is also zero, and the prediction result will not be optimal [9].Decision tree was also used in another case study by Sathiyanarayanan to identify breast cancer.The results showed that the decision tree algorithm is simpler and can compare each attribute by assigning a value to each node, and the results showed an accuracy of 99% [10].In addition, there is also the SVM algorithm.The SVM algorithm has high accuracy and can find subtle patterns in complex data sets [11].
In previous studies, researchers only used one algorithm, so this study will compare two types of algorithms, namely decision tree and support vector machine [12].The purpose of this research is to compare the accuracy of the classification of college KIP recipients from the two algorithms [13].The research stages will use the SEMMA (Sample, Explore, Modify, Model, and Assess) method which starts with collecting data sets, understanding data processing, preprocessing data, modeling, and testing the accuracy and precision of the model [14].

Research Method
This research will use the SEMMA (Sample, Explore, Modify, Model, and Assess) approach [15].It starts with the process of determining the dataset, exploring and visualizing the dataset, and modifying the dataset so that it is ready to be modeled [16].The research also includes modeling with machine teaching algorithms and evaluating the accuracy of the model [17].To measure more stable model performance and reduce the risk of overfitting on training data, model validation was performed with the K-fold cross-validation method [18] [19].
The confusion matrix table will be used to measure the performance of the model.This table can calculate model evaluation metrics such as accuracy, precision, recall, and F1 score [20].The comparison of a classification result with all classification results is known as accuracy [21].And recall indicates how successfully the algorithm recognizes the class, while precision indicates how precise the classification result is from all data [22].The F1 score, which is a combination of recall and precision, shows the overall performance of the method.Figure 1 shows the stages of the research to be carried out.The research was conducted starting from 1) collecting datasets of KIP Lecture recipients for the last 5 years for modeling, 2) Dataset exploration by visualizing and describing the dataset, which is the process of understanding the dataset to sort out the data that suits the modeling needs, 3) Modification: variable selection, cleaning and transformation of the dataset.This process is carried out to ensure that the dataset to be modeled has been verified, the process that will be carried out is variable / feature selection, data cleaning and data transformation, 4) Modeling datasets with decision tree and SVM machine learning algorithms.In classifying data using the SVM method, the kernel function K(xi, xd) is used.The kernel function that will be used as in formula (1) as follows [23]:

Comparative SVM and Decision Tree Algorithm in
Decision tree studies a problem from an independent set of data depicted in a tree chart with a "divide and conquer" approach [24].Formulas (2) and ( 3) are equations of data in tuples D Accuracy of the machine learning model is measured by the confusion matrix table.The confusion matrix equation [25] to calculate accuracy, precision, and recall, is by collaborating the system classification results with actual observations grouped into True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).Each equation for calculating the confusion matrix is as follows:

Data Exploration
Data exploration step to describe and visualize the dataset of KIP Scholarship Recipients.The raw dataset consists of 16 attributes containing non-numerical data objects.In the implementation of the decision tree and SVM algorithms, the required data is numeric, so data transformation is needed.The attributes that will form the dataset are DTKS status, P3KE status, combined father and mother's income, and the label is the Scholarship Status.

Data Modification
Original dataset which was obtained from the KIP Scholarship application process, still contains raw data.Therefore, it is necessary to adjust the data to be able to process it with decision tree and SVM algorithms.Some of the attributes that need to be modified include Status DTKS, Status P3KE, combined father and mother's income, and the label is the Scholarship Status.
1. Status DTKS and Status P3KE attributes, which contain the value 'Not Registered,' are changed to '0,' and for 'Registered,' they are changed to '1.' The modification process for these attributes is shown in Figure 3. 2. The attributes "Penghasilan Ayah" or Father's Income and "Penghasilan Ibu" or Mother's Income, which contain values such as 'Tidak Berpenghasilan', '-', and a range of salary options, will be combined and a new attribute called 'Family Income' will be created.The process of modifying these attributes is shown in Figure 4.The "Penghasilan Orangtua" attribute, mentioned in point two, will be categorized into three groups: 1) "Low" for incomes below Rp 2,000,000, 2) "Medium" for incomes between Rp 2,000,000 and Rp 4,000,000, and 3) "High" for incomes above Rp 4,000,000.All values are in monthly income.The process of modifying this attribute is illustrated in Figure 5.The Achievement or "Prestasi" attribute will be categorized into two groups, "Berprestasi" or achieved and "Tidak Berprestasi" or Not Achieved and will be changed to "1" and "0," respectively.The process of modifying this attribute is illustrated in Figure 6.

Data Modelling, Evaluation and Validation
In this study, data modeling techniques, specifically Decision Tree and Support Vector Machine (SVM), are utilized for the classification of KIP university scholarship program recipients.The Decision Tree method constructs a tree-shaped structure to systematically divide the dataset based on attributes such as DTKS status, P3KE status, academic achievement, and parental income, ultimately predicting a student's eligibility for the scholarship.SVM seeks to maximize the margin between KIP university grantees and non-recipients by identifying the best hyperplane to effectively divide the two classes.To evaluate the effectiveness of the decision tree and SVM models in predicting or categorizing KIP university awardee, additional model assessment and validation are carried out.The confusion matrix approach will be used for model evaluation, and the K-Fold cross-validation technique will be used for validation.data as category 1 (True Positive) out of all samples that actually belong to category 1 (Actual 1), it mistakenly predicts 93 data as category 0 (False Negative).Conversely, the model accurately predicts all 416 data points as category 0 (True Negative) based on the confusion matrix results for all samples that are actually included in category 0 (Actual 0).The algorithm incorrectly predicts 103 data as category 0 (False Negative) out of all samples that truly belong to category 1 (Actual 1) instead of accurately predicting any data as category 1 (True Positive).

Conclusion
Overall, the performance of the Support Vector Machine (SVM) model in classifying KIP Lecture scholarship recipients looks slightly superior compared to the decision tree model.As seen in Table 2, the SVM model achieved a higher accuracy rate of 80.17%, compared to the accuracy of the decision tree model which only reached 78.44%.Similarly, SVM is superior in terms of precision with a score of 84.44%, while the decision tree model has a precision of 73.11%.In addition, both models showed similar recall values to their accuracy values, indicating their ability to correctly identify KIP university recipients from the dataset.Overall, when considering the F1 score, which combines precision and recall, the SVM model achieved a score of 71.46%, while the Decision Tree model had an F1 score of 73.20%.Thus, although the SVM model performed slightly better in terms of accuracy and precision, both models showed competitive performance in performing classification for this case study.

Figure 3 .
Figure 3. Modification of DTKS and P3KE Status Attributes

Figure 4 .
Figure 4. Parent Income Attribute Modification 3.The "Penghasilan Orangtua" attribute, mentioned in point two, will be categorized into three groups: 1) "Low" for incomes below Rp 2,000,000, 2) "Medium" for incomes between Rp 2,000,000 and Rp 4,000,000, and 3) "High" for incomes above Rp 4,000,000.All values are in monthly income.The process of modifying this attribute is illustrated in Figure5.

Figure 5 .
Figure 5. Merging Parent Income Attributes 4.The Achievement or "Prestasi" attribute will be categorized into two groups, "Berprestasi" or achieved and "Tidak Berprestasi" or Not Achieved and will be changed to "1" and "0," respectively.The process of modifying this attribute is illustrated in Figure6.

Figure 6 .
Figure 6.Modification of Student Achievement Attributes 5. From the results of the dataset modification above, a clearer dataset distribution can be obtained as shown in Figure 7.

Figure 7 .Figure 8 .
Figure 7. Dataset Distribution of Each Course KIP Attribute So that the KIP "university" dataset used is DTKS Status, P3KE Status, Achievement, Scholarship Status, Parent Income, Low, Medium, and High.

Figure 9 .
Figure 9. Model Evaluation Results with Decision Tree Figure 9 shows the results of modeling and validation of the decision tree model using the K-Fold cross-validation method.The results of K-Fold Cross-Validation for the Decision Tree model in the case of KIP Lecture recipient classification can be explained as follows: 1. Accuracy: The accuracy value of 0.78439 indicates that the Decision Tree model is able to correctly predict about 78.44% of all cases in the test dataset.Accuracy measures the extent to which the model can correctly classify data on KIP Lecture recipients.2. Precision: The precision of 0.73107 reflects the model's ability to correctly identify the College KIP recipients from all predicted positives (True Positives + False Positives).This indicates that about 73.11% of the model predictions are correct.

Figure 10 .
Figure 10.SVM Model Evaluation Result Figure 10 shows the results of modeling and validation of the SVM model using the K-Fold cross-validation method.The results of K-Fold Cross-Validation for the SVM model in the case of KIP Lecture recipient classification can be explained as follows: 1. Accuracy: The accuracy value of 0.80174 indicates that the SVM model is able to correctly predict about 80.17% of all cases in the test dataset.Accuracy measures the extent to which the model can correctly classify data on KIP university recipients.2. Precision: The precision of 0.84443 reflects the model's ability to correctly identify the College KIP recipients from all predicted positives (True Positives + False Positives).This indicates that about 84.44% of the model predictions are correct.3. Recall (sensitivity level): The recall value of 0.80174 or about 80.17% indicates that the model has the ability to identify a large number of KIP university recipients out of all predicted positives (True Positives + False Negatives). 4. F1 Score: F1 score of 0.71465 is the harmonic mean of precision and recall.This gives an overall picture of the model's performance of 71.46%.

Figure 11 .
Figure 11.Comparison of Confusion Matrix Results between Decision Tree and SVMThe outcomes of the classification evaluation or the predictions made by the SVM and decision tree models are shown visually in Figure11above as the Confusion Matrix findings.The real class or category of the examined data is indicated by the label "Actual" (horizontal).Conversely, the class or category that the model has predicted based on the provided data is indicated by the "Predicted" label (vertical).The decision tree model accurately predicts 397 data as category 0 (True Negative) based on the confusion matrix results for all samples that truly belong to category 0 (Actual 0).However, the model also incorrectly predicts 19 data as category 1 (False Positive).While the model accurately predicts 10

Table 1 .
Attributes of KIP Scholarship Dataset.

Table 2
Comparison of Decision Tree and SVM Performance