Sentiment Analysis of Bjorka Hacker Using the Naive Bayes and C.45 Algorithms

In 2023, Indonesia was again devastated by a hacker known as Bjorka. Bjorka did not act just once or twice; every time, Bjorka made the entire Indonesian population proud. The 19 million BPJS Employment data belonging to the Indonesian people that Bjorka hacked is the BPJS Employment data belonging to the Indonesian people that Bjorka hacked. Since the release of the Bjorka story, there has been a surge in the number of people criticizing it on social media, particularly Facebook, so the criticism or opinions can be used to conduct sentiment analysis. Based on this, developing a method that can automatically classify beliefs into positive and negative categories through sentiment analysis is necessary. The sentiment analysis process begins with data preprocessing, followed by keyword analysis using the TF-IDF method, algorithm development, and analysis of classification results. The data classification methods used in this study are Naive Bayes and C4.5. The data will be analyzed using text mining and classified using the Naive Bayes and C4.5 algorithms. Based on the results of the tests, the best classification was achieved by Nave Bayes, with a score of 70 percent for the C4.5 algorithm and 68 percent for the C4.5 algorithm. The Nave Bayes algorithm can predict up to 70% data transmission rates for both positive and negative signals.


Introduction
Data security has become a critical component in the evolution of information technology.The use of information technology has impacted a variety of areas, including data collection and storage on a cloud-based basis [1].Data breaches occur frequently due to advancements in technology or an individual's ability to detect a violation, commonly referred to as a hacker [2].
One of the hackers who is talked about quite a lot by people in Indonesia is hacker Bjorka [2].Hacker Bjorka is reported to have leaked 11 GB of Tokopedia marketplace data, 26 million Indihome customer data, 1.3 billion SIM card registration data, 105 million KPU data [3] [4], and recently Bjorka has leaked BPJS Employment data [4].This has given rise to various opinions on social media, especially Facebook [5].
Opinions from the public are very diverse; some are amazed by Bjorka's figure, and some do not like it or comment that this is just a diversion of issues [6].The large and ever-increasing number of opinions is a challenge in describing public sentiment, so an analytical approach to public views is needed.Text mining is one approach that can be applied to overcome this problem [7].Previously, research on Bjorka was conducted using a Support Vector Machine, yielding accuracy results of 62.33 percent [8].Furthermore, similar analysis using the Naive Bayes algorithm produced an accuracy score of 83.06% [9].Another study compared the accuracy of Naive Bayes, C4.5, and Random Forest for classifying online motorcycle taxi services [10].The comparison results showed that Naive Bayes accuracy was better than C4.5 and Random Forest accuracy with an average Naive Bayes accuracy of 69.18%, while Random Forest was 69.18%: 66.34% and C4.5 of 65% [11][12].

IICS SEMNASTIK
This sentiment analysis research will compare accuracy using the Naive Bayes and C4.5 algorithms [13][14].This research adds TF-IDF feature weighting; with this additional method, it is hoped that the system will be able to classify sentiment better and produce better accuracy [15] [16].

Research Method 2.1. Research Design
The process in this research involves the application of the Knowledge Discovery in Database method [17].KDD is a step in extracting potential, implied, and previously unknown information from a data set [18].The workflow of the research can be seen in Figure 1.

Dataset
This study relied on 1,187 data points obtained from Facebook social media.The collected data is then manually labeled with the assistance of a linguist to determine positive and negative sentiments [19].

Term Frequency Inverse Document Frequency (TF-IDF) Weighting
TF-IDF is a feature extraction method that gives a value to each word in the training dataset.The TF-IDF approach gives a score based on how often words appear in the document [20] [21].TF-IDF calculations can be done using equation (1).

Naive Bayes
Naive Bayes is a technique that can be applied to classify data.Naive Bayes Classification Approach is a statistical method used to estimate a class's membership probability [22].Naive Bayes calculations can be done using the equation ( 2).
(2)  The C4.5 algorithm is the result of the development of the ID3 algorithm with various improvements and enhancements [23].Some improvements include handling numeric attributes, managing missing values, and reducing noise in data sets [24].The C4.5 calculation can be done using equation ( 5). (5) After the entropy is calculated, attribute selection is done using Information Gain [25].Information Gain calculations can be carried out using equation ( 6). ( 6)

Evaluation
The final stage is the evaluation of Classification performance [26].Based on the accuracy value, which trains how often the model is produced correctly, it is described using a confusion matrix [27].Classification evaluation also measures its performance using recall and precision [28].To calculate the accuracy value (7), it is found in equation, precision equation ( 8), recall equation (9).Besides using the confusion matrix, whether the prediction results are good or bad, a classification model can also use the Receiver Operating Characteristic (ROC) [29] and dan Area Under the Curve (AUC) [30].Accuracy = (TP+TN)/(TP+TN+FP+FN) (7) Precision = (TP)/(TP+FP) (8) Recall = (TP) / (TP+FN) (9) Where TP= True Positive, TN=True Negative, FP=False Positive and FN = False Negative.

Results and Analysis 3.1. Text Preprocessing
In this stage, the data that has been labeled is then carried out by a data cleaning process.The stages of text preprocessing carried out can be seen in Table 1.

Transformation
At this stage, the data that has been labeled will then be calculated by weighting each word.The Term Frequency Inverse Document Frequency weight calculation uses equation (1).The results of the Term Frequency Inverse Document Frequency calculation can be seen in Table 2.

D Comment Category
D1 Transferring Tax Issues ?

Naive Bayes
The first stage in Naive Bayes calculations is to calculate the prior probability.Prior probability calculations can be done using equation ( 3), and can be seen in Table 5.From the results of these calculations, it can be concluded that the test data for 'tax issue experts' falls into the category of negative.

C.45
The next stage is calculating the C4.5 algorithm.It is known that the training data can be seen in Table 6.After that, the value of each term of the training data is calculated.The calculations can be seen in Table 7 Table 7. Training Data Values

D Hurry Sell People PrivacyOrder GuaranteedSwitch Issue Case
Ministry of Finance Tax AnswerAmount After calculating all terms, the results are obtained in Table 8.Furthermore, it can be concluded that the root node obtained is as in Figure 2.  Because the entropy value for the "command" node branch is not zero, a recalculation is performed to determine the next node.The second information gain calculation can be seen in Table 9.
From Table 9, it can be concluded that the root node obtained is as in Figure 3.Because the entropy value for the "command" node branch is still not equal to 0, it is necessary to recalculate to determine the next node.The third information gain calculation can be seen in Table 10.
From the results of Table 10, it can be concluded that the root node obtained is as in Figure 4.

Evaluation
At the evaluation stage, researchers carried out model tests and model evaluations to determine the performance of the Naive Bayes and C4.5 algorithms.The classification results will be displayed in the form of a confusion matrix.The results of the evaluation of each algorithm can be seen in Table 11.

Conclusion
Based on the comparison accuracy results of the Naive Bayes and C4.5 algorithms, the highest accuracy value was obtained at 70% for the Naive Bayes algorithm.In comparison, the C4.5 algorithm obtained an accuracy value of 68%.This proves that the Naive Bayes algorithm can predict data accuracy of 70% for positive and negative sentiment.

Figure 1 .
Figure 1.Research Design during training, where the equation can be seen in equations (3) and ( After getting the prior probability results, the next stage calculates the likelihood.The likelihood calculation can be done using equation (4) and can be seen as follows.After obtaining the likelihood probability results, the next stage is to classify the test data.Manual calculations can be calculated using equation (2) to classify test data. (ℎ |  )  ( |  )  ( |  ) = 0. 66667  0. 1  0. 1  0. 5 = 0. 00333335

Figure 4 .
Figure 4. Node 3 Decision TreeBecause the entropy value on the 'tanggung' node branch has reached a value equal to 0, the 'lose' attribute branch is a leaf node, and no gain calculation is carried out to determine the next node.

Table 2 .
TF-IDF calculation After the TF-IDF weight results are obtained, the next stage is implementing the Naive Bayes and C.45 algorithms.Examples of training and test data samples can be seen in Table3 and Table 4.

Table 3 .
Sample Testing Data

Table 4 .
Sample and Testing