Sentiment Analysis for Indonesian Salt Policy uses Naïve Bayes and Information Gain Methods

. Salt production is one of the concerns of the Indonesian government. Several government policies such as salt imports have had a major impact on local salt farmers. Other factors are due to the increased demand for salt, decreased domestic salt production which is unfavorable due to weather factors, and the price of imported salt is lower than that of local salt. Many people express their opinions regarding the salt import policy, via Twitter social media. Sentiment analysis can be applied to analyze tweets or writings by the public regarding salt import policies and classify the data. This study uses the naïve Bayes classifier algorithm model as a sentiment classification algorithm on Twitter social media tweets. The classification process uses the Naïve Bayes algorithm. The feature extraction and weighting method is the TF-IDF method. Not all of the features resulting from the TF-IDF process are used, so feature selection is carried out using the information gain method. Model testing was carried out 5 times with 500 data, using feature selection and without feature selection. Without feature selection, the highest accuracy result is 84% at K=4, while without feature selection it produces an accuracy of 71% at K=3, so there is an increase of 13%.


Introduction
Indonesia is a country that has a long coastline with a total coastline length of 99,093 kilometers [1].This is used by the Indonesian population as a source of livelihood.For example, the use of the sea or beach is to produce Indonesian local salt.Salt is one of the basic needs which is Indonesia's strategic commodity that is needed by all Indonesian people both individually and as an industrial raw material.Without salt, the production processes of the food/beverage and chemical industries will be hampered and will also hamper national economic growth [1].Domestic salt production is one of the concerns of the Indonesian government.But on the other hand, policies issued by the government such as salt imports have had a major impact on salt farmers who are Indonesian local salt producers.According to a news source, liputan6.comshows that in 2021 total salt imports will reach 3.07 tons.This is influenced by several factors including the quality of local salt which is still inferior to imported salt.Other factors are caused by the increasing demand for salt, domestic salt production is not good due to weather factors, and the price of imported salt is lower than local salt.This certainly has an impact on decreasing the motivation and income of local salt farmers [1].This salt import policy elicited many responses from the Indonesian people who were concerned about local salt farmers.
Corresponding author: ykustiyahningsih@trunojoyo.ac.idMany people express their opinions regarding the salt import policy, including through the social media Twitter.In terms of conveying an opinion, Twitter is one of the most popular text-based social media [2].The ease of using Twitter in expressing opinions is a factor for social users to write responses regarding government policies.Public writings on Twitter social media can be used by the government to find out opinions or responses from the public towards policies made.The government can find out the positive and negative responses submitted by twitter users as a material for consideration in making policies including salt import policies.Tweet data from twitter can be classified into positive or negative opinion by using sentiment analysis.Sentiment analysis is the process of analyzing text data originating from the writings or opinions of the public in response to an issue or problem [3].Recently, sentiment analysis has become a type of research that has received wide attention because it can produce important information such as knowing public opinion, politics, and decision-making processes [4].The main purpose of sentiment analysis is to analyze a document, review, and comments and categorize it as a positive, negative, or neutral sentiment [2].Sentiment analysis can be applied to analyze tweet data or public writings submitted via Twitter on salt import policies and classify the data.
Twitter data classification can use a data classification model.Several models of data classification algorithms that are often used include the Naïve Bayes algorithm, K-Nearest Neighbors (KNN), and Decision Tree C4.5 [5].This study will use the naïve Bayes classifier algorithm model as a sentiment classification algorithm on social media tweets, Twitter.The Naive Bayes algorithm includes a simple data classification algorithm that is used to calculate the probability of each data against an existing data set to determine its classes [5].Data that will be used for the classification process must be cleaned first to improve the data structure.The classification process for text data must go through a feature extraction process to determine the features or attributes to be used and the weight of each feature.The feature extraction and weighting method used in this study is the tf-idf method.The process of extraction and feature weighting with TF-IDF is carried out to determine the features produced from each sentence and the weight of each word to sentences or documents in a document (data) set [6].Not all of the resulting features from the TF-IDF process will be used, so feature selection is carried out to determine the features that are considered the most important.In this study, the information gain method is used for the feature selection process.The feature selection process with Information Gain can determine the features that are considered most relevant for use in the classification process with the naïve Bayes algorithm.The Naïve Bayes algorithm has several advantages compared to other data classification algorithms, including the Naïve Bayes algorithm which is suiTable for classifying large amounts of data and produces a fairly high level of classification accuracy [3].Therefore, this study uses the Naïve Bayes algorithm as a data classification method from Twitter, which is quite large in number.The results of the accuracy of the naïve Bayes classification model in previous studies reached 84% [7], and 75.42% [8].This sentiment analysis research is expected to show the results of the accuracy of the classification model used and provide useful information to the public and the government regarding the opinions of Twitter users regarding salt import policies.

Sentiment Analysis
Sentiment analysis is an analytical technique of emotions and opinions from social media, websites or documents studied since 2000s [9].Sentiment analysis can also be defined as a process of processing data and classifying the data into several categories.In general, sentiment analysis is divided into 3 levels [10]

Crawling Data
Crawling Data is the process of automatically retrieving data from the website according to the needs of the user [11,12,13,14,15].Data can come from various websites according to the needs of data seekers, including from the Twitter website.The process of crawling data can use several assistive tools including programming languages in order to obtain data according to the needs of data seekers.

TF-IDF Extraction and Weighting
Feature extraction and weighting is a process to determine the features obtained from the words in each sentence and to give weight to each word.One method of feature extraction and weighting is TF-IDF (Term Frequency -Inverse Document Frequency).The concept of the TF-IDF method is to extract sentences to determine features and assign weights to each feature by finding the value of a relationship in a set of documents [11].There are 3 calculation steps performed in the TF-IDF feature extraction and weighting process, namely : 1. Term Frequency (TF) is used to calculate the weight of a term/word in a sentence.The TF value of each word will be greater if the word appears more and more in a sentence [12].The equation for calculating the TF value is as follows [13]. = { 1 + 10( , ),  , > 0 0,  , = 0 1 With: tf = term frequency value f_(t,d) = term frequency (t) in document (d) 2. Inverse document frequency (IDF) is used to reduce the weight of each word that appears frequently or occurs a lot in several sentences so that the weight will be lower [12].The equation for calculating tf-idf is as follows [13] : With : D = total number of documents idft = document containing the term t idf = inverse document frequency 3. Information Gain feature selection Information gain is a feature selection method to measure the importance of each feature or attribute in determining the class of a data [14].
The concept of the feature selection process is to select the features that are considered the most important and affect the determination of the class of a data and remove features that are considered less important.The application of feature selection aims to speed up the system or model being built and improve accuracy [15].
The steps to calculate the information gain are as follows:

RESULTS AND DISCUSSION
The results of the stages carried out in this research are as follows: 1. Crawling data.The results of crawling data obtained 500 data taken in 2021 with the keywords used "Indonesian salt policy".The following data labelling can be seen in Table 2. Crawling Sentence Result

Example of sentences
In this case, @kkpgoid is among the most responsible.I suspect, is it possible that there is a salt mafia that benefits from this salt import activity?Just like the export of lobster seeds, which turned out to be bribery committed by Edhi Prabowo Cs.
2. Manual labelling.The result of manual labelling is crawling data whose class has been determined manually.The data that has been labelled used as learning data for classification model.The following data labelling can be seen in Table 2. Labelling Data.

Example of sentences Labelling
In this case, @kkpgoid is among the most responsible.I suspect, is it possible that there is a salt mafia that benefits from this salt import activity?Just like the export of lobster seeds, which turned out to be bribery committed by Edhi Prabowo Cs.Positif 3. Data preprocessing is a step taken to clean and improve the data structure.In the data preprocessing stage, there are 6 stages that are carried out, starting from cleansing, tokenizing, case folding, slang words conversion, stemming, stop words.The results of the preprocessing process are in the form of clean data and improved data structures.The preprocessing process will also affect the accuracy of the classification model to be built.The following is an example of data after carrying out several stages in data preprocessing.
Cleansing Data: The cleansing stage produces data that has removed URL, numbers, hashtags, and usernames.The following is an example of data after the cleaning process.The following data cleansing can be seen in Table 3.

Example of sentences
In this case, kkpgoid is among the most responsible.I suspect that there should be no salt mafia that benefits from this salt import activity.Just like the export of lobster seeds, it turns out that there was an act of bribery committed by Edhy Prabowo Cs.
Tokenizing: The Tokenizing stage produces data that is cut into words by removing punctuation without paying attention to the relationship of each word in the sentence.The following data after tokenizing can be seen in Table 4.

Table 4. Data After Tokenizing Example of sentences
In this case kkpgoid is among the most responsible.I suspect that there should be no salt mafia that benefits from this salt import activity.Just like the export of lobster seeds, it turned out that there was an act of bribery committed by Edhy Prabowo Cs.
Case Folding: The Case folding stage produces data whose letters will change to all lowercase.The following an example of data after case folding process can be seen in Table 5.

Example of sentences
In this case kkpgoid is among the most responsible.I suspect that there should be no salt mafia that benefits from this salt import activity, just like the export of lobster seeds, which turned out to be an act of bribery committed by edhy prabowo cs Convert Slang words The slang words conversion stage will check each word.This stage will check whether the words are in the slang word dictionary or not.If a non-standard word is found according to the slang words dictionary, then it will be changed to a standard word according to the slang words dictionary.The following data after the slang word conversion process can be seen in Table 6 Table 6.Data after conversion of slang words Example of sentences this case kkpgoid is among the most responsible.I suspect that there should be no salt mafia that benefits from this salt import activity, just like the export of lobster seeds, which turned out to be an act of bribery committed by edhy prabowo cs

Stemming
The Stemming stage will change all the words in the sentence into basic words by removing all the affixes.The stemming process uses assistance from the literary library with the Nazief and Adriani algorithms.The following Data after stemming process can be seen in Table 7.

Table 7. Data after stemming Example of sentences
In this case, KKPGoid is the most responsible.I suspect that there should be no salt mafia that can profit from this salt import activity, the same as with the export of lobster seeds, which are clearly available for bribes that are being carried out by Edhy Prabowo cs

Stop words
The Stop words stage is last preprocessing stage.This stage will remove words that are considered unimportant or have no meaning in a sentence.The stop words process uses the help of previously created stop words data.The following data after the stop words process can be seen in Table 8.

4 .
Calculating entropy values or class uncertainty measures () = ∑ −   2     3 with: c = the value in the classification class Pi = number of samples for class i 5. Calculating information gain: v= possible values for attribute A Values(A)= set of possible values for A |Sv|= number of samples for the value of v |S|= the total number of data samples Entropy(  ) = entropy for samples that have a value of v 6. Naïve Bayes Classifier The naïve Bayes classifier algorithm is a data classification algorithm proposed by Thomas Bayes.The concept of the Naïve Bayes classifier is to calculate the probability of a particular class of data in the future by referring to previous data so that the data can be classified [16].Learning data that is used as a reference for classification is often referred to as training data.All training data have their respective classes which have been determined manually as a learning for the classification model created.Classifying data using the naïve Bayes algorithm is done by calculating the probability of each feature or attribute to certain classes so that the probability value can determine the class of each data [17].The basic formula for the naïve Bayes algorithm used is the following [18] : whose class is unknown c = class that is used as a hypothesis P(C|X) = Probability of a sample to enter in class C (Posterior) P(C) = Probability of a certain class (prior) P(X|C) = Chance of appearance of features/characteristics in class C (likelihood) P(X) = Probability of X The steps in the naïve Bayes algorithm can be done as follows: Calculating the likelihood or conditional probability P(x|C)=P (x1,x2…xn|C) (7) C = Class x = vector of attribute values n P(x|C) = probability document from class C that contains attribute value x

Table 9 .
6etermining Feature selection.The feature selection process will produce features that are considered the most important and discard other features that are considered unimportant with the aim of increasing the accuracy of the classification model.This study uses the information gain method for the feature selection method and succeeds in reducing the number of features to 500 features.Then the classification model will be tested through the model testing phase using the k-fold cross validation method with a value of k = 5.The results of the model test will show the level of accuracy, precision, and recall of each test (Fold) and the average level.Following are the results of model testing conducted using 5-fold cross validation.6.Classification process and model testing.The classification process was carried out by dividing the 500 data into training data and test data.The result of the calcification process is a new label for each data that is carried out by the classification model.The new label column shows labelling results provided by the naïve Bayes classification model.7. Result of Testing Testing the accuracy of model is done using 500 data.The k-fold cross validation method is used to test accuracy of model with a fold value = 5.The 500 data will be divided into 2, namely training data and test data.The distribution of data can be seen in Table 1.K-Fold cross validation data, based on predetermined folds.] K=3 Training Training Testing Training Training