motorconcer
New Coder
I have been trying to play around with certain datasets i found on github to see how well i can conduct a sentiment analysis on different datasets and how codes work. So i have a dataset which i wanted to incorporate in the code i found the only issue is that it is a highly unbalanced dataset. for example the negative sentiment has around 5000 tweets whereas the positive has roughly 15,000 tweets. so i found different ways which i could handle this situation. The first was to use the following code using sklearn resample:
from sklearn.utils import resample
df_majority = my_df[my_df.target==1]
df_minority = my_df[my_df.target==0]
df_minority_upsampled = resample(df_minority,
replace=True,
n_samples=15025,
random_state=123)
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
x = df_upsampled.Tweet
y = df_upsampled.target
from sklearn.model_selection import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)
however using the following code i felt the results werent quite right. I then kept reading quite a bit on SMOTE which has worked extremely well with unbalanced datasets. the only issue is i have no idea how i can incorporate it into the code i found online. I am honestly really amateur at coding so some help would be appreciated. this is the following code im using:x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
my idea was to incorporate:from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time
cvec = CountVectorizer()
lr = LogisticRegression()
n_features = np.arange(1000,20000,1000)
def nfeature_accuracy_checker(vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
result = []
print (classifier)
print ("\n")
for n in n_features:
vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier)
])
print ("Validation result for {} features".format(n))
nfeature_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
result.append((n,nfeature_accuracy,tt_time))
return result
change the above code to:SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)
then call the results using:def nfeature_accuracy_checker(pipeline, vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
Am i thinking right or am i completely butchering the whole thing? also im happy to explain further if anyone doesnt fully understand what im trying to do. thank youprint ("RESULT FOR UNIGRAM WITH STOP WORDS (Tfidf)\n")
feature_result_ugt = nfeature_accuracy_checker(SMOTE_pipeline, vectorizer=tvec)