Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!
  • Guest, before posting your code please take these rules into consideration:
    • It is required to use our BBCode feature to display your code. While within the editor click < / > or >_ and place your code within the BB Code prompt. This helps others with finding a solution by making it easier to read and easier to copy.
    • You can also use markdown to share your code. When using markdown your code will be automatically converted to BBCode. For help with markdown check out the markdown guide.
    • Don't share a wall of code. All we want is the problem area, the code related to your issue.


    To learn more about how to use our BBCode feature, please click here.

    Thank you, Code Forum.

Python help!!incorporating SMOTE using Python. Highly imbalanced dataset

motorconcer

New Coder
I have been trying to play around with certain datasets i found on github to see how well i can conduct a sentiment analysis on different datasets and how codes work. So i have a dataset which i wanted to incorporate in the code i found the only issue is that it is a highly unbalanced dataset. for example the negative sentiment has around 5000 tweets whereas the positive has roughly 15,000 tweets. so i found different ways which i could handle this situation. The first was to use the following code using sklearn resample:
from sklearn.utils import resample
df_majority = my_df[my_df.target==1]
df_minority = my_df[my_df.target==0]

df_minority_upsampled = resample(df_minority,
replace=True,
n_samples=15025,
random_state=123)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])
x = df_upsampled.Tweet

y = df_upsampled.target


from sklearn.model_selection import train_test_split

SEED = 2000

x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
however using the following code i felt the results werent quite right. I then kept reading quite a bit on SMOTE which has worked extremely well with unbalanced datasets. the only issue is i have no idea how i can incorporate it into the code i found online. I am honestly really amateur at coding so some help would be appreciated. this is the following code im using:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time

cvec = CountVectorizer()
lr = LogisticRegression()
n_features = np.arange(1000,20000,1000)

def nfeature_accuracy_checker(vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
result = []
print (classifier)
print ("\n")
for n in n_features:
vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier)
])
print ("Validation result for {} features".format(n))
nfeature_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
result.append((n,nfeature_accuracy,tt_time))
return result
my idea was to incorporate:
SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)
change the above code to:
def nfeature_accuracy_checker(pipeline, vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
then call the results using:
print ("RESULT FOR UNIGRAM WITH STOP WORDS (Tfidf)\n")
feature_result_ugt = nfeature_accuracy_checker(SMOTE_pipeline, vectorizer=tvec)
Am i thinking right or am i completely butchering the whole thing? also im happy to explain further if anyone doesnt fully understand what im trying to do. thank you
 

New Threads

Buy us a coffee!

Back
Top Bottom