java - is punctuation kept in a bag of words? -


i'm creating bag of words module scratch. i'm not sure whether it's best practice in approach whether remove punctuation. consider sentence

i've been "dmx world center" long time ago.are u? 

question: bag of words, should consider

  • the token dmx (no quotation mark) or "dmx (which includes left quotation mark)
  • u (without question mark) or u? (with question mark)

in short, should remove punctuation marks when getting distinct words?

thanks in advance

updated code of have implemented

sample text : ham , im .. on snowboarding trip. wondering if planning befor go..a meet , greet kind of affair? cheers,

   hashset<string> bagofwords = new hashset<string>();    bufferedreader reader = new bufferedreader(new filereader(path));    while (reader.ready()) {        string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham        string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words        (string word : words) {            bagofwords.add(word);        }    } 

try replacing code

 while (reader.ready()) {        string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham        string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words        (string word : words) {            bagofwords.add(word.replaceall("[!-+.^:,\"?]"," ").trim()); // removes sepecial characters mentioned        }    } 

Comments

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

javascript - storing input from prompt in array and displaying the array -