java - is punctuation kept in a bag of words? -
i'm creating bag of words module scratch. i'm not sure whether it's best practice in approach whether remove punctuation. consider sentence
i've been "dmx world center" long time ago.are u? question: bag of words, should consider
- the token dmx(no quotation mark) or"dmx(which includes left quotation mark)
- u(without question mark) or- u?(with question mark)
in short, should remove punctuation marks when getting distinct words?
thanks in advance
updated code of have implemented
sample text : ham  , im .. on snowboarding trip. wondering if planning befor go..a meet , greet kind of affair? cheers,
   hashset<string> bagofwords = new hashset<string>();    bufferedreader reader = new bufferedreader(new filereader(path));    while (reader.ready()) {        string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham        string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words        (string word : words) {            bagofwords.add(word);        }    } 
try replacing code
 while (reader.ready()) {        string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham        string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words        (string word : words) {            bagofwords.add(word.replaceall("[!-+.^:,\"?]"," ").trim()); // removes sepecial characters mentioned        }    } 
Comments
Post a Comment