java - is punctuation kept in a bag of words? -
i'm creating bag of words module scratch. i'm not sure whether it's best practice in approach whether remove punctuation. consider sentence
i've been "dmx world center" long time ago.are u?
question: bag of words, should consider
- the token
dmx
(no quotation mark) or"dmx
(which includes left quotation mark) u
(without question mark) oru?
(with question mark)
in short, should remove punctuation marks when getting distinct words?
thanks in advance
updated code of have implemented
sample text : ham , im .. on snowboarding trip. wondering if planning befor go..a meet , greet kind of affair? cheers,
hashset<string> bagofwords = new hashset<string>(); bufferedreader reader = new bufferedreader(new filereader(path)); while (reader.ready()) { string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words (string word : words) { bagofwords.add(word); } }
try replacing code
while (reader.ready()) { string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words (string word : words) { bagofwords.add(word.replaceall("[!-+.^:,\"?]"," ").trim()); // removes sepecial characters mentioned } }
Comments
Post a Comment