java - is punctuation kept in a bag of words? -
i'm creating bag of words module scratch. i'm not sure whether it's best practice in approach whether remove punctuation. consider sentence
i've been "dmx world center" long time ago.are u? question: bag of words, should consider
- the token
dmx(no quotation mark) or"dmx(which includes left quotation mark) u(without question mark) oru?(with question mark)
in short, should remove punctuation marks when getting distinct words?
thanks in advance
updated code of have implemented
sample text : ham , im .. on snowboarding trip. wondering if planning befor go..a meet , greet kind of affair? cheers,
hashset<string> bagofwords = new hashset<string>(); bufferedreader reader = new bufferedreader(new filereader(path)); while (reader.ready()) { string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words (string word : words) { bagofwords.add(word); } }
try replacing code
while (reader.ready()) { string msg = reader.readline().split("\t", 2)[1].tolowercase(); // 2nd part. 1st part indicate wether message spam or ham string[] words = msg.split("[\\s+\n.\t!?+,]"); // regex i've used split words (string word : words) { bagofwords.add(word.replaceall("[!-+.^:,\"?]"," ").trim()); // removes sepecial characters mentioned } }
Comments
Post a Comment