pandas - Split series containing lists of strings into multiple columns -
i'm using pandas perform string matching twitter dataset.
i've imported csv of tweets , indexed using date. i've created new column containing text matches:
in [1]: import pandas pd indata = pd.read_csv('tweets.csv') indata.index = pd.to_datetime(indata["date"]) indata["matches"] = indata.tweet.str.findall("rudd|abbott") only_results = pd.series(indata["matches"]) only_results.head(10) out[1]: date 2013-08-06 16:03:17 [] 2013-08-06 16:03:12 [] 2013-08-06 16:03:10 [] 2013-08-06 16:03:09 [] 2013-08-06 16:03:08 [] 2013-08-06 16:03:07 [] 2013-08-06 16:03:07 [abbott] 2013-08-06 16:03:06 [] 2013-08-06 16:03:02 [] 2013-08-06 16:03:00 [rudd] name: matches, dtype: object
what want end dataframe, grouped day/month, can plot different search terms columns , plot.
i came across looks perfect solution on answer (https://stackoverflow.com/a/16637607/2034487) when trying apply series, i'm getting exception:
in [2]: only_results.apply(lambda x: pd.series(1,index=x)).fillna(0) out [2]: exception - traceback (most recent call last) ... exception: reindexing valid uniquely valued index objects
i want able apply changes within dataframe apply , reapply groupby conditions , perform plots efficiently - , love learn more how .apply() method works.
thanks in advance.
update after successful answer
the issue duplicates in "matches" column hadn't seen. iterated through column remove duplicates , used original solution @jeff linked above. successful, , can .groupby() on resultant series see daily, hourly, etc, trends. here's example of resultant plot:
in [3]: successful_run = only_results.apply(lambda x: pd.series(1,index=x)).fillna(0) in [4]: successful_run.groupby([successful_run.index.day,successful_run.index.hour]).sum().plot() out [4]: <matplotlib.axes.axessubplot @ 0x110b51650>
you've got duplicate result (e.g. rudd appears more once in single tweet), hence exception (see below).
i think it's going preferable count occurences rather list findall (pandas datastructures aren't designed contain lists, although str.findall uses them).
recommend using this:
in [1]: s = pd.series(['aa', 'aba', 'b']) in [2]: pd.dataframe({key: s.str.count(key) key in ['a', 'b']}) out[2]: b 0 2 0 1 2 1 2 0 1
note (the exception because of duplicate 'a's found in first 2 rows):
in [3]: s.str.findall('a').apply(lambda x: pd.series(1,index=x)).fillna(0) #invalidindexerror: reindexing valid uniquely valued index objects
Comments
Post a Comment