pandas - Split series containing lists of strings into multiple columns -

i'm using pandas perform string matching twitter dataset.

i've imported csv of tweets , indexed using date. i've created new column containing text matches:

in [1]: import pandas pd indata = pd.read_csv('tweets.csv') indata.index = pd.to_datetime(indata["date"]) indata["matches"] = indata.tweet.str.findall("rudd|abbott") only_results = pd.series(indata["matches"]) only_results.head(10)  out[1]: date 2013-08-06 16:03:17          [] 2013-08-06 16:03:12          [] 2013-08-06 16:03:10          [] 2013-08-06 16:03:09          [] 2013-08-06 16:03:08          [] 2013-08-06 16:03:07          [] 2013-08-06 16:03:07    [abbott] 2013-08-06 16:03:06          [] 2013-08-06 16:03:02          [] 2013-08-06 16:03:00      [rudd] name: matches, dtype: object

what want end dataframe, grouped day/month, can plot different search terms columns , plot.

i came across looks perfect solution on answer (https://stackoverflow.com/a/16637607/2034487) when trying apply series, i'm getting exception:

in [2]: only_results.apply(lambda x: pd.series(1,index=x)).fillna(0) out [2]: exception - traceback (most recent call last) ... exception: reindexing valid uniquely valued index objects

i want able apply changes within dataframe apply , reapply groupby conditions , perform plots efficiently - , love learn more how .apply() method works.

thanks in advance.

update after successful answer

the issue duplicates in "matches" column hadn't seen. iterated through column remove duplicates , used original solution @jeff linked above. successful, , can .groupby() on resultant series see daily, hourly, etc, trends. here's example of resultant plot:

in [3]: successful_run = only_results.apply(lambda x: pd.series(1,index=x)).fillna(0) in [4]: successful_run.groupby([successful_run.index.day,successful_run.index.hour]).sum().plot()  out [4]: <matplotlib.axes.axessubplot @ 0x110b51650>

plot grouped day , hour

you've got duplicate result (e.g. rudd appears more once in single tweet), hence exception (see below).

i think it's going preferable count occurences rather list findall (pandas datastructures aren't designed contain lists, although str.findall uses them).
recommend using this:

in [1]: s = pd.series(['aa', 'aba', 'b'])  in [2]: pd.dataframe({key: s.str.count(key) key in ['a', 'b']}) out[2]:      b 0  2  0 1  2  1 2  0  1

note (the exception because of duplicate 'a's found in first 2 rows):

in [3]: s.str.findall('a').apply(lambda x: pd.series(1,index=x)).fillna(0) #invalidindexerror: reindexing valid uniquely valued index objects

Search This Blog

Brazell

pandas - Split series containing lists of strings into multiple columns -

Comments

Post a Comment

Popular posts from this blog

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

html - How to style widget with post count different than without post count -

url rewriting - How to redirect a http POST with urlrewritefilter -