luckcros.blogg.se - Python tweet archiver

PYTHON TWEET ARCHIVER DOWNLOAD

Let’s plot the words that have the highest median retweets for each of our accounts (Figure 7.6).Public health and social science increasingly use Twitter for behavioral and marketing surveillance.

PYTHON TWEET ARCHIVER DOWNLOAD

Word_by_rts % group_by ( id, word, person ) %>% summarise (rts = first ( retweets ) ) %>% group_by ( person, word ) %>% summarise (retweets = median ( rts ), uses = n ( ) ) %>% left_join ( totals ) %>% filter ( retweets != 0 ) %>% ungroup ( ) word_by_rts %>% filter ( uses >= 5 ) %>% arrange ( desc ( retweets ) ) #> # A tibble: 180 × 5 #> person word retweets uses total_rts #> #> 1 David animation 85 5 13014 #> 2 David gganimate 65 7 13014 #> 3 David download 52 5 13014 #> 4 David start 51 7 13014 #> 5 Julia tidytext 50 7 1750 #> 6 David introducing 45 6 13014 #> 7 David understanding 37 6 13014 #> 8 David error 34.5 8 13014 #> 9 David bayesian 34 7 13014 #> 10 David modeling 34 5 13014 #> # … with 170 more rowsĪt the top of this sorted data frame, we see tweets from Julia and David about packages that they work on, like gganimate and tidytext. Let’s find the total number of retweets for each person. To start with, let’s look at the number of times each of our tweets was retweeted. Notice that the word column contains tokenized emoji. Tidy_tweets % filter ( ! str_detect ( text, ) ) %>% mutate (text = str_remove_all ( text, remove_reg ) ) %>% unnest_tokens ( word, text, token = "tweets", strip_url = TRUE ) %>% filter ( ! word %in% stop_words $ word, ! word %in% str_remove_all ( stop_words $ word, "'" ) ) tidy_tweets #> # A tibble: 11,468 × 7 #> id created_at source retweets favori…¹ person word #> #> 1 8.04e17 16:44:03 Twitter Web Client 0 0 Julia score #> 2 8.04e17 16:44:03 Twitter Web Client 0 0 Julia 50 #> 3 8.04e17 16:42:03 Twitter Web Client 0 9 Julia snow… #> 4 8.04e17 16:42:03 Twitter Web Client 0 9 Julia 🌨 #> 5 8.04e17 16:42:03 Twitter Web Client 0 9 Julia drin… #> 6 8.04e17 16:42:03 Twitter Web Client 0 9 Julia tea #> 7 8.04e17 16:42:03 Twitter Web Client 0 9 Julia 🍵 #> 8 8.04e17 16:42:03 Twitter Web Client 0 9 Julia #rst… #> 9 8.04e17 16:42:03 Twitter Web Client 0 9 Julia 😍 #> 10 8.04e17 02:56:10 Twitter Web Client 0 11 Julia julie #> # … with 11,458 more rows, and abbreviated variable name ¹favorites We are comparing many slopes here and some of them are not statistically significant, so let’s apply an adjustment to the p-values for multiple comparisons. The next step is to use map() and tidy() from the broom package to pull out the slopes for each of these models and find the important ones. Now notice that we have a new column for the modeling results it is another list column and contains glm objects. , family = "binomial" ) ) ) nested_models #> # A tibble: 32 × 4 #> person word data models #> #> 1 David #rstats #> 2 David broom #> 3 David data #> 4 David ggplot2 #> 5 David tidy #> 6 David time #> 7 David tweets #> 8 Julia #rstats #> 9 Julia blog #> 10 Julia data #> # … with 22 more rows Library ( purrr ) nested_models % mutate (models = map ( data, ~ glm ( cbind ( count, time_total ) ~ time_floor. Let’s do that now and take a look at the resulting structure. We can use nest() from tidyr to make a data frame with a list column that contains little miniature data frames for each word. This is the data set we can use for modeling.

The count column tells us how many times that person used that word in that time bin, the time_total column tells us how many words that person used during that time bin, and the word_total column tells us how many times that person used that word over the whole year.