I am working on a project that aims to retrieve a large data-set (i.e., tweet data which is a couple of days old) from Twitter using the twitteR library on R. have difficulty storing tweets because my machine has only 8 GB of memory. It ran out of memory even before I set it to retrieve for one day. Is there a way where I can store the tweets straight to my disk without storing into RAM? I am not using the streaming API as I need to get old tweets.
2 Answers
Find a way to make your program write to disk periodically. Keep count of the number of tweets you grab and save after that number is high. I don't write R but psuedocode might look like:
$tweets = get_tweets();
$count = 0;
$tweet_array = array();
for each ($tweets as $tweet) {
$tweet_array += $tweet;
$count++;
if ($count > 10000) {
append_to_file($tweet_array, 'file_name.txt');
clear_array($tweet_array);
}
}
- 1,169
- 8
- 20
I worked on a Twitter data project last Fall wherein we used Java libraries to pull in tweet data from the streaming and the rest API's. We used Twitter4J (an unofficial Java library) for the Twitter API.
The tweet data was fetched and directly written onto text files on our hard drives. Yes, we did increase the memory and heap. I believe R studio will have a similar option. An alternative would be to pull in lesser amounts of tweet data with more number of repetitions.
- 681
- 3
- 17