What is the Twitter Scraper?
The Twitter Scraper is a tool developed by Intersect for use in collecting data directly from Twitter, based on a variety of parameters, including hashtags and phrases, users, location and language, and outputs matching tweets to a comma separated values (.csv) spreadsheet for analysis.
Who can access the Twitter Scraper?
The Twitter Scraper is offered through Launchpod, a utility for deploying NeCTAR Research Cloud virtual machines (VMs), and can be accessed by anyone with access to the Australian Access Federation (AAF) – all Australian universities are subscribers to AAF, meaning anyone with a username and password at an Australian university can configure and deploy a Twitter Scraper.
How can I access the Twitter Scraper?
To use the Twitter Scraper, you must use the Launchpod service to deploy a NeCTAR virtual machine with the Twitter Scraper application. For information on how to do this, please read the Intersect guide to Launchpod, which will take you through the first steps of deploying a Twitter Scraper VM through Launchpod. This guide will assist you in the specific configurations you will need for the Twitter Scraper application in order to create a working scraper.
What else do I need to know?
Before deploying a Twitter Scraper, you must first obtain the necessary access credentials from your Twitter account by creating an ‘application’ from within Twitter, and you must also specify all scrape parameters during the deployment stage.
Obtaining Twitter access credentials
Important: The steps to obtain your Twitter access tokens and consumer API keys only need to be done once. If you intend to deploy more Twitter Scrapers, you should use the same access tokens and credentials. You can always log into the Twitter Applications Development site to get access tokens you’ve already generated, or you can keep them in a safe place for later use.
The Twitter Scraper uses your Twitter account to monitor the constant stream of tweets and find those matching your specified search terms. Before you can use the Twitter Scraper, you need to create an ‘application’ within Twitter, and then obtain the Consumer Key and the Consumer Secret from this application, as well as your Access Token and Access Token Secret. The following sections will describe how to obtain these.
First, navigate to the Twitter Applications Development site and log in with your Twitter credentials. From here you will create the ‘application’. This page is typically used by developers to register apps that they have created and which people can download. Most of the information is actually irrelevant for the Twitter Scraper, but it needs to be entered in order to create the access credentials that the Twitter Scraper needs. As of mid-2015, you need to have a mobile phone number associated with your Twitter account to register an application. If you don’t have a mobile number, go to your Twitter profile and enter one.
Click Create New App and provide a name (such as ‘Twitter scraper’) and a brief description (such as ‘this is an app to harvest tweets’). You also need to provide a URL. It does not have to be a real website, but it does have to begin with ‘http://’. A dummy URL like http://www.example.com is sufficient. Scroll to the end of the page and select Yes, I agree to the Developer Agreement. Finally, click Create your Twitter Application.
Once your app has been created, navigate to the ‘Keys and Access Tokens’ tab of the app settings. This page will display two of the four settings you need, the Consumer Key (API Key) and the Consumer Secret (API Secret). You can copy these straight out of Twitter and into a note or document for use later, or put them directly into the relevant fields in Launchpod.
To generate the other two credentials, scroll to the bottom of the same page and under Your Access Token click the button Create my access token. The page will reload, and the Access Token and the Access Token Secret will be displayed at the end. Make sure when copying the Access Token, that you copy the whole line, including the 9-digit number at the front, which is the same as your owner ID. The Launchpod Twitter Scraper will not function without this.
Now that you have all of the access credentials needed to deploy a Twitter Scraper, you can begin configuring the tweet harvesting parameters.
Specifying Scrape Parameters
The Twitter Scraper search parameters need to be configured prior to the deployment of the server. Once it is deployed, you cannot change the search settings. The reason for this is that it prevents you from changing the search parameters during a harvest period. If you need to modify your search settings, you will need to deploy a new VM.
This is where you will enter the search terms and hashtags that the Twitter Scraper will monitor the Twitter stream for. You can set search phrases in a number of ways. To search for an individual word, enter a single word and hit return. If you want to search for tweets that contain multiple words in any order, enter each word and hit return. If you want to search for an exact phrase, enclose the phrase in double quotes before hitting return. The example below will return any tweet that contains the word ‘Launchpod’, in addition to any tweet that contains both the words ‘Intersect’ and ‘Australia’, in addition to any tweet that contains the exact phrase ‘Twitter Scraper’.
In other words, each search term is separated by an implicit OR operation, and every word inside a search term (such as ‘Intersect Australia’ above) is separated by an implicit AND operation. Lastly, any phrase in double quotes will match only those tweets that contain that exact phrase. You can also mix multiple words and exact phrases in a single search term, such as the search term below, which will return only tweets that contain the exact phrase ‘Twitter Scraper’ as well as both words ‘Intersect’ and ‘Australia’.
- Exclude Phrases
This field is used to block the persistence of tweets that match the pattern. Exclusion overrules inclusion, meaning that even if a tweet matches every inclusion parameter, if it matches a single exclusion parameter, it will not be included in the output. The excluded phrases field works exactly like the phrases field in the interpretation of terms.
You may also specify particular users you want to include in your search using the users field. If you leave this blank, the effect will be that all users are included in the harvest. If you have just one username entered, only tweets from that user will be included in the output. If you need to include a large list of usernames, you can paste a list of usernames separated by commas.
Checking the ‘Set Location Bounds’ will open a map window in which you can define a bounding box. The result is that only tweets that are geocoded as originating from inside that box will be included in the output. Tweets that originated from outside the bounding box will not be harvested, nor will tweets that do not have any geocoding. Remember that users can switch off geocoding, and many do for privacy reasons. It is not currently possible to include tweets that are not geocoded, unless you do not specify any location boundary.
- Output Fields
The output fields allows you to customise the spreadsheet output containing the tweets. By default, all elements are included, but you can uncheck any of the checkboxes if you do not want to include a particular tweet attribute in your output.
You can use the Languages field to restrict your search to tweets in that language. You can also select multiple languages. The effect of selecting no language here is analogous to selecting no users; any tweet will match irrespective of its language value. Tweets are identified as being in a particular language automatically by machine detection, and this may not work perfectly.
- Treat Hashtags Independently
This setting modified the behaviour of the Twitter Scraper when tweets contain multiple hashtags. Hashtags are parsed out of the content of the tweet body and sent to a separate column of the spreadsheet output. If this box is left unchecked, a tweet containing more than one tweet will occur once in the spreadsheet, and both hashtags will be written to the hashtag column. If this box is checked, then the tweet will be repeated in the output spreadsheet, once for each hashtag, and the hashtag column will contain only one hashtag. This may be useful if you intend to process the output of the scraper using hashtags. If you only intend to analyse the tweet body, you may want to leave this box unchecked so as not to have repeats of the tweet.
This checkbox allows you to de-identify the tweets captured by the scraper, which can be useful for research ethics and privacy concerns. When this box is checked, the username column of the output will be replaced by a string so that you cannot identify the user. However, since the same username will be replaced by the same string each time, meaning you are still able to track particular users’ tweets. Note that selecting the de-identify checkbox does not obscure usernames within the body text of tweets.
The period refers to the time interval between each time the Twitter Scraper writes the data out to a spreadsheet. The output works by filling a cached buffer of tweets as they emerge in the Twitter stream and are captured by the scraper. The buffer can fit a few dozen tweets. When the buffer fills, the content is appended to a spreadsheet. When the period ends and the next period begins (that is, the next hour if you select ‘hourly’), then a new spreadsheet will be commenced. If you select ‘none’ as the period, then every matching tweet will be written to the same spreadsheet. If your scrape settings are broad and capture a lot of tweets, the spreadsheet can blow out in size very quickly. Depending on how many tweets you expect to match your search settings, you may want to set the period to daily, or even hourly, to manage the size of the output spreadsheet.
After you have configured the Twitter Scraper, you can now deploy it by clicking ‘Deploy’. The process should only take around four or five minutes, after which you will be sent an email, letting you know that your Twitter Scraper instance has been deployed. This email will also let you know how you can access the data.
While the Twitter Scraper is operational, it will continue to extract tweets from the Twitter Stream and, every time its buffer fills, write those tweets out to the spreadsheet. A link to the location of the spreadsheet will be included in the email that Launchpod sends you. It will simply be the IP address of the machine followed by /1. If you selected a time period, then you will see one spreadsheet corresponding to each of that period that has begun since the VM was deployed. If you log in immediately, then you will see one spreadsheet corresponding to the current time period. This is the spreadsheet that the Twitter Scraper will write the tweets to when the buffer fills. If you selected ‘hourly’ as the period, and log in several hours after the VM was deployed, you will see several spreadsheets.
Due to the way the Twitter Scraper is deployed, there are no easy ways to see if it is working, until tweets begin to appear in the spreadsheet, which, depending on the harvest settings, could be a long time. That is, if you specific very specific search terms, you might be waiting a long time for the buffer to fill and for the tweets to be written to the spreadsheet, and in the meantime, you have no way of knowing whether the lack of output is due to an error or a lack of matching tweets. There is a log file which can be accessed by navigating to the IP of the machine (sent to your email) followed by ‘/logs’. If the scraper is working, this log file will display lines like:
2015-04-15 00:05:41,925 INFO (LoggerTweetProcessor.java:42) -  Processed 10 tweets
If the log file shows an error or java exception instead, then you may have configured the machine incorrectly. You should delete the instance and try again, paying particular attention to the Twitter API Access Tokens and Consumer Keys.
Are there any costs involved?
The Twitter Scraper application was built by Intersect and is open source. Launchpod is a service built by Intersect to deploy virtual machines using the NeCTAR Research Could. The Twitter Scraper, Launchpod and the NeCTAR Research Cloud are free to use for anyone with a username and password at an Australian organisation that subscribes to AAF. All Australian universities are subscribers to AAF.