Skip to content

pmarcis/latvian-tweet-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

latvian-tweet-corpus

This repository is supposed to contain two things:

  • The Latvian Tweet Corpus - a collection of tweets that I am continuously collecting for various sentiment analysis and computational social science application purposes (mostly in Latvian). Due to the size of the corpus, I cannot share it openly, but if you will contact me directly with describing what you intend to do with it, we will definitely find a way how I can share the data with you.
  • Tools for tweet corpus collection. I have factored the Twitter Monitor, the solution that collects the tweets, out of my experimental projects that can be used to collect tweets from Twitter. So ... if you are interested in the Twitter Monitor, read further.

If you are using the Latvian Tweet Corpus or the Twitter Monitor and you happen to publish anything, please be so kind and reference the following paper:

@inproceedings{Pinnis2018,
 address = {Tartu, Estonia},
 author = {Pinnis, Mārcis},
 booktitle = {Human Language Technologies – The Baltic Perspective - Proceedings of the Seventh International Conference Baltic HLT 2018},
 doi = {10.3233/978-1-61499-912-6-112},
 keywords = {latvian,sentiment analysis,social networks,tweet corpus},
 pages = {112--119},
 publisher = {IOS Press},
 title = {{Latvian Tweet Corpus and Investigation of Sentiment Analysis for Latvian}},
 year = {2018}
}

Twitter Monitor

The Twitter Monitor is a console application that provides functionality for continuous monitoring of tweets from a pre-defined list of Twitter users and queries (both can be specified).

The Twitter Monitor adapts to the frequency of how users tweet, meaning that users/queries who/that produce tweets more frequently will be monitored more frequently than those who/that tweet less frequently.

Deployment

The GitHub repository contains a compiled version of the Twitter Monitor.

Just:

git clone https://github.com/pmarcis/latvian-tweet-corpus.git

The compiled version is in the CompiledVersion folder.

On Windows you will need the .NET Framework 4.5.2.

On Linux you will need mono (sudo apt-get install mono-complete).

Building from Source

For those who know what .NET and C# is: skip this part!

On Windows

Install Visual Studio (the Community Eddition is free...) and then:

git clone https://github.com/pmarcis/latvian-tweet-corpus.git -> open TwitterMonitor.sln -> select Build and Build Solution

The executable files will be in the TwitterMonitorConsole\bin\Release folder.

On Linux

You will need mono. To install mono, execute:

sudo apt-get install mono-complete

Then:

git clone https://github.com/pmarcis/latvian-tweet-corpus.git
cd latvian-tweet-corpus
msbuild /p:Configuration=Release TwitterMonitor.sln
cd TwitterMonitorConsole/bin/Release

Usage

To use the Twitter Monitor, you have to:

  • Acquire access details in order to be able to use the Twitter API. There are four strings you need to acquire - an access token and access token secret (for your application), and a consumer key and a consumer secret (so that your application can impersonate you) . For more details, head over to apps.twitter.com.

  • Create a monitoring object JSON file (a file that lists the Twitter users that you will monitor). There is an example here.

  • Create a query JSON file (a file that lists the queries that you will monitor). There is an example here. Note that it is advisable to list queries that are not too ambiguous, otherwise you will end up with a lot of irrelevant garbage in your data!

To launch the Twitter Monitor on Windows, execute:

.\TwitterMonitorConsole.exe -q query_example.json -mo monitoring_object_example.json -o tweets -si 50000 -st 6 -at [AccessToken] -ats [AccesstokenSecret] -ck [ConsumerKey] -cs [ConsumerSecret]

The following arguments are needed:

  • -q [QueryFile] - path of the query JSON file.
  • -mo [MonitoringObjectFile] - path of the monitoring object JSON file.
  • -o [OutPrefix] - prefix of the output files (Twitter Monitor will output 3 JSON files (tweets, queries, and monitoring objects) after X (specified by -si) collected tweets and the -o prefix specifies where to save the JSON files (the ending fill have a timestamp - so ... no worries about possibly overwriting existing data).
  • -si [SavingInterval] - the tweet interval that will be used to create JSON files (default: 50000) (optional). If this number is higher than 1000, there will be an initial set of JSONs created after the first 1000 collected tweets.
  • -st [SleepTime] - to obey Twitter's rules, a sleep time is used (the polite interval (and the default) is 6 seconds per request (Twitter may change its policies though ... so this is not carved in stone); make sure not to set this lower or you will risk getting banned from Twitter) (optional).
  • -at [AccessToken] - the access token for Twitter API.
  • -ats [AccessTokenSecret] - the access token secret for Twitter API.
  • -ck [ConsumerKey] - the consumer key for Twitter API.
  • -cs [ConsumerSecret] - the consumer secret for Twitter API.

To launch the Twitter Monitor on Linux, execute:

mono TwitterMonitorConsole.exe -q query_example.json -mo monitoring_object_example.json -o tweets -si 50000 -st 6 -at [AccessToken] -ats [AccesstokenSecret] -ck [ConsumerKey] -cs [ConsumerSecret]

The Twitter Monitor can be gracefully stopped by creating a file named stop in the folder where the executable file is located. This will trigger the creation of the last set of JSON files.

Integrated Libraries

The Tweet Monitor uses the following third party libraries:

About

The Latvian Tweet Corpus and Twitter Monitor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages