Skip to content

WIP: Twint#69

Merged
lmeyerov merged 5 commits intomasterfrom
twint
Jun 29, 2020
Merged

WIP: Twint#69
lmeyerov merged 5 commits intomasterfrom
twint

Conversation

@bmorphism
Copy link
Copy Markdown
Collaborator

This begins the integration of Twint data to the dataframe format required for storing of tweets into Neo4j.

A few key differences from Twarc and other details:

  • search proceeds by terms or for user's timeline (with options for geofencing) twint.run.Search()
  • user details need to be separately fetched through a separate call to twint.run.Followers() and twint.run.Lookup()
  • the search for terms is an AND and is not performed asynchronously - multiple Prefect jobs will be required for multiple terms (and a separate job to enrich profile details)

@lmeyerov and I spent some time to get the fields from the Twint df to line up with that what we were getting from Twarc, but work remains on integrating several additional fields (see https://github.com/TheDataRideAlongs/ProjectDomino/blob/twint/modules/Twint.py#L59)

Of note are: user_mentions, retweet_id, in_reply_to_status_id.

Consequently, twint is designed to generate tweets based on Since and Until timestamps (with granularity down to a second) and can operate as a streaming mechanism, whereas twarc can be preserved for historic pulls by id.

@bmorphism bmorphism requested review from bechbd and lmeyerov April 22, 2020 06:40
@bmorphism bmorphism self-assigned this Apr 22, 2020
@lmeyerov
Copy link
Copy Markdown
Contributor

lmeyerov commented Apr 22, 2020

Let's keep working on the branch till we're ready to merge

@bechbd We need some deltas to neo4j:

  • Tweet: Add props conversation_id, geo
  • Tweet: Allow null props retweet_id, maybe others

In addition, twint largely fails at grabbing user profile data. We're probably better off doing a separate prefect job that adds independently hydrates recently recorded_created_at user ids that we found (empty/partial). We'll take a look at that our next session.

@bechbd
Copy link
Copy Markdown
Collaborator

bechbd commented Apr 23, 2020

Added conversation_id and geo properties. Other properties already allow nulls

@lmeyerov lmeyerov merged commit 4a84f63 into master Jun 29, 2020
@lmeyerov lmeyerov deleted the twint branch June 29, 2020 19:15
@lmeyerov
Copy link
Copy Markdown
Contributor

Merging as working for local (non-neo4j push) use of twint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants