ePluribus: Ethnicity on Social Networks

is the name of the paper I wrote with Lars, Itamar, and Cameron. It will appear at this year’s ICWSM. You may commence bating those breaths.


Filed under Uncategorized

14 responses to “ePluribus: Ethnicity on Social Networks

  1. Delip Rao

    I’m really excited about the ethnicity data you collected from MySpace for this paper and would like to use it for reproducing your results. Can you please share the data?

    • Hi, sorry but I can’t share the data but it shouldn’t be too hard to write some code to get some data yourself =).

      • Delip

        Thanks but variability in the crawling process and the possibility of profiles being deleted will not result in the same dataset and the results will not be exactly comparable. While there could be legal issues in hosting this data and providing a download link, I don’t see any reason in not providing a back-channel access to the evaluation data for the sake of science.

  2. Tommy Nguyen

    Do you mind sharing the code the probabilistic/Bayesian approach to estimating the distribution of ethnicities of a population given only their first and last name?


  3. Tommy Nguyen

    We are a team of social scientists, and we are studying a similar problem.

    I’m generating a list of questions that I thought you could address (via email). I’m having a hard time decoding the notation for the algorithm that the paper uses for the generative process.

    In the mean time, could you provide some pointers?

  4. Tommy Nguyen

    To be honest, I have no background on variational inference and LDA, but I finally got a chance to install the lda-package. It is installed but it is missing some additional packages for the demos to run.

    I have spent a few hours searching for literature review but couldn’t find anything. So I thought you could address the issue…

    Are there any algorithms/programs/papers on detecting educational level base on sentence analysis? If not, do you think making simple changes to LDA could make it work?

    Any help is really appreciated.

    • If you’re interested, Latent Dirichlet allocation by Blei et al. (2003) in JMLR is a good paper to read to get up to speed on LDA and variational inference.

      As far as I know, I don’t know of anyone using LDA for analyzing education level. You could always use the standard reading-level metrics (GNU diction implements these). I think Thomas Landauer has also done some stuff with pLSI for education but I could be wrong.

  5. Tommy Nguyen

    Thanks for the great tips!

    I’m still trying to digest the LDA paper by Blei et al, but I have a better understanding of it now.

    Could you provide the changes you did to your implementation of LDA in R?
    And maybe the commands that you used to run it?

    • Cool, sounds like you’re on track! Another good paper to read to get a better sense of more modern approaches to inference is “On Smoothing and Inference for Topic Models” by Asuncion et al.

      I’m in the process of migrating all my topic modeling code here: https://r-forge.r-project.org/projects/rtm/ It’s faster and much cleaner. I can make you a contributor if you want and we can start hacking at what you want. I think the only major thing that we need to implement is a way to initialize the topic counts and then you should be able to put together a few lines of R to execute what you want.

  6. Tommy Nguyen

    That sounds great! I would like to contribute to the project, since I probably going to need to use the results from ePluribus and your implementation. Just keep in mind that I have absolutely no background on NLP or data mining, so it probably takes some time for me to catch up.

    Im going to put together a list of questions relating to the ePluribus paper and your implementation soon.

    If you don’t mind, I like to pick your brains on a different topic. I’m currently having a problem with successful predictions of educational level on the tweets using the conventional methods (Kincaid, ARI, Coleman, FOG, SMOG, etc). There are two main challenges: tweets are around 140 characters long and Tweeters don’t use conventional spelling or grammar.

    How would you go around this? I thought about providing the parameters of the conventional methods to fit the nature of the Tweets, but not sure if my advisor would approve this idea.

  7. Aykut Firat

    Jonathan, we replicated your ethnicity study using Mallet, but we also want to do the whole thing in R. Could you please update us whether you have the modifications that would allow us to externally provide the beta parameter (and thus turn off training).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s