March 22, 2010 · 2:06 am

ePluribus: Ethnicity on Social Networks

is the name of the paper I wrote with Lars, Itamar, and Cameron. It will appear at this year’s ICWSM. You may commence bating those breaths.

14 Comments

Filed under Uncategorized

14 responses to “ePluribus: Ethnicity on Social Networks”

Delip Rao

February 9, 2011 at 1:54 pm

I’m really excited about the ethnicity data you collected from MySpace for this paper and would like to use it for reproducing your results. Can you please share the data?

Reply
- slycoder
  
  February 9, 2011 at 6:17 pm
  
  Hi, sorry but I can’t share the data but it shouldn’t be too hard to write some code to get some data yourself =).
  
  Reply
  - Delip
    
    February 21, 2011 at 8:52 pm
    
    Thanks but variability in the crawling process and the possibility of profiles being deleted will not result in the same dataset and the results will not be exactly comparable. While there could be legal issues in hosting this data and providing a download link, I don’t see any reason in not providing a back-channel access to the evaluation data for the sake of science.
Tommy Nguyen

February 13, 2011 at 10:44 pm

Do you mind sharing the code the probabilistic/Bayesian approach to estimating the distribution of ethnicities of a population given only their first and last name?

Thanks

Reply
- slycoder
  
  February 14, 2011 at 1:27 am
  
  It’s not really in distributable shape right now. Fortunately, it’s a simple modification of LDA. You can download an implementation of LDA (such as my R package @ http://cran.r-project.org/web/packages/lda/ or a more experimental version @ https://r-forge.r-project.org/projects/rtm/) and make some simple modifications. I can give you some pointers if you’re interested.
  
  Reply
Tommy Nguyen

February 14, 2011 at 1:34 am

We are a team of social scientists, and we are studying a similar problem.

I’m generating a list of questions that I thought you could address (via email). I’m having a hard time decoding the notation for the algorithm that the paper uses for the generative process.

In the mean time, could you provide some pointers?

Reply
- slycoder
  
  February 14, 2011 at 1:36 am
  
  Sure thing. How familiar are you with variational inference and LDA?
  
  Reply
Tommy Nguyen

February 21, 2011 at 8:43 pm

To be honest, I have no background on variational inference and LDA, but I finally got a chance to install the lda-package. It is installed but it is missing some additional packages for the demos to run.

I have spent a few hours searching for literature review but couldn’t find anything. So I thought you could address the issue…

Are there any algorithms/programs/papers on detecting educational level base on sentence analysis? If not, do you think making simple changes to LDA could make it work?

Any help is really appreciated.

Reply
- slycoder
  
  February 21, 2011 at 10:13 pm
  
  If you’re interested, Latent Dirichlet allocation by Blei et al. (2003) in JMLR is a good paper to read to get up to speed on LDA and variational inference.
  
  As far as I know, I don’t know of anyone using LDA for analyzing education level. You could always use the standard reading-level metrics (GNU diction implements these). I think Thomas Landauer has also done some stuff with pLSI for education but I could be wrong.
  
  Reply
Tommy Nguyen

March 1, 2011 at 6:00 pm

Thanks for the great tips!

I’m still trying to digest the LDA paper by Blei et al, but I have a better understanding of it now.

Could you provide the changes you did to your implementation of LDA in R?
And maybe the commands that you used to run it?

Reply
- slycoder
  
  March 2, 2011 at 1:54 am
  
  Cool, sounds like you’re on track! Another good paper to read to get a better sense of more modern approaches to inference is “On Smoothing and Inference for Topic Models” by Asuncion et al.
  
  I’m in the process of migrating all my topic modeling code here: https://r-forge.r-project.org/projects/rtm/ It’s faster and much cleaner. I can make you a contributor if you want and we can start hacking at what you want. I think the only major thing that we need to implement is a way to initialize the topic counts and then you should be able to put together a few lines of R to execute what you want.
  
  Reply
Tommy Nguyen

March 3, 2011 at 10:34 pm

That sounds great! I would like to contribute to the project, since I probably going to need to use the results from ePluribus and your implementation. Just keep in mind that I have absolutely no background on NLP or data mining, so it probably takes some time for me to catch up.

Im going to put together a list of questions relating to the ePluribus paper and your implementation soon.

If you don’t mind, I like to pick your brains on a different topic. I’m currently having a problem with successful predictions of educational level on the tweets using the conventional methods (Kincaid, ARI, Coleman, FOG, SMOG, etc). There are two main challenges: tweets are around 140 characters long and Tweeters don’t use conventional spelling or grammar.

How would you go around this? I thought about providing the parameters of the conventional methods to fit the nature of the Tweets, but not sure if my advisor would approve this idea.

Reply
Aykut Firat

October 10, 2012 at 5:22 pm

Jonathan, we replicated your ethnicity study using Mallet, but we also want to do the whole thing in R. Could you please update us whether you have the modifications that would allow us to externally provide the beta parameter (and thus turn off training).

Reply
- slycoder
  
  October 15, 2012 at 2:06 am
  
  Hi, the current version of the LDA package has a parameter, freeze.topics, which might do what you want.
  
  Reply