Monthly Archives: May 2009

May 29, 2009 · 12:05 am

I shall be telling this with a sigh somewhere ages and ages hence

Warning: If you have a vested interest in what I do after I graduate (you know who you are; if you’re not sure if this applies to you, then it doesn’t) or you don’t have the stomach for the inherently impolitic nature of job-search aporia, you should probably stop reading.

As some of you know, I’ve been wrestling with some decisions over what I want to do once I graduate (in the near future, knock on wood). And I’ve had a really hard time coming to any sort of definitive conclusion. I know a lot of you also had/have the same decisions to make.

The first order decision is whether to go the route of industry or academia. I’m really interested in having the freedom and resources to work on interesting problems; I’ve fielded arguments from both sides on whether industry or academia is best able to realize this. On the one hand, industry generally has a lot of resources (be that data or computers). And some places are engaging on problems which I think are of real research interest. But in the end you have to sing for your supper, so your eye is typically always towards some product (vague though it may be).

Academia on the other hand is relatively unfettered. It is true, however, that you are subject to the whims of inconstant moons, grants, committees, etc. You’ve got to hustle and sometimes that hustle may entail compromise on the research front. Academics also might have fewer resources on average, but they’re no slouches either. On the balance, academia provides an environment in which the talismans of research, publications, are given priority. Whether these count as progress is up for debate, I suppose, but there’s no denying that there’s lots of interesting stuff in those pages.

The second order decisions are on where one should go in academia/industry. What should one look for in an academic position? In an industry position? A smart boss? A famous boss? Smart peers? Smart underlings? Other interactions with academia/industry? A secure role? A flexible role? Other? All of the above? None of the above?

Finally, while it may seem from the above discussion that options can only cause headache, I still think options are good. So what track provides the most opportunities for change in the future? I’ve heard it argued that one should just go to academia; you can always leave. On the other hand, it’s also been argued that silicon valley changes much more quickly than the lumbering giant that is academia; you should take those opportunities when they appear. If I go to academia will I wall myself off in an ivory tower? If I go to industry, will academia then shun me?

So, do you have advice/musings for me? Feel free to comment below, or email me if it’s personal. And contact me if you want to take a more specific survey about these questions =).

Update:

I did an informal poll of some people I know. They came out on the side of going into academia. Out of curiosity, I broke down the results according to whether the survey respondent was currently in academia or industry (there’s some fuzziness in how I categorized people: e.g., grad students were counted as academia). The result of this is here. As you can see, if I were to only query industry people, then industry would win out on my poll. People currently in academia on the other hand are overwhelmingly tilted towards academia. Go figure.

Update 2:

Edo pointed me to this helpful link.

1 Comment

Filed under Uncategorized

May 10, 2009 · 12:57 am

Fun with names

In line with a previous post I decided to have some more fun with names.

I used the census names data to generate 200 names by taking a random first name and a random last name and combining them. The first names were chosen 50/50 from the male and female lists. All the random choices were done uniformly so that I’d get “interesting” names instead of a bunch of Dave Joneses.

For each of the names I asked five mechanical turkers to judge the person to whom the name belongs on six axes:

Hero or villain?
Trustworthy or untrustworthy?
Attractive or unattractive?
Sophisticated or naive?
Powerful or weak?
Outgoing or reserved?

There were a few methodological issues that I need to sort out at some point. And I’m going to keep the full results secret for a little longer but I thought I’d share a few quick aggregate results:

Most heroic – Kurt Rollin
Most Villainous – Toney Prus
Most outgoing – Jeanie Bainard
Most reserved – Jesusita Fondaw
Most trustworthy – Laureen Hodgen
Most untrustworthy – Robbie Holck
Most sophisticated – Cecille Legat
Most naive – Ramon Smialowski
Most powerful – Rich Keaty
Most weak – Bonnie Wurz
Most attractive – Whitney Okano
Least attractive – Freddy Doughtery

How do you guys and gals read these names (and other ones)? Let me know in the comments!

8 Comments

Filed under Uncategorized

May 6, 2009 · 3:47 am

Word counts vs. word presence for LDA

Every now and then someone (you know who you are) asks if the feature vectors one passes into LDA should be vectors of word counts (i.e., vectors of non-negative integers) or vectors of word presence/absence (i.e., vectors of binary values). Now the former gives strictly more information so the short answer is that you should always use word counts when available. But in my experience the difference is less than you might think.

To that end, I decided to put together a little side-by-side. Today’s corpus is Cora abstracts (with stopword and infrequent word filtering). I’m running everything through stock LDA-C with alpha set to 1.0 and K set to 10. Let me just preface what follows with the warning that your mileage may vary (and if it does, let me know!).

First up, let’s look at the topics (or specifically, the top 10 words in each topic) produced for the two feature representations. No earth-shattering differences here and absent an objective way to measure the quality of a topic, I’d be hard-pressed to say that one produced results any better than the other.

Word counts

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
learning	problem	performance	model	research
control	problems	results	models	report
state	search	classification	data	part
reinforcement	method	paper	bayesian	technical
paper	selection	methods	analysis	paper
dynamic	algorithms	data	probability	grant
system	solution	method	markov	university
systems	space	classifier	time	supported
simulation	optimization	parallel	distribution	science
robot	test	application	methods	artificial
Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
network	learning	algorithm	genetic	knowledge
neural	decision	error	model	system
networks	paper	number	evolutionary	design
input	examples	show	fitness	reasoning
training	features	algorithms	visual	case
weights	algorithms	class	results	theory
recurrent	algorithm	learning	population	paper
hidden	rules	results	evolution	systems
trained	approach	function	crossover	cases
output	tree	model	strategies	approach

Word presence

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
system	research	performance	learning	algorithms
behavior	report	paper	data	genetic
systems	abstract	approach	present	search
complex	technical	parallel	classification	algorithm
design	part	proposed	algorithm	problem
model	paper	implementation	show	results
development	university	results	training	optimal
paper	science	level	accuracy	show
computational	supported	memory	results	problems
environment	computer	multiple	decision	find
Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
method	bayesian	neural	function	learning
methods	reasoning	network	distribution	paper
problem	models	networks	algorithm	learn
applied	model	input	general	problem
problems	cases	learning	class	system
time	case	model	model	reinforcement
demonstrate	paper	hidden	linear	approach
technique	framework	training	functions	knowledge
large	markov	trained	show	learned
learning	knowledge	weights	results	programming

Next, let’s look at the entropy of the topic multinomials to get an idea of what the general shape of these topic distributions look like. Here I’ve computed the topic entropies for both sides of the comparison; I then sort them by entropy (silly exchangeability!). Finally, I’m showing the scatter plot of the entropy of the word-counts topics vs. the entropy of the word-presence topics. The blue line runs along the diagonal. The differences are not too phenomenal; as expected binary features mean higher entropy since binary features amount to a smaller number of observations with which to overcome smoothing. But in the end these are really just 3-5% differences.

Next up, let’s look at convergence of the likelihood bound (these aren’t strictly comparable in any meaningful way since of course the likelihood is computed over two different data representations). Here I’m showing the variational bound on the log likelihood as a function of iteration. In black, word counts are used as features; in red, strictly binary features are used. Because the likelihood scales are different, I’ve rescaled the two curves so that they’re approximately equal (the two axes reflect the two original scales). As with the other evaluations, the shape of the two curves is very similar.

Finally, let’s look at the entropy of the per-document topic-proportions to get an idea of how topics are assigned in each case. As with the previous scatter plot, this plot shows the entropy of each document’s topic proportions under word counts vs. word presence. As before, less data in the binary feature case means generally higher entropy. But the differences are more notable in this case. While for many documents the differences are within 10% for some the difference is as large as 100%. This is most likely due to the fact that in some cases (e.g., when the number of distinct words in a document is small), feature binarization changes the character and amount of data in the document by quite a lot.

I think it is in these corner cases that using full word counts data is likely to be most useful. But overall I think the differences are not that great and not worth expending too many grey cells over.

6 Comments

Filed under Uncategorized

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Monthly Archives: May 2009

I shall be telling this with a sigh somewhere ages and ages hence

Update:

Update 2:

Fun with names

Word counts vs. word presence for LDA

Word counts

Word presence

Blog Stats

Archives

Meta