Monday, April 19, 2010

One graph to conquer it all

I received a link from a friend of mine to a very interesting challenge:

Graph Theory II: CONTEST for Geekgold

It's an interesting contest in which people decomposed a board game into a graph and then you need to identify the board game looking at the graph.

This reminded me of an old discussion I had at work. But before I get there I have to remind people that my Ph.D. research was on graph-structured databases, so I have used a lot of graphs in my life. I don't even know how many different graph "frameworks" I have implemented, including a very restricted but highly efficient graph database. I like graphs, but I also learned their limitations.

Back to my story: my old manager once had a vision of how to solve every problem. He thought that if we could build a giant graph that recorded everything and how everything related to everything you could solve all problems. And we had many meeting discussing this vision, which was never implemented, because nobody really believed in it beyond that manager.

This vision had two different problems:

The first one, which is the easiest to explain, is scale. It's very hard to build something that has an arbitrary level of connectivity and allows for queries that could be of any length. If you add to this the need to build something on a large fleet of small and "unreliable" hardware, so requiring redundancy and robustness to failure, you basically would have a very hard time to keep it to any reasonable level of performance. That I have personal experience with, as I did implement a graph structured database that, in order to achieve any meaningful performance characteristics it required (1) to pre-calculate all the "joins" and (2) make them read-only.

The second one is harder without explicit examples, but it's related to the curse of dimensionality: if you add too much to your graph, soon you can't conclude anything from it, because little noise in many dimensions will overwhelm all your signal. Just saying that in the air is hard to convince anybody. New techniques to deal with large datasets with large number of dimensions are more and more successful at identifying "low-hanging fruit" at scale, i.e. whatever has a very large signal compared to the rest of the noise. If you have a lot of data, those algorithms have been able to scale so well, that it's possible to apply them in all the data and find a lot of different "high signal" patterns. It's not that the filters are getting better, it's just that we have been doing a better job at looking at a lot of data at once.

In any way, going back to the board game graph puzzle, it's an interesting challenge. I don't know enough board games to be able to recognize almost any of them, but I had fun just trying to decipher the graphs and relate them to a possible game. Enjoy!

Sunday, April 04, 2010

The danger of Twitter sentiment analysis

So I was reading an article from TechCrunch entitled "Sentiment Is Split On The iPad: People Either Love It, Or Hate Others For Not Shutting Up About It". The subject was funny so I decided to read it. But, when I started to look at their sources I realized they were pretty much just using TweetFeel which is a service that tries to do sentiment analysis on the tiny twitter messages flowing around with the given keyword.

Sentiment analysis is a very hot topic lately and there are lots of interesting results from it. However, it doesn't work that well, because most of the methods are based on keywords around the concept you are looking for and language is not very good at being locally unambiguous. Twitter makes it different:

- It has a positive thing that people can't write much, so they will put their sentiments there and not just make a reference to it in an far away phrase
- It's bad because there is only so much context that can be obtained from 140 characters.

So I decided to use TweetFeel to see what data they were using. They made some references in the article, but I wasn't sold. TweetFeel is quite interesting: it keeps streaming the references to the keyword you enter (in this case "ipad") and highlights it in green if it's considered good and red if it's bad. It also keeps count of good and bad references.

After letting it run for a minute or so I was seeing about the same thing that the TechCrunch article mentioned:

Negative: 37 (52%)
Positive: 34 (48%)

But when I started looking at what was considered positive and negative, I started seeing some very interesting tweets (I don't recommend people clicking on the links):

Free ipad WTF Risen #thefeelingyouget #TLS 5lko
Free ipad WTF #OMGThatsSoTrue Feliz Páscoa #TheFeelingYouGet #HappyBdayKoba j59y

In other words, lots of spam sources that were trying to use common keywords to get people to click on their links hoping it had something to do with the their keyword spam. Moreover, because had the word "WTF" I'm guessing TweetFeel considered that negative. My 1-minute sample is not significant, but if I remove all those spam tweets, here is the new count:

Negative: 26 (43%)
Positive: 34 (57%)

(as I said, this is not statistically significant, so don't take these numbers too seriously)

Anyway, now that I'm talking about the iPad, one might be wondering if I'm planning on buying one. The answer right now is "no". If I looks at how I access the information that I want to access and interact, I don't really think that there is a gap that is worth $500. Although there are some apps on it that I really wished I could access without having one, even if it's just to play around with it for some time (like the Marvel app for reading comics). I just hope that the trend is not for thing to migrate all to the iPad framework and not to also have a web or other computer-based way of accessing it.

Saturday, April 03, 2010

Amazon still invaded by medical doctors?

A long time ago suddenly Amazon started recommending me a lot of medical stuff, like anatomy books, drug dictionaries, etc. Looking at why I was recommended it (which is one of the best features for curious people like me), it was all because I had bought a Palm Pilot (yes, this was a long time ago). Apparently Palm Pilots were very popular among doctors because it had some very useful apps for them to keep track of patients, do quick calculations and get references to drugs.

The interesting thing is that if you think of the statistics of it (and I ask people that claim statistics background to go through this during interviews all the time), it makes sense that you will see something like this if you have a biased population. Let's say that out of customers for product A, 20% are medical doctors and the rest is a random scattering of other types of people. If, just to make it simple, 50% of all medical doctors buy product B, suddenly you will see that 10% of everybody that bought product A also bought product B.

The reason that medical doctors are such an interesting category is that I'm not aware of any other category of people that have such strong counts of specific products they purchase. Engineers don't all buy similar books. Graphic designers also don't. Maybe lawyers might, but I haven't seen any evidence that this is happening.

Anyway, why was this brought to my attention? Well, it's because I added something to my Amazon wedding registry and I received the following recommendation:

A stethoscope? Unfortunately for this type of recommendation I can't see why I was recommended it. It would have been interesting.