Moving downstream

Tuesday, June 08, 2010

Time to move on...

So I finally decided to leave Blogger and build something that provides me with a little bit more control over the types of content that I can provide and manage. This blog will not be updated any more. If you want to continue following me, please move to my new spot:

www.movingdownstream.com/blog

Yes, my own domain! Isn't that exciting? What isn't exciting is that I've had it for quite some time now and only now I'm putting it to some use.

Enough about my last post. So long!

BTW: I've imported all posts from this blog into my new blog, so you won't have to ever refer back to this blog.

Tuesday, May 18, 2010

It's hard to blog when having too much fun

Yes, I know, it's been another hiatus on my blogging activity. It's not that I don't have anything to say, it's just that the things that I want to talk about would require more time to blog than I have free to think about them. So I decided to just throw things out there and at least I'll get some things out. I apologize in advance for not providing much commentary on any of them. The order is not really important.

1) Average per household food cost for the largest cities in the US

Original article from Bundle

The diagram above is quite interesting, but, at the same time, not necessarily that helpful. Looking at New York numbers is probably the best way to see how misleading the numbers could be. It's all a matter of how big the city is and how diverse the population is. Cheap food can be very cheap and it becomes expensive very rapidly. For example, you can go to McDonald's and spend $4 per person on a reasonably sized meal. At the same time, I can go to Whole Foods and spend $15 on salad and a little bit of starch and protein.

The same thing can be said about groceries. I remember when I was living in Stillwater, OK that I could go to Wal-Mart and spend $30 on my weekly groceries. Today I generally don't spend less than $70. And I'm probably even buying less food than I used to buy (I used to have more time to cook than I have today).

My point is that when you have things that are this different, every time you have a mix of population of multiple different income levels, low income people (that is usually a majority) will pull the average cost down very rapidly. A better metric would be to look at this number as a fraction of the income level. Do New Yorkers spend more of their salary on food, or on buying HDTVs?

2) Arts

Lots of things going on in the arts department here. A day before my birthday Amy and I went to see the world premiere of Amelia. It was a very different opera. The story line was mostly non-linear and there were very few things that I could call a "big aria".

The production was divided into two acts of roughly an hour each, each with three scenes. Between the scenes the curtains would go down but the orchestra kept on playing. No pause for clapping until the end of the acts. The music was quite modern, unusual for opera, but nothing that I can say I had never heard before. But, as I said, very different for an opera.

Later this month we are going to watch Candide. This piece has a little bit more of a personal connection, as I've rehearsed most of it for a future performance that actually never happened before I left the choir. It's much lighter operetta, but with the normal arias, duets, choirs, etc. Let's see how well they do it. It's always a danger to listen to a piece that I know very well, as any small mistakes will drive me crazy.

Finally, the choir I sing with (Seattle Jewish Chorale) is getting ready for our last concert of the season coming up on June 13th. Tickets are available on Brown Paper Tickets and with me if anybody is interested. It's going to be a great concert. I'm very excited about it.

3) Work

Lots of things happening at work. So many that writing them here will bore some people. I'll just say that I've been working late most days (not today, though, as I'm in one of those uninspired days - which gave me time to write this post!) but starting to see some light at the end of the tunnel of my most important project of this year.

Oh, and today I was awarded my first patent! It was one I filed with two other co-workers over 4 years ago. I have other 5 or 6 out there being reviewed. The patent process is very interesting, to put it mildly. I believe that patents are important, but they can be easily misused and that makes me sad. The question that people with way more understanding on the subject that I do have been asking is whether the danger of misuse is worse than the benefits that it provides. I will not even try to answer the question.

And I think that's all I'm going to write about today. There are my topics to cover, like my robot building project, the wedding, Facebook, Twitter, working in South Lake Union, my new gadgets, winemaking, books read, just to name a few, but I'll leave those to my readers' imagination until I decide to be uninspired to work again and write another long-ish post.

Monday, April 19, 2010

One graph to conquer it all

I received a link from a friend of mine to a very interesting challenge:

Graph Theory II: CONTEST for Geekgold

It's an interesting contest in which people decomposed a board game into a graph and then you need to identify the board game looking at the graph.

This reminded me of an old discussion I had at work. But before I get there I have to remind people that my Ph.D. research was on graph-structured databases, so I have used a lot of graphs in my life. I don't even know how many different graph "frameworks" I have implemented, including a very restricted but highly efficient graph database. I like graphs, but I also learned their limitations.

Back to my story: my old manager once had a vision of how to solve every problem. He thought that if we could build a giant graph that recorded everything and how everything related to everything you could solve all problems. And we had many meeting discussing this vision, which was never implemented, because nobody really believed in it beyond that manager.

This vision had two different problems:

The first one, which is the easiest to explain, is scale. It's very hard to build something that has an arbitrary level of connectivity and allows for queries that could be of any length. If you add to this the need to build something on a large fleet of small and "unreliable" hardware, so requiring redundancy and robustness to failure, you basically would have a very hard time to keep it to any reasonable level of performance. That I have personal experience with, as I did implement a graph structured database that, in order to achieve any meaningful performance characteristics it required (1) to pre-calculate all the "joins" and (2) make them read-only.

The second one is harder without explicit examples, but it's related to the curse of dimensionality: if you add too much to your graph, soon you can't conclude anything from it, because little noise in many dimensions will overwhelm all your signal. Just saying that in the air is hard to convince anybody. New techniques to deal with large datasets with large number of dimensions are more and more successful at identifying "low-hanging fruit" at scale, i.e. whatever has a very large signal compared to the rest of the noise. If you have a lot of data, those algorithms have been able to scale so well, that it's possible to apply them in all the data and find a lot of different "high signal" patterns. It's not that the filters are getting better, it's just that we have been doing a better job at looking at a lot of data at once.

In any way, going back to the board game graph puzzle, it's an interesting challenge. I don't know enough board games to be able to recognize almost any of them, but I had fun just trying to decipher the graphs and relate them to a possible game. Enjoy!

Sunday, April 04, 2010

The danger of Twitter sentiment analysis

So I was reading an article from TechCrunch entitled "Sentiment Is Split On The iPad: People Either Love It, Or Hate Others For Not Shutting Up About It". The subject was funny so I decided to read it. But, when I started to look at their sources I realized they were pretty much just using TweetFeel which is a service that tries to do sentiment analysis on the tiny twitter messages flowing around with the given keyword.

Sentiment analysis is a very hot topic lately and there are lots of interesting results from it. However, it doesn't work that well, because most of the methods are based on keywords around the concept you are looking for and language is not very good at being locally unambiguous. Twitter makes it different:

- It has a positive thing that people can't write much, so they will put their sentiments there and not just make a reference to it in an far away phrase
- It's bad because there is only so much context that can be obtained from 140 characters.

So I decided to use TweetFeel to see what data they were using. They made some references in the article, but I wasn't sold. TweetFeel is quite interesting: it keeps streaming the references to the keyword you enter (in this case "ipad") and highlights it in green if it's considered good and red if it's bad. It also keeps count of good and bad references.

After letting it run for a minute or so I was seeing about the same thing that the TechCrunch article mentioned:

Negative: 37 (52%)
Positive: 34 (48%)

But when I started looking at what was considered positive and negative, I started seeing some very interesting tweets (I don't recommend people clicking on the links):

Free ipad WTF http://j.mp/cQdSbk Risen #thefeelingyouget #TLS 5lko
Free ipad WTF http://sn.im/v8uda #OMGThatsSoTrue Feliz Páscoa #TheFeelingYouGet #HappyBdayKoba j59y

In other words, lots of spam sources that were trying to use common keywords to get people to click on their links hoping it had something to do with the their keyword spam. Moreover, because had the word "WTF" I'm guessing TweetFeel considered that negative. My 1-minute sample is not significant, but if I remove all those spam tweets, here is the new count:

Negative: 26 (43%)
Positive: 34 (57%)

(as I said, this is not statistically significant, so don't take these numbers too seriously)

Anyway, now that I'm talking about the iPad, one might be wondering if I'm planning on buying one. The answer right now is "no". If I looks at how I access the information that I want to access and interact, I don't really think that there is a gap that is worth $500. Although there are some apps on it that I really wished I could access without having one, even if it's just to play around with it for some time (like the Marvel app for reading comics). I just hope that the trend is not for thing to migrate all to the iPad framework and not to also have a web or other computer-based way of accessing it.

Saturday, April 03, 2010

Amazon still invaded by medical doctors?

A long time ago suddenly Amazon started recommending me a lot of medical stuff, like anatomy books, drug dictionaries, etc. Looking at why I was recommended it (which is one of the best features for curious people like me), it was all because I had bought a Palm Pilot (yes, this was a long time ago). Apparently Palm Pilots were very popular among doctors because it had some very useful apps for them to keep track of patients, do quick calculations and get references to drugs.

The interesting thing is that if you think of the statistics of it (and I ask people that claim statistics background to go through this during interviews all the time), it makes sense that you will see something like this if you have a biased population. Let's say that out of customers for product A, 20% are medical doctors and the rest is a random scattering of other types of people. If, just to make it simple, 50% of all medical doctors buy product B, suddenly you will see that 10% of everybody that bought product A also bought product B.

The reason that medical doctors are such an interesting category is that I'm not aware of any other category of people that have such strong counts of specific products they purchase. Engineers don't all buy similar books. Graphic designers also don't. Maybe lawyers might, but I haven't seen any evidence that this is happening.

Anyway, why was this brought to my attention? Well, it's because I added something to my Amazon wedding registry and I received the following recommendation:

A stethoscope? Unfortunately for this type of recommendation I can't see why I was recommended it. It would have been interesting.

Tuesday, March 16, 2010

Is being recognized always good?

Yesterday I received an email from a former research colleague that exclaimed that a paper that I co-wrote with him in the past has surpassed 100 citations. The paper, Problems with fitting to the power-law distribution was quite an interesting paper to write. The idea to write it was from the aforementioned colleague, but most of the experiments and statistical background was given by me (that's why I was given the first author position on the paper).

I'm proud of having written it, but every time I read about it I remember one sad thing: this wasn't what my research was about. None of the papers I wrote for my research received any recognition. It's quite an interesting conflict that probably I'll have to live with for the rest of my life. Unless I decide to get back to research and continue my work on feature extraction on graph-structured databases until I find ways to draw better parallels to other people's research results and people can use my proposed ideas.

Anyway, I should be happy for the achievement. 100 citations in 5.5 years is something that very few can claim. Actually Google Scholar claims that the paper has something between 174 and 175 citations. I don't trust the results of Google Scholar, but I have to live with it, as it's the only "free" system that provides some sort of comprehensive view of articles and citations.

Sometimes people need to be careful with statements

The written language is a powerful thing, but is also a dangerous weapon that can backfire if you don't know your audience very well and don't measure your words correctly. Here is a statement that I received from somebody through email:

"[...] the development is either algorithmic or done in C# [...]"

In other words, this person claimed that no algorithms can be done in C#? Quite a strong statement!

Thursday, February 25, 2010

The fall of FriendFeed

A long time ago, I mentioned in this blog that I really liked the idea of FriendFeed as a hub for people's activities online that was both open and semi-extensible. Unfortunately it was probably a little too complex for most people and it never really caught up with the masses and was eventually acquired by Facebook.

After the acquisition there was a mass exodus from people adding content directly to FriendFeed, but I could still follow must of the people that I liked to follow there, because of the "hub" effect. I couldn't follow the discussions anymore, but there was still something out there to see.

Well, tonight a new nail was added to the FriendFeed coffin: it was down and has been down for at least 45 minutes (I tried to access it 45 minutes ago and it was down and, as far as I can tell, it's still down). I'm getting a great message:

500 Internal Server Error
nginx/0.6.34

Not good at all...

Actually I'm really tired of all this "news push" technologies. I have even stopped reading regularly my RSS readers. I don't access Twitter or Facebook. I at least read my emails, but haven't been very good at replying to them.

So you might ask what I've been doing with all this extra free time on my hands? My blog hasn't been the one receiving all of it, so what is it? To tell you the truth, I'm not really sure. I've been working until reasonably late, dealing with wedding stuff, sometimes playing some video game at night (as "my" PS3 is going to stop being mine on Tuesday).

Other things I've been doing is struggling with OpenEmbedded... I don't even want to start on this one. It has been a very painful process to just get to build a distribution with the developer libraries of OpenCV for my BeagleBoard. It's one of the hard things of working with a reasonably fast moving open source project: the main documentation that are the user discussions all seem to refer to previous versions, because their suggestions don't seem to work for me. And I'm really learning to despise Windows 7. If they call this the best Windows yet, those Microsoft people are keeping their bar quite low to allow for even better OSs in the future.