Noisebridge - User contributions [en]

Machine Learning Meetup Notes:2011-4-13

2011-04-19T18:38:42Z

Dkritz: link to uploaded ppt

Anthony Goldbloom from Kaggle Visits

*Link to his talk: [https://www.noisebridge.net/images/e/ed/Goldbloom_-_Predictive_modeling_competitions_-_April_2011.ppt PPT presentation]
*Guy used random forests to win HIV competition. Word "random forests" is trademarked. Dude taught himself machine learning from watching youtube videos. Random forests are pretty robust to new data.
**Used [http://cran.r-project.org/web/packages/caret/ caret] package in R to deal with random forests.
*Kaggle splits test dataset into two, uses half for leaderboard.
*Often score difference between winning model and second place is not statistically significant. So they award prizes to top few. Might impose restrictions on execution time of model.
*Performance bottoms out in competitions within a few weeks in general. This seems to be due to all the information being "squeezed" out of the dataset at that point.
*Chess rating competition: build a new rating system that more accurately produces the results. The performance still plateaued, but took longer.
*Most users of kaggle are from computer science and statistics, followed by economics, math, biostats.
*Tools people use:
**R: lots of american users
**Matlab
**SAS
**Weka
**SPSS
**Python: although it's lower on the list, people are successful with it
*R packages used: Caret, RFE, GLM, NNET, Forecast
*Heritage Prize
**Real shit is going down may 4th, with release of all datasets.
**Ends in 2 years. No rush.
**Four prizes in total, given out throughout the next two years.

File:Goldbloom - Predictive modeling competitions - April 2011.ppt

2011-04-19T18:36:42Z

Dkritz: Anthony Goldbloom from Kaggle Health Prize PPT talk to the ML group Wednesday April 13th 2011

Anthony Goldbloom from Kaggle Health Prize PPT talk to the ML group Wednesday April 13th 2011

Machine Learning/Datasets

2011-03-16T06:07:10Z

Dkritz:

Machine learning is a vast field and there are many different types of problems to be solved. If you find a dataset interesting, try to categorize it (or add a new category) and add it to the links below.

===Classification===
*[http://yann.lecun.com/exdb/mnist/ MNIST Handwritten Digits]
**Classify handwritten digits using this dataset, a very popular one with lots of training examples.
*[http://archive.ics.uci.edu/ml/datasets/Heart+Disease Heart Disease]
**Predict whether a person will have heart disease based on a subset of 76 factors.
*[http://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29 Census Income]
**Try to predict whether a person has an income greater than or less than 50k

===Regression===
*[http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/alps-R.html Boiling point in the Alps]
**The boiling point of water at different barometric pressures.
*[http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/shocking-R.html Shocking Rats]
**How does shocking a rat affect it's ability to complete a maze?
*[http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/icecream-R.html Ice Cream Sales]
**Predict the quantity of ice cream consumed based on some other variables.
*[http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/health/fev.html Smoking and Respiratory Function]
**How does smoking affect lung capacity?

===Time Series===
*[http://robjhyndman.com/tsdldata/data/ausgundeaths.dat Gun-related Deaths in Australia]
**"Deaths from gun-related homicides and suicides and non-gun-related homicides and suicides. Australia: 1915-2004. Source: Neill and Leigh (2007)."
*[http://robjhyndman.com/tsdldata/data/immig.dat Immigration Rates]
**"Annual immigration into the United States: thousands. 1820 – 1962. From Kendall & Ord (1990), p.13."
*[http://robjhyndman.com/tsdldata/roberts/beards.dat Percent of Men with Beards 1866-1911]
**"Percent of Men with full beards, 1866 – 1911. Source: Hipel and Mcleod (1994)."
*[http://robjhyndman.com/tsdldata/roberts/velmon.dat Velocity of Money in America 1869-1960]
**The [http://en.wikipedia.org/wiki/Velocity_of_money velocity of money] is basically the number of times a single unit of money changes hands over a period of time. Theory goes, MV=PY, or Velocity = Prices * Economic Output / Quantity of Money.
*[http://robjhyndman.com/tsdldata/annual/globtp.dat Changes in Global Air Temperature 1880-1985]
**"Surface air temperature change for the globe, 1880-1985, Temperature change actually means temperature against an arbitrary zero point. From James Hansen and Sergej Lebedeff, "Global Trends of Measured Surface Air Temperature", `Journal of Geophysical Research`, Vol. 92, No. D11, pages 13,345-13,372, November 20, 1987."
*[http://robjhyndman.com/tsdldata/data/earthq.dat Number of Earthquakes per Year 1900-1988 (>= 7.0)]
**"Source: National Earthquake Information Center. Different lists will give different numbers depending on the formula used for calculating the magnitude."

===Clustering===
*[http://archive.ics.uci.edu/ml/datasets/Plants USDA Plants Data]
**Automatically cluster plants based on 70 attributes.
*[http://www.uni-koeln.de/themen/statistik/data/cluster/ Nutriens in Meat, Fish and Fowl]
**Can you cluster into animal type given the data?

===Text Data===
*[http://www.cs.cmu.edu/~enron/ Enron Emails]
**Search through Enron's publicly accessible emails.
*[http://archive.ics.uci.edu/ml/datasets/Bag+of+Words Bag of Words]
**Collection of word counts for various types of documents, including Enron emails, scientific papers, and New York Times articles.

===Reinforcement Learning===

CS229

2010-09-07T20:50:37Z

Dkritz: /* Progress: Watching Lectures */ adding Dave (myself) to the lecture track list

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| First! ;) [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}