MESSAGE
DATE  20161027 
FROM  Christopher League

SUBJECT  Re: [Learn] Phylogenetics educational links

From learnbouncesatnylxs.com Thu Oct 27 16:09:40 2016 ReturnPath: XOriginalTo: archiveatmrbrklyn.com DeliveredTo: archiveatmrbrklyn.com Received: from www.mrbrklyn.com (www.mrbrklyn.com [96.57.23.82]) by mrbrklyn.com (Postfix) with ESMTP id 6EBF4161312; Thu, 27 Oct 2016 16:09:40 0400 (EDT) XOriginalTo: learnatnylxs.com DeliveredTo: learnatnylxs.com Received: from liucs.net (contrapunctus.net [174.136.110.10]) by mrbrklyn.com (Postfix) with ESMTP id CA8F9160E77 for ; Thu, 27 Oct 2016 16:09:37 0400 (EDT) Received: from localhost (unknown [148.4.40.11]) by liucs.net (Postfix) with ESMTPSA id AF1C5E08E for ; Thu, 27 Oct 2016 16:09:35 0400 (EDT) From: Christopher League To: learnatnylxs.com InReplyTo: <8a528f2862982fe80a018899000d0244atmrbrklyn.com> References: <8a528f2862982fe80a018899000d0244atmrbrklyn.com> UserAgent: Notmuch/0.21 (http://notmuchmail.org) Emacs/25.1.1 (x86_64unknownlinuxgnu) Date: Thu, 27 Oct 2016 16:09:30 0400 MessageID: <8760odd05h.fsfatcontrapunctus.net> MIMEVersion: 1.0 Subject: Re: [Learn] Phylogenetics educational links XBeenThere: learnatnylxs.com XMailmanVersion: 2.1.17 Precedence: list ListId: ListUnsubscribe: , ListArchive: ListPost: ListHelp: ListSubscribe: , ContentType: multipart/mixed; boundary="===============0074659448==" ErrorsTo: learnbouncesatnylxs.com Sender: "Learn"
===============0074659448== ContentType: multipart/signed; boundary="====="; micalg=pgpsha256; protocol="application/pgpsignature"
===== ContentType: multipart/mixed; boundary="==="
=== ContentType: multipart/alternative; boundary="===="
==== ContentType: text/plain ContentTransferEncoding: quotedprintable
Here's a set of introductory slides on inference of phylogenetic trees.
Based on what I've learned today, "small" parsimony algorithms like Fitch and Sankoff rely on labeling a *given* tree shape (aka topology), so we already have to know (or hypothesize) the ancestral relationships. The algorithm just determines which labels (mutations) to assign to interior nodes. That's really unsatisfying to me.
But the "big" parsimony techniques have to search the entire tree space, which is ENORMOUS. The problem is formally NPhard. Now  people do manage to solve (or approximately solve) NPhard problems every day by using piles of dirty tricks. When it comes to searching gigantic spaces, those dirty tricks are the classic techniques of artificial intelligence.
So the lecture slides get into techniques like greedy algorithms, hillclimbing, simulated annealing, genetic algorithms, etc. Anyway, there could be a lot of meat here that fits under the heading of AI + phylogenetics. It's much more accessible stuff than trying to automatically interpret 3D model data to take measurements of maxillary bones.
CL
Ruben Safir writes:
> http://telliott99.blogspot.com/2010/03/fitchandsankoffalgorithmsfor.h= tml > > "The Fitch algorithm considers the sites (or characters) one at a time. A= t each tip in the tree, we create a set containing those nucleotides (state= s) that are observed or are compatible with the observation. Thus, if we se= e an A, we create the set {A}. If we see an ambiguity such as R, we create = the set {AG}. Now we move down the tree [away from the tips]. In algorithmi= c terms, we do a postorder tree traversal. At each interior node we create = a set that is the intersection of sets at the two descendant nodes. However= , if that set is empty, we instead create the set that is the union of the = two sets at the descendant nodes. Every time we create such a union, we als= o count one change of state." > > > https://en.wikipedia.org/wiki/Nonparametric > Nonparametric statistics are statistics not based on parameterized famili= es of probability distributions. They include both descriptive and inferent= ial statistics. The typical parameters are the mean, variance, etc. Unlike = parametric statistics, nonparametric statistics make no assumptions about t= he probability distributions of the variables being assessed. The differenc= e between parametric models and nonparametric models is that the former ha= s a fixed number of parameters, while the latter grows the number of parame= ters with the amount of training data.[1] Note that the nonparametric mode= l does, counterintuitively, contain parameters: the distinction is that par= ameters are determined by the training data in the case of nonparametric s= tatistics, not the model. > > https://en.wikipedia.org/wiki/FitchMargoliash_algorithm > Distancematrix methods > > Distancematrix methods of phylogenetic analysis explicitly rely on a mea= sure of "genetic distance" between the sequences being classified, and ther= efore they require an MSA (multiple sequence alignment) as an input. Distan= ce is often defined as the fraction of mismatches at aligned positions, wit= h gaps either ignored or counted as mismatches.[1] Distance methods attempt= to construct an alltoall matrix from the sequence query set describing t= he distance between each sequence pair. From this is constructed a phylogen= etic tree that places closely related sequences under the same interior nod= e and whose branch lengths closely reproduce the observed distances between= sequences. Distancematrix methods may produce either rooted or unrooted t= rees, depending on the algorithm used to calculate them. They are frequentl= y used as the basis for progressive and iterative types of multiple sequenc= e alignment. The main disadvantage of distancematrix methods is their inab= ility to efficiently use=20 > information about local highvariation regions that appear across multip= le subtrees.[2] > > > =20 > So many immigrant groups have swept through our town > that Brooklyn, like Atlantis, reaches mythological > proportions in the mind of the world  RI Safir 1998 > http://www.mrbrklyn.com=20 > > DRM is THEFT  We are the STAKEHOLDERS  RI Safir 2002 > http://www.nylxs.com  Leadership Development in Free Software > http://www2.mrbrklyn.com/resources  Unpublished Archive > http://www.coinhangout.com  coins! > http://www.brooklynliving.com > > Being so tracked is for FARM ANIMALS and and extermination camps, > but incompatible with living as a free human being. RI Safir 2013 > _______________________________________________ > Learn mailing list > Learnatnylxs.com > http://lists.mrbrklyn.com/mailman/listinfo/learn
==== ContentType: text/html; charset=utf8 ContentTransferEncoding: quotedprintable
1.0, userscalable=3Dyes">
Here=E2=80=99s a set of introductory slides on inference of phylogenetic= trees.
pdf" class=3D"uri">http://www.cs.otago.ac.nz/cosc348/phylo/Lecture14_PhyloO= ptim.pdf
Based on what I=E2=80=99ve learned today, =E2=80=9Csmall=E2=80=9D parsim= ony algorithms like Fitch and Sankoff rely on labeling a given tre= e shape (aka topology), so we already have to know (or hypothesize) the anc= estral relationships. The algorithm just determines which labels (mutations= ) to assign to interior nodes. That=E2=80=99s really unsatisfying to me.
But the =E2=80=9Cbig=E2=80=9D parsimony techniques have to search the en= tire tree space, which is ENORMOUS. The problem is formally NPhard. Now = =E2=80=93 people do manage to solve (or approximately solve) NPhard proble= ms every day by using piles of dirty tricks. When it comes to searching gig= antic spaces, those dirty tricks are the classic techniques of artificial i= ntelligence.
So the lecture slides get into techniques like greedy algorithms, hillc= limbing, simulated annealing, genetic algorithms, etc. Anyway, there could = be a lot of meat here that fits under the heading of AI + phylogenetics. It= =E2=80=99s much more accessible stuff than trying to automatically interpre= t 3D model data to take measurements of maxillary bones.
CL
Ruben Safir rubenatmrbrklyn.com= writes:
http://telliott99.blogspot.com/2010/03/fitchandsankoffalgorithmsfor.= html
=E2=80=9CThe Fitch algorithm considers the sites (or characters) one at = a time. At each tip in the tree, we create a set containing those nucleotid= es (states) that are observed or are compatible with the observation. Thus,= if we see an A, we create the set {A}. If we see an ambiguity such as R, w= e create the set {AG}. Now we move down the tree [away from the tips]. In a= lgorithmic terms, we do a postorder tree traversal. At each interior node w= e create a set that is the intersection of sets at the two descendant nodes= . However, if that set is empty, we instead create the set that is the unio= n of the two sets at the descendant nodes. Every time we create such a unio= n, we also count one change of state.=E2=80=9D
https://en.wikipedia.org/wiki/Nonparametric Nonparametric statistics ar= e statistics not based on parameterized families of probability distributio= ns. They include both descriptive and inferential statistics. The typical p= arameters are the mean, variance, etc. Unlike parametric statistics, nonpar= ametric statistics make no assumptions about the probability distributions = of the variables being assessed. The difference between parametric models a= nd nonparametric models is that the former has a fixed number of parameter= s, while the latter grows the number of parameters with the amount of train= ing data.[1] Note that the nonparametric model does, counterintuitively, c= ontain parameters: the distinction is that parameters are determined by the= training data in the case of nonparametric statistics, not the model.
https://en.wikipedia.org/wiki/FitchMargoliash_algorithm Distancematrix= methods
Distancematrix methods of phylogenetic analysis explicitly rely on a me= asure of =E2=80=9Cgenetic distance=E2=80=9D between the sequences being cla= ssified, and therefore they require an MSA (multiple sequence alignment) as= an input. Distance is often defined as the fraction of mismatches at align= ed positions, with gaps either ignored or counted as mismatches.[1] Distanc= e methods attempt to construct an alltoall matrix from the sequence query= set describing the distance between each sequence pair. From this is const= ructed a phylogenetic tree that places closely related sequences under the = same interior node and whose branch lengths closely reproduce the observed = distances between sequences. Distancematrix methods may produce either roo= ted or unrooted trees, depending on the algorithm used to calculate them. T= hey are frequently used as the basis for progressive and iterative types of= multiple sequence alignment. The main disadvantage of distancematrix meth= ods is their inability to efficiently use information about local highvari= ation regions that appear across multiple subtrees.[2]
=E2=80=93 So many immigrant groups have swept through our town that Broo= klyn, like Atlantis, reaches mythological proportions in the mind of the wo= rld  RI Safir 1998 http://www.mrbrklyn.com
DRM is THEFT  We are the STAKEHOLDERS  RI Safir 2002 http://www.nylxs.= com  Leadership Development in Free Software http://www2.mrbrklyn.com/reso= urces  Unpublished Archive http://www.coinhangout.com  coins! http://www.= brooklynliving.com
Being so tracked is for FARM ANIMALS and and extermination camps, but in= compatible with living as a free human being. RI Safir 2013 ______________= _________________________________ Learn mailing list Learnatnylxs.com http:/= /lists.mrbrklyn.com/mailman/listinfo/learn
====
===
===== ContentType: application/pgpsignature; name="signature.asc"
BEGIN PGP SIGNATURE
iQEcBAEBCAAGBQJYEl76AAoJEGuLsz1PMbCLTawIAJwrk9T8w1h5LZFyWh8W+bYC L0PdcTaqztW5afgz4wyBNE+Cyf60cCYGXkje6WFSQs+hIZ2UcnU5rGKKIPzyK74G HiXsMYMyaxpzBUdYmk9qRRks/no4zsDW5Va/f6g8SE36l1+cvHZv8Jj2eKIfXFkc hZZ7csdz9IzbOW5jkVBlvCidaHOxZRk4gdbLaY8l4iI8Bc/XQv2p/MLepsI6NkMV WcqbzUMIi8UsF+mtwj/gAsB0weYiGAbmXoIc7gCyFDQZXdqqaKgX7kOOTRv8OPJl VXPZWEVIDAe2BwOlnJ/T0WfwUUNIboyI6d8tfn7sOo94bxCiNkC+IjZjzIjML4Q= =9OrT END PGP SIGNATURE =====
===============0074659448== ContentType: text/plain; charset="usascii" MIMEVersion: 1.0 ContentTransferEncoding: 7bit ContentDisposition: inline
_______________________________________________ Learn mailing list Learnatnylxs.com http://lists.mrbrklyn.com/mailman/listinfo/learn
===============0074659448==

