• Slidy
  • Toolbending:There are secret messages everywhere

    De Ustensile
    Aller à : navigation, rechercher

    A présenter

    Chacun prépare 1 fichier (fichier.mp3, fichier.wav, fichier.ps, fichier.png, fichier.txt, fichier.pdf ...) pour la session prochaine:

    • Essayer de varier les ingrédients: textes, fonts, options, couleurs ...
    • Utiliser "man", "-h" ou chercher en-ligne pour trouver des options, exemples, de l'aide
    • Combiner plusieurs recettes en utilisant <, | et >
    • Documentez vos recettes ici: http://ustensile.be/index.php/Recettes !


    Workshop: There are secret messages everywhere

    We look at internet content every day with it's diverse design styles, advertising and images. But if we look examine beyond the result that we see in the browser, we can see that this is a product of structured data. Lets do a few experiments to see if we can extract specific bits of information from web sources.

    cURL is a command-line swiss army knife for grabbing internet data: http://curl.haxx.se/

    Try downloading some urls with cURL and examine the HTML. Also remember, you can pipe the output to a file or a command line program to do more things with it. Try using "grep" to find particular parts of the web page that are interesting.

    $ curl http://apple.stackexchange.com/feeds/question/35852 | grep author

    RSS and Atom feeds are special XML files that describe the recent updates to a web site. We can use them to make our own newspapers and curated collections. https://fr.wikipedia.org/wiki/RSS

    We can write a short python program that will extract common details from RSS feeds by using the minidom library. http://docs.python.org/2/library/xml.dom.minidom.html

    Using the DOM can be rather confusing and sometimes we need to "scrape" contents from an HTML page. The python library Beautiful Soup makes this much easier. http://www.crummy.com/software/BeautifulSoup/

    Installation BeautifulSoup sur Linux

    $ sudo apt-get install python-bs4

    Sauvegarder un page immb dans un fichier

    $ curl http://www.imdb.com/title/tt1740707/?ref_=fn_al_tt_1 > movie.html

    Dans un éditeur de texte, le sauvegarder sous "movie.py"

    from bs4 import BeautifulSoup
    doc = open ("movie.html")
    soup = BeautifulSoup(doc)
    actors = soup.find_all(itemprop="name")
    for actor in actors:
        print actor.text

    Dans le terminal:

    $ python movie.py

    Brendan Howell was born in Manchester, CT, USA in 1976. He is an artist and a reluctant engineer who has created various software works and interactive electronic inventions. Currently, he lives in Berlin, Germany. He has done research and led courses at the Muthesius Kunsthochschule, Merz Akademie, Fachhochschule Potsdam and the Kunsthochschule Berlin, Weißensee.