Parsing Tweets

A method using regexp and php. Actually what this does is parse Tweets using regexp to reformat the text as html with links. A tutorial here:

http://saturnboy.com/2010/02/parsing-twitter-with-regexp/

This is a php library that breaks out hashtags, usernames, etc., but doesn’t really provide a way to isolate the remaining stuff. I have put it in tkzic/API –  there is an example php program provided.

https://github.com/mzsanford/twitter-text-php

hashtags – using regular expressions

http://stackoverflow.com/questions/11551065/parsing-tweets-to-extract-hashtags-in-r

twitter-text-rb : ruby gem which parses out usernames and hashtags

https://github.com/twitter/twitter-text-rb

 

Twitter streaming from php to Max

update 6/2014 – This project is part of the Internet sensors projects: https://reactivemusic.net/?p=5859. Check the link for current versions.

original post

notes

Got a test patch running today which breaks out tweets (in php and curl) and sends them to Max via Osc.

(update) Have parsed data to remove  hyperlinks and Twitter symbols.

It took some tweaking of global variables in php – and probably would be better written using classes (as in this example: http://stackoverflow.com/questions/1397234/php-curl-read-incrementally – see post from GZipp.

Max patch: tkzic/max teaching examples/twitter-php-streamer1.maxpat

php code: twitterStreamMax.php

<?php

// max-osc-play.php
//
//	collection of php OSC code from Max stock-market thing
//

include 'udp.php';		// udp data sending stuff

$DESTINATION = 'localhost';
$SENDPORT = '7400';
$RECVPORT = '7401';

//////////////////////////////////////////////////////////////////////////////////////////

	$USERNAME = 'username';
	$PASSWORD = 'password';
	$QUERY    = 'cats';		// the hashtag # is optional

	// these variables are defined as global so they can be used inside the write callback function
	global $osc;
	global $kount;

	// initialize OSC
	$osc = new OSCClient();  // OSC object
	$osc->set_destination($DESTINATION, $SENDPORT);

	// This amazing program uses curl to access the Twitter streaming API and breaks the data
	// into individual tweets which can be saved in a database, sent out via OSC, or whatever
	//

	/**
	 * Called every time a chunk of data is read, this will be a json encoded message
	 * 
	 * @param resource $handle The curl handle
	 * @param string   $data   The data chunk (json message)
	 */
	function writeCallback($handle, $data)
	{
	    /*
	    echo "-----------------------------------------------------------\n";
	    echo $data;
	    echo "-----------------------------------------------------------\n";
	    */

		$maxdata = "/tweet" ;				// header - begin   
		global $kount;					// test counter
		global $osc;						// osc object

	    $json = json_decode($data);
	    if (isset($json->user) && isset($json->text)) {

			// here we have a single tweet
	        echo "@{$json->user->screen_name}: {$json->text}\n\n";

			// do some cleaning up...
			// remove URL's
			$s = $json->text;		// raw tweet text

			// ok now need to do the same thing below for URL,s RT's @'s etc., 
			// and then remove redundant spaces	
			/* example
			Depending on how greedy you'd like to be, you could do something like:

			$pg_url = preg_replace("/[^a-zA-Z 0-9]+/", " ", $pg_url);

			This will replace anything that isn't a letter, number or space

			*/		

			// display all hashtags and their indices
			foreach( $json->entities->hashtags as $obj )
			{
			  echo "#:{$obj->text}\n";		// display hashtag
			  // get rid of the hashtag
			 	// note: this gets rid of all hashtags, which could obscure the meaning of the tweet, if
				// the hashtag is used inside a sentence like: "my #cat is purple" - would be changed to: "my is purple"
				// so we could use some intelligent parsing here...

			//  $s = str_replace("#{$obj->text}", "", $s );

			// this is a more benign approach, which leaves the word but removes the #

			$s = str_replace("#{$obj->text}", "{$obj->text}", $s );

			}

			foreach( $json->entities->urls as $obj )
			{
			  echo "U:{$obj->url}\n";		// display url			
			  $s = str_replace("{$obj->url}", "", $s );   // get rid of the url		
			}

			foreach( $json->entities->user_mentions as $obj )
			{
				echo "@:{$obj->screen_name}\n";		// display 			
				$s = str_replace("RT @{$obj->screen_name}:", "", $s );   // get rid of re-tweets
				$s = str_replace("@{$obj->screen_name}:", "", $s );   // get rid of other user mentions
				$s = str_replace("@{$obj->screen_name}", "", $s );   // get rid of other user mentions		
			}

			// $s = str_replace("RT ", "", $s );   // get rid of RT's (re-tweet indicators)

			// $s = preg_replace( '/[^[:print:]]/', '',$s); // remove non printable characters

			$s = htmlspecialchars_decode($s);		// decode stuff like &gt;

			$s = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/u', '', $s); // get rid of unicode junk

			$s = preg_replace('/[^(\x20-\x7F)]*/','', $s);		// get rid of other non printable stuff

			$s = preg_replace('!\s+!', ' ', $s);	// remove redundant white space

			echo "revised tweet: {$s}\n";

			$maxdata = "/tweet " . "{$json->text}";
			// $maxdata = $maxdata . " " . $kount++;
		   	$osc->send(new OSCMessage($maxdata));

	    }

	    return strlen($data);
	}

// initialize OSC 

// initialize curl

	$ch = curl_init();

	curl_setopt($ch, CURLOPT_URL, 'https://stream.twitter.com/1/statuses/filter.json?track=' . urlencode($QUERY));
	curl_setopt($ch, CURLOPT_USERPWD, "$USERNAME:$PASSWORD");
	curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'writeCallback');
	curl_setopt($ch, CURLOPT_TIMEOUT, 20); // disconnect after 20 seconds for testing
	curl_setopt($ch, CURLOPT_VERBOSE, 1);  // debugging
	curl_setopt($ch, CURLOPT_ENCODING,  'gzip, deflate'); // req'd to get gzip
	curl_setopt($ch, CURLOPT_USERAGENT, 'tstreamer/1.0'); // req'd to get gzip

	curl_exec($ch); // commence streaming

	$info = curl_getinfo($ch);

	var_dump($info);

?>

Twitter streaming php decoder breaks out individual tweets

This code was adapted (i.e. stolen verbatim) from a stackoverflow post by drew010

http://stackoverflow.com/questions/10337984/using-the-curl-output

Here’s the code. It solves a huge problem for the class of projects which need to grab a large amount of tweets in real time to either save in a database, or trigger some action.

My version of the code is in tkzic/api/twitterStream1.php

<?php

$USERNAME = 'youruser';
$PASSWORD = 'yourpass';
$QUERY    = 'nike';

/**
 * Called every time a chunk of data is read, this will be a json encoded message
 * 
 * @param resource $handle The curl handle
 * @param string   $data   The data chunk (json message)
 */
function writeCallback($handle, $data)
{
    /*
    echo "-----------------------------------------------------------\n";
    echo $data;
    echo "-----------------------------------------------------------\n";
    */

    $json = json_decode($data);
    if (isset($json->user) && isset($json->text)) {
        echo "@{$json->user->screen_name}: {$json->text}\n\n";
    }

    return strlen($data);
}

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'https://stream.twitter.com/1/statuses/filter.json?track=' . urlencode($QUERY));
curl_setopt($ch, CURLOPT_USERPWD, "$USERNAME:$PASSWORD");
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'writeCallback');
curl_setopt($ch, CURLOPT_TIMEOUT, 20); // disconnect after 20 seconds for testing
curl_setopt($ch, CURLOPT_VERBOSE, 1);  // debugging
curl_setopt($ch, CURLOPT_ENCODING,  'gzip, deflate'); // req'd to get gzip
curl_setopt($ch, CURLOPT_USERAGENT, 'tstreamer/1.0'); // req'd to get gzip

curl_exec($ch); // commence streaming

$info = curl_getinfo($ch);

var_dump($info);

Sending Tweets from Arduino through Pachube.com

http://www.tigoe.com/pcomp/code/arduinowiring/1135/#more-1135

from Tom Igoe

(update) I have got this working, exactly as described in the Igoe post – The code is in EthernetPachubeTweeter_tz1.

Essentially, anything that originates from the Arduino is sent to a feed in Pachube. That feed has a datastream which has a trigger which tweets any new data which arrives.

The next thing to try is figuring out whether this can be done as a single line http: request in curl, and therefore, from Max – or any other source

(update) – this is slightly broken – check out the post about converting cosm to xively https://reactivemusic.net/?p=6843]

 

Twitter streaming API examples

Update 5/2014 – all of these examples are broken due to Twitter API upgrade that requires OAUTH instead of user/password. Have left this post – as an example of what you can do. For examples of alternatives, see the internet-sensors projects: https://github.com/tkzic/internet-sensors

— original post —

Here is an example that I actually got working to track mention of dogs… You need to replace USER:PASS with your Twitter login and password. The JSON search results will be written to the file tweets.json.

A running tally of results will be displayed to the console while this is running.

curl https://stream.twitter.com/1/statuses/filter.json -u USER:PASSWORD -d "track=dog" > tweets.json

This one searches for #cats (hashtag)

curl https://stream.twitter.com/1/statuses/filter.json -u USERNAME:PASSWORD -d "track=#cats" > tweets.json

https://dev.twitter.com/discussions/2403

Haven’t tried this one:

Do you have a specific example which doesn’t appear to work? Following Taylor’s advice, I was able to find several streaming entries tracking the “photo” keyword:

  1. curl https://stream.twitter.com/1/statuses/filter.json -d 'track=photo' -u [username]:[password] -# | grep "\"media_url\""

Here is a technique (in the answer) which tracks when the stream gets a hit

http://stackoverflow.com/questions/4786786/using-curl-to-update-mysql-when-curl-spits-out-json

Here are other useful links and examples:

basic curl statuses example, no filters

https://dev.twitter.com/discussions/9911

Look at the post from Matt Seward to do locations…

https://dev.twitter.com/discussions/3779

followers:

https://dev.twitter.com/discussions/4067

tracking:

https://dev.twitter.com/discussions/5520