Want to guess what the most popular pandas-related question on StackOverflow is about? http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas

The strength of pandas really is in medium-data analytics, which can roughly be described as "datasets that fit in memory, comfortable". Depending on your data needs (and ability to buy time on a big EC2 instance with, say, 244 GiB of RAM), this section may not apply to you.

Pandas is not meant for "Big Data", but then again you probably don't have big data.

Chunking and Iteration

The first potential operation for handling larger-than-memory data is chunking or batching your data, and iterating over each batch. This immediately rules out algorithms that require the full dataset to be in memory at once, but with a bit of cleverness you can work around that limitation for many problems.

In [4]:
pd.Timestamp('2014-01-01').strftime("%Y%m")
Out[4]:
'201401'
In [18]:
from distributed import Executor

executor = Executor('127.0.0.1:8786')
In [24]:
from distributed.diagnostics import progress
In [25]:
progress?
In [8]:
import os
import requests
In [12]:
def download_month(month):
    os.makedirs('comext', exist_ok=True)
    base = ("http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/"
            "BulkDownloadListing?sort=1&"
            "downfile=comext%2F2015S1%2Fdata%2Fnc{:%Y%m}.7z")
    r = requests.get(base.format(month), stream=True)
    filename = 'comext/{:%Y-%m}.tsv.7z'.format(month)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
    return filename

    
In [26]:
dates = pd.date_range(start='2012-01-01', end='2014-12-01', freq='m')
futures = executor.map(download_month, dates)
progress(*futures)
In [61]:
!rename -S .gz .7z comext/*.gz
In [73]:
rm nc201401.dat
In [83]:
!7z x -ocomext comext/*.7z
7-Zip [64] 15.09 beta : Copyright (c) 1999-2015 Igor Pavlov : 2015-10-16
p7zip Version 15.09 beta (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Scanning the drive for archives:
  0M Scan         1 file, 38021410 bytes (37 MiB)

Extracting archive: comext/2014-01.tsv.7z
--
Path = comext/2014-01.tsv.7z
Type = 7z
Physical Size = 38021410
Headers Size = 126
Method = LZMA:26
Solid = -
Blocks = 1

  0%      5% - nc201401.dat                    10% - nc201401.dat                    16% - nc201401.dat                    22% - nc201401.dat                    27% - nc201401.dat                    34% - nc201401.dat                    39% - nc201401.dat                    42% - nc201401.dat                    48% - nc201401.dat                    53% - nc201401.dat                    56% - nc201401.dat                    60% - nc201401.dat                    65% - nc201401.dat                    68% - nc201401.dat                    73% - nc201401.dat                    79% - nc201401.dat                    83% - nc201401.dat                    87% - nc201401.dat                    92% - nc201401.dat                    96% - nc201401.dat                   100% 1      Everything is Ok

Size:       244210771
Compressed: 38021410
In [89]:
df = pd.read_csv('comext/nc201401.dat', dtype={'DECLARANT': 'object'})
In [90]:
df.head()
Out[90]:
DECLARANT PARTNER PRODUCT_NC FLOW STAT_REGIME PERIOD VALUE_1000ECU QUANTITY_TON SUP_QUANTITY
0 001 3 01 1 4 201401 2910.83 521.5 NaN
1 001 3 01 2 4 201401 1234.51 250.7 NaN
2 001 3 01012100 1 4 201401 92.11 2.0 4.0
3 001 3 01012100 2 4 201401 10.55 0.5 1.0
4 001 3 01012990 1 4 201401 32.97 1.1 2.0
In [1]:
import dask.dataframe as dd
In [2]:
import zipfile
In [14]:
zf = zipfile.ZipFile('ml-latest.zip')

zf.extractall()
In [3]:
ls ml-latest/
README.txt   links.csv    movies.csv   ratings.csv  tags.csv
In [4]:
df = pd.read_csv('ml-latest/ratings.csv')
df['timestamp'] = pd.to_datetime(df.timestamp, unit='s')
In [5]:
ratings = dd.from_pandas(df, npartitions=100)
In [6]:
ratings.head()
Out[6]:
userId movieId rating timestamp
0 1 169 2.5 2008-03-07 22:08:14
1 1 2471 3.0 2008-03-07 22:03:58
2 1 48516 5.0 2008-03-07 22:03:55
3 2 2571 3.5 2015-07-06 06:50:33
4 2 109487 4.0 2015-07-06 06:51:36
In [43]:
s = df.head(1000000)
In [44]:
s2 = dd.from_pandas(s, npartitions=20)
In [45]:
def sessionize(ts):
    return (ts.sort_values().diff() >= pd.Timedelta(1, unit='h')).fillna(True).cumsum()
In [48]:
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
In [56]:
with Profiler() as prof, ResourceProfiler() as rprof:
    out = ratings.groupby('userId').timestamp.apply(sessionize, columns='timstamp').compute()
In [57]:
prof.visualize()
Out[57]:
<bokeh.plotting.figure.Figure at 0x139a92ba8>
In [58]:
rprof.visualize()
Out[58]:
<bokeh.plotting.figure.Figure at 0x113cb8860>
In [47]:
%%time
s.groupby('userId').timestamp.apply(sessionize)
CPU times: user 7.8 s, sys: 143 ms, total: 7.94 s
Wall time: 7.93 s
Out[47]:
userId        
1       2           0
        1           0
        0           0
2       3           0
        4           0
                 ... 
10790   999616    111
        999595    111
        999596    111
        999758    112
        999475    113
dtype: int64
In [31]:
%%time
s2.groupby(level=0).timestamp.apply(sessionize)
CPU times: user 1.35 s, sys: 24 ms, total: 1.37 s
Wall time: 1.38 s
Out[31]:
userId  userId  movieId
1       1       48516      0
                2471       0
                169        0
2       2       2571       0
                109487     0
                          ..
1052    1052    50872      1
                59315      1
                47099      2
                1246       2
                356        2
dtype: int64
In [67]:
df.groupby(['userId']).timestamp.apply(sessionize)
Out[67]:
userId          
1       2           0
        1           0
        0           0
2       3           0
        4           0
                   ..
247753  22884374    0
        22884369    0
        22884373    0
        22884368    1
        22884365    1
dtype: int64
In [43]:
ratings.groupby('userId').rating.apply(np.mean)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-4bc4de3fdded> in <module>()
----> 1 ratings.groupby('userId').rating.apply(np.mean)

/Users/tom.augspurger/Envs/blog/lib/python3.5/site-packages/dask/dataframe/groupby.py in apply(self, func, columns)
    205         """
    206         # df = set_index(self.df, self.index, **self.kwargs)
--> 207         if self.index._name == self.df.index._name:
    208             return map_partitions(_groupby_level0_getitem_apply,
    209                                   self.df, self.key, func,

AttributeError: 'str' object has no attribute '_name'
In [33]:
df
Out[33]:
userId movieId rating timestamp
0 1 169 2.5 2008-03-07 22:08:14
1 1 2471 3.0 2008-03-07 22:03:58
2 1 48516 5.0 2008-03-07 22:03:55
3 2 2571 3.5 2015-07-06 06:50:33
4 2 109487 4.0 2015-07-06 06:51:36
... ... ... ... ...
95 4 1966 3.0 2002-11-19 21:02:20
96 4 2132 5.0 2002-11-19 20:26:41
97 4 2174 4.0 2004-06-29 21:35:21
98 4 2248 4.0 2002-11-19 20:45:14
99 4 2289 4.0 2002-11-19 20:38:55

100 rows × 4 columns

In [32]:
# %load ml-latest/README.txt
Summary
=======

This dataset (ml-latest) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 22884377 ratings and 586994 tag applications across 34208 movies. These data were created by 247753 users between January 09, 1995 and January 29, 2016. This dataset was generated on January 29, 2016.

Users were selected at random for inclusion. All selected users had rated at least 1 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


Usage License
=============

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

* The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
* The user may not redistribute the data without separate permission.
* The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
* The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).

If you have any further questions or comments, please email <grouplens-info@cs.umn.edu>


Citation
========

To acknowledge use of the dataset in publications, please cite the following paper:

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=<http://dx.doi.org/10.1145/2827872>


Further Information About GroupLens
===================================

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:

* recommender systems
* online communities
* mobile and ubiquitious technologies
* digital libraries
* local geographic information systems

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <grouplens-info@cs.umn.edu> - we are always interested in working with external collaborators.


Content and Use of Files
========================

Formatting and Encoding
-----------------------

The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

User Ids
--------

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).

Movie Ids
---------

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).


Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Tags Data File Structure (tags.csv)
-----------------------------------

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

Links Data File Structure (links.csv)
---------------------------------------

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.

Cross-Validation
----------------

Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see [LensKit](http://lenskit.org) for tools, documentation, and open-source code examples.
  File "<ipython-input-32-b6172370324e>", line 3
    =======
     ^
SyntaxError: invalid syntax
In [ ]:
 
In [ ]: