Questions tagged [aggregation]

47 questions
7
votes
4 answers

How to group by multiple columns in dataframe using R and do aggregate function

I have a dataframe with columns as defined below. I have provided one set of example, similar to this I have many countries with loan amount and gender variables country loan_amount gender 1 Austia 175 …
SRS
  • 1,065
  • 5
  • 11
  • 22
5
votes
1 answer

Pandas dataframe resample aggregation by mills too slow

Given this test data: import pandas as pd import numpy as np data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02…
hinzsa
  • 51
  • 1
  • 3
4
votes
2 answers

Approach to creating a user profile in music web application

I am working on a use case, and I'm unsure of the best way to proceed: in order to analyze the behavior of users of a web-based music application, we retain all songs each has played since 2009. We store this information in flat files, each…
4
votes
2 answers

How can I calculate mean and variance incrementally?

Say I have a set S of values, and want to store in a database some summary information about that set, so that later when I acquire a new value v I can make a reasonable estimate of what the summary information would be about the set S ∪ {v} ---…
dubiousjim
  • 181
  • 1
  • 6
3
votes
0 answers

Difference Bagging and Bootstrap aggregating

Bootstrap belongs to Efron. Tibshirani wrote a book about that in reference to Efron. Bootstrap process for estimating the standard error of statistic s(x). B bootstrap sample are generatied from original data. Finally the standard deviation of the…
martin
  • 329
  • 3
  • 14
3
votes
0 answers

User defined aggregations on data of around 200GB where row order matters

I am working with "medium large" data of around 200GB. The data are long form log files, where there are several thousand logs for each "entity". The entities are actually flights and each log entry occurs at a different time stamp. Temporal order…
Placidia
  • 226
  • 1
  • 5
3
votes
1 answer

Can I apply survival analysis to predict if a user will revisit the website?

I have one business problem in hand which is to predict if a user will revisit the website or not within 6 months. I need to majorly understand what are the factors which make the user return and also need to give business recommendations on what…
3
votes
1 answer

Pandas DataFrame: Aggregating multi-level groups by matching keys

I have some data that looks like this; data.head() stock date binNum volume 0 stock0 d120 2 249500.0 1 stock0 d120 3 81500.0 2 stock0 d120 4 79000.0 3 stock0 d120 5 244000.0 4 stock0 d120 6 …
Steztric
  • 181
  • 8
3
votes
1 answer

Aggregation of Discount

I am trying to predict sales quantity of an item based on their attributes. Discount is one of those attributes. The problem is I am having different discounts in same period for same item .I need to consider sales for period wise i.e week .While…
2
votes
1 answer

Labeling and aggregating features issue

I am trying build a simple binary classifier (some tree based algorithm for now) and my training data will have features aggregated at the user level. So I'll have a unique records of each user. These aggregated features are like "number of logged…
2
votes
1 answer

Feature Selection on Aggregated Targetdata

I have a question about feature selection on a dataset where the target variable is aggregated by the sum of different data points. I want to predict the number of sales depending on a variety of features like: week price per unit store…
2
votes
1 answer

How to deal with a potencially multiple categorical variable

I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem…
2
votes
1 answer

What are the approaches to aggregate categorical variables?

I am working on a clickstream dataset. I have come up with the following example dataset to explain my problem: ClickTimeStamp | SessionID | ART_weekOfYear | PagenameClicked | TimeSpentPerSession | CustID | ContractID | ... | TARGET…
2
votes
2 answers

Privacy through moving averages?

I am considering the following hypothetical situation: I have a time series of data. In general, 'the public' should have access to features of this data. However, making the time series available would constitute a privacy leak. I am considering…
Elle Najt
  • 131
  • 4
2
votes
0 answers

Aggregating over sparse data

I am not sure if the title accurately reflects my problem but I essentially would like to aggregate a set of metrics of similar nature that comes from different data sources into a single metric. Say we are measuring MetricA from SourceA, MetricB…
chi
  • 131
  • 4
1
2 3 4