Questions tagged [efficiency]

Efficiency, in algorithmic processing, is usually associated to resource usage. The metrics to evaluate the efficiency of a process are commonly account for execution time, memory/disk or storage requirements, network usage and power consumption.

42 questions
94
votes
12 answers

How big is big data?

Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated…
Rubens
  • 4,117
  • 5
  • 25
  • 42
65
votes
6 answers

When is a Model Underfitted?

Logic often states that by underfitting a model, it's capacity to generalize is increased. That said, clearly at some point underfitting a model cause models to become worse regardless of the complexity of data. How do you know when your model has…
blunders
  • 1,932
  • 2
  • 15
  • 19
22
votes
1 answer

XGBRegressor vs. xgboost.train huge speed difference?

If I train my model using the following code: import xgboost as xg params = {'max_depth':3, 'min_child_weight':10, 'learning_rate':0.3, 'subsample':0.5, 'colsample_bytree':0.6, 'obj':'reg:linear', 'n_estimators':1000, 'eta':0.3} features =…
14
votes
4 answers

Looking for example infrastructure stacks/workflows/pipelines

I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know…
13
votes
2 answers

Is FPGrowth still considered "state of the art" in frequent pattern mining?

As far as I know the development of algorithms to solve the Frequent Pattern Mining (FPM) problem, the road of improvements have some main checkpoints. Firstly, the Apriori algorithm was proposed in 1993, by Agrawal et al., along with the…
Rubens
  • 4,117
  • 5
  • 25
  • 42
12
votes
3 answers

Best languages for scientific computing

It seems as though most languages have some number of scientific computing libraries available. Python has Scipy Rust has SciRust C++ has several including ViennaCL and Armadillo Java has Java Numerics and Colt as well as several other Not to…
ragingSloth
  • 1,854
  • 3
  • 14
  • 15
12
votes
2 answers

Tradeoffs between Storm and Hadoop (MapReduce)

Can someone kindly tell me about the trade-offs involved when choosing between Storm and MapReduce in Hadoop Cluster for data processing? Of course, aside from the obvious one, that Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch…
mbbce
  • 347
  • 2
  • 8
10
votes
3 answers

How do various statistical techniques (regression, PCA, etc) scale with sample size and dimension?

Is there a known general table of statistical techniques that explain how they scale with sample size and dimension? For example, a friend of mine told me the other day that the computation time of simply quick-sorting one dimensional data of size n…
Bridgeburners
  • 229
  • 1
  • 7
10
votes
2 answers

Kernel trick explanation

In support vector machines, I understand it would be computationally prohibitive to calculate a basis function at every point in the data set. However, it is possible to find this optimal solution due to the so-called kernel trick. Other answers to…
user1717828
  • 245
  • 1
  • 3
  • 9
10
votes
1 answer

What is the most efficient data indexing technique

As we all know, there are some data indexing techniques, using by well-known indexing apps, like Lucene (for java) or Lucene.NET (for .NET), MurMurHash, B+Tree etc. For a No-Sql / Object Oriented Database (which I try to write/play a little around…
10
votes
4 answers

Why is it hard to grant efficiency while using libraries?

Any small database processing can be easily tackled by Python/Perl/... scripts, that uses libraries and/or even utilities from the language itself. However, when it comes to performance, people tend to reach out for C/C++/low-level languages. The…
Rubens
  • 4,117
  • 5
  • 25
  • 42
8
votes
2 answers

Filtering spam from retrieved data

I once heard that filtering spam by using blacklists is not a good approach, since some user searching for entries in your dataset may be looking for particular information from the sources blocked. Also it'd become a burden to continuously validate…
Rubens
  • 4,117
  • 5
  • 25
  • 42
8
votes
3 answers

How to compare experiments run over different infrastructures

I'm developing a distributed algorithm, and to improve efficiency, it relies both on the number of disks (one per machine), and on an efficient load balance strategy. With more disks, we're able to reduce the time spent with I/O; and with an…
Rubens
  • 4,117
  • 5
  • 25
  • 42
8
votes
3 answers

Levenshtein distance vs simple for loop

I have recently begun studying different data science principles, and have had a particular interest as of late in fuzzy matching. For preface, I'd like to include smarter fuzzy searching in a proprietary language named "4D" in my workplace, so…
6
votes
2 answers

How to best accomplish high speed comparison of like data?

I attack this problem frequently with inefficiency because it's always pretty low on the priority list and my clients are resistant to change until things break. I would like some input on how to speed things up. I have multiple datasets of…
Steve Kallestad
  • 3,208
  • 4
  • 23
  • 41
1
2 3